Best Practices for Optimal Performance
This article provides tips on how to optimize performance based on maintaining Apache Cassandra clusters of all sizes, for a large number of clients across cloud deployments.
Optimal setup and configuration
Replication factor, number of disks, number of nodes, and SKUs
Most clouds are designed to support multiple availability zones in most regions, and we recommend to map availability zones to racks via a cassandra-racckdc.properties file (see cassandra-rackdc.properties file | Apache Cassandra Documentation). In accordance with this, for the best level of reliability and fault tolerance, we highly recommend configuring a replication factor equal to the number of availability zones in the region the data center is in.. We also recommend specifying a multiple of the replication factor as the number of nodes, for example 3, 6, 9, etc.
Depending on reliability constraints we recommend using remote disks or local disks. To work around disk size restrictions we recommend to form RAID 0 disks striped. In particular with cloud based disks it’s important to review IOPS restrictions on each disk as well as the chosen host type since they might differ.
We strongly recommend extensive benchmarking of your workload against the chosen compute instance and disks. Benchmarking is especially important in the case of vms with only eight cores or less. Our research shows that eight core CPUs only work for the least demanding workloads, and most workloads need a minimum of 16 cores to be performant.
Analytical vs. Transactional workloads
Transactional workloads typically need a data center optimized for low latency, while analytical workloads often use more complex queries, which take longer to execute. In most cases you would want separate data centers:
- One optimized for low latency
- One optimized for analytical workloads
Optimizing for analytical workloads
We recommend you apply the following cassandra.yaml settings for analytical (OLAP) workloads:
Timeouts
Value | Cassandra MI Default | Recommendation for OLAP |
read_request_timeout_in_ms | 5,000 | 10,000 |
range_request_timeout_in_ms | 10,000 | 20,000 |
counter_write_request_timeout_in_ms | 5,000 | 10,000 |
cas_contention_timeout_in_ms | 5,000 | 10,000 |
truncate_request_timeout_in_ms | 60,000 | 120,000 |
slow_query_log_timeout_in_ms | 500 | 1,000 |
roles_validity_in_ms | 2,000 | 120,000 |
permissions_validity_in_ms | 2,000 | 120,000 |
Caches
Value | Cassandra MI Default | Recommendation for OLAP |
file_cache_size_in_mb | 2,048 | 6,144 |
More recommendations
Value | Cassandra MI Default | Recommendation for OLAP |
commitlog_total_space_in_mb | 8,192 | 16,384 |
column_index_size_in_kb | 64 | 16 |
compaction_throughput_mb_per_sec | 128 | 256 |
Client settings
We recommend boosting Cassandra client driver timeouts in accordance with the timeouts applied on the server.
Optimizing for low latency
Cassandra’s default settings are already suitable for low latency workloads. To ensure best performance for tail latencies we highly recommend using a client driver that supports speculative execution and configuring your client accordingly (see https://docs.datastax.com/en/developer/java-driver/4.15/manual/core/speculative_execution/). For Java V4 driver, you can find a demo illustrating how this works and how to enable the policy at Azure-Samples/azure-cassandra-mi-java-v4-speculative-execution: This repository contains Java v4 sample code for connecting to Azure Managed Instance for Apache Cassandra and running load tests using speculative execution. (github.com)
Monitoring for performance bottlenecks
CPU performance
Like every database system, Cassandra works best if the CPU utilization is around 50% and never gets above 80%. It is paramount to set up a metrics pipeline to monitor and alert on these metrics. Most clouds will have something like that built in (figure 1).
Figure 1 – Sample CPU metrics.
If the CPU is permanently above 80% for most nodes the database will become overloaded manifesting in multiple client timeouts. In this scenario, we recommend taking the following actions:
- Vertically scale up to a vm with more CPU cores (especially if the cores are only 8 or less).
- Horizontally scale by adding more nodes (as mentioned earlier, the number of nodes should be multiple of the replication factor).
If the CPU is only high for a few nodes, but low for the others, it indicates a hot partition and needs further investigation.
Disk performance
Clouds have several restrictions on IOPS and some will support “burst IOPS” aka getting more IOPS than configured for a short amount of time to deal with short periods of higher load.Some restrictions will apply to the disk (how many IOPS does a single disk support) others to the virtual machine (how many IOPS can a vm handle making reasoning about his complex. Therefore careful monitoring is required when it comes to disk related performance bottlenecks. In this case it’s important to gather and review IOPS metrics as shown in figure 2:
Figure 2 – IOPS metrics showing the IOPS for reads and writes on each Cassandra node. Note the resolution in this chart is 5 minutes so each data point shows IO Operations per 5 minutes to reach IOPS divide the value by 5*60 aka 300.
If metrics show one or all of the following characteristics, it can indicate that you need to scale up.
- Consistently higher than or equal to the IOPS configured for the chosen disk or vm. If you only see the IOPS elevated for a few nodes, you might have a hot partition and need to review your data for a potential skew.
- If your IOPS are lower than what is supported by the chosen vm, but higher or equal to the disk IOPS, you can take the following actions:
- Add more disks to increase performance or choose a more performant disk type.. Remember we recommended to use a RAID 0 earlier to make adding disk more easy.
- Scale up the data center(s) by adding more nodes.
- If your IOPS max out what your vm supports, you can:
- Vertically scale up to a different vm supporting more IOPS.
- Scale up the data center(s) by adding more nodes.
Network performance
In most cases network performance is sufficient. However, if you are frequently streaming data (such as frequent horizontal scale-up/scale down) or there are huge ingress/egress data movements, this can become a problem. You may need to evaluate the network performance of your vm. It is important to gather and monitor byte-in/out in the metrics on each vm (figure 3) and if necessary switches and routers in between nodes:
Figure 3 – There are several network related metrics and we recommend received bytes/transmitted bytes to monitor.
If the network metrics reach the limits on your vm and/or the underlying network connections there are several changes to consider:
- If you only see the network elevated for a small number of nodes, you might have a hot partition and need to review your data distribution and/or access patterns for a potential skew.
- Vertically scale up to a different vm type supporting more network I/O.
- Horizontally scale up the cluster by adding more nodes.
Too many connected clients
Deployments should be planned and provisioned to support the maximum number of parallel requests required for the desired latency of an application. For a given deployment, introducing more load to the system above a minimum threshold increases overall latency. Gather and monitor the number of connected clients (figure 4) to ensure this does not exceed tolerable limits.
Figure 4 – This shows the connected clients on a more busy cluster for each node. Increases or drops can provide insight in right-sizing the cluster.
Disk space
We advise occasionally reviewing disk space metrics. Cassandra accumulates a lot of disk and then reduces it when compaction is triggered – additionally backups will require disk space as well depending how much data is added/removed frequently. Hence it is important to review disk usage (figure 5) over longer periods to establish trends – like compaction unable to recoup space.
Figure 5 – Disk utilization for a healthy data center.
Note: In order to ensure available space for compaction, disk utilization should be kept to around 50%.
- If you only see this behavior for a few nodes, you might have a hot partition and need to review your data distribution and/or access patterns for a potential skew.
- add more disks but be mindful of IOPS limits imposed by your chosen vm type
- horizontally scale up the cluster
JVM memory
Cassandra’s default formula assigns about half the VM’s memory to the JVM with an upper limit of 31 GB – which in most cases is a good balance between performance and memory. Some workloads, especially ones which have frequent cross-partition reads or range scans might be memory challenged.
In most cases memory gets reclaimed effectively by the Java garbage collector, but especially if the CPU is often above 80% there aren’t enough CPU cycles for the garbage collector left. So any CPU performance problems should be addressed before memory problems.
If the CPU hovers below 70%, and the garbage collection isn’t able to reclaim memory, you might need more JVM memory. This is especially the case if you are on a vm type with limited memory. In most cases, you will need to review your queries and client settings and reduce fetch_size along with what is chosen in the limit within your CQL query.
If you indeed need more memory, you can:
- Increase the JVM memory settings but be mindful that the memory you take from the system won’t be available for OS level disk caches and therefore might reduce read/write performance
- Scale vertically to a vm type that has more memory available
Tombstones
The default in cassandra-reaper is to run repairs every 7 days which removes rows whose TTL has expired (called “tombstone”). Some workloads have more frequent deletes and you might find warnings in the cassandra log like “Read 96 live rows and 5035 tombstone cells for query SELECT …; token <token> (see tombstone_warn_threshold)”, or even errors indicating that a query couldn’t be fulfilled due to excessive tombstones.
A short term mitigation if queries don’t get fulfilled is to increase the tombstone_failure_threshold in the Cassandra config from the default 100,000 to a higher value.
In addition to this, we recommend reviewing the TTL on the keyspace and potentially run repairs daily to clear out more tombstones. If the TTLs are short, for example less than two days, and data flows in and gets deleted quickly, we recommend reviewing the compaction strategy and favoring Leveled Compaction Strategy. In some cases, such actions may be an indication that a review of the data model is required.
Batch warnings
You might encounter this warning in the CassandraLogs and potentially related failures:
Batch for [<table>] is of size 6.740KiB, exceeding specified threshold of 5.000KiB by 1.740KiB.
In this case you should review your queries to stay below the recommended batch size. In rare cases and as a short term mitigation you can increase batch_size_fail_threshold_in_kb in the Cassandra config from the default of 50 to a higher value.
Large partition warning
You might encounter this warning in the CassandraLogs:
Writing large partition <table> (105.426MiB) to sstable <file>
This indicates a problem in the data model. Here is a stack overflow article that goes into more detail. This can cause severe performance issues and needs to be addressed.
Next steps
In this article, we laid out some best practices for optimal performance. Each workload is different and we recommend setting up a robust benchmarking environment to explore the effect of different settings on your performance.