This resource is based on an article originally published here.
Introduction
When using Apache Cassandra a strong understanding of the concept and role of partitions is crucial for design, performance, and scalability. This blog covers the key information you need to know about partitions to get started with Cassandra. It covers topics including how to define partitions, how Cassandra uses them, what are the best practices and known issues.
Cassandra stores data with tunable consistency in partitions across a cluster, with each partition representing a set of rows. Partitioning is performed through a mathematical function and data locality is determined by the partition key.
Data partitioning is a common concept amongst distributed data systems. Such systems distribute incoming data into chunks called ‘partitions’. Features such as replication, data distribution, and indexing use a partition as their atomic unit. Data partitioning is usually performed using a simple mathematical function such as identity, hashing, etc. The function uses a configured data attribute called ‘partition key’ to group data in distinct partitions.
Consider an example where we have server logs as incoming data. This data can be partitioned using the log timestamp rounded to the hour value — this partitioning configuration results in data partitions with one hour worth of logs each. Here the partitioning function used is the ‘identity function and the partition key used is a timestamp with a rounded hour.
Apache Cassandra Data Partitions
Apache Cassandra, a NoSQL database, belongs to the big data family of applications and operates as a distributed system, and uses the principle of data partitioning as explained above. Data partitioning is performed using a partitioning algorithm which is configured at the cluster level while the partition key is configured at the table level.
The Cassandra Query Language (CQL) is designed on SQL terminologies of table, rows and columns. A table is configured with the ‘partition key’ as a component of its primary key. Let’s take a deeper look at the usage of the Primary key in the context of a Cassandra cluster.
A primary key in Cassandra represents a unique data partition and data arrangement within a partition. The optional clustering columns handle the data arrangement part. A unique partition key represents a set of rows in a table which are managed within a server (including all servers managing its replicas).
A primary key has the following CQL syntax representations:
Definition 1:
CREATE TABLE server_logs(
log_hour timestamp PRIMARYKEY,
log_level text,
message text,
server text
)
partition key: log_hour
clustering columns: none
Definition 2:
CREATE TABLE server_logs(
log_hour timestamp,
log_level text,
message text,
server text,
PRIMARY KEY (log_hour, log_level)
)
partition key: log_hour
clustering columns: log_level
Definition 3:
CREATE TABLE server_logs(
log_hour timestamp,
log_level text,
message text,
server text,
PRIMARY KEY ((log_hour, server))
)
partition key: log_hour,server
clustering columns: none
Definition 4:
CREATE TABLE server_logs(
log_hour timestamp,
log_level text,
message text,
server text,
PRIMARY KEY ((log_hour, server),log_level)
)WITH CLUSTERING ORDER BY (column3 DESC);
partition key: log_hour,server
clustering columns: log_level
This set of rows is generally referred to as a ‘partition’.
Definition1 has all the rows sharing a ‘log_hour’ as a single partition.
Definition2 has the same partition key as Definition1, but all rows in each partition are arranged with the ascending order ‘log_level’.
Definition3 has all the rows sharing a ‘log_hour’ for each distinct ‘server’ as a single partition.
Definition4 has the same partition as Definition3, but it arranges the rows with descending order of ‘log_level’ within the partition.
Cassandra read and write operations are performed using a partition key on a table. Cassandra uses ‘tokens’ (a long value out of range -2^63 to +2^63-1) for data distribution and indexing. The tokens are mapped to the partition keys using a ‘partitioner’. The partitioner applies a partitioning function to convert any given partition key to a token. Each node in a Cassandra cluster owns a set of data partitions using this token mechanism. The data is then indexed on each node with the help of the partition key. The takeaway here is, Cassandra uses a partition key to determine which node store data on and where to find data when it’s needed.
See below diagram of Cassandra cluster with 3 nodes and token-based ownership.
*This is a simple representation of tokens, the actual implementation uses Vnodes.
Learn why Cassandra is a winning solution for Big Data: The Unmatchable ROI of Managed Apache Cassandra® Service
Controlling the size of the data stored in each partition is essential to ensure even distribution of data across the cluster and to get good I/O performance. Below are the impacts Partitioning has on some of the different aspects of a Cassandra cluster:
Read Performance: Cassandra maintains caches, indexes and index summaries to locate partitions within SSTables files on disk. Large partitions cause inefficiency in maintaining these data structures and result in performance degradation. The Cassandra project has made several improvements in this area, especially in version 3.6 where the engine was restructured to be more performant for large partitions and more resilient against memory issues and crashing.
Memory Usage: The partition size directly impacts the JVM heap size and garbage collection mechanism. Large partitions increase pressure on the JVM heap and make garbage collection inefficient.
Cassandra Repairs: Repair is a maintenance operation to make data consistent. It involves scanning data and comparing it with other data replicas followed by data streaming if required. Large partition sizes make it hard to repair data.
Tombstones Eviction: Cassandra uses unique markers called ‘tombstones’ to mark data deletion. Large partitions can contribute to difficulties in tombstone eviction if data deletion pattern and compaction strategy are not appropriately implemented.
Being aware of these impacts helps in an optimal partition key design while deploying Cassandra. It might be tempting to design the partition key to having only one row or a few rows per partition. However, a few other factors might influence the design decision, primarily, the data access pattern and ideal partition size.
The access pattern and its influence on partitioning key design are explained in-depth in one of our ‘Data modelling’ articles here – A 6 step guide to Apache Cassandra data modelling. In a nutshell, an ‘access pattern’ is the way a table is going to be queried, i.e. a set of all ‘select’ queries for a table. Ideal CQL select queries always have a single partition key in the ‘where’ clause. This means Cassandra works most efficiently when queries are designed to get data from a single partition.
Partitioning Key design
Now let’s look into designing the partitioning key that leads to an ‘ideal partition size’. The practical limit on the size of a partition is two billion cells, but it is not ideal to have such large partitions. The maximum partition size in Cassandra should be under100MB and ideally less than 10MB. Application workload and its schema design haves an effect on the optimal partition value. However, a maximum of 100MB is a rule of thumb. A ‘large/wide partition’ is hence defined in the context of the standard mean and maximum values.
In the versions after 3.6, it may be possible to operate with larger partition sizes. However, thorough testing and benchmarking for each specific workload is required to ensure there is no impact of your partition key design on the cluster performance.
Below are some best practices to consider when designing an optimal partition key:
A partition key for a table should be designed to satisfy its access pattern and with the ideal amount of data to fit into partitions.
A partition key should not allow ‘unbounded partitions’. An unbounded partition grows indefinitely in size as time passes.
In the server_logs table example, if the server column is used as a partition key it will create unbounded partitions as logs for a server will increase with time. The time attribute of log_hour, in this case, puts a bound on each partition to accommodate an hour worth of data.
A partition key should not create partition skew, in order to avoid uneven partitions and hotspots. A partition skew is a condition in which there is more data assigned to a partition as compared to other partitions and the partition grows indefinitely over time.
In the server_logs table example, suppose the partition key is server and if one server generates way more logs than other servers, it will create a skew.
Partition skew can be avoided by introducing some other attribute from the table in the partition key so that all partitions get even data. If it is not feasible to use a real attribute to remove skew, a dummy column can be created and introduced to the partition key. The dummy column then distinguishes partitions and it can be controlled from an application without disturbing the data semantics.
In the skew example above, consider a dummy column partition smallint is introduced and the partition key is altered to server, partition. Now the application logic sets the partition attribute to 1 until there are enough rows in a partition and then it sets partition to 2 for the same server.
Time Series data can be partitioned using a time element in the partition key along with other attributes. This helps in multiple ways –
it works as a safeguard against unbounded partitions
access patterns can use the time attribute to query specific data
data deletion can be performed for a time-bound etc.
In the server_logs table, all four definitions use the time attribute log_hour. All four definitions are good examples of bounded partitions by the hour value.
Summary
The important elements of the Cassandra partition key discussion are summarized below:
Each Cassandra table has a partition key which can be standalone or composite.
The partition key determines data locality through indexing in Cassandra.
The partition size is a crucial attribute for Cassandra performance and maintenance.
The ideal size of a Cassandra partition is equal to or lower than 10MB with a maximum of 100MB.
The partition key should be designed carefully to create bounded partitions with size in the ideal range.
It is essential to understand your data demographics and consider partition size and data distribution when designing your schema. There are several tools to test, analyse and monitor Cassandra partitions.
The Cassandra version 3.6 and above incorporates significant improvements in the storage engine which provides much better partition handling.