New token allocation algorithm in Cassandra 3.0

Background

Token allocation for a distributed database like Cassandra is not a trivial problem. One wants to have an even split of the token range, so that load1 can be well distributed between nodes, as well as the ability to add new nodes and have them take a fair share of the load without the necessity to move data between the existing nodes. The two requirements are at odds, as having a well-split range means that new nodes can easily break the balance.

A common solution to the problem, also the one usually taken by Cassandra until recently, is to use a high number of randomly-allocated token ranges per node ("virtual nodes" or "vnodes"). Since the vnode tokens are random it is very easy to add a new node without significantly affecting the load distribution of the cluster. The ranges served by individual vnodes may vary wildly in size, but the averaging effect of the number of vnodes should keep the load variation between nodes in control. Or, at least, this is the expectation.

Problem

Unfortunately, as the number of nodes in the cluster grows, the disproportions in the size of the individual vnode ranges, as well as in the overall size of the token space served by a node, are theoretically guaranteed to grow continuously as more nodes are added. Significantly underused, as well as significantly overused nodes always emerge. To make matters worse it is not trivial to improve the situation when a heavily loaded node is identified -- simply adding new members to the cluster would usually not affect the problem one. The effects of replication provide an additional challenge if one tries to allocate manually. And to prevent this from happening, i.e. to be able to keep the distribution under control, one needs to increase the number of vnodes per node as the cluster increases in size, which brings with itself a range of complications, including performance problems from dealing with a superlinearly increasing overall vnode count and the need to move data between existing nodes.

Solution

A better method of generation of tokens for new nodes was needed. Such a method is provided in Cassandra 3.0, where a token allocation algorithm can be triggered during bootstrap. The algorithm is engaged by specifying the allocate_tokens_for_keyspace parameter in cassandra.yaml in combination with num_tokens. As the replication strategies and factors are specified as part of the keyspace definition, the former parameter is needed to specify a keyspace from which the algorithm can find the replication to optimize for. When bootstrap is complete, the new node will have been allocated num_tokens new tokens which try to optimize the replicated token ownership distribution in the cluster.

For new clusters choosing this method over random allocation permits good distributions for much smaller numbers of vnodes per node: for example, experimenting with the distributions generated randomly shows that for 1000 nodes with replication factor 3, one would require about 250 vnodes to keep overutilization below 30%. With the new algorithm a similar or lower maximum can be achieved with just 4, and more importantly, the expected over- and underutilization would be stable and not degrade as more and mode machines are added to the cluster.

For existing clusters using the method to add new machines should quickly take away responsibility from the most heavily overutilized nodes, and will gradually improve the spread as new nodes are added, until it achieves a couple of percent variability in the allocated shares of both old and new nodes (for the old default of 256 vnodes).

Algorithm

The central idea of the algorithm is to generate candidate tokens, and figure out what would be the effect of adding each of them to the ring as part of the new node. The new token will become primary for part of the range of the next one in the ring, but it will also affect the replication of preceding ones.

The algorithm is able to quickly assess the effects thanks to some observations which lead to a simplified but equivalent version of the replication topology2:

Replication is defined per datacentre and replicas for data for this datacentre are only picked from local nodes. That is, no matter how we change nodes in other datacentres, this cannot affect what replicates where in the local one. Therefore in analysing the effects of adding a new token to the ring, we can work with a local version of the ring that only contains the tokens belonging to local nodes.

If there are no defined racks (or the datacentre is a single rack), data must be replicated in distinct nodes. If racks are defined, data must be replicated in distinct racks3. In either case, there is a clearly defined separation of all token ranges in the local ring into groups where only one replica of any data can reside.

The <n> token ranges where a data item is replicated are allocated going onwards in the ring, skipping token ranges in the same replication group. "Skipping" is difficult to deal with efficiently, but this turns out to be equivalent to saying that a vnode has responsibility for a contiguous span of tokens that ends at its token, and begins at the nearest token of the <n>-th distinct replication group that precedes it, or at the nearest other member of the same replication group, whichever is closer.

The latter observation makes it relatively easy to assess the changes to the responsibility caused by adding a new token in the ring.

Under the assumptions1 of the algorithm, the size of that responsibility ("replicated ownership") is proportional to the load that the presence of this vnode causes. The load over the whole node is thus proportional to the sum of the replicated ownerships of its vnodes.

The latter, the sum of the replicated ownerships of each node's vnodes, is what the algorithm tries to distribute evenly. We do this by evaluating the standard deviation in the ownership of all nodes and the effect on this deviation of selecting a specific token and pick the best in a set of candidates. To keep complexity under control, the candidate tokens are chosen to be the midpoints between existing ones4. Doing this repeatedly for all requested vnodes plus some optimizations gives the allocation algorithm.

To provide some further flexibility in situations where different sizes of nodes are present, we modulate the load distribution by the number of vnodes in a node -- similarly to what random allocation results in for the majority of the nodes. For example, if a user specifies node A to use 4 vnodes, and node B to use 6, we assume that node B needs to take 50% higher load than node A. Another example would be an existing cluster where nodes use 32 vnodes each and several years later new machines, 4 times as powerful as the original members of the cluster, need to be added. In that case the user can and should request an allocation of 128 vnodes for the new machines, and the algorithm will try to allocate 4 times more load for them.

Footnotes

1 To be able to reason about load at a good level of abstraction, this post assumes perfect partition key hashing, as well as (relatively) even load per partition.2 given for NetworkTopologyStrategy as SimpleStrategy is just its simplification with no racks and datacentres3 Cassandra also attempts to deal with situations where we have defined racks, but they are fewer than the replication factor by allocating in all of the distinct racks and then allowing rack repeats to fill in the required replication count. This is a very rarely used scenario that complicates reasoning significantly and isn't supported by the algorithm.4 Empirical testing shows this to be no worse than using greater variability, e.g. choosing 4 different candidates for each existing range at 1/5 intervals.

DataStax has many ways for you to advance in your career and knowledge.

As of the last two days, trying to setup multi-DC or restore snapshots, … I’m pretty disappointed in your docs. Google pointed me to stackoverflow or githubs that were way more explainatory (and accurate)…
Too bad my CTO is already convinced we need to migrate our DBs to cassandra, …