Understanding Hinted Handoff (in Cassandra 0.8)

This post describes hinted handoff for obsolete versions of Cassandra. For a description of how it works today, see this update

Hinted Handoff is an optional part of writes whose primary purpose is to provide extreme write availability when consistency is not required.

Secondarily, Hinted Handoff can reduce the time required for a temporarily failed node to become consistent again with live ones. This is especially useful when a flakey network causes false-positive failures.

How it works

If a write is made and a replica node for the row is known to be down (see “Guarantees,” below), Cassandra will write a hint to a live replica node indicating that the write needs to be replayed to the unavailable node. If no replica nodes are alive for this row key, the coordinating node will write the hint (and the update itself) locally.

Once a node discovers via Gossip that a node for which it holds hints has recovered, it will send the data row corresponding to each hint to the target.

Hinted Handoff and ConsistencyLevel

A hinted write does not count towards ConsistencyLevel requirements of ONE, QUORUM, or ALL. If insufficient replica targets are alive to sastisfy a requested ConsistencyLevel, UnavailableException will be thrown with or without Hinted Handoff. (This is an important difference from Dynamo’s replication model; Cassandra does not default to sloppy quorum. But, see “Extreme write availability” below.)

To see why, let’s look at a simple cluster of two nodes, A and B, and a replication factor (RF) of 1: each row is stored on one node.

Suppose node A is down while we write row K to it with ConsistencyLevel.ONE. We must fail the write: recall that the ConsistencyLevel contract is that “reads always reflect the most recent write when W + R > RF, where W is the number of nodes to block for on write, and R the number to block for on reads.”

If we wrote a hint to B and call the write good because it is written “somewhere,” the contract would be violated because there is no way to read the data at any ConsistencyLevel until A comes back up and B forwards the data to him.

Extreme write availability

For applications that want Cassandra to accept writes even when all the normal replicas are down (so even ConsistencyLevel.ONE cannot be satisfied), Cassandra provides ConsistencyLevel.ANY. ConsistencyLevel.ANY guarantees that the write is durable and will be readable once an appropriate replica target becomes available and receives the hint replay.

Performance

Cassandra’s hinted handoff is designed to minimize the extra load on the cluster to avoid cascading failure.

There are two parts to a hinted write: the hint, that specifies where to replay a given row, and the data being written itself. Writing a hint to a replica node that is already going to store the data row is almost free: the hint will be included in the data row’s commitlog append, and the hint itself is small compared to most data rows.

But in the worst case, if all replicas in the cluster are down, and if you are writing at CL.ANY, Hinted Handoff can increase the effective load on the cluster. Here’s how:

Recall that cluster write throughput is directly proportional to the number of nodes N, and inversely proportional to the number of replicas RF. If a single node writes 15,000 rows per second, then you would expect a 5 node cluster writing 3 replicas will be roughly 15,000 * N / RF or 25,000 rows/s.

Without Hinted Handoff, if you lose a node, you lose 1/N of your cluster capacity but you also have to write 1/N less rows, so there is no increase in load on the rest of the cluster. If you lose all the replicas for a range of rows, you lose RF/N of capacity and write RF/N less rows. But if you are writing at CL.ANY then you only reduce the write load by (RF-1)/N, i.e., you still have to write the hinted row to the coordinator.

So although the worst-case scenario is relatively mild, it is something you should take into consideration for capacity planning if you plan to do writes at CL.ANY.

Operations

When removing a node from the cluster (with decommission or removetoken), Cassandra automatically removes hints targetting the node that no longer exists.

If you are doing writes at CL.ANY, you may also have data rows that don’t “belong” on those nodes as part of the hinting process, as well. (That is, the hint consists of a “pointer” to the data row. This makes hinting repeated updates of the same row more efficient.) You can purge these by running cleanup.

Repair and the fine print

A common misunderstanding is that Hinted Handoff lets you safely get away without needing repair. This is not the case (as of Cassandra 0.8.x), because Hinted Handoff only happens once the failure detector has recognized that a replica is unavailable. That means that (for ConsistencyLevel < ALL) there is a window during which a node is dead and not accepting writes, but no hints are created for it.

Thus, we often say that Hinted Handoff is an optimization for consistency. Say, for instance, that you are doing reads and writes at CL.ONE, with 3 replicas. You are thus implicitly accepting the possibility of stale reads in exchange for higher availability. Hinted Handoff won’t change the worst case behavior–you can still see stale reads–but you will see less of them with hints enabled, than without.

Future work

CASSANDRA-2034 is open to make Read Repair unnecessary when Hinted Handoff is enabled. (Even with this done, that will only allow disabling Read Repair, not Anti-Entropy Repair, which would still be required whenever a node fails so completely that its data has to be re-replicated from scratch.)

CASSANDRA-2045 is open to allow storing the data changes to replay inline with the hint, which is more efficient for some workloads.

Comments

In this article, you have written that hinted handoff only happens when Failure detector marks a node as down. I think the coordinator node should write a hint if it could not write to a node or to another coordinator in case multi DC write.

By looking at the code, I think it is already happening. Read this comment in MessageingService.java
/**
* Send a message to a given endpoint. This method specifies a callback
* which is invoked with the actual response.
* Also holds the message (only mutation messages) to determine if it
* needs to trigger a hint (uses StorageProxy for that).
* @param message message to be sent.
* @param to endpoint to which the message needs to be sent
* @param cb callback interface which is used to pass the responses or
* suggest that a timeout occurred to the invoker of the send().
* suggest that a timeout occurred to the invoker of the send().
* @param timeout the timeout used for expiration
* @return an reference to message id used to match with the result
*/
public String sendRR(Message message, InetAddress to, IMessageCallback cb, long timeout)