Wednesday, November 04, 2009

Eventual Consistency by Example

Recently, there has been a lot of chitchat about the eventual consistency model as illustrated in the famous Amazon Dynamo paper, and today employed by several non-relational databases such as Voldemort or Cassandra.Everything starts with this blog post by the Facebook Infrastructure Lead, claiming: "Dynamo: A flawed architecture", where he makes a few points against the eventual consistency model and the related "sloppy" quorum approach.

However, his points seems to be based on a few misconceptions which the Amazon paper doesn't help to clarify, so let's try to spread some light by first giving a few definitions, and then a simple example.

Definitions.

We may sum up theeventual consistency model in the following statement:

Given a total number on T nodes, we choose a subset of N nodes for holding key/value replicas, arrange them in a preference list calling the top node "coordinator", and pick the minimum number of writes (W) and reads (R) that must be executed by the coordinator on nodes belonging to the preference list (including itself) in order to define the write and read as "successful".

Here are the key concepts, extracted from the statement above:

N is the number of nodes defining the number of replicas for a given key/value.

Those nodes are arranged in a preference list.

The node at the top of the preference list is called coordinator.

W is the minimum number of nodes where the write must be successfully replicated.

R is the minimum number of nodes where the read must be successfully executed.

Writes are executed by the coordinator on the first W available nodes on the preference list.Reads are executed by the coordinator on the first R available nodes on the preference list.This is eventually consistent because consistency itself depends on W and R: you may successfully write and read from a number of nodes less than the total number of replicas (N), with conflicts resolved at read time and values asynchronously propagated by hinted-handoff or alike (which could both be a topic for another post).The same way, this is a sloppy quorum because writes and reads may succeed even if not all replica nodes (N) are up and running, depending again on W and R.

More specifically, values for N, W and R can be tuned in order to:

Achieve high write availability by setting: W < N

Achieve high read availability by setting: R < N

Achieve full consistency by setting: W + R > N

This seems to be the most difficult part to grasp, so ... let's proceed with our example!

An example.

Let me show you the simplest possible example: a cluster of two nodes.

We have:

Node1 : the coordinator.

Node2 : guess what, the second cluster node.

N = 2 : our replication factor.

Let's examine all possible eventual consistency scenarios.

W = 1, R = 1 -> W < N, R < N

You get write/read high availability: you can write/read to/from whatever available node, but you may get inconsistent reads. For example, if you write to node1 and then read from node2 (maybe because node1 has suddenly failed), you'll get an inconsistent read.With two nodes, this is the only eventually consistent scenario.Let's see why.

W = 1, R = 2 -> W + R > N

This is a fully consistent scenario.Here, you can write to whatever node, but you can only read when both nodes are online, more specifically:

If you write to node1 and then read from node2 without reading from node1 because suddenly death, the read will fail.

Same if writing and reading from the same node: if you write to node1 and then read from node1 without reading from node2 because suddenly death, the read will fail.

Finally, the two most misunderstood scenarios:

If you write to node2 because node1 is offline, then node1 gets online and you read the same value from it, your read will succeed because you have both nodes online and you will not get any inconsistent value! That's because node2 will have the most up-to-date version which will be "acquired" by node1 and returned to you!

If you write to node1, then it goes offline and you write again on node2, you will never lose the old value! That's because you will be able to read only after node1 coming online again: at that time, a read from node1 and node2 will return both values (because divergent), leaving the reconciliation process to the client.

W = 2, R = 1 -> W + R > NW = 2, R = 2 -> W + R > N

Even easier: consistent reads are guaranteed by the W value equal to the replication factor.

Finally ...

Hope this write-up helped you clarifying a few misconceptions about eventual consistency.Feedback is obviously highly appreciated.

16 comments:

Thanks for the explanation. When you say that (R+W)>N ensures full consistency, the natural question is, "Which N?" Ensuring that addition or deletion or (especially) replacement of nodes doesn't lead to a violation of the full consistency that practically every description of Dynamo has led people to expect in that case is definitely one of the Hard Parts, and I for one am not convinced that all of the Dynamo-alikes get that part right.

Check out Section 4.6 of the Dynamo paper. I will just quote the first para:

"If Dynamo used a traditional quorum approach it would be unavailable during server failures and network partitions, and would have reduced durability even under the simplest of failure conditions. To remedy this it does not enforce strict quorum membership and instead it uses a “sloppy quorum”; all read and write operations are performed on the first N healthy nodes from the preference list, which may not always be the first N nodes encountered while walking the consistent hashing ring. "

As the previous commentor asked - which N? The Dynamo paper clearly says - first healthy N. So there's no consistent quorum group for a given key range.

Now - certain people (Jonathan) has argued that Cassandra has consistent quorum groups (does not use first healthy N). However, if this were the case - as the Dynamo paper points out - one loses availability (since loss of 1 node prevents writes (in ur W=2,R=1,N=2 case).

What i have been trying to point out is that Dynamo (at best) presents people with a hard choice between consistency and availability. I think this is a bad choice - because both consistency and availability are achievable when one operates within the context of systems that don't suffer from partitions. Because of this reason - it is better to build a layered system where partition tolerance is provided at a higher level in the system.

N is the number of requested replicas, which may be less than the total number of nodes (T).Then, if T increases, N may or may not be increased as well.

The hard part about this, IMHO poorly explained in the Dynamo paper, is: if N < T and all nodes in N die at a certain point in time, will other nodes (T - N) come up in the preference list replacing the old ones?If this needs to be true, nodes in T - N must have aligned their state with old nodes in N (i.e. using merkle trees) ... if not so, I agree: leaving of *all* nodes in N may lead to data loss.

That's exactly the point of the eventual consistency model employed by Dynamo, where you can turn the sloppy quorum in a consistent one by setting R + W > N.There's no problem with it, it's your choice.

On the other side, saying that:

"I think this is a bad choice - because both consistency and availability are achievable when one operates within the context of systems that don't suffer from partitions."

It's a whole different point: we may agree or disagree, it doesn't matter in the context of the eventual consistency model.You may say you don't like it, but you can't say it's flawed because in your statement you're making different trade-offs: loosing partition tolerance (you) vs eventually loosing consistency (dynamo).

"You may say you don't like it, but you can't say it's flawed because in your statement you're making different trade-offs: loosing partition tolerance (you) vs eventually loosing consistency (dynamo)."

Joydeep is saying it's flawed because, in his opinion, it makes a bad trade-off. Whether you agree with his opinion, that's a different story. :)

Anyway, I definitely believe that this whole debate would have been less of a flamefest if Joydeep had focused on what he said above and if he had used a more moderate tone.

A different take on how to go about the inevitable trade-offs in this space from someone who has experience dealing with a very large infrastructure should be welcome.

Then the loss of consistency "may or may not" occur as well, and "ensures" doesn't allow for that. Therefore, the common statement that W+R>N ensures consistency, without mentioning this caveat, is untrue. I've been explaining consistency and distributed systems to people for twenty years, and it's hard enough for most people to "get it" without vague or misleading claims.

Again, that's IMHO the weakest point of the whole Dynamo paper (at least regarding the eventual consistency model): and you're right, such a scenario should be taken into account when discussing the consistency features (and I should update my post as well).

@sbtourist: the Dynamo paper itself is crystal clear - i don't know what the ambiguity about section 4.6 and the para i quoted are. first N healthy != consistent quorum group = sloppy quoruum = eventual consistency. It has nothing to do with T (if one simply reads the paper straight).

Kannan has also pointed out in comments on my blog that there's also no consistency because failed writes are not backed out (one could even get uncommitted reads).

this is however not the only point i have tried to make. imho - the problems with Dynamo arise from the design principles (symmetry, decentralization) that were laid out. once those principles are set in stone - there is little choice in the design space. more pragmatic designers (take Yahoo PNUTS for example) worked backwards from the problem at hand and came up with a much more better balance of C and P. i am sure there are plenty of more works like that.

sometimes it takes a little act of insanity to change the world. if i had written a nicely diplomatic review of Dynamo's problems - no one would have noticed (clearly - i am no Stonebraker). That sort of discussion happens everyday in grad schools.

so yeah - i have become a little bit of a lightning rod. (it's not pleasant to get called the Fox News of Distributed systems :-(). But now that a whole lot more people are talking about this (and will no doubt converge to better design points) - i think i have (at least partially) succeeded in my goal of initiating change (even if it at some personal cost).

Yeah so the idea is to set W=2,R=2,N=3. This way you write to a majority of all nodes holding a replica, and you read from a majority as well. Thus you can be sure that you have seen the latest version, and you can allow for one node to fail. If you'd have N=5 you'd set R=3,W=3

Sergio, could you provide some more info on reconciliation when retrieving two different versions from the different nodes? Also, do such mechanisms exist in all eventually consistent document storages?

I would like ask a naive question - If in case of W = 2 where every write has to write synchronously to 2 nodes, doesn't it affect the performance and overall scalability?

Also in case of R = 2, does it mean that every read has to atleast read from 2 nodes from amongst the set of N nodes? What happens if a request is directed to one of the (T-N) nodes? How does this actually work? Can somebody throw some light on this?

There are two kind of data-reconciliation strategies: server-based, relying on vector-clocks resolution, or client-based, which returns all conflicting values letting the client do the actual resolution.

While client-based is pretty easy to implement (all the burden is left to the reader), server-based is not implemented by all stores AFAIK.

@Anonymous

If you set W = S (whatever S), then yes, S writes must synchronously happen: that's why you should be careful with it.

Regarding your second question, the request must be routed to the preference-list N nodes: what are your concerns?

Guild Wars 2 Gold Therefore any amount associated with Closed circuit anchoring screws us more than. Buy D3 Gold Nevertheless each and every occupation offers it is countertop. Whether it wasn't for tough Closed circuit the actual robber could be past Website author

The best place to learn Chinese online is in China. However, we understand that it isn't always possible to move here to learn Chinese language. The next best thing is to study with our experienced teachers in a virtual classroom. Online students enjoy the same excellent way of mandarin online lessons and custom designed courseware that we provide for our face to face clients.