Post navigation

Understanding race-induced conflicts in BigCouch

Distributed databases with a near-real-time multi-master configuration – such as BigCouch, coming soon to Apache CouchDB – must deal with the potential of simultaneous modifications of a single resource. While the approach taken by multiple single-machine Apache CouchDB servers using regular HTTP replication is well understood, the situation changes a little bit when dealing with BigCouch-style internal replication inside a cluster.

I think it’s time to have a better understanding of what this means, and what impact this has on you as an application developer. Most of the time, there’s no change – to your app, a BigCouch-style cluster looks and feels like a single Apache CouchDB node. But when making near-simultaneous writes to the same document from different clients, you may experience document conflicts that you wouldn’t have with an Apache CouchDB 1.x single server.

How does this happen? Bear with me – this gets a bit complex. Hopefully this diagram will help.

The sequence diagram below depicts a scenario where two very closely spaced writes to the same document in a 3-node BigCouch cluster will lead to a document conflict.

In this example, the database cluster is a 3-node cluster, with default settings of n=3, r=2 and w=2. (This means that a successful write to a document must write 2 copies to disk before an HTTP 201 status code is returned.) Client 1 and Client 2, both external client processes talking to the cluster, are both trying to update /db/doc, which is currently at revision 3. Client 1 is trying to write rev 4-1, and Client 2 is trying to write rev 4-2.

For the purposes of this example, we are going to state a specific ordering for document writes, and treat the write as being processed serially in a specific order. In reality, writes are issued in parallel with no coordination of writes. The scenario as shown is simply one of the possible event sequences you may run into, in the wild.

Client 1’s write is being mediated by Node A. For this example, Node A issues writes to nodes A, B and C, in that order. Client 2’s write is being mediated by Node C. For this example, Node C issues writes to nodes C, B and A, in that order. Both Clients’ first writes succeed (Client 1/Node A and Client 2/Node C) and return the internal equivalent of an HTTP 201 status code.

Arbitrarily, we say Client 1’s request to Node B arrives prior to Client 2’s request to Node B. (One will always be processed before the other; it doesn’t matter which one gets there first.) Node B makes the write of revision 4-1, and returns a 201 equivalent to Node A.

At this point, Node A has 2 successful write responses back. Since the cluster is configured with w=2, a 201 is returned to Client 1 and the client disconnects; its work finished. The third write, already issued by Node A to Node C, will eventually be handled, the conflict recorded to disk and a 409 sent back to Node A. Note that both copies of the document are kept, capturing the document conflict on Node C when this write occurs. Node A’s work is now done.

Just after completing its write of revision 4-1, Node B then processes Node C’s write attempt of rev 4-2 from Client 2. This results in the conflict being written to disk, and returns the 409 equivalent to Node C. The same happens when Node C’s write to Node A is processed. Node C now has a non-conflict 201 response from itself, and the 409 responses from Node B and Node A, so it sends the client a 202 status.

At the end of the process, all 3 nodes have both versions of the document recorded, fulfilling CouchDB’s promise of eventual consistency.

Still with me? Good.

So which document “wins”? By design, the document with the higher hash value (the second part of the _rev token, i.e. _rev=##-hash) will win. If 4-1 and 4-2 were the actual _rev values, 4-2 would win. As such, there is no guarantee that the write with a 201 response will be the ‘winner.’[1]

The closer together writes to the same document occur, the more likely it is that the cluster may still be processing a previous write when the subsequent write comes in. Even with resolution and DELETEs of the losing conflicts, document “tombstones” will be left behind on these leaf nodes to ensure replication results in eventual consistency (CouchDB’s guarantee! [2])

The best approach is to avoid these kinds of document conflicts via an access pattern where simultaneous writes are as unlikely to occur as possible. There are a number of resources out there on how to design apps this way, but one of my favourites is to never re-write a document, store all operations as new documents, and use a view or _all_docs and the startkey/endkey parameters to retrieve the latest state of a given resource.

Barring an application redesign, your CouchDB client should look for 202s and conflicts in documents consistently, and provide evidence of this result to the application layer. [3] You can also create a conflicts view to review all conflicts in a given database.

Resist the temptation to try and resolve the conflict in the CouchDB library! Only at the application layer can you best decide how to deal with a document conflict. You might initially choose to ignore conflicts, but probably it’s in your best interest to perform some sort of manual resolution and write a new, merged version based on data in all the conflicted versions.

[2] Doing better is a provably difficult problem. For reference, start with L. Lamport’s Time, Clocks and the Ordering of Events in a Distributed System, Communications of the ACM 21, 7 (July 1978), 558-565.

[3] Take note: some CouchDB libraries, such as python-couchdb, do not differentiate between a 201 and a 202 response!