Concurrency bugs in datacenter distributed systems

The paper provides a comprehensive study on real-world distributed concurrency bugs and is a good complement to the paper I reviewed before: “Why does cloud stop computing”. While the previous paper looked at all possible bugs that lead to service outages, this paper focuses only on distributed concurrency (DC) bugs. This type of bug happens to be my favorite kind of bug. I am fascinated with “failures”, and with DC bugs doubly so. DC bugs are execution timing/ordering/scheduling related, they occur nondeterministically, and are extremely difficult to detect, diagnose, and fix in production systems. I love seeing those long traces of improbable DC bugs surface when I model check distributed algorithms in TLA+. The experience keeps me humble.

The paper presents analysis of 104 DC bugs from Cassandra, HBase, Hadoop MapReduce, and ZooKeeper. These bugs are categorized and studied according to the triggering timing condition and input preconditions, error and failure symptoms, and fix strategies, as shown in Table 1. The bug taxonomy database is available here.

Trigger warning

More than 60% of DC bugs are triggered by a single untimely message delivery that commits order violation or atomicity violation, with regard to other messages or computation. Figure 1 shows possible triggering patterns.

This sounds like a very striking and surprising finding. How is it that simple? How do we reduce most DC bugs to untimely message delivery? If you think about it, it is actually not very surprising. What is a DC bug? It is a state inconsistency across processes. What makes the state inconsistency into a bug? A communication/message-exchange between the two inconsistent processes.

Of course the root cause leading to the state inconsistency can be pretty complex. Figure 3 shows that many DC bugs need complex input preconditions, such as faults (63% in Figure 3.b), multiple protocols (80% in Figure 3.f), and background protocols (81% in Figure 3g). As an example, consider the juicy bug described in Figure 4.

Can we fix it? Yes, we can!

The paper analyzes bug patches to understand developers’ fix strategies for DC bugs. They find that DC bugs can be fixed by either disabling the triggering timing by adding extra synchronization, or by changing the system’s handling logic by ignoring/neutralizing the untimely message.

Yes, this sounds very simple. But what this perspective hides is that you first need to figure out the bugs before you can fix them. And identifying the DC bugs is the hard question as they occur nondeterministically and are extremely difficulty to detect and diagnose. For example, Figure 3 shows 47% of DC bugs lead to silent failures and hare hard to detect and debug in production and reproduce offline.

This is also a losing game, as you always need to play catch up with the new corner cases arising and haunting your system. Instead of doing a case-by-case fixing of the systems, it is better to fix our misconceptions and approach to designing these distributed protocols.

I like the section on Root Causes, which talks about developers misconceptions about distributed systems that leads to these bugs. These misconceptions are simple to correct, and doing so can eradicate many DC bugs.

One hop is faster than two hops.

No hop is faster than one hop.

Atomic blocks cannot be broken.

Interactions between multiple protocols seem to be safe.

Enough states are maintained to detect/handle problems. (Upon observing that some fixes add new in-memory/on-disk state variables to handle untimely message and fault timings.)

Another thing to improve would be to use better testing tools. Recall that 63% of DC bugs surface in the presence of hardware faults such as machine crashes (and reboots), network delay and partition (timeouts), and disk errors. The paper doesn’t mention the Jepsen testing tool, but it would be a good fit to detect many DC bugs mentioned in this study.

And finally ounce of prevention is worth a pound of cure. Design your distributed protocols right in the first place. You can use TLA+ to specify and model-check your distributed protocols/algorithms before you start implementing them.

In the lessons learned section the paper discusses about what can be done to detect/prevent DC bugs, and says that “No matter how sophisticated the tools are, they are ineffective without accurate specifications. This motivates the creation or inference of local specifications that can show early errors or symptoms of DC bugs.” The implication is that since the protocols do not come with good formal specifications, the tool support for testing, checking, detection, etc. becomes ineffective. This point further motivates that it is important to get the distributed protocol right first, and the specification should accompany the protocol so that the implementation can be made reliable. TLA+ provides support for writing the specifications. You can see my previous posts on TLA+ below.