Building on quicksand

October 10, 2011

“Building on Quicksand,” by Pat Helland and David Campbell, discusses basic principles for building fault-tolerant, stateful systems. The context is a familiar one: as systems scale, designers must confront a fundamental tradeoff between availability and consistency. The paper offers insights for designing in the face of this inconvenient reality.

Rather than get wrapped-up in the CAP theorem or NoSQL debate, the paper focuses on asynchronyas the fundamental issue. In scalable designs, local subsystems must be able to move forward without waiting for acknowledgements from other parts of the system – i.e., they must operate asynchronously. Unfortunately, most applications are built around CRUD transactions, which is not an asynchronous model. The paper argues that applications must embrace a new model, a model centered on associative, commutative, idempotent operations.

“Building on Quicksand” uses Dynamo as an example. While the Dynamo paper provides interesting details on Dynamo itself, the real lesson is how the shopping cart application deals with the asynchrony of Dynamo. This application does not use Dynamo to directly store a cart’s contents in a CRUD manner. Instead, it uses Dynamo to store a history of associative, commutative, idempotent shopping-cart operations (e.g., “add item,” and “remove item”). The application also has logic for reconciling inconsistent histories that might come from different Dynamo replicas. This non-CRUD shopping cart is more complicated than a CRUD one would be, but, the paper contends, app developers cannot be shielded from this complexity.

The good news is that working with asynchrony is more familiar than we may think. The paper gives an example (edited for conciseness):

In the past, a form would have multiple carbon copies with a printed serial number on top. When a purchase-order was submitted, a copy was kept in the file of the submitter. If the form and its work were not completed by the expected date, the submitter would follow up. Even if the work was lost, the purchase-order would be resubmitted without modification to ensure a lack of confusion. For example, you wouldn’t change the number of items being ordered, as that may cause confusion. The unique serial number would ensure the work was not performed twice.

The paper points out that our “forefathers” were clever in dealing with asynchronous business processes, and suggests that we look for inspiration in the patterns that they developed.

“Building on Quicksand” gives practical advice for dealing with asynchrony. For example, it describes “guesses and apologies.” Asynchronous subsystems act on local “guesses” about the overall state of the system (e.g., a guess as to how many copies of Harry Potter are in stock). Occasionally, these guesses will be wrong, so the system must be prepared to issue an “apology” (e.g., an email indicating that Harry Potter will ship later than expected). Issuing “apologies” is a common part of doing business (think about over-booked airplanes), but it isn’t typically seen as a solution to the CAP limitation (it should be).

I highly recommend this article to anyone charged with building large-scale, distributed systems.