October 10, 2011

“Building on Quicksand,” by Pat Helland and David Campbell, discusses basic principles for building fault-tolerant, stateful systems. The context is a familiar one: as systems scale, designers must confront a fundamental tradeoff between availability and consistency. The paper offers insights for designing in the face of this inconvenient reality.

Rather than get wrapped-up in the CAP theorem or NoSQL debate, the paper focuses on asynchronyas the fundamental issue. In scalable designs, local subsystems must be able to move forward without waiting for acknowledgements from other parts of the system – i.e., they must operate asynchronously. Unfortunately, most applications are built around CRUD transactions, which is not an asynchronous model. The paper argues that applications must embrace a new model, a model centered on associative, commutative, idempotent operations.

“Building on Quicksand” uses Dynamo as an example. While the Dynamo paper provides interesting details on Dynamo itself, the real lesson is how the shopping cart application deals with the asynchrony of Dynamo. This application does not use Dynamo to directly store a cart’s contents in a CRUD manner. Instead, it uses Dynamo to store a history of associative, commutative, idempotent shopping-cart operations (e.g., “add item,” and “remove item”). The application also has logic for reconciling inconsistent histories that might come from different Dynamo replicas. This non-CRUD shopping cart is more complicated than a CRUD one would be, but, the paper contends, app developers cannot be shielded from this complexity.

The good news is that working with asynchrony is more familiar than we may think. The paper gives an example (edited for conciseness):

In the past, a form would have multiple carbon copies with a printed serial number on top. When a purchase-order was submitted, a copy was kept in the file of the submitter. If the form and its work were not completed by the expected date, the submitter would follow up. Even if the work was lost, the purchase-order would be resubmitted without modification to ensure a lack of confusion. For example, you wouldn’t change the number of items being ordered, as that may cause confusion. The unique serial number would ensure the work was not performed twice.

The paper points out that our “forefathers” were clever in dealing with asynchronous business processes, and suggests that we look for inspiration in the patterns that they developed.

“Building on Quicksand” gives practical advice for dealing with asynchrony. For example, it describes “guesses and apologies.” Asynchronous subsystems act on local “guesses” about the overall state of the system (e.g., a guess as to how many copies of Harry Potter are in stock). Occasionally, these guesses will be wrong, so the system must be prepared to issue an “apology” (e.g., an email indicating that Harry Potter will ship later than expected). Issuing “apologies” is a common part of doing business (think about over-booked airplanes), but it isn’t typically seen as a solution to the CAP limitation (it should be).

I highly recommend this article to anyone charged with building large-scale, distributed systems.

July 2, 2011

I’ve finally posted my second System Note (SN-1): “Audit, Patch, and Bootstrap.” Written with my co-worker, Rohit Chandra, this note summarizes a recipe that many co-workers have helped refine to minimize the impact of Byzantine errors in systems that replicate state.

For performance and availability reasons, replication of state is commonplace in large-scale systems. But it is error-prone, especially at scale, where Byzantine errors are inevitable. And, when errors occur, they can require painful repairs. After losing more weekends than we’d care to admit to cleaning corrupted data, we developed a simple, three-step pattern that minimizes the impact of replication errors:

Where is the Master? The first step is to identify clearly the master copy of your data — and ensure that it is resilient.

How do you audit? The next step is to identify strategies for auditing replicas against that source of truth.

How do you repair? The final step is to identify procedures for repairing discrepancies uncovered by audits. We recommend defining both “patch” procedures that repairs small, simple errors, and “bootstrap” procedures that rewrite replicas from the ground-up.

These steps are simple to remember, simple to apply — and always yields actionable insights. SN-1 describes these steps in a bit more detail, and discusses some practical issues that arise when following them. If you or some friends are building systems that replicate data, give them a try, you won’t be sorry.

June 26, 2010

The fable of The Chicken and the Pig is about commitment to a project or cause.

The Story

The Chicken and the Pig were passing a church on a country road, where they saw a sign reading “Charity meals for the poor, please contribute.” Says the Chicken to the Pig “Sounds like a worthy cause, let’s contribute a ham-and-egg breakfast.” Responds the Pig, thoughtfully, “Madam, for you that would be contribution, for me a total commitment.”

Interpretation and lessons

This fable is commonly referenced in the Agile software development community to to illustrate two types of project members: pigs, who are totally committed to the project and accountable for its outcome, and chickens, who consult on the project and are informed of its progress. By extension, a rooster can be defined as a person who struts around offering uninformed, unhelpful opinions.

A successful project needs both chickens and pigs (roosters are seen as unproductive). However, given the sacrifice required of being a pig — foreswearing other projects and opportunities – they can be difficult to collect. Thus, the construction of a successful project-team must ensure that the project has sufficient “pigs” – and ensure that they are empowered to drive the project in return for committing to and taking accountability for it.

External links

Epilogue

I originally posted this to Wikipedia. Over time, other contributors changed it, watering down my voice — and, in my biased opinion, the punch of the original. I love Wikipedia, it’s a phenomenal resource for mankind. But this experience illustrates the value of individual voice. The future of media has yet to be written…

April 4, 2010

One of the best articles on software design is David Parnas’ 1972 article, On the criteria to be used in decomposing systems into modules” (CACM 15:2). In this paper, Parnas describes two decompositions of a small program. On the surface, these decompositions seem almost identical. However, on closer inspection, Parnas shows that one of these is driven by a functional decomposition of the problem (what he called the “flow chart” approach), while the other one is driven by the principle of information hiding. He goes on to illustrate the benefits of driving design by the principle of the later.

In the flow chart approach, one identifies the computational steps implied by the requirements of the system (first do A, then do B, then do C), and then decomposes the system according to those steps. Parnas claims that the flow chart approach typically leads to software that is not extensible and is otherwise hard to maintain (mostly because the data structures passed between the modules end up being fragile).

The information-hiding approach is based on two simple steps. First and foremost, identify critical subproblems whose solutions are likely to change over time (e.g., because a better algorithm is found, or because different solutions work better in different environments). These subproblems have become known as the secrets to be encapsulated by a module. Second, define abstract interfaces for these secrets, i.e., interfaces that allow clients to access any solution to the problem that may be provided by the module, without exposing any details of a particular solution.

In my experience (and Parnas’ too, see “The secret history of information hiding”), programmers still too often generate designs using the flow-chart approach. True information hiding is hard, and most programmers haven’t learned how it’s done. Here are some tips on mastering this art:

— Read a bunch of Parnas’ papers. He wrote extensively on the topic, using case studies, which really help you understand what he’s saying.

— Consciously stay away from flow-chart thinking. Focusing on data structures is one way to break the flow-chart mindset. Also, Parnas’ writings on “Virtual Machines” might inspire you to think in another direction (see, for example, “Designing software for ease of extension and contraction”).

— Ask about each module in your design: What is the “secret” being encapsulated by this module? What is the subproblem I’m trying to carve off? If you can’t answer this question in a sentence or two — or if your answer sounds like the summary of a step in a larger process — you probably have a problem.

— Focus on problems not solutions. When confronted with a design problem, don’t generate a solution and then attempt to organize that solution into modules. Rather, decompose the larger problem into a series of subproblems. It’s these subproblems that are exposed in the design; solutions to those subproblems are the secrets being encapsulated.

March 29, 2010

As mentioned in my first post, I often write memos to help me and (secondarily) my colleagues better understand an algorithm, design, or design rational. In this I’ve been inspired by the amazing example of Edsgar Dijkstra and his “EWDs.”

While these memos are typically short, they are often at least a couple of pages in length. In my opinion, this is too long for a blog post, which should be a few short paragraphs easy to read while standing in line for coffee. So I created a Web site, http://www.sysnotes.org/, to host these longer memos. When I post memos over there, I’ll post a summary here in this blog, along with a link so those interested can read the entire note.

To kick sysnotes off, I’ve posted SN-0, “An Introduction to System Notes.” This introductory note describes the motivation for sysnotes a little deeper. It also encourages other system builders to submit their own notes for publication. So please pile on!

March 2, 2010

Larry Bird is known for saying “First master the fundamentals.” Bird was obsessed with the basics, showing up early to every practice, performing basic drills repeatedly until he could literally perform them with his eyes closed. While not as flashy as contemporaries like Earvin “Magic” Johnson, Bird’s devotion to the basics earned him permanent recognition as one of the all-time greats of basketball.

In my experience, great builders of computer systems are similarly devoted to the fundamentals of their profession. They recognize that our field has amassed an enormous body of theoretical and experiential knowledge. Great systems builders make it a point to visit and revisit this accumulated knowledge, making it their own, so they can deploy it “with their eyes closed,” so to speak, when they’re in the throes of designing a new system.

In my own attempts to internalize the fundamentals of systems building, I’ve developed the habit of writing short memos that articulate what I’m learning in my own words. In sharing these pieces with colleagues at the office, I’ve been blessed with feedback that has deepened my understanding of the point at hand. In hopes of receiving even broader feedback, I’ve decided to try my hand at becoming a blogger. So if you find something of interest in here, please let me know what you think.