Errors in Database Systems, Eventual Consistency, and the Cap Theorem

Recently, there has been considerable renewed interest in the CAP theorem [1] for database management system (DBMS) applications that span multiple processing sites. In brief, this theorem states that there are three interesting properties that could be desired by DBMS applications:

C: Consistency. The goal is to allow multisite transactions to have the familiar all-or-nothing semantics, commonly supported by commercial DBMSs. In addition, when replicas are supported, one would want the replicas to always have consistent states.

A: Availability. The goal is to support a DBMS that is always up. In other words, when a failure occurs, the system should keep going, switching over to a replica, if required. This feature was popularized by Tandem Computers more than 20 years ago.

P: Partition-tolerance. If there is a network failure that splits the processing nodes into two groups that cannot talk to each other, then the goal would be to allow processing to continue in both subgroups.

The CAP theorem is a negative result that says you cannot simultaneously achieve all three goals in the presence of errors. Hence, you must pick one objective to give up.

In the NoSQL community, this theorem has been used as the justification for giving up consistency. Since most NoSQL systems typically disallow transactions that cross a node boundary, then consistency applies only to replicas. Therefore, the CAP theorem is used to justify giving up consistent replicas, replacing this goal with “eventual consistency.” With this relaxed notion, one only guarantees that all replicas will converge to the same state eventually, i.e., when network connectivity has been re-established and enough subsequent time has elapsed for replica cleanup. The justification for giving up C is so that the A and P can be preserved.

The purpose of this blog post is to assert that the above analysis is suspect, and that recovery from errors has more dimensions to consider. We assume a typical hardware model of a collection of local processing and storage nodes assembled into a cluster using LAN networking. The clusters, in turn, are wired together using WAN networking.

Let’s start with a discussion of what causes errors in databases. The following is at least a partial list:

1) Application errors. The application performed one or more incorrect updates. Generally, this is not discovered for minutes to hours thereafter. The database must be backed up to a point before the offending transaction(s), and subsequent activity redone.

2) Repeatable DBMS errors. The DBMS crashed at a processing node. Executing the same transaction on a processing node with a replica will cause the backup to crash. These errors have been termed Bohr bugs. [2]

3) Unrepeatable DBMS errors. The database crashed, but a replica is likely to be ok. These are often caused by weird corner cases dealing with asynchronous operations, and have been termed Heisenbugs [2]

4) Operating system errors. The OS crashed at a node, generating the “blue screen of death.”

5) A hardware failure in a local cluster. These include memory failures, disk failures, etc. Generally, these cause a “panic stop” by the OS or the DBMS. However, sometimes these failures appear as Heisenbugs.

6) A network partition in a local cluster. The LAN failed and the nodes can no longer all communicate with each other.

7) A disaster. The local cluster is wiped out by a flood, earthquake, etc. The cluster no longer exists.

8) A network failure in the WAN connecting clusters together. The WAN failed and clusters can no longer all communicate with each other.

First, note that errors 1 and 2 will cause problems with any high availability scheme. In these two scenarios, there is no way to keep going; i.e., availability is impossible to achieve. Also, replica consistency is meaningless; the current DBMS state is simply wrong. Error 7 will only be recoverable if a local transaction is only committed after the assurance that the transaction has been received by another WAN-connected cluster. Few application builders are willing to accept this kind of latency. Hence, eventual consistency cannot be guaranteed, because a transaction may be completely lost if a disaster occurs at a local cluster before the transaction has been successfully forwarded elsewhere. Put differently, the application designer chooses to suffer data loss when a rare event (such as a disaster) occurs, because the performance penalty for avoiding it is too high.

As such, errors 1, 2, and 7 are examples of cases for which the CAP theorem simply does not apply. Any real system must be prepared to deal with recovery in these cases. The CAP theorem cannot be appealed to for guidance.

Let us now turn to cases where the CAP theorem might apply. Consider error 6 where a LAN partitions. In my experience, this is exceedingly rare, especially if one replicates the LAN (as Tandem did). Considering local failures (3, 4, 5, and 6), the overwhelming majority cause a single node to fail, which is a degenerate case of a network partition that is easily survived by lots of algorithms. Hence, in my opinion, one is much better off giving up P rather than sacrificing C. (In a LAN environment, I think one should choose CA rather than AP). Newer SQL OLTP systems (e.g., VoltDB and NimbusDB) appear to do exactly this.

Next, consider error 8, a partition in a WAN network. There is enough redundancy engineered into today’s WANs that a partition is quite rare. My experience is that local failures and application errors are way more likely. Moreover, the most likely WAN failure is to separate a small portion of the network from the majority. In this case, the majority can continue with straightforward algorithms, and only the small portion must block. Hence, it seems unwise to give up consistency all the time in exchange for availability of a small subset of the nodes in a fairly rare scenario.

Lastly, consider a slowdown either in the OS, the DBMS, or the network manager. This may be caused by skew in load, buffer pool issues, or innumerable other reasons. The only decision one can make in these scenarios is to “fail” the offending component; i.e., turn the slow response time into a failure of one of the cases mentioned earlier. In my opinion, this is almost always a bad thing to do. One simply pushes the problem somewhere else and adds a noticeable processing load to deal with the subsequent recovery. Also, such problems invariably occur under a heavy load–dealing with this by subtracting hardware is going in the wrong direction.

Obviously, one should write software that can deal with load spikes without failing; for example, by shedding load or operating in a degraded mode. Also, good monitoring software will help identify such problems early, since the real solution is to add more capacity. Lastly, self-reconfiguring software that can absorb additional resources quickly is obviously a good idea.

In summary, one should not throw out the C so quickly, since there are real error scenarios where CAP does not apply and it seems like a bad tradeoff in many of the other situations.

Disclosure: In addition to being an adjunct professor at the Massachusetts Institute of Technology, Michael Stonebraker is associated with four startups that are either producers or consumers of database technology.

Comments

Daniel Creswell

June 25, 2010 04:12

"Unfortunately, I cant seem to locate that paper, which was Tandem-specific, in any case."

I think you meant: Why Do Computers Stop and What Can Be Done About It?

Which is here: http://www.hpl.hp.com/techreports/tandem/TR-85.7.html

John Schlesinger

October 04, 2010 07:49

It is really good to have a serious discussion of this topic. I do not, however, agree that CA is better than AP.

I have been convinced that distributed two phase commit with ACID transactions is not the right way forward since 1988. At that time I was responsible for the application API in CICS at the IBM Hursley lab. The problem is very simple: even in a heterogeneous environment the two phase commit takes too long and is too liable to fail.

Although the Open Group (X/Open at the time) managed to create a standard for transction managers talking to resource managers (XA), they never managed to stabilise the standard for transaction manager to transaction manager communication (XA+). We had first hand experience of this in IBM trying to get IMS, CICS and OS/400 to do distributed two phase commits. This failure to standardise means that even if you go for CA rather than AP, you won't find the middleware to help you do it in a homogeneous environment.

Even if you standardise on a single transaction manager which can do distributed two phase commits (an area where there are many Bohr type errors still in the code of most commercial application servers), you will find that the architecture doesn't scale. The bank I am currently consulting with is putting in a new core banking system. A customer update (a new address say) implies updating five other systems in three data centres. The latency of this would be unbearable. Customer updates are master data changes and happen about 1000 times less often than transactional data changes, like making payments. There the problem is different but even worse. A distributed two phase commit keeps a session open to each transaction manager for each transaction until it commits. Using ACID for payment processing for instance would quickly lead to running out of resources in current operating systems.

There is a management argument too. Two phase commit only buys you consistency because, in the case of a crash, you can recover all resources back to a consistent state. However, if you use this as the architecture for consistency, then, in a typical enterprise environment, you might expect each transaction to have sessions to five other transaction managers and ten resource managers (at least a database and a queue for each) across say three data centres. That means coordinating log identifier exchange (an unpleasant feature of distributed two phase commit) across three data centres and fifteen parties. Our experience was that as soon as two data centres were implicated, heuristic commit ruled the day. In other words, practically speaking, distributed two phase commit results in consistency by guess work.

Back in 1988 we found that customers were already voting with their feet. A large rail operator in the US was distributing train schedule updates (a safety critical transaction) using messages rather than CICS' built in two phase commit because they found it more reliable. This was a large part of the reason we decided to implement transactional MQ - it is a much better way to distribute transactions.

The reasons for many of us preferring AP to CA have nothing to do with whether we prefer NoSQL to SQL but everything to do with the inability of the vendors to scale distributed two phase commit in real world environments. Jim Gray knew this very well. He tried hard to create a nested transaction semantic that vendors could implement, but all of us went to messaging instead. He admitted part of this to the Register in one of his last interviews (http://www.theregister.co.uk/2006/05/30/jim_gray/) where he says:

"Frankly, over 25 years there's been a lot of work in this area [when to use transactions]and not much success. Workflow systems have, by and large, come to the conclusion that what the accountants told us when we started was correct - the best thing you can do is have compensation if you run a transaction and something goes wrong. You really can't undo the transaction, the only thing you can do is run a new transaction that reverses the effects of that previous transaction as best you can."