Friday, December 14, 2018

Partitioned consensus and its impact on Spanner’s latency

In a post that I published in September, I described two primary approaches for performing consensus in distributed systems, and how the choice of approach affects the consistency guarantees of the system. In particular, consensus can either be unified, such that all writes in the system participate in the same distributed consensus protocol, or it can be partitioned, such that different subsets of the data participate in distinct consensus protocols.

The primary downside of partitioned consensus was that achieving global consistency is much more complicated. Consistency guarantees require that requests submitted after previous requests complete will never “go back in time” and view a state of the system that existed prior to the completed request. Such guarantees are hard to enforce in partitioned consensus systems since different partitions operate independently from each other: Consistency requires a notion of “before” and “after” --- even for events on separate machines or separate partitions. For partitions that operate completely independently, the most natural way to define “before” and “after” is to use real time --- the time on the clocks of the different partitions. However, clocks tend to skew at the millisecond granularity, and keeping clocks in sync is nontrivial. We discussed how Google has a hardware solution that aids in clock synchronization, while other solutions attempt to use software-only clock synchronization algorithms.

In contrast, unified consensus results in a global order of all requests. This global order can be used to implement the notion of “before” and “after” without having to rely on local clock time, which entirely avoids the need for clock synchronization. This results in stronger consistency guarantees: unified consensus systems can guarantee consistency at all times, while partitioned consensus systems can only guarantee consistency if the clock skew stays within an expected bound. For software-only implementations, it is hard to avoid occasionally violating the maximum clock skew bound assumption, and the violations themselves may not be discoverable. Therefore, unified consensus is the safer option.

The post led to several interesting debates, most of which are beyond the scope of this post. However, there was one interesting debate I’d like to explore more deeply in this post. In particular, the question arose: Are there any fundamental latency differences between unified-consensus systems and partitioned-consensus systems? When I read the comments to my post (both on the post itself and also external discussion boards), I noticed that there appears to be a general assumption amongst my readers that unified consensus systems must have higher latency than partitioned consensus systems. One reader even accused me of purposely avoiding discussing latency in that post in order to cover up a disadvantage of unified consensus systems. In this post, I want to clear up some of the misconceptions and inaccurate assumptions around these latency tradeoffs, and present a deeper (and technical) analysis on how these different approaches to consensus have surprisingly broad consequences on transaction latency. We will analyze the latency tradeoff from three perspectives: (1) Latency for write transactions, (2) Latency for linearizable read-only transactions and (3) Latency for serializable snapshot transactions.

Latency for write transactions

The debate around write transactions is quite interesting since valid arguments can be made for both sides.

The partitioned-consensus side points out two latency downsides of unified consensus: (1) As mentioned in my previous post, in order to avoid scalability bottlenecks, unified consensus algorithms perform consensus batch-at-a-time instead of on individual requests. They, therefore, have to pay the additional latency of collecting transactions into batches prior to running consensus. In the original Calvin paper, batch windows were 10ms (so the average latency would be 5ms); however, we have subsequently halved the batch window latency in my labs at Yale/UMD. FaunaDB (which uses Calvin’s unified consensus approach) also limits the batch window to 10ms. (2) For unified consensus, there will usually be one extra network hop to get the request to the participant of the consensus protocol for its local region. This extra hop is local, and therefore can be assumed to take single-digit milliseconds. If you combine latency sources (1) and (2), the extra latency incurred by the preparation stage for unified consensus is approximately 10-15ms.

On the other hand, the unified-consensus side points out that multi-partition atomic write transactions require two-phase commit for partitioned-consensus systems. For example, let’s say that a transaction writes data in two partitions: A and B. In a partitioned-consensus system, the write that occurred in each partition achieves consensus separately. It is very possible that the consensus in partition A succeeds while in B it fails. If the system guarantees atomicity for transactions, then the whole transaction must fail, which requires coordination across the partitions --- usually via two-phase commit. Two-phase commit can result in significant availability outages (if the coordinator fails at the wrong time) unless it runs on top of consensus protocols. Thus Spanner and Spanner-derivative systems all run two-phase commit over partitioned consensus for multi-partition transactions. The latency cost of the Raft/Paxos protocol itself (once it gets started) is the same for unified and partitioned consensus, but two-phase commit requires two rounds of consensus to commit such transactions. A single round of consensus may take between 5ms and 200ms, depending on how geographically disperse the deployment is. Since Spanner requires two sequential rounds, the minimum transaction latency is double that --- between 10ms for single-region deployments to 400ms for geographically disperse deployments.

In practice, this two-phase commit also has an additional issue: a transaction cannot commit until all partitions vote to commit. A simple majority is not sufficient --- rather every partition must vote to commit. A single slow partition (for example, a partition undergoing leader election) stalls the entire transaction. This increases the observed long tail latency in proportion to transaction complexity.

In contrast, unified consensus systems such as Calvin and its derivatives such as FaunaDB do not require two-phase commit. [Side note: a discussion of how to avoid two-phase commit in a unified consensus system can be found in this VLDB paper. FaunaDB’s approach is slightly different, but the end result is the same: no two-phase commit.] As a result, unified consensus systems such as Calvin and FaunaDB only require one round of consensus to commit all transactions --- even transactions that access data on many different machines.

The bottom line is that the better latency option between unified or partitioned consensus for write transactions is somewhat workload dependent. Unified consensus increases latency for all transactions by a little, but partitioned consensus can increase latency by a lot more for multi-partition transitions. For most applications, it is impossible to avoid multi-partition transactions. For example, many applications allow transactional interactions between arbitrary users (payments between users, exchanges between users, “friendship” status updates between users, gaming interactions, etc.). Although it is possible to group users into partitions such that many of their interactions will be with other users within that partition (e.g. partition by a user’s location), as long as arbitrary interactions are allowed, there will always be interactions between users in different partitions. These multi-partition interactions are much slower in partitioned-consensus systems. Thus, for most workloads, unified-consensus is the better latency option for write transactions.

Latency for linearizable read-only transactions

Linearizable read-only transactions are generally sent to the consensus leader’s region and performed (or at least receive a timestamp) there [other options exist, but this is what Spanner and other systems mentioned in my previous post do]. In unified-consensus, there is only one leader region for the whole system. Linearizable read transactions that initiate from near this region will be processed with low latency, while transactions that initiate from farther away will observe correspondingly higher latency.

Meanwhile, in partitioned-consensus, there is one leader per partition, and these leaders can be located in different regions. The partitioned-consensus supporters argue that this can lead to lower latency in an array of common scenarios. An application developer can specify a location-based partitioning algorithm. All data that is usually accessed from region X should be placed in the same partition, with a consensus leader located in region X. All data that is usually accessed from region Y should be placed in the same partition, with a consensus leader located in region Y. In doing so, a larger number of read-only transactions will observe lower latency.

The downside of this approach is that it breaks the abstraction of the consensus protocol as a separate component of the system --- now the consensus protocol and data storage layer become much more intertwined, increasing the monolithicity of the system. Furthermore, consensus protocols run leader elections after a lease expires, and would have to reduce the democratic nature of this protocol in order to ensure the leader remains in the closest possible region. Finally, it increases complexity and reduces the flexibility of the partitioning protocol. As far as I know, the most well-known example of a partitioned-consensus system --- Spanner --- does not take advantage of this potential optimization for these reasons.

Consequently, although in theory, there is a potential latency advantage for partitioned-consensus systems for linearizable read-only transactions, in practice this advantage is not realized.

On the other hand, there is a fundamental latency advantage for unified-consensus systems in the presence of multi-partitioned transactions. A multi-partition transaction in a partitioned-consensus system must involve more than one leader. The leaders of each partition accessed by the read transaction must communicate with each other in order to figure out a timestamp at which this linearizable read can be safely performed (see sections 4.1.4 and 4.2.2 of the Spanner paper). Alternatively, a timestamp sufficiently into the future (at the end of the clock skew uncertainty bound) can be chosen at which to perform the read. Either way ---- whether it is communication across leaders (that may be located in different geographic regions) or whether it is waiting until the end of the clock skew uncertainty bound --- multi-partition reads pay an additional latency cost in order to ensure linearizability. In contrast, unified consensus systems have only a single leader region, and can perform linearizable reads across the servers in this region, without any communication with other regions or waiting for clock skew uncertainty windows to close.

Latency for serializable snapshot read-only transactions

Many applications --- even if they require linearizable write transactions --- do not require linearizable read-only transactions, and instead are content to read from a recent snapshot of the database state. However, such snapshot reads must be serializable --- they should reflect the database state as of a particular transaction in the serial order, and none of the writes from transactions after that transaction.

Recall that transactions may write data on different machines / partitions. For example, a transaction, T, may write data on partition A and partition B. A serializable snapshot that includes data from both partition A and partition B must therefore include both of T’s writes to those partitions, or neither. In particular, it should include both of T’s writes if the snapshot is as of a point in time “after” T, and neither of T’s writes if the snapshot is as of a point in time “before” T. Note that this notion of “point in time” must exist --- even across partitions.Therefore, once again, there needs to be a global notion of “before” and “after” --- even for writes across different machines. As long as this notion of “before” and “after” exists, such snapshot reads can be sent to any replica to be processed there, and do not require any consensus or interaction with consensus leader regions. This is critically important to support low latency reads in a geographically disperse deployment.

As mentioned in the introductory paragraph of this post, both unified- and partitioned-consensus systems are capable of generating global notions of “before” and “after”, and thus both types of systems are able to achieve the performance advantage of being able to perform these reads from any replica. However, as we mentioned above, unified-consensus systems can achieve this global notion of “before” and “after” without any clock synchronization, while partitioned-consensus systems use clock synchronization. Thus, unified-consensus can always achieve correct serializable snapshot reads, while partitioned-consensus can only achieve the same result if the maximum clock skew bound assumption is not violated.

Conclusion

The latency debate between unified vs. partitioned consensus is an intricate one. However, it is clear that multi-partition transactions exacerbate the disadvantages of partitioned-consensus transactions in (at least) three dimensions:

Multi-partition transactions require two-phase commit on top of the consensus protocol in partitioned-consensus systems. In many deployments, consensus across replicas is the latency bottleneck. By requiring two-phase commit on top of consensus, partitioned-consensus systems result in (approximately) double the latency (relative to unified-consensus) in such deployments, and higher long tail latency.

Partitioned-consensus systems require clock synchronization in order to achieve low latency snapshot reads (in addition to all linearizable operations). Any incorrect assumptions of the maximum clock skew across machines may result in serializability violations (and thus incorrect results being returned).

As we discussed above, it is usually impossible to avoid multi-partition transactions in most real-world applications. Furthermore, as an application scales, the number of partitions must increase, and thus the probability of multi-partition transactions is also likely to increase. Therefore, the disadvantages of partitioned-consensus systems relative to unified-consensus systems accentuate as the application scales.

6 comments:

Can you clarify the "Latency for write transactions" section? Paragraph 2 says that consensus is possible in 10-15ms for a unified-consensus system, while paragraph 3 says consensus requires 10-400ms for a partitioned-consensus system. That seems to suggest that there are cases where the partitioned system might take (400/15)=~25x as long. I don't think that's what you're saying, right?

To illustrate, and to make sure I'm correctly understanding, take the following example. Say that (using Calvin terms) replica A is in a US DC and replica B is in a EU DC. Say that roundtrip latency is 100ms. A client transaction is initiated from the US DC that includes writes to multiple partitions. Calvin will take ~100ms to commit that txn and ack the client (ignoring the batch accumulation time). A partitioned consensus system with a traditional 2PC protocol would take ~200ms. Is that correct? If so, then the takeaway would be that writes to geo-distributed replicas take roughly 2x as long for a partitioned-consensus system using traditional 2PC.

Ha, yes, I figured you were referencing me. I do want to point out that I never "accused [you] of purposely avoiding discussing latency". That would require me to know your intent, which I'd never presume to do. My point was that to get a complete and fair understanding of the tradeoffs involved, it's essential to discuss latency in geo-distributed use cases. This latest post suggests that you agree.

Regarding comparisons of latency I'll make two points (assuming geo-distributed use cases):

1) You mentioned that "it is possible to group users into partitions such that many of their interactions will be with other users within that partition (e.g. partition by a user’s location)". I'd strengthen this; not only is it possible, it is *vital* if you want users around the world to enjoy low latencies for common operations. Every serious commerce application will do this. Today that's typically done by hosting separate DBs in each region of the world (i.e. today's architectures are already partitioned). A database worthy of the "global" adjective should give those same latency and consistency advantages for data that can be partitioned, but *also* allow cross-region transactions for data that can't be partitioned. This is not something that unified-consensus systems can do without sacrificing either consistency or latency. But partitioned-consensus systems can do this (at the cost of accepting clock-skew risk).

2) It's both possible and practical to use a distributed txn protocol that can ack the client after paying latency equivalent to only one round of consensus (at cost of more recovery work if coordinator fails). I'm not sure what's been published on this topic, and it's challenging to discuss details in a blog comment. But I will say that the latency advantage that you're asserting is due to choice of algorithm, not due to some limitation that is theoretically impossible to address. Referring back to my example in my previous comment, I fully expect that tomorrow's partitioned-consensus systems will commit that transaction in ~100ms rather than ~200ms.

Don't worry --- I wasn't talking about you when I said somebody accused me of hiding latency. That was somebody else (either on Twitter or HackerNews or elsewhere --- I don't remember). But obviously our conversation on my previous post was primarily about latency, and half of the section on "Latency for linearizable read-only transactions" came out from that conversation.

Daniel Abadi

About Me

Daniel Abadi is the Darnell-Kanal Professor of Computer Science at the University of Maryland, College Park, doing research primarily in database system
architecture and implementation. He received a Ph.D. from MIT and a M.Phil. from Cambridge. He is best known for his research in column-store database systems (the
C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, which was commercialized by VoltDB and the Calvin project which inspired FaunaDB),
and Hadoop (the HadoopDB project, which was commercialized by Hadapt). Abadi has been a recipient of a Churchill
Scholarship, an NSF CAREER Award, a Sloan Research Fellowship, the 2008 SIGMOD
Jim Gray Doctoral Dissertation Award, a VLDB best paper award, a VLDB 10 year best paper award, the 2013-2014 Yale Provost's Teaching Prize, and the 2013 VLDB Early Career Researcher Award. He blogs at http://dbmsmusings.blogspot.com and
tweets at http://twitter.com/#!/daniel_abadi.