Friday, December 14, 2018

In a post that I published in September, I described two primary approaches for performing consensus in distributed systems, and how the choice of approach affects the consistency guarantees of the system. In particular, consensus can either be unified, such that all writes in the system participate in the same distributed consensus protocol, or it can be partitioned, such that different subsets of the data participate in distinct consensus protocols.

The primary downside of partitioned consensus was that achieving global consistency is much more complicated. Consistency guarantees require that requests submitted after previous requests complete will never “go back in time” and view a state of the system that existed prior to the completed request. Such guarantees are hard to enforce in partitioned consensus systems since different partitions operate independently from each other: Consistency requires a notion of “before” and “after” --- even for events on separate machines or separate partitions. For partitions that operate completely independently, the most natural way to define “before” and “after” is to use real time --- the time on the clocks of the different partitions. However, clocks tend to skew at the millisecond granularity, and keeping clocks in sync is nontrivial. We discussed how Google has a hardware solution that aids in clock synchronization, while other solutions attempt to use software-only clock synchronization algorithms.

In contrast, unified consensus results in a global order of all requests. This global order can be used to implement the notion of “before” and “after” without having to rely on local clock time, which entirely avoids the need for clock synchronization. This results in stronger consistency guarantees: unified consensus systems can guarantee consistency at all times, while partitioned consensus systems can only guarantee consistency if the clock skew stays within an expected bound. For software-only implementations, it is hard to avoid occasionally violating the maximum clock skew bound assumption, and the violations themselves may not be discoverable. Therefore, unified consensus is the safer option.

The post led to several interesting debates, most of which are beyond the scope of this post. However, there was one interesting debate I’d like to explore more deeply in this post. In particular, the question arose: Are there any fundamental latency differences between unified-consensus systems and partitioned-consensus systems? When I read the comments to my post (both on the post itself and also external discussion boards), I noticed that there appears to be a general assumption amongst my readers that unified consensus systems must have higher latency than partitioned consensus systems. One reader even accused me of purposely avoiding discussing latency in that post in order to cover up a disadvantage of unified consensus systems. In this post, I want to clear up some of the misconceptions and inaccurate assumptions around these latency tradeoffs, and present a deeper (and technical) analysis on how these different approaches to consensus have surprisingly broad consequences on transaction latency. We will analyze the latency tradeoff from three perspectives: (1) Latency for write transactions, (2) Latency for linearizable read-only transactions and (3) Latency for serializable snapshot transactions.

Latency for write transactions

The debate around write transactions is quite interesting since valid arguments can be made for both sides.

The partitioned-consensus side points out two latency downsides of unified consensus: (1) As mentioned in my previous post, in order to avoid scalability bottlenecks, unified consensus algorithms perform consensus batch-at-a-time instead of on individual requests. They, therefore, have to pay the additional latency of collecting transactions into batches prior to running consensus. In the original Calvin paper, batch windows were 10ms (so the average latency would be 5ms); however, we have subsequently halved the batch window latency in my labs at Yale/UMD. FaunaDB (which uses Calvin’s unified consensus approach) also limits the batch window to 10ms. (2) For unified consensus, there will usually be one extra network hop to get the request to the participant of the consensus protocol for its local region. This extra hop is local, and therefore can be assumed to take single-digit milliseconds. If you combine latency sources (1) and (2), the extra latency incurred by the preparation stage for unified consensus is approximately 10-15ms.

On the other hand, the unified-consensus side points out that multi-partition atomic write transactions require two-phase commit for partitioned-consensus systems. For example, let’s say that a transaction writes data in two partitions: A and B. In a partitioned-consensus system, the write that occurred in each partition achieves consensus separately. It is very possible that the consensus in partition A succeeds while in B it fails. If the system guarantees atomicity for transactions, then the whole transaction must fail, which requires coordination across the partitions --- usually via two-phase commit. Two-phase commit can result in significant availability outages (if the coordinator fails at the wrong time) unless it runs on top of consensus protocols. Thus Spanner and Spanner-derivative systems all run two-phase commit over partitioned consensus for multi-partition transactions. The latency cost of the Raft/Paxos protocol itself (once it gets started) is the same for unified and partitioned consensus, but two-phase commit requires two rounds of consensus to commit such transactions. A single round of consensus may take between 5ms and 200ms, depending on how geographically disperse the deployment is. Since Spanner requires two sequential rounds, the minimum transaction latency is double that --- between 10ms for single-region deployments to 400ms for geographically disperse deployments.

In practice, this two-phase commit also has an additional issue: a transaction cannot commit until all partitions vote to commit. A simple majority is not sufficient --- rather every partition must vote to commit. A single slow partition (for example, a partition undergoing leader election) stalls the entire transaction. This increases the observed long tail latency in proportion to transaction complexity.

In contrast, unified consensus systems such as Calvin and its derivatives such as FaunaDB do not require two-phase commit. [Side note: a discussion of how to avoid two-phase commit in a unified consensus system can be found in this VLDB paper. FaunaDB’s approach is slightly different, but the end result is the same: no two-phase commit.] As a result, unified consensus systems such as Calvin and FaunaDB only require one round of consensus to commit all transactions --- even transactions that access data on many different machines.

The bottom line is that the better latency option between unified or partitioned consensus for write transactions is somewhat workload dependent. Unified consensus increases latency for all transactions by a little, but partitioned consensus can increase latency by a lot more for multi-partition transitions. For most applications, it is impossible to avoid multi-partition transactions. For example, many applications allow transactional interactions between arbitrary users (payments between users, exchanges between users, “friendship” status updates between users, gaming interactions, etc.). Although it is possible to group users into partitions such that many of their interactions will be with other users within that partition (e.g. partition by a user’s location), as long as arbitrary interactions are allowed, there will always be interactions between users in different partitions. These multi-partition interactions are much slower in partitioned-consensus systems. Thus, for most workloads, unified-consensus is the better latency option for write transactions.

Latency for linearizable read-only transactions

Linearizable read-only transactions are generally sent to the consensus leader’s region and performed (or at least receive a timestamp) there [other options exist, but this is what Spanner and other systems mentioned in my previous post do]. In unified-consensus, there is only one leader region for the whole system. Linearizable read transactions that initiate from near this region will be processed with low latency, while transactions that initiate from farther away will observe correspondingly higher latency.

Meanwhile, in partitioned-consensus, there is one leader per partition, and these leaders can be located in different regions. The partitioned-consensus supporters argue that this can lead to lower latency in an array of common scenarios. An application developer can specify a location-based partitioning algorithm. All data that is usually accessed from region X should be placed in the same partition, with a consensus leader located in region X. All data that is usually accessed from region Y should be placed in the same partition, with a consensus leader located in region Y. In doing so, a larger number of read-only transactions will observe lower latency.

The downside of this approach is that it breaks the abstraction of the consensus protocol as a separate component of the system --- now the consensus protocol and data storage layer become much more intertwined, increasing the monolithicity of the system. Furthermore, consensus protocols run leader elections after a lease expires, and would have to reduce the democratic nature of this protocol in order to ensure the leader remains in the closest possible region. Finally, it increases complexity and reduces the flexibility of the partitioning protocol. As far as I know, the most well-known example of a partitioned-consensus system --- Spanner --- does not take advantage of this potential optimization for these reasons.

Consequently, although in theory, there is a potential latency advantage for partitioned-consensus systems for linearizable read-only transactions, in practice this advantage is not realized.

On the other hand, there is a fundamental latency advantage for unified-consensus systems in the presence of multi-partitioned transactions. A multi-partition transaction in a partitioned-consensus system must involve more than one leader. The leaders of each partition accessed by the read transaction must communicate with each other in order to figure out a timestamp at which this linearizable read can be safely performed (see sections 4.1.4 and 4.2.2 of the Spanner paper). Alternatively, a timestamp sufficiently into the future (at the end of the clock skew uncertainty bound) can be chosen at which to perform the read. Either way ---- whether it is communication across leaders (that may be located in different geographic regions) or whether it is waiting until the end of the clock skew uncertainty bound --- multi-partition reads pay an additional latency cost in order to ensure linearizability. In contrast, unified consensus systems have only a single leader region, and can perform linearizable reads across the servers in this region, without any communication with other regions or waiting for clock skew uncertainty windows to close.

Latency for serializable snapshot read-only transactions

Many applications --- even if they require linearizable write transactions --- do not require linearizable read-only transactions, and instead are content to read from a recent snapshot of the database state. However, such snapshot reads must be serializable --- they should reflect the database state as of a particular transaction in the serial order, and none of the writes from transactions after that transaction.

Recall that transactions may write data on different machines / partitions. For example, a transaction, T, may write data on partition A and partition B. A serializable snapshot that includes data from both partition A and partition B must therefore include both of T’s writes to those partitions, or neither. In particular, it should include both of T’s writes if the snapshot is as of a point in time “after” T, and neither of T’s writes if the snapshot is as of a point in time “before” T. Note that this notion of “point in time” must exist --- even across partitions.Therefore, once again, there needs to be a global notion of “before” and “after” --- even for writes across different machines. As long as this notion of “before” and “after” exists, such snapshot reads can be sent to any replica to be processed there, and do not require any consensus or interaction with consensus leader regions. This is critically important to support low latency reads in a geographically disperse deployment.

As mentioned in the introductory paragraph of this post, both unified- and partitioned-consensus systems are capable of generating global notions of “before” and “after”, and thus both types of systems are able to achieve the performance advantage of being able to perform these reads from any replica. However, as we mentioned above, unified-consensus systems can achieve this global notion of “before” and “after” without any clock synchronization, while partitioned-consensus systems use clock synchronization. Thus, unified-consensus can always achieve correct serializable snapshot reads, while partitioned-consensus can only achieve the same result if the maximum clock skew bound assumption is not violated.

Conclusion

The latency debate between unified vs. partitioned consensus is an intricate one. However, it is clear that multi-partition transactions exacerbate the disadvantages of partitioned-consensus transactions in (at least) three dimensions:

Multi-partition transactions require two-phase commit on top of the consensus protocol in partitioned-consensus systems. In many deployments, consensus across replicas is the latency bottleneck. By requiring two-phase commit on top of consensus, partitioned-consensus systems result in (approximately) double the latency (relative to unified-consensus) in such deployments, and higher long tail latency.

Partitioned-consensus systems require clock synchronization in order to achieve low latency snapshot reads (in addition to all linearizable operations). Any incorrect assumptions of the maximum clock skew across machines may result in serializability violations (and thus incorrect results being returned).

As we discussed above, it is usually impossible to avoid multi-partition transactions in most real-world applications. Furthermore, as an application scales, the number of partitions must increase, and thus the probability of multi-partition transactions is also likely to increase. Therefore, the disadvantages of partitioned-consensus systems relative to unified-consensus systems accentuate as the application scales.

Friday, September 21, 2018

The CAP theorem famously states that it is impossible to guarantee both consistency and availability in the event of a network partition. Since network partitions are always theoretically possible in a scalable, distributed system, the architects of modern scalable database systems fractured into two camps: those that prioritized availability (the NoSQL camp) and those that prioritized consistency (the NewSQL camp). For a while, the NoSQL camp was clearly the more dominant of the two --- in an “always-on” world, downtime is unacceptable, and developers were forced into handling the reduced consistency levels of scalable NoSQL systems. [Side note: NoSQL is a broad umbrella that contains many different systems with different features and innovations. When this post uses the term “NoSQL”, we are referring to the subset of the umbrella that is known for building scalable systems that prioritize availability over consistency, such as Cassandra, DynamoDB (default settings), Voldemort, CouchDB, Riak, and multi-region deployments of Azure CosmosDB.]

Over the past decade, application developers have discovered that it is extremely difficult to build bug-free applications over database systems that do not guarantee consistency. This has led to a surprising shift in momentum, with many of the more recently released systems claiming to guarantee consistency (and be CP from CAP). Included in this list of newer systems are: Spanner (and its Cloud Spanner counterpart), FaunaDB, CockroachDB, and YugaByte. In this post, we will look more deeply into the consistency claims of these four systems (along with similar systems) and note that while some do indeed guarantee consistency, way too many of them fail to completely guarantee consistency. We will trace the failure to guarantee consistency to a controversial design decision made by Spanner that has been tragically and imperfectly emulated in other systems.

What is consistency anyway?

Consistency, also known as “atomic consistency” or “linearizability”, guarantees that once a write completes, all future reads will reflect that value of the write. For example, let’s say that we have a variable called X, whose value is currently 4. If we run the following code:

X = 10;Y = X + 8;

In a consistent system, there is only one possible value for Y after running this code (assuming the second statement is run after the first statement completes): 18. Everybody who has completed an “Introduction to Programming” course understands how this works, and relies on this guarantee when writing code.

In a system that does not guarantee consistency, the value of Y after running this code is also probably 18. But there’s a chance it might be 12 (since the original value of X was 4). Even if the system returns an explicit message: “I have completed the X = 10 statement”, it is nonetheless still a possibility that the subsequent read of X will reflect the old value (4) and Y will end up as 12. Consequently, the application developer has to be aware of the non-zero possibility that Y is not 18, and must deal with all possible values of Y in subsequent code. This is MUCH more complicated, and beyond the intellectual capabilities of a non-trivial subset of application developers.

[Side note: Another name for "consistency" is "strong consistency". This alternate name was coined in order to distinguish the full consistency guarantee from weaker consistency levels that also use the word "consistency" in their name (despite not providing the complete consistency guarantee). Indeed, some of these weaker consistency levels, such as "causal consistency", "session consistency", and "bounded staleness consistency" provide useful guarantees that somewhat reduce complexity for application developers. Nonetheless, the best way to avoid the existence of corner case bugs in an application is to build it on top of a system that guarantees complete, strong consistency.]

Why give up on consistency?

Consistency is a basic staple, a guarantee that is extremely hard to live without. So why do most NoSQL systems fail to guarantee consistency? They blame the CAP theorem. (For example,the Amazon Dynamo paper, which inspired many widely used NoSQL systems, such as Cassandra, DynamoDB, and Riak, mention the availability vs. consistency tradeoff in the first paragraph of the section that discussed their “Design Considerations”, which lead to their famous “eventually consistent” architecture.) It is very hard, but not impossible, to build applications over systems that do not guarantee consistency. But the CAP theorem says that it is impossiblefor a system that guarantees consistency to guarantee 100% availability in the presence of a network partition. So if you can only choose one, it makes sense to choose availability. As we said above, once the system fails to guarantee consistency, developing applications on top of it without ugly corner case bugs is extremely challenging, and generally requires highly-skilled application developers that are able to handle the intellectual rigors of such development environments. Nonetheless, such skilled developers do exist, and this is the only way to avoid the impossibility proof from the CAP theorem of 100% availability.

The reasoning of the previous paragraph, although perhaps well-thought out and convincing, is fundamentally flawed. The CAP theorem lives in a theoretical world where there is such a thing as 100% availability. In the real world, there is no such thing as 100% availability. Highly available systems are defined in terms of ‘9s’. Are you 99.9% available? Or 99.99% available? The more 9s, the better. Availability is fundamentally a pursuit in imperfection. No system can guarantee availability.

This fact has significant ramifications when considering the availability vs. consistency tradeoff that was purported by the CAP theorem. It is not the case that if we guarantee consistency, we have to give up the guarantee of availability. We never had a guarantee of availability in the first place! Rather, guaranteeing consistency causes a reduction to our already imperfect availability.

Therefore: the question becomes: how much availability is lost when we guarantee consistency? In practice, the answer is very little. Systems that guarantee consistency only experience a necessary reduction in availability in the event of a network partition. As networks become more redundant, partitions become an increasingly rare event. And even if there is a partition, it is still possible for the majority partition to be available. Only the minority partition must become unavailable. Therefore, for the reduction in availability to be perceived, there must be both a network partition, and also clients that are able to communicate with the nodes in the minority partition (and not the majority partition). This combination of events is typically rarer than other causes of system unavailability. Consequently, the real world impact of guaranteeing consistency on availability is often negligible. It is very possible to have a system that guarantees consistency and achieves high availability at the same time.

The glorious return of consistent NewSQL systems

The argument above actually results in 3 distinct reasons for modern systems to be CP from CAP, instead of AP (i.e. choose consistency over availability):

(1) Systems that fail to guarantee consistency result in complex, expensive, and often buggy application code.

(2) The reduction of availability that is caused by the guarantee of consistency is minute, and hardly noticeable for many deployments.

(3) The CAP theorem is fundamentally asymmetrical. CP systems can guarantee consistency. AP systems do not guarantee availability (no system can guarantee 100% availability). Thus only one side of the CAP theorem opens the door for any useful guarantees.

I believe that the above three points is what has caused the amazing renaissance of distributed, transactional database systems --- many of which have become commercially available in the past few years --- that choose to be CP from CAP instead of AP. There is still certainly a place for AP systems, and their associated NoSQL implementations. But for most developers, building on top of a CP system is a safer bet.

However, when I say that CP systems are the safer bet, I intend to refer to CP systems that actuallyguarantee consistency. Unfortunately, way too many of these modern NewSQL systems fail to guarantee consistency, despite their claims to the contrary. And once the guarantee is removed, the corner case bugs, complexity, and costs return.

Spanner is the source of the problem

I have discussed in previous posts that there are many ways to guarantee consistency in distributed systems. The most popular mechanism, which guarantees consistency at minimal cost to availability, is to use the Paxos or Raft consensus protocols to enforce consistency across multiple replicas of the data. At a simplified level, these protocols work via a majority voting mechanism. Any change to the data requires a majority of replicas to agree to the change. This allows the minority of replicas to be down or unavailable and the system can nonetheless continue to read or write data.

Most NewSQL systems use consensus protocols to enforce consistency. However, they differ in a significant way in how they use these protocols. I divide NewSQL systems into two categories along this dimension: The first category, as embodied in systems such as Calvin (which came out of my research group) and FaunaDB, uses a single, global consensus protocol per database. Every transaction participates in the same global protocol. The second category, as embodied in systems such as Spanner, CockroachDB, and YugaByte, partitions the data into ‘shards’, and applies a separate consensus protocol per shard.

The main downside of the first category is scalability. A server can process a fixed number of messages per second. If every transaction in the system participates in the same consensus protocol, the same set of servers vote on every transaction. Since voting requires communication, the number of votes per second is limited by the number of messages each server can handle. This limits the total amount of transactions per second that the system can handle.

Calvin and FaunaDB get around this downside via batching. Rather than voting on each transaction individually, they vote on batches of transactions. Each server batches all transactions that it receives over a fixed time period (e.g., 10 ms), and then initiates a vote on that entire batch at once. With 10ms batches, Calvin was able to achieve a throughput of over 500,000 transactions per second. For comparison, Amazon.com and NASDAQ likely process no more than 10,000 orders/trades per second even during peak workloads [Update: there has been some discussion about these numbers from my readers. The number for NASDAQ might be closer to 100,000 orders per second. I have not seen anybody dispute the 10,000 orders per second number from Amazon.com, but readers have pointed out that they issue more than 10,000 writes to the database per second. However, this blog post is focused on strictly serializable transactions rather than individual write operations. For Calvin's 500,000 transactions per second number, each transaction included many write operations.]

The main downside of the second category is that by localizing consensus on a per-shard basis, it becomes nontrivial to guarantee consistency in the presence of transactions that touch data in multiple shards. The quintessential example is the case of someone performing a sequence of two actions on a photo-sharing application (1) Removing her parents from having permission to see her photos (2) Posting her photos from spring break. Even though there was a clear sequence of these actions from the vantage point of the user, if the permissions data and the photo data are located in separate shards, and the shards perform consensus separately, there is a risk that the parents will nonetheless be able to see the user’s recently uploaded photos.

Spanner famously got around this downside with their TrueTime API. All transactions receive a timestamp which is based on the actual (wall-clock) current time. This enables there to be a concept of “before” and “after” for two different transactions, even those that are processed by completely disjoint set of servers. The transaction with a lower timestamp is “before” the transaction with a higher timestamp. Obviously, there may be a small amount of skew across the clocks of the different servers. Therefore, Spanner utilizes the concept of an “uncertainty” window which is based on the maximum possible time skew across the clocks on the servers in the system. After completing their writes, transactions wait until after this uncertainty window has passed before they allow any client to see the data that they wrote.

Spanner thus faces a potentially uncomfortable tradeoff. It is desirable that the uncertainty window should be as small as possible, since as it gets larger, the latency of transactions increases, and the overall concurrency of the system decreases. On the other hand, it needs to 100% sure that clock skew never gets larger than the uncertainty window (since otherwise the guarantee of consistency would no longer exist), and thus larger windows are safer than smaller ones.

Spanner handles this tradeoff with a specialized hardware solution that uses both GPS and atomic clocks to ensure a minimal clock skew across servers. This solution allows the system to keep the uncertainty window relatively narrow while at the same time keeping the probability of incorrect uncertainty window estimates (and corresponding consistency violations) to be extremely small. Indeed, the probability is so small that Spanner’s architects feel comfortable claiming that Spanner “guarantees” consistency.

[It is worth noting at this point that systems that use global consensus avoid this problem entirely. If every transaction goes through the same protocol, then a natural order of all transactions emerges --- the order is simply the order in which transactions were voted on during the protocol. When batches are used instead of transactions, it is the batches that are ordered during the protocol, and transactions are globally ordered by combining their batch identifier with their sequence number within the batch. There is no need for clock time to be used in order to create a notion of before or after. Instead, the consensus protocol itself can be used to elegantly create a global order.]

Spanner Derivatives

Spanner is a beautiful and innovative system. It was also invented by Google and widely used there. Either because of the former or latter (or both), it has been extremely influential, and many systems (e.g., CockroachDB and YugaByte) have been inspired by the architectural decisions by Spanner. Unfortunately, these derivative systems are software-only, which implies that they have inherited only the software innovations without the hardware and infrastructure upon which Spanner relies at Google. In light of Spanner’s decision to have separate consensus protocols per shard, software-only derivatives are extremely dangerous. Like Spanner, these systems rely on real-world time in order to enforce consistency --- CockroachDB on HLC (hybrid logical clocks) and YugaByte on Hybrid Time. Like Spanner, these systems rely on knowing the maximum clock skew across servers in order to avoid consistency violations. But unlike Spanner, these systems lack hardware and infrastructure support for minimizing and measuring clock skew uncertainty.

CockroachDB, to its credit, hasacknowledged that by only incorporating Spanner’s software innovations, the system cannot guarantee CAP consistency (which, as mentioned above, is linearizability).

YugaByte, however, continues to claim a guarantee of consistency [Edit for clarification: YugaByte only makes this claim for single key operations; however, YugaByte also relies on time synchronization for reading correct snapshots for transactions running under snapshot isolation.]. I would advise people to be wary of these claims which are based on assumptions of maximum clock skew. YugaByte, by virtue of its Spanner roots, will run into consistency violations when the local clock on a server suddenly jumps beyond the skew uncertainty window. This can happen under a variety of scenarios such as when a VM that is running YugaByte freezes or migrates to a different machine. Even without sudden jumps, YugaByte’s free edition relies on the user to set the assumptions about maximum clock skew. Any mistaken assumptions on behalf of the user can result in consistency violations.

In contrast to CockroachDB and YugaByte, FaunaDB was inspired by Calvin instead of Spanner. [Historical note: the Calvin and Spanner papers were both published in 2012]. FaunaDB therefore has a single, elegant, global consensus protocol, and needs no small print regarding clock skew assumptions. Consequently, FaunaDB is able to guarantee consistency of transactions that modify any data in the database without concern for the corner case violations that can plague software-only derivatives of Spanner-style systems.

There are other differences between Calvin-style systems and Spanner-style systems that I’ve talked about in the past. In this post we focused on perhaps the most consequential difference: global consensus vs. partitioned consensus. As with any architectural decision, there are tradeoffs between these two options. For the vast majority of applications, exceeding 500,000 transactions a second is beyond their wildest dreams. If so, then the decision is clear. Global consensus is probably the better choice.

Tuesday, March 27, 2018

In my previous blog post, I discussed the relatively new Apache Arrow project, and compared it with two similar column-oriented storage formats in ORC and Parquet. In particular, I explained how storage formats targeted for main memory have fundamental differences from storage formats targeted for disk-resident data. There was a surprising amount of activity surrounding the post --- the post received 28,000 visits (making it the 7th most popular post on my blog all time), and 86 comments in a HackerNews thread discussing the post. Given this clear interest of my readers in Apache Arrow, I would like to take a deeper look into the project in this post, and present my analysis of both some specific decisions that were made regarding the format itself, and also my personal experience with installing and running experiments on the code base.

A quick caveat before we begin: Many of the comments on the HackerNews thread revolved around a back-and-forth between fans and contributors to the Apache Arrow project who went ballistic when they read my title with a sarcastic tone (the title was: “Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?”) and more thoughtful and thorough readers who tried to calm them down and explain that the entire post was there to explain precisely why it makes sense to have Arrow as a separate project. However, one common point that was brought up by the pro-Arrow crowd was that my post was narrow in the sense that I only looked at Arrow from the perspective of using it as a storage format in the context of database and data analytics engines, whereas Arrow, as a general standard for representing data in main memory could also be used outside of this space. I should have been clearer about the scope of my analysis in that post, so this time around I want to be more clear: the scope of my analysis in this post is solely from the perspective of using Apache Arrow as a storage format in the context of database and data analytics engines and tools. I limit the scope to this context for two reasons: (1) I predict that the majority of Arrow’s use cases will be in that context (where I define data analytics tools broadly enough to include projects like Pandas) (2) As someone who has spent his entire career as a database system researcher, this is the only context in which I am qualified to present my opinion.

What exactly is Apache Arrow?

Arrow’s homepage self-describes in the following way: “Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication.”

In other words, the creators of Arrow envision the project having impact in three ways: (1) as a development platform, (2) as a columnar memory format standard, and (3) as a set of useful libraries.

In practice, the majority of the code in the Github repository at the time of my interactions with the codebase was for constructing, accessing, and testing data structures using the Arrow standard. So my analysis in this article will just focus on the Arrow standard and the set of code that is provided to help in the implementation of this standard.

Is it even possible to have everybody agree on a data representation standard?

For decades, it was impossible to fathom that there could be a standard representation for data in a database system. Database systems have historically been horribly monolithic and complex pieces of software. The many components of the system --- the storage layer, the transaction manager, the access manager, the recovery manager, the optimizer, etc. --- were significantly intertwined, and designed assuming particular architectural choices for the other components. Therefore, a “page” of data on storage was not a simple block of data, but also contained information critical to the recovery manager (e.g. the identifier of the log record that most recently wrote to this page), the transaction manager (e.g. timestamps required by multi-version concurrency control schemes), access manager, and so on. Each unique database system had different concurrency control schemes, different logging structures, and different index implementations; therefore a page of data in one system looked vastly different than a page of data in other system.

Therefore, if you wanted to move data from one system to another one, you would have to “export” the data, which involved rewriting the data from the complex page format stored inside the system to a simpler representation of the data. This simple representation would be passed to the other system which would then rewrite the simple representation into its own proprietary standard. This process of rewriting the data before export is called “serialization” and rewriting it back before import is called “deserialization”. Serialization and deserialization costs when transferring data between systems have historically been necessary and unavoidable.

Over the past few decades, the database system world has changed significantly. First, people stopped believing that one size fits all for database systems, and different systems started being used for different workloads. Most notably, systems that specialized in data analysis separated from systems that specialized in transactional processing. Systems that specialized in data analysis tend to either be read-only or read-mostly systems, and therefore generally have far simpler concurrency control and recovery logic. Second, as the price of memory has rapidly declined, a larger percentage of database applications fit entirely in main memory. This also resulted in simpler recovery and buffer manager logic, which further simplified data representation. Finally, as open source database systems started to proliferate, a greater emphasis was placed on modular design and clean interfaces between the components of the system, in order to accommodate the typical distributed development of open source projects.

All of this has lead to much simpler data representations in main memory and analytical database engines, especially those in the open source sphere. Fixed width data types are often just represented in arrays, and variable-width data types in only slightly more complicated data structures. All of a sudden, the prospect of standardizing the data representation across main memory analytical database systems has become a realistic goal, thereby enabling the transfer of data between systems without having to pay serialization and deserialization costs.

This is exactly the goal of Apache Arrow. Arrow is, in its essence, a data representation specification --- a standard that can be implemented by any engine that processes data in main memory. Engines that use this standard internally can avoid any kind of serialization and deserialization costs when moving data between each other, which several other blog posts (e.g. here and here) have shown to result in significant performance gains. 13 major open source projects, including Pandas, Spark, Hadoop and Dremio have already embraced the standard, which I believe is enough critical mass for the Arrow standard to become ubiquitous in the data analytics industry. Even if existing systems do not adopt the standard for their own internal data representation, I expect they will at least support data exports in Arrow. This increases the motivation for any new main memory analytics engine being designed to adopt it.

While ubiquity is usually a good indicator of quality, there are plenty of languages, APIs, and pieces of software that become ubiquitous for reasons entirely unrelated to their quality. For example, the SQL interface to database systems took off due to the business dominance of the systems that used SQL, even though there were arguably better interfaces to database systems that had been proposed prior to SQL’s take-over. Furthermore, even high quality things are often optimized for certain scenarios, and yield suboptimal performance in scenarios outside of the intended sweet spot. Therefore, I took a deeper look at Apache Arrow without any preconceived biases. Below, I present my analysis and experience with playing with the code of the project, and discuss some of the design decisions of Arrow, and the tradeoffs associated with those decisions.

Columnar

It would be easy for someone who sees Arrow’s self-description of being “columnar” to mistakenly assume that Arrow’s scope is limited to two dimensional data structures that have rows and columns, and that by being “columnar”, it is possible to derive that Arrow stores data column-by-column instead of row-by-row. In fact, Arrow is actually a more general standard --- including specification for one-dimensional data structures such as arrays and lists, and also data structures with more than two dimensions through its support for nesting. Nonetheless, we have all become accustomed to interacting with data through relational database systems and spreadsheets, both of which store data in two dimensional objects, where each row corresponds to an entity, and each column an attribute about that entity. By being columnar, Apache Arrow stores such two dimensional objects attribute by attribute instead of entity by entity. In other words, the first attribute of each entity is stored contiguously, then the second attribute of every entity, and so on.

This columnar representation means that if you want to extract all attributes for a particular entity, the system must jump around memory, finding the data for that entity from each of the separately-stored attributes. This results in a random memory access pattern which results in slower access times than sequential access patterns (my previous post discusses this point in more detail). Therefore, Arrow is not ideal for workloads that tend to read and write a small number of entire entities at once, such as OLTP (transactional) workloads. On the other hand, data analytics workloads tend to focus on only a subset of the attributes at once; scanning through large quantities of entities to aggregate values of these attributes. Therefore, storing data in a columnar fashion results in sequential, high performance access patterns for these workloads.

Storing data column-by-column instead of row-by-row has other advantages for analytical workloads as well --- for example it enables SIMD acceleration and potentially increases compression rates (my previous post goes into more detail on these subjects). The bottom line is that by choosing a columnar representation for two-dimensional data structures, Arrow is clearly positioning itself for adoption by data analytics workloads that do not access individual data items, but rather access a subset of the attributes (columns) from many data items. Indeed, many of the open source projects that have been built natively on Arrow, such as Dremio and NVIDIA’s Open GPU Accelerated Analytics (GOAI), are focused on analytics.

Fixed-width data types

Note that attributes of entities tend to have a uniform data type. Therefore, by choosing a columnar data representation, Arrow can store columns of two dimensional tables in an identical way to how it stores data in one dimension of uniform type. In particular, fixed-width data types such as integers and floats can simply be stored in arrays. Arrow is little endian by default and pads arrays to 64-byte boundaries. Aside from the extra padding, Arrow arrays store data in memory in an equivalent fashion to arrays in C, except that Arrow arrays have three extra pieces of metadata that are not present in C arrays: (1) the length of the array, (2) the number of null elements in the array, and (3) a bitmap indicating which elements of the array are null. One interesting design decision made by Arrow is that null elements of the array take up an identical amount of space as non-null elements --- the only way to know if an element is null is by checking to see if there is a 1-bit in the associated bit for that element in null-bitmap that is part of the metadata for the array. The alternative design would have been to not waste storage at all on the null elements, and instead derive the location of null elements by inspection of the null bitmap. The tradeoff here is storage space vs random access performance. By expending space on null elements, the nth element of the array can be quickly located by simply multiplying n by the fixed-width size of each element in the array. However, if the null elements are removed from the array (and their location derived from the null bitmap), the amount of space needed to store the array will be smaller, but additional calculations and bit counting must occur before finding the value for an element in the array at a particular index. On the other hand, sequential scans of the entire array may be faster if the system is bottlenecked by memory bandwidth, since the array is smaller.

Since Arrow’s design decision was made to optimize for random array element access, I ran a simple benchmark where I created an array of size 100,000,000 32-bit integers, put random values in each element of the array, and then searched for the value at 50,000 different locations in the array. I first tried this experiment in a regular C array that allowed null elements to take up an identical amount of space as non-null elements (similar to Arrow). I then tried a different C array where nulls take up no space, and a high performance index is used to speed up random access of the array . I then installed Apache Arrow, built the same array using the Int32Builder in the Arrow C++ API and accessed it through the Arrow Int32Array API. I ran these experiments on an EC2 t2.medium instance. The results are shown below:

As expected, the version of the C array where nulls take up no space was much slower than the other options for this random access workload. Even though we used a high performance index, direct address offset calculations are always faster. Accessing data through the API that comes with the Arrow codebase was slightly slower than accessing data from an array directly. However, this is not because of the Arrow data format itself. When, after building the Arrow Array, instead of accessing the array through the Arrow API, I instead accessed a pointer to the raw data in the array, cast it as a const int*, and then proceeded to access this raw data directly in C, I saw equivalent performance to a normal C array. This cause of the slowdown from accessing data through the Arrow API is presumably from the C++ compiler failing to inline all of the extra function calls (despite the -O3 flag). I therefore conclude that for applications the are extremely performance sensitive, it is better to access raw data created in accordance to the Arrow specification than to use the API to access the data. But for most cases, using the API will be sufficient. As far as the decision to allow nulls to take up space, that was certainly a win for this random-access workload. But for a workload that scans through the entire dataset, it would have been up to 10% faster for the C array in which nulls take up no space, since in our experiment, 10% of all the values were null, and thus that version of the C array was 10% smaller than for the Arrow-specified arrays.

Variable-width data types

For variable width data types, Arrow stores the data for each element contiguously in memory without any separator bytes between the elements. In order to determine where one element ends and the next one begins, Arrow stores the byte offset of the first byte of each element of the array inside an integer array next to the raw data (there is an example in the next section below). In order to access a random element of the variable-width array, the integer array is accessed to find out the starting position of this and the next element in the raw data (the difference between these positions is the length of the element), and then the raw data is accessed.

The decision not to include separator bytes in the raw data between the elements makes the solution more general --- you don’t have to reserve special byte values for these separators. However, it slows down certain types of sequential access patterns. For example, I ran an experiment where I created an array of 12,500,000 variable-sized strings (average of 8 characters per string) using the StringBuilder API, and searched for a substring of size two characters within all elements of the array (extracting the index of all elements that contain the substring). I measured how long this query took when accessing the array both through the Arrow StringArray C++ API, and also over the raw Arrow data directly. Thirdly, I measured how long the same query took over a string array that included a separator byte between each element. The results are shown below:

In this case, the best performance was the array that was not created according to the Arrow specification. The reason for this is that the raw data could not be searched directly for the two-byte substring in the dataset created according to the Arrow specification, because the companion integer array containing the list of element boundaries needed to be repeatedly consulted to ensure that substring matches did not span multiple elements. However, when seperator bytes were located inside the array itself, no secondary array needed to be scanned.

It should be noted that string separators only accelerate certain types of queries, and I purposely chose one such query for this example. For queries that they do not accelerate, they tend to have to opposite effect, decreasing performance by bloating the size of the array. Furthermore, it should be reiterated at this point that reserving byte values for string separators would have prevented any application that do not reserve the same byte values from using Arrow, thereby limiting the scope of Arrow’s utility. In addition, many other queries can actually benefit from having the companion integer array. For example, an equality comparison (name == "johndoe") can utilize the integer array to ignore any value that has a different length. It should also be noted that any application that wishes to have string separators can simply add them to their strings directly, before creating the array using the StringBuilder API. So this experiment does not show a fundamental weakness of the Arrow standard --- it just indicates that in some cases you may have to add to the raw data in order to get optimal performance.

Nested Data

As self-describing data formats such as JSON become more popular, users are increasingly dropping the two-dimensional restrictions of relational tables and spreadsheets, and instead using nested models for their data. Arrow elegantly deals with nested data without requiring conceptual additions to the basic layout principles described above, where raw data is stored contiguously and offset arrays are used to quickly find particular data elements. For example, in a data set describing classes at the University of Maryland, I may want to nest the list of students in each class. For example, the data set:

Note that the nested student attribute required two different offset lists: (1) Students are variable length and thus we need one offset list to specify where one student ends and the next one begins, just as for any variable length attribute; and (2) We need second offset list to indicate how many students exist per class. The "Nested student list offsets" accomplish this second goal --- it indicates that the first class has (3-0) students, and the second class has (5-3) students, etc.

Arrow currently allows list, struct, and various types of union type data to be nested in an attribute value.

Conclusion

It is important to separate out the specification of a standard from the tools and libraries that are provided in the current codebase that help developers with implementing this standard. As long as you are performing in-memory analytics where your workloads are typically scanning through a few attributes of many entities, I do not see any reason not to embrace the Arrow standard. The time is right for database systems architects to agree on and adhere to a main memory data representation standard. The proposed Arrow standard fits the bill, and I would encourage designers of main memory data analytics systems to adopt the standard by default unless they can find a compelling reason that representing their data in a different way will result in a significantly different performance profile (for example, Arrow’s attribute-contiguous memory layout is not ideal if your workloads are typically accessing multiple attributes from only a single entity, as is common in OLTP workloads). I also found the tools available in the codebase to read and write data using this standard to be easy to use and quick to get started with. However, I did find that at times, the code was slightly slower than the raw (and less general) implementation of the standard I wrote myself. Nonetheless, the existing codebase is good enough for most use cases and will likely help to further the acceleration of the adoption of the standard. Furthermore, additional performance enhancements to the codebase appear to be on their way, such as optimized LLVM-based processing modules.

Daniel Abadi

About Me

Daniel Abadi is the Darnell-Kanal Professor of Computer Science at the University of Maryland, College Park, doing research primarily in database system
architecture and implementation. He received a Ph.D. from MIT and a M.Phil. from Cambridge. He is best known for his research in column-store database systems (the
C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, which was commercialized by VoltDB and the Calvin project which inspired FaunaDB),
and Hadoop (the HadoopDB project, which was commercialized by Hadapt). Abadi has been a recipient of a Churchill
Scholarship, an NSF CAREER Award, a Sloan Research Fellowship, the 2008 SIGMOD
Jim Gray Doctoral Dissertation Award, a VLDB best paper award, a VLDB 10 year best paper award, the 2013-2014 Yale Provost's Teaching Prize, and the 2013 VLDB Early Career Researcher Award. He blogs at http://dbmsmusings.blogspot.com and
tweets at http://twitter.com/#!/daniel_abadi.