Google’s new paper is about a distributed relational database called Spanner that was a follow up to a presentation from earlier in the year about a new database for AdWords called F1. If you recall, that presentation revealed Google’s migration of AdWords from MySQL to a new database that supported SQL and hierarchical schemas — two ideas that buck the trend from relational databases.

Meet Spanner

This new database, Spanner, is a database unlike anything we’ve seen. It’s a database that embraces ACID, SQL, and transactions, that can be distributed across thousands of nodes spanning multiple data centers across multiple regions. The paper dwells on two main features that define this database:

Schematized Semi-relational Tables — A hierarchical approach to grouping tables that allows Spanner to co-locate related data into directories that can be easily stored, replicated, locked, and managed on what Google calls spanservers. They have a modified SQL syntax that allows for the data to be interleaved, and the paper mentions some changes to support columns encoded with Protobufs.

“Reification of Clock Uncertainty” — This is the real emphasis of the paper. The missing link in relational database scalability was a strong emphasis on coordination backed by a serious attempt to minimize time uncertainty. In Google’s new global-scale database, the variable that matters is epsilon — time uncertainty. Google has achieved very low overhead (14ms introduced by Spanner in this paper for datacenters at 1ms network distance) for read-write (RW) transactions that span U.S. East Coast and U.S. West Coast (data centers separated by around 2ms of network time) by creating a system that facilitates distributed transactions bound only by network distance (measured in milliseconds) and time uncertainty (epsilon).

Correction (10/5/12):Peter Norton points out the obvious, 14 ms coast-to-coast (US) is impossible light over glass takes at least 40ms to cross North America. Google ran these tests on networks at 1 ms network distance. The previous paragraph has been altered to reflect that 14ms is the overhead introduced by Spanner at this distance.

A Spanner deployment consists of a few management servers to manage multiple “zones” across data centers. A “Zone master” and a series of “location proxies” manage hundreds or thousands of “spanservers” that perform the bulk of the work in the Spanner database. Spanservers house units of data called “directories,” each of these units implements a Paxos state machine atop something called a tablet. Spanservers store data in B-trees using a composite key alongside a timestamp and a value.

“Paxos is a family of protocols for solving consensus in a network of unreliable processors. Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communication medium may experience failures.”

In other words, Paxos is about figuring out consensus under potentially sketchy circumstances. This is very important to Spanner because each of these spanserver nodes needs to be able to elect itself a “leader” for a transaction — Paxos provides a mechanism to ensure consensus about which node is running a particular transaction.

Think of Spanner as a database whose data is distributed among thousands (tens of thousands) of these spanservers, which rely on zonemasters and other servers, which keep track of the location of data to direct them to spanservers, which use Paxos and other protocols to coordinate and manage read-write transactions among themselves. All of this is made possible because Spanner’s TrueTime API allows each node participating in a transaction to minimize time uncertainty.

Clocks galore: Armageddon masters and GPS clocks

To implement a continent-wide relational database with support for distributed two-phase commits that would complete in a reasonable amount of time (14.1 ms), Google had to find a way to master time. In Spanner there are snapshot reads that don’t need to read the latest timestamp, there are read transactions that need more of a guarantee that they are reading the latest version, and then there are read-write transactions. And, read-write transactions that can span multiple span servers in different datacenters is the real prize. Read the paper and you can see what has to happen to a group of these spanservers to coordinate. Here’s a segment that describes what happens in a read-write transaction:

“[The coordinator leader] first acquires write locks, but skips the prepare phase. It chooses a timestamp for the entire transaction after hearing from all other participant leaders. The commit timestamp s must be greater or equal to all prepare timestamps” … “and greater than any timestamps the leader has assigned to previous transactions (again, to preserve monotonicity). The coordinator leader then logs a commit record through Paxos (or an abort if it timed out while waiting on the other participants).”

In essence, RW transactions are possible because low time uncertainty reduces the amount of time that these independent “leaders” in a transaction need to wait to conclude that consensus has been reached. There’s a built in latency of 2 * epsilon, and a good deal of the paper is focused on how Google now focuses on reducing time uncertainty. Time uncertainty has now become the metric to measure for Spanner.

“TrueTime is implemented by a set of time master machines per datacenter and a timeslave daemon per machine. The majority of masters have GPS receivers with dedicated antennas; these masters are separated physically to reduce the effects of antenna failures, radio interference, and spooﬁng. The remaining masters (which we refer to as Armageddon masters) are equipped with atomic clocks. An atomic clock is not that expensive: the cost of an Armageddon master is of the same order as that of a GPS master”

The evolution of persistence at Google

In the beginning there was BigTable … BigTable was the inspiration for Cassandra, HBase, and a number of other initial offerings in the NoSQL space. In BigTable there was a simple data model that consistent of rows with columns and column groups. All operations on a row were atomic, and BigTable was the perfect match for some of the huge data problems that Google had to solve. Google Earth, Google Analytics, Personalized Search: all of these applications had to juggle petabytes of data using something a step up from the filesystem that would allow applications to work with data at scale. BigTable was all about “row mutations” and “scanners,” and if you read between the lines of the BigTable paper in 2006, not everyone was a fan of the access pattern for BigTable:

“Given the unusual interface to BigTable, an interesting question is how difficult it has been for our users to adapt to using it. New users are sometimes uncertain of how to best use the BigTable interface, particularly if they are accustomed to using relational databases that support general-purpose transactions. Nevertheless, the fact that many Google products successfully use BigTable demonstrates that our design works well in practice.” [Emphasis added.]

My Translation: “We think BigTable works for everything, but the AdWords groups isn’t convinced so they’e still using MySQL.

Enter Megastore

A few years after Big Table, Megastore was created by what seems to be a completely separate team. Megastore was Google’s internal reaction to BigTable, a sort of in-between SQL and NoSQL built atop BigTable. Here’s the rationale behind Megastore from the Megastore paper (my emphasis included):

“NoSQL datastores such as Google’s BigTable, Apache Hadoop’s HBase, or Facebook’s Cassandra are highly scalable, but their limited API and loose consistency models complicate application development. Replicating data across distant datacenters while providing low latency is challenging, as is guaranteeing a consistent view of replicated data, especially during faults.

Reading between the lines here, this strikes me as: “Yes, BigTable scales, but it’s difficult to work with and we’ve had some annoying downtime because replication is a challenge.” Megastore, like Spanner after it, made use of Paxos and also had a similar hierarchical data model (although not exactly the same). From the Megastore paper:

“Megastore tables are either entity group root tables or child tables. Each child table must declare a single distinguished foreign key referencing a root table … Thus each child entity references a particular entity in its root table (called the root entity). An entity group consists of a root entity along with all entities in child tables that reference it.”

There’s another excerpt from the Megastore paper that foreshadows the weakness of this approach. While transactions were supported, they were discouraged. Here’s the excerpt:

“Megastore supports two-phase commit for atomic updates across entity groups. Since these transactions have much higher latency and increase the risk of contention, we generally discourage applications from using the feature in favor of queues. Nevertheless, they can be useful in simplifying application code for unique secondary key enforcement.”

This quote stands out knowing what we know now about the motivation to create Spanner. While Megastore provided transactions, it appears that using them created massive latency, inefficiency, and contention. In other words, and I’m reading between the lines again, “Megastore supports transactions but don’t use them, they will ruin performance.” It should also be noted that some of the applications that are using Megastore are the same applications that experienced widespread downtime a few years ago — GMail among them.

If you remember Google’s public explanation for day-long problems with Google Mail it was the combination of a loss of a single data center coupled with problems in replication. My guess is that the very public problems from years ago triggered an investment in the development of Spanner. There was an incentive to find a solution that could ensure consistency among data centers without having to worry about a separate replication process. There was also continued pressure to get Adwords off of the franken-MySQL deployment that is referenced in the Spanner paper.

Hey, need some continent-wide ACID? Here’s Spanner.

If I’m reading the author list correctly, the Spanner paper appears to be a merger of two separate teams. The BigTable authors and the Megastore authors collaborated to create what is likely the sucessor to BigTable. Spanner takes some ideas from Megastore building upon the hierarchical schema with root tables (in Spanner ‘directories’), it also redefines the low-level approach to storage. Where Megastore relies on BigTable for storage, Spanner takes responsibility for storage defining a new B-tree-based approach to storing segmented keys and data that correspond to these “schematized semi-relational tables”

The BigTable team defined an approach to persistence that could scale in 2006, in 2009-2010 the Megastore team built a solution with a more natural data model and transaction support atop BigTable. Megastore, while satisfying an internal need for storage with structure, still presented significant challenges because truly distributed transactions were only possible with significant latency penalties. So what does Google do? They solve the fundamental problem — time uncertainty. They retool the underlying approach to BigTable tablets to store hierarchical, semi-relational data.

Did Google just prove an entire industry wrong?

My read of this paper is that Google’s just proved a lot of NoSQL proponents wrong. Most of the rationale I read for switching to NoSQL is the inability to support both transactions and horizontally-scaled, distributed systems. Like the BigTable paper, this Spanner paper will take some time to percolate through the industry. We’ve been playing catch up with Google since the early part of the last decade, and it looks like we’ll be playing catch up for some time because Google just proved that you can scale the relational database horizontally and have consistent transactions across a continent.

So the next time someone tells you that the relational database is “over” or “dead.” Point them at the Spanner paper.

Strata Conference + Hadoop World — The O’Reilly Strata Conference, being held Oct. 23-25 in New York City, explores the changes brought to technology and business by big data, data science, and pervasive computing. This year, Strata has joined forces with Hadoop World.