Like many startups, Quizlet faces a huge challenge in scaling our databases. We use MySQL to power our website, which allows us to serve millions of students every month, but is difficult to scale up — we need our database to handle more writes than a single machine can process. There are many solutions to this problem, but these can be complex to run or require extensive refactoring of your application's SQL queries. Google recently announced Cloud Spanner, a hosted and distributed relational database which offers a possible solution to scaling these kinds of queries. Cloud Spanner is a very novel technology — it relies on Google's powerful infrastructure, including atomic clocks in its data centers, to coordinate execution of SQL queries among many distributed nodes.

In this post we'll describe the scaling challenge, examine Cloud Spanner's architecture, and test how a Quizlet production query workload[1] would perform. As a new product, there's little information available on Cloud Spanner past the official documentation, so we describe the lessons learned during testing and give guidance on Cloud Spanner's pitfalls. The results are instructive of Cloud Spanner's performance characteristics and hopefully will give you (the SQL-querying reader) a deeper understanding of this technology. Our tests suggest that Cloud Spanner is the most compelling product we've seen for scaling a high-throughput relational workload like Quizlet's.

A Typical Problem

Quizlet's history of database scaling is typical of many growing tech companies. You start with an application which keeps state in a relational database, MySQL in our case. As the application grows you're forced to optimize. We've been pragmatic about our optimization, increasing our capacity as we prepare for jumps in our traffic around back-to-school and exam periods. Our strategies include splitting tables into their own hardware (vertical sharding) and adding replicas[2].

Our current architecture is multiple pods of MySQL machines. Each pod has a master, at least one replica we keep in case the master fails (and for executing queries that don't need to be consistent), and a replica we snapshot every hour for backups. A tier of Memcached machines acts as a cache for some queries. Our tables are divided among the pods, but we've avoided horizontal sharding (splitting a single table across multiple machines) because of the complexity involved.

Quizlet's MySQL architecture

The fundamental problem with this architecture is that it's limited by the capacity of the master. Without horizontal sharding, all of the writes and consistent reads must happen on that single machine. When that database reaches a certain size and query throughput (the database we test in this post is around 700 GiB) we are at risk of hitting a performance ceiling. Solving this problem means a major rearchitecture.

What are the options?

Some cloud services offer hosted relational databases (RDS and Aurora on AWS, Cloud SQL on GCP), but these do not offer scaling beyond a single node. In other words, these services don't solve the scaling problem.

You can use an existing clustering technology, such as XtraDB or Vitess for MySQL, and CitusDB for Postgres. These products solve the problem for you, but may not fit your application's queries precisely and carry complexity and maintenance costs. These databases have a different architecture than vanilla MySQL or Postgres that may also require partial application redesign.

You can roll your own scaling technology. You can design exactly what you need, but it involves a major engineering effort to build and maintain it.

You can switch to a horizontally scalable NoSQL technology. Examples are Cassandra, HBase, MongoDB, DynamoDB, and BigTable. These can be a harder transition, since they disallow some relational features like joins, and your query options become more limited. DynamoDB and BigTable are hosted by AWS and GCP, respectively. A hosted database can be a major win in terms of maintenance, but switching to NoSQL may simply not be an option to rewrite an application with hundreds of tables and queries.

You can continue to scale vertically with large, expensive, enterprise-grade hardware. For example, machines with 10 TiB memory exist that might alleviate a database bottleneck. These aren't available in a cloud context and essentially punt the single-point-of-failure problem.

Many, perhaps most, scaled-up consumer tech companies develop their own MySQL sharding technology. This is the case at Facebook[3], Pinterest[4], Tumblr[5], Uber[6], Yelp[7], Twitter[8], DropBox[9], AirBnB[10], Asana[11], Square[12], YouTube[13], etc. Each of these deployments requires careful design, implementation, and significant maintenance. Teams of engineers are necessary. Real, live humans wake up and fix it when it breaks. For a small company like Quizlet (4 infrastructure engineers, 50 total employees), scaling and managing a high-throughput database is a huge concern.

The database scalability challenge is particularly acute for Quizlet for two reasons: our growth rate and our seasonality. We've been lucky to have strong organic user growth for many years, which we must plan for in the future. As an education product, our traffic follows the school year; summers and holidays are very quiet, and exam periods are the busiest times. These factors combine to make database scalability one of the most critical factors in our infrastructure's uptime and stability.

Cloud Spanner

Based on tests we've conducted for Quizlet, Cloud Spanner is the most compelling solution to this problem that we've seen thus far. Spanner[15] was developed at Google to be a highly-scalable distributed relational database that could replace workloads you might run on a sharded MySQL deployment. It carries innovations that make it very scalable while providing strong consistency guarantees. It couldn't exist without deep integration with Google's internal storage and compute services. We're not aware of any comparable hosted products. It's hard to overemphasize this: as a product available to cloud customers, Cloud Spanner is something completely new.

Quizlet is especially well-positioned to take advantage of Spanner, given that we run entirely on Google Cloud Platform. One of the reasons we switched to GCP was the belief that Google could offer unique cloud products based on internally developed technology[16]. Cloud Spanner appears to be an example of this effect.

Cloud Spanner has a quite novel architecture — it has been developed specifically to take advantage of Google's massively scaled cloud and uses GPS and atomic clocks to coordinate its distributed nodes. Google has spent a great deal of time, money, and human energy to take Spanner to production first internally and now as a cloud product.

Something important we've learned: though the core concepts are the same, Cloud Spanner has some significant differences to the original paper published in 2012. You shouldn't necessarily assume that the performance or failure characteristics of Cloud Spanner match those described in the paper. That being said, the core principles are still the same.

Architecture

Spanner is a distributed relational database with a SQL interface. Like other stateful distributed databases, Spanner's nodes require frequent coordination with one another. One of Spanner's key innovations is to reduce the amount of communication required during read and write operations by maintaining consistency through extremely accurate clocks. Each node running Spanner has access to these clocks through the TrueTime API. The original Spanner paper[15] describes the clock architecture with the following.

The underlying time references used by TrueTime are GPS and atomic clocks. TrueTime uses two forms of time reference because they have different failure modes... TrueTime is implemented by a set of time master machines per datacenter and a timeslave daemon per machine. The majority of masters have GPS receivers with dedicated antennas; these masters are separated physically to reduce the effects of [GPS] antenna failures, radio interference, and spoofing. The remaining masters (which we refer to as Armageddon masters) are equipped with atomic clocks. An atomic clock is not that expensive: the cost of an Armageddon master is of the same order as that of a GPS master.

Having an accurate measure of time is very useful in a distributed database. When you read data you have to ask is this replica up to date? If you read and you see multiple transactions on a single row, you must ask did transaction A occur before transaction B? When you write to a row you must ask when was the lock on this row last released? When you're reading data from multiple places and it must be consistent, you ask did this transaction happen after time t? While most distributed databases establish these properties by communicating between machines, if you have very accurate clock time on each node then the job becomes much easier.

This glosses over much of the complexity how Spanner uses the TrueTime timestamps, which is covered in more detail in the Spanner paper[15].

Spanner partitions its data into splits, a concept called shards or tablets in other distributed databases (though each database divides and manages these differently). A row's primary key determines the split on which it is stored.

Cloud Spanner optimizes its split configurations based on both data and query workload. By “split configuration” we mean the number of splits and the way that data is partitioned among splits. Split optimization is opaque to the user, but because it factors in query workload, the splits in a database can change even if no data is being written. Spanner will create more splits over time with a high throughput workload, and consolidate to fewer splits with a low throughput workload. This makes testing more difficult because it creates a warming effect for a high throughput workload — performance will get better as Spanner gradually optimizes the splits.

Spanner Virtualization

If you're running a Cloud Spanner cluster, you configure its performance capacity with the number of nodes. By increasing the number of nodes you add capacity, achieving scalability. Google doesn't expose the exact amount of compute that a node gives you, but the rule of thumb is that a node can handle around 2,000 writes per second and 10,000 reads per second. In our testing, we found this to be workload-dependent.

Both storage and compute are deeply virtualized in Cloud Spanner, so it's important to understand what a node actually means.

A Cloud Spanner “node” isn't a physical machine, per se. It is an allocation of compute capacity and disk access throughput on a set of machines with access to the storage API. A scheduling mechanism manages jobs executed on this pool of machines. This compute architecture means you can make sub-second changes to the number of nodes on an active/production Spanner cluster. That's not the amount of time it takes to call the API. That's the amount of time it takes to adjust your node capacity and see the change's effect on an output metric like query latency.

The number of nodes you have is architecturally separate from the amount of data you store, so these can scale up and down separately, but Cloud Spanner has a limit of 2 TiB per node[17].

Cloud Spanner's data is written to Google's Colossus[18], a storage substrate that splits and replicates it. This means that a specific bit is written to disk dozens of times. The data for a particular table could be divided and replicated among thousands of disks. So Cloud Spanner doesn't use a disk so much as a storage API. The storage API abstracts the difficult details of reliability while giving clients an expectation of access latency and throughput. This means the data is written to multiple physical data centers and failure of any single disk is effectively abstracted away from Cloud Spanner.

Cloud Native Database

Much of Spanner's power stems from its uniquely deep pairing of infrastructure with software. It depends on a storage layer, a compute layer, and an extremely accurate time API; this kind of technology isn't possible without a scaled cloud. Even if Spanner was open-source, it wouldn't really be possible to run outside of a Google datacenter. Conversely, a distributed database written specifically to run on GCP wouldn't have the same capabilities as Spanner without direct access to Google's storage and time APIs. CockroachDB is an open-source database written to replicate Spanner outside of Google, and even though it has a similar architecture it comes with major disadvantages[19].

Consider the cost of duplicating Spanner in your own environment. For example, you would need to run your own atomic clocks. You would need not only a significant datacenter, but you would need scaled storage and compute layers that operate over thousands of machines. Some software already exists to do this, e.g. the Hadoop ecosystem, but these are far from easy to run at scale. Scaled-up hosted infrastructure is a compelling economic argument for using a cloud platform like GCP, AWS, or Azure. You get access to services that wouldn't be cost effective for smaller businesses to build and maintain.

Interface

SQL Dialect

Cloud Spanner uses a SQL dialect which matches the ANSI SQL:2011 standard with some extensions for Spanner-specific features. This is a SQL standard simpler than that used in non-distributed databases such as vanilla MySQL, but still supports the relational model (e.g. JOINs). It includes data-definition language statements like CREATE TABLE. Spanner supports 7 data types: bool, int64, float64, string, bytes, date, timestamp[20].

Cloud Spanner doesn't, however, support data manipulation language (DML) statements. DML includes SQL queries like INSERT and UPDATE. Instead, Spanner's interface definition includes RPCs for mutating rows given their primary key[21]. This is a bit annoying. You would expect a fully-featured SQL database to include DML statements. Even if you don't use DML in your application you'll almost certainly want them for one-off queries you run in a query console.

Though Cloud Spanner supports a smaller set of SQL than many other relational databases, its dialect is well-documented and fits our use case well. Our requirements for a MySQL replacement are that it supports secondary indices and common SQL aggregations, such as the GROUP BY clause. We've eliminated most of the joins we do, so we haven't tested Cloud Spanner's join performance.

Data Locality

Cloud Spanner offers two powerful tools to manage data locality on the tables. This is worth highlighting because MySQL and Postgres don't have precisely similar tools.

The first tool is interleaved tables. Many query workloads define a close relationship between multiple tables. The example given in the Cloud Spanner documentation is a database with a table of musical artists and a table of albums. There's an obvious relationship between these two tables. You can optimize some queries by storing the data for a particular album near (on disk) the data for that album's artist. Interleaved tables explicitly define this relationship in the schema, giving the user a simple and powerful way to manipulate data locality.

The second tool for managing data locality is stored indexes. Cloud Spanner allows you to explicitly define which data from a table is stored in a secondary index. This is useful because if you've traversed a secondary index then you've already done a disk access. By storing additional information in the secondary index leaf node, you can avoid having to touch the row which the index references at all. If that leaf node contains additional data, you could potentially fulfill the query without reading the full row referenced by the index, saving a disk access.

Communication Protocol

Cloud Spanner uses gRPC[22] as its client communication framework, which represents a vast improvement over how most databases communicate. gRPC helps with some of the important details of interacting with a database in a complex production environment.

Most databases are accessed over the network and thus need a protocol for accepting client connections and passing messages back and forth. The common practice is to design a custom binary protocol for handling this communication, then implement clients in all the supported programming languages that speak this protocol. MySQL and Postgres are examples[23,24]. There's no real standard among different databases for the communication protocol, and the only true implementation of each is the database's source code[25]. This means writing clients or proxies for a database is a difficult and error-prone process.

Cloud Spanner improves on this by using gRPC as its communication framework. Using gRPC doesn't mean that we have a universal standard for database communication, but it does help with the problem. The advantages of using gRPC in this context are:

The communication protocol is written in a machine readable format, Protobufs, which is visible to users[26]. It's easy to examine all of the endpoints and message types. Compare this to the MySQL or Postgres wire specifications, which require careful study to understand what messages are being passed around.

gRPC automatically generates clients in 10 officially supported languages. In practice these clients need some wrapping for usability, but it gives you a solid starting place. For example, enums will be deterministically defined.

gRPC communicates over HTTP2, so you can leverage existing layer 7 tooling, like an HTTP2 proxy.

Authentication with Spanner is handled using gRPC's (mostly) generic authentication mechanism. In other protocols authentication is custom-tied to the binary protocol.

A REST HTTP interface can be generated from the gRPC definition, so users have that option as well.

Using gRPC as the communication layer is a more modern and convenient architecture than the older-style practice of implementing a custom binary protocol — this is the future. We suspect that new databases written in the future will adopt gRPC or similar frameworks. For example, CockroachDB, started in 2014, uses gRPC as its communication protocol.

Testing Methodology

The goal of our testing is to determine how one of Quizlet's production MySQL workloads would perform on Cloud Spanner. Our strategy is to create a synthetic query workload that closely matches a production MySQL sample and execute the workload in a controlled environment against both MySQL and Cloud Spanner.

Testing Schema

We've chosen to test our Terms table, one of our most critical datasets and query workloads. Quizlet has around 150 million flashcard study sets and the Terms table holds a row for every term/definition pair on Quizlet. Think of these as the front and back of a flashcard. The Terms table holds 6 billion rows, 625GiB of data, 80GiB of indexes, and handles around 3,000 queries per second at peak.

In MySQL, we define the Terms table with the following schema:

After mapping this into Cloud Spanner's data types, we define Terms with:

Synthetic load testing

We captured a sample of the workload running in production on the MySQL-hosted Terms table. The query sample can be aggregated into a discrete number of query patterns by extracting the query parameters from each executed query, for example SELECT a FROM b WHERE a = $1. For testing, we trimmed our more esoteric queries and consolidated into 18 query patterns.

We mapped each query pattern into Spanner's SQL dialect and wrote an engine to run these queries against Cloud Spanner, attempting to replicate the characteristics of the original workload. We executed each pattern at the same frequency as it's observed in the sample workload; if the above query comprises 15% of the sample, we replicate that. We also attempt to match the distribution of values for the query parameters. For example, row ids aren't usually queried with a uniform distribution; recently inserted rows are queried with greater frequency. Though the queries are generated, the data on which we test is a copy of our production database.

We call this strategy synthetic load testing, because we're generating queries rather than executing the exact queries from the original workload. Though we must take care to ensure the workload replicates the sample, synthetic testing gives us a controlled, clean, and scalable way to experiment with the workload on Spanner. For example, we can run tests with arbitrarily high query rates regardless of the size or throughput of the original sample. A synthetic workload at n queries per second tends to be easier for a database to handle than n production queries per second because edge case queries have an outsized effect on performance. Still, we believe our synthetic testing is robust. It provides a clean environment for testing and optimization, and we use it to prepare to scale our systems up ~6x when Quizlet transitions from our summer traffic lull into back-to-school.

Here are the queries included in the synthetic workload. It's important not only to represent the queries themselves in the synthetic test, but also to replicate the locality characteristics of the queries. For example, query #13 selects all of the terms from a list of sets. To replicate the production characteristics of this query, we take care to generate queries with the same distribution of ids[28].

UPDATE `terms` SET `definition`=?, `last_modified`=? WHERE `id`=? AND `set_id`=?

15

0.33%

UPDATE `terms` SET `is_deleted`=?, `last_modified`=? WHERE `id` IN (??) AND `set_id`=??

16

12.56%

UPDATE `terms` SET `rank`=?, `last_modified`=? WHERE `id`=? AND `set_id`=?

17

1.06%

UPDATE `terms` SET `word`=?, `last_modified`=? WHERE `id`=? AND `set_id`=?

18

0.32%

UPDATE `terms` SET `definition`=?, `word`=?, `last_modified`=? WHERE `id`=? AND `set_id`=?

Overall this workload is 73% reads and 27% writes. Since our production architecture caches some query results in Memcached, this is skewed more towards writes than the overall read/write load of the application.

We ran our tests using a high-memory 64-core GCE instance running Percona MySQL 5.7 using a pd-ssd disk. This machine type has 416 GiB of memory and we've tuned the InnoDB buffer pool to 340 GiB. All disks on GCE are mounted remotely.

Bear in mind that this is a pretty narrow test. We'll show how a Quizlet production workload would map into Cloud Spanner (rather than exhaustively evaluate Spanner against every dimension of MySQL or other databases). Other workloads may perform much differently. Nevertheless, we've attempted to fairly compare MySQL and Cloud Spanner performance in this specific case.

Test Results

The key takeaway of our testing was that Cloud Spanner queries have higher latency at low throughputs compared with a virtual machine running MySQL. Spanner's scalability, however, means that a high-capacity cluster can easily handle workloads that stretch our MySQL infrastructure.

Running our suite of queries at 3,000 queries per second establishes a baseline of performance. Spanner results here have median latencies between 6 and 12ms, while MySQL is able to respond to queries in 1-2ms.

Below is a table of results, comparing some of the interesting queries from the query suite between MySQL and Spanner execution at 3,000 qps.

Query

DB

Min (ms)

Mean (ms)

Median (ms)

p90 (ms)

p99 (ms)

Max (ms)

4

mysql

0.46

0.91

0.74

1.19

2.46

51.05

spanner

3.60

5.95

5.66

7.33

10.33

248.83

5

mysql

12.95

42.78

35.03

53.11

136.31

5,083.88

spanner

10.79

18.27

17.44

21.27

29.72

1,870.98

7

mysql

0.64

1.70

0.96

2.83

9.12

506.86

spanner

2.16

5.41

4.93

7.78

11.31

1,014.82

8

mysql

1.49

7.98

4.60

14.04

64.30

242.77

spanner

7.78

21.16

19.43

30.69

43.64

411.44

12

mysql

0.47

1.25

0.94

1.74

5.16

342.33

spanner

2.48

6.33

5.83

8.94

12.90

319.42

16

mysql

0.44

0.86

0.73

1.14

2.38

78.96

spanner

0.30

7.50

7.12

9.82

13.49

2,585.78

Query 5, which selects multiple full sets and then filters them, is the most expensive of the query suite. It performs with significantly higher latency on MySQL and somewhat higher latency for Spanner.

An interesting pattern above is that queries 14-18, which are all updates, perform with higher latency on Spanner than the easy selects and non-bulk inserts.

As we reach our performance ceiling and stress MySQL, we begin to see median latencies rise. Observe that at 9,000 qps, Spanner latency is basically unchanged, while MySQL latency has jumped.

Below are the numbers for MySQL and Spanner at 9,000 qps.

Query

DB

Min (ms)

Mean (ms)

Median (ms)

p90 (ms)

p99 (ms)

Max (ms)

4

mysql

0.52

18.45

14.32

40.40

61.28

77.57

spanner

3.29

5.65

5.34

6.89

9.50

336.97

5

mysql

18.27

57.99

45.67

108.75

199.18

300.63

spanner

10.04

19.32

16.91

21.17

93.41

815.62

7

mysql

0.72

20.78

15.51

42.39

96.93

190.09

spanner

2.11

5.16

4.62

7.32

10.82

404.61

8

mysql

0.50

17.20

14.78

34.47

56.66

101.92

spanner

7.39

21.13

18.72

30.64

51.23

605.43

12

mysql

0.73

17.91

14.65

42.12

73.50

73.50

spanner

2.35

6.12

5.55

8.54

12.46

1,072.98

16

mysql

0.50

16.56

14.42

34.72

58.29

82.25

spanner

3.51

7.16

6.76

9.35

12.70

1,007.84

This is a very simple test. We've intentionally targeted the performance ceiling of our 64-core MySQL machine with the described workload. The point is to demonstrate the limitation of a single-node database versus Spanner. In practice, you could shard the workload among multiple MySQL machines, meaning that it is possible to run a workload of this size on MySQL, but to do so you would introduce an additional layer of complexity.

Latency as Throughput Increases

Database scalability is the major goal of our experimentation with Spanner, so we compared query latency at 9, 15, and 30 Spanner nodes. We chose 9 because it was the minimum number of nodes possible given our data size and the way we'd loaded our cluster. In each of these configurations we found the maximum throughput at which we could execute the query workload.

When MySQL is near its limit on throughput, latency increases drastically. However, when Spanner reaches its throughput capacity its median latency is largely unchanged, though latency increases at the tail, which you can see in the p99 chart.

You can also observe in these charts that Spanner scales almost linearly. Distributed databases scale sublinearly because there's always coordination overhead in communicating with additional nodes. Spanner caps out around 17,000 qps with 15 nodes and 33,000 qps with 30 nodes, meaning there's fairly little additional overhead as the cluster size is doubled.

Nodes vs Throughput

How does query throughput change as we increase the number of nodes? Above, we saw how latency changes on these two systems when altering the queries per second. To understand Spanner better, we also experiment with the number of Spanner nodes. In this test, we've varied the number of nodes and observed the maximum throughput Spanner was able to handle given a fixed number of clients executing tests. The maximum throughput was found by sending queries without a rate limit to Cloud Spanner, observing the queries per second (which varies when it's at capacity), then dialing down qps and targeting the nearest increment of 100 that Cloud Spanner could comfortably execute without falling behind the target.

Nodes vs Latency

How does query latency change as we increase the number of nodes? We varied the number of nodes while holding the query throughput constant. We observed a negative correlation between nodes and the median query latency. This is interesting but not very surprising — it seems that adding compute capacity distributes the query load more and reduces contention, which lowers latency. The strength of this effect likely depends on the query workload.

Lessons from Testing

Cloud Spanner has thorough documentation[29], but of course some things you discover through experimentation.

Here's the most important thing: with Cloud Spanner you have slightly higher latency for simple queries compared with MySQL, but get a vastly more scalable database. Not every application can handle Spanner's ~5ms minimum query time, but if you can, then you can have that latency for a very high-throughput workload[30].

Schema Optimization

Spanner query latency is positively correlated with the number of splits that a given query must access. So a query that accesses 10 rows in disparate parts of the primary key space will take longer than one where the keys reside on the same splits. This is expected with a distributed system. What we failed to anticipate, however, was the effect of a secondary index on query performance.

When testing out a write-heavy workload we discovered that bulk writes would severely impact the performance of queries using the secondary index. Even though our writes had strong locality, the secondary index meant that a bunch of splits need to be updated when doing a bulk write. Since Spanner uses pessimistic locking a bulk write with a secondary index updates many splits, which creates contention for reads that use that secondary index. Our takeaway is that Spanner's secondary indexes are a powerful tool but come with a major cost for certain schema/workloads. Our optimization was to change our primary key to include the column in the secondary index, which let us eliminate the secondary index and drastically lowered latency[31]. This isn't a bug or problem with Spanner - any database makes architecture decisions around which you should optimize.

Node-to-Split Ratio

We also discovered that there's a maximum number of splits per Spanner node. That threshold combined with Spanner's split optimization produces a surprising effect: it might be impossible to reduce the number of nodes on your database, even if you previously ran the database with that number of nodes. Since Cloud Spanner doesn't expose the number of splits in the database it's not possible to predict the minimum nodes you can set without experimentation. This can cause problems with capacity planning. For example, you may not be able to autoscale a Cloud Spanner cluster down to a size of your choice if it optimized the splits for a larger cluster.

Cost

Cloud Spanner's cost as compared to other options is another area that very much depends on your workload. For very small or low-throughput databases Cloud Spanner is overkill, with a single node costing $0.90 per hour, or $648 in 30 day month, plus storage costs[32]. For significant workloads it makes more sense.

The example workload we show above runs in production on 3 large GCE instances with SSD drives. We estimate that it costs roughly as much to run as a 10-node Cloud Spanner cluster, which makes Cloud Spanner comparable or slightly cheaper based on the performance in our testing. This is good news, given that Spanner presumably would reduce the human maintenance cost of running a database and makes it possible to autoscale the number of nodes in a cluster based on traffic during the day or week, which has the potential to reduce costs below those numbers. It's hard to say precisely how to think of these prices, given that we don't have experience running Cloud Spanner in production, but for a workload like ours it appears to carry similar infrastructure costs to our self-managed MySQL instances.

Operational Considerations

A previous boss of mine once told me, “It takes a decade to write a database.” This trivially false statement carries the truth that writing real production software infrastructure takes years of iteration. The feedback loop of running a production workload, optimizing that workload, then running again, exposes bottlenecks and flaws that would be impossible to predict a priori. Maturing to real production quality takes time, and it's hard to trust a database before then. Consider that MySQL didn't include by default the now-standard InnoDB storage engine until 6 years after its initial release[33].

Spanner is compelling partly because it has been under development internally at Google for around ten years[15]. It powers the Google's Ads and Analytics businesses, high throughput workloads which demand reliability. Spanner's development carries extreme self-interest for Google, and it's been proven as a production system, at least internally. It's rare that a newly released database carries this level of investment and production-hardening.

Failure Conditions

Before taking a system to production it's important to consider how it will fail. Without having run Spanner in production, the best we can do now is study it and test as thoroughly as possible to understand its failure characteristics.

The documentation describes Cloud Spanner's zone replication with the following.

For any regional configuration, Cloud Spanner maintains 3 replicas, each within a different Google Cloud Platform availability zone in that region. Each replica is a full copy of your operational database that is able to serve read-write and read-only requests. Cloud Spanner uses replicas in different zones so that if a single zone failure occurs, your database remains available.[34]

At the very least it's clear that Spanner has been designed for robustness to zone failure. This means that if an entire data center within a GCP region goes down, Spanner should continue responding to queries. This protects against a category of failures, but doesn't fully answer the question of “what can go wrong?” Here's how we've come to think about Spanner's risks.

Spanner's major resources are compute, disk, and network. Let's define 4 failure modes — 1 for each of those resources, and 1 for everything else.

Google's suggestion was that the best way to simulate compute failure would be to manipulate the number of nodes while running load testing. Again, a node doesn't correspond with a specific machine, it's more like an allocation of compute resources to your Spanner cluster. We've been told that dropping the number of nodes while running a test is a close approximation to what would happen with a compute failure. So how does query latency change when the number of nodes is altered?

Median, p90, and p99 query latency over time as we randomly changed Cloud Spanner cluster size in nodes.

In the above graph, we ran 2,000 queries per second while changing the number of nodes and recording latency. Median latency is largely unaffected, except when the number of nodes is decreased below a threshold, in this case 2 nodes. p90 and p99 latency change dramatically. As we discuss above, nodes are an abstract notion of compute, rather than physical nodes. Cloud Spanner impressively applies the change immediately, as demonstrated by the rapid change in latency after a change in nodes. Note that this is a slightly different query workload and schema as the charts shown above.

Disk is abstracted in Cloud Spanner, so an individual disk failure is negligible and not worth thinking about. A single bit is written dozens of times when you account for replication both locally and across multiple zones. So the question worth asking here is “What happens when there is a storage system failure?” Without witnessing such a failure it's unclear what would happen — it would likely be very disruptive.

Since compute and disk are virtualized, network access to and between these systems is critical for Spanner's functionality. Spanner is robust to any single node disappearing, so again, the thing to ask is “What if there is a systemic failure?” With network in particular, Cloud Spanner would likely be hosed. We don't know exactly how. We've seen network partitions in GCP before but they've usually been resolved quickly. Cloud Spanner's architecture means that data consistency should be guaranteed during a partition, but it's likely that latency would be impacted if some portion of Spanner's compute nodes were inaccessible. If a partition means that some Spanner splits (segments of data) are inaccessible, then logically some queries would fail.

As for other failure modes — this remains an open question. Spanner has been running at Google internally for years, which gives us confidence in the core technology's failure modes. As a public cloud product, it is very new.

Backups

Spanner's architecture is particularly helpful for database backups, as described in the original paper[15].

Spanner has two features that are difficult to implement in a distributed database: it provides externally consistent reads and writes, and globally-consistent reads across the database at a timestamp. These features enable Spanner to support consistent backups, consistent MapReduce executions, and atomic schema updates, all at global scale, and even in the presence of ongoing transactions.

By setting a timestamp for the data you want to read in the query, the client can define the historical point at which to read the database. Scanning the entire database at this timestamp yields a consistent backup.

We expect in the future Cloud Spanner will have push-button backup functionality that will create and export a backup with little client interaction. Even without this feature, however, the architecture described above makes it possible to implement backups on the client side. For large data sizes doing a full export from Spanner would likely require real work.

Pitfalls

Cloud Spanner shines as a well-designed and scalable datastore. However, it still has a few rough edges as a production-ready database. We expect these will be addressed over time, but are worth knowing about if you use it now.

Picking Indexes

Cloud Spanner isn't yet very accurate when it decides which indexes to use. Often a query engine will have a choice between the primary key index and a secondary index before executing a query, which it picks based on estimates of the running time with each. In our testing, Spanner doesn't yet appear to be very accurate in these estimates. The upshot is that if you want to use a secondary index, you should be explicit in the query, for example SELECT a FROM my_table@{FORCE_INDEX=my_index} WHERE a = @a_value. Similarly, to force use of the primary index on a table, use SELECT a FROM my_table@{FORCE_INDEX=_base_table} WHERE a = @a_value. We don't consider this to be a major problem; arguably a developer writing a query for a high-performance database should pick the index explicitly. Automatic selection is a convenience feature.

Index Traversal

Fast index traversal for certain queries appears to only support one side of the index; the query select max(id) from mytable where id is the primary key, requires a full table scan if the index is in ascending order. select min(id) from mytable skips the scan and executes quickly, as it should, since this query only requires traversing down the leftmost path of the primary key index tree. The opposite is also true; max(id) executes quickly if the index is in descending order. This seems mostly like a quirk, but might be annoying in certain cases.

Monitoring

Despite its sophistication, or perhaps because of it, introspection into Cloud Spanner's internals is currently its biggest problem.

Cloud Spanner has no interface for monitoring your instance's currently running queries. Clients can launch expensive queries that take a long time to execute, such as a full-table scan. Since an instance has a finite resource pool, resource-intensive queries can impact the performance and latency of other queries running on the cluster. From an operational standpoint, it's important to have visibility into the currently running workloads to diagnose production issues and understand performance. This is also important for index build jobs, which can be started from any client and may take hours to run for a large table.

Cloud Spanner has some metrics about resource use but as a hosted product it offers much less introspection into system performance than a self-hosted solution. Currently Cloud Spanner gives you metrics about CPU utilization (mean and max across the cluster), read/write operations and throughput, and the total storage consumed. These are helpful, but don't give enough visibility to solve certain categories of problems. For instance, diagnosing a hot shard problem and attributing to a specific query remains difficult. When running a complex production system, more information is always better. Cloud Spanner is still missing introspection into key information, such as the number of splits on a cluster, the actions of the split optimizer, and the longest-running queries.

Spanner's query explainer is detailed in describing the query plan, but it doesn't apply the query plan to your data. For example, it doesn't offer any information on how many rows a query would scan. MySQL and Postgres are more helpful when run an EXPLAIN, which greatly helps in understanding and optimizing a query.

Overall Spanner gives the user some critical information but lacks visibility into other spaces. To be considered fully production-ready, Cloud Spanner must add additional introspection so that users can diagnose problems and plan their node capacity.

Moving Forward

Cloud Spanner is the most compelling cloud service we've seen for scaling a high-throughput relational workload, MySQL in our case. It has some rough edges as a production system, but its latency and scalability are unique. We have not seen another database, relational or otherwise, that can scale as smoothly and rapidly as Cloud Spanner. And the fact that it is hosted eliminates an entire category of maintenance.

Quizlet's primary datastore is the most critical element of our infrastructure, but also the hardest system to scale. Given our evaluation of Cloud Spanner and the fact we run our infrastructure on GCP, we plan to continue experimenting with it and will likely migrate one of our production databases over in late spring of 2017.

While Cloud Spanner is still unproven, it holds great promise. Google is one of a handful of companies capable of delivering a hosted service of this sort. One that requires deep integration between the database software and underlying cloud infrastructure, and depends on significant existing storage and compute resources. We're optimistic about its future and its potential to help us smoothly scale our core infrastructure.

Footnotes

1 - We use query workload and workload frequently in this post. To define more precisely what we mean: a query workload is the set of queries, the frequency of these queries, and the query throughput from an application. The Quizlet web app queries MySQL with a mix of SELECT, UPDATE, INSERT, and DELETE queries, and our MySQL architecture must respond to these queries within a latency boundary, otherwise we can't successfully serve web and API requests.

21 - While a Cloud Spanner mutation must be done using the primary key, you can build your own writes that depend on a secondary index by opening a transaction, doing a read using the secondary index to fetch the primary key of a row, doing a write with that row, and closing the transaction.

25 - This architecture means that each database communicates in its own special way, and requires custom clients in many different languages, with no standardization or shared tooling. It's particularly painful that you can't use HTTP tooling like nginx (though there's definitely an overhead to communication over HTTP). Let's say you wanted to write a service that pools connections to an upstream database that has a custom binary protocol - many clients connecting to this proxy, which mediates these connections to the upstream database, a simple task. If this was HTTP, you would just configure something like nginx, but in this case you would need to implement the server-side of the custom protocol to receive connections, and a client for the custom protocol to speak to the upstream server. Suddenly just proxying a database connection becomes a complex job.

You might be thinking, “but there is a standard for this, just use ODBC.” This is even farther removed from the root of the problem. Rather than a communication standard, ODBC is a library standard, meaning it specifies the function calls you must support for ODBC-compliance. You still need to implement a custom client for your custom database communication protocol, but since this is a library, it's now specific to a certain operating system. ODBC support in newer programming languages is patchy at best. For example, there's no official ODBC client for Go. It's like if the Facebook API was implemented as a binary network protocol, then OS-specific drivers were offered as the means for accessing it — insane. ODBC is a 25 year old standard that doesn't hold up well to how modern applications are developed.

We've mapped last_modified in the MySQL schema from a Unix time integer to a TIMESTAMP column type.

Each Term is a member of a Set, so there is a one-to-many relationship between Sets and Terms. Spanner allows you to manipulate row locality in cases like this, where two tables have a parent-child relationship and it might make sense for the child rows to be stored near the parent row, but still represented is a separate table. We haven't done this in our testing because we want to compare directly between the MySQL workload and the Spanner testing workload.

We've optimized the MySQL workload by partitioning locally on set_id. This is a good partition column because we get locality among the Terms that correspond to a Set. The consequence of this decision, however, is that queries without a set_id must execute on each partition. With Spanner, we've created this same locality by flipping the ordering of id and set_id in the primary key.

28 - Our rows are queried in an exponential distribution. There's extreme temporal locality in most workloads like ours, meaning that recently written rows are much more likely to be read. To match this effect, we generate ids on an exponential distribution with coefficients trained using the distribution we see in production.

30 - The latency makes a big difference for us. As noted earlier, we use Memcached as a cache for some query results. This has important implications on consistency, because Memcached must be kept in sync with the database. If it's not, then the cached results are incorrect. If Cloud Spanner had latency around 1ms we could potentially drop our caching, though this would have implications on total query throughput and may not be cost effective. The point here is: latency matters and lower is always better.

31 - To be more precise: our initial schema design on Cloud Spanner was to make the primary key a compound index of (column A, column B), and then add a secondary index for column B. This more closely matched our MySQL workload, which has that primary key but is partitioned on column B. We found it advantageous with Spanner to flip the order of the primary key to (column B, column A), which allowed us to drop the secondary index and was a big performance win. This example is very specific to our use case — it's easy to image a different workload where you would need that secondary index.