Categories

Meta

Author: rayokota

This week Google released Cloud Spanner1, a publicly available version of their Spanner database. This completes the public release of their 3 main databases, Bigtable (released as Cloud Bigtable), Megastore (released as Cloud Datastore), and Spanner. Spanner is the culmination of Google’s research in data stores, which provides a globally distributed, relational database that is both strongly consistent and highly available.

But doesn’t the CAP theorem state that we have to choose consistency over availability, or availability over consistency? Over the years, Google has been arguing that you can have both strong consistency and high availability, and that you don’t have to settle for eventual consistency. In fact, all 3 of Google’s data stores are strongly consistent systems.

Some Background

In 2000, Brewer came up with the CAP conjecture2, which was later proved as a theorem by Gilbert and Lynch3. It states that you can choose only 2 of the 3 properties:

C: consistency (or linearizability)

A: 100% availability (in the context of network partitions)

P: tolerance of network partitions

Later Coda Hale made the point that you can’t sacrifice partition tolerance, so really the choice is between CP and AP (and not CA)4.

What is the tradeoff?

According to the CAP theorem, when you choose a data store, you must choose either an AP system (that is eventually consistent) or a CP system (that is strongly consistent). But Google would argue the following points:

In AP systems, client code becomes more complex and error-prone in order to deal with inconsistencies.

AP systems are not 100% available in practice.

CP systems can be made to be highly available in practice.

From the above 3 points, when you choose availability over consistency, you are not gaining 100% availability but you are losing consistency and you are gaining complexity.

Let’s drill down into these points.

Client complexity

Here is what Google has to say about using AP systems:

“We also have a lot of experience with eventual consistency systems at Google. In all such systems, we find developers spend a significant fraction of their time building extremely complex and error-prone mechanisms to cope with eventual consistency and handle data that may be out of date. We think this is an unacceptable burden to place on developers and that consistency problems should be solved at the database level.”5

This has led Google to focus on data stores that are CP.

AP systems in practice

Many engineers are confused about the definition of “availability” in the CAP theorem. Most engineers think of availability in terms of a service level agreement (SLA) or a service level objective (SLO), which is typically measured in “9s”. However, as Kleppmann has pointed out, the “availability” in the CAP theorem is not a measurement or a metric, but a liveness property of an algorithm.6 I am going to distinguish between the two types of availability by referring to them as “effective availability” and “algorithmic availability”.

Effective availability: the empirically measured percentage of successful requests over some period, often measured in “9s”.

Algorithmic availability: a liveness property of an algorithm where every request to a non-failing node must eventually return a valid response.

The CAP theorem is only concerned with algorithmic availability. An algorithmic availability of 100% does not guarantee an effective availability of 100%. The algorithmic availability from the CAP theorem only applies if both the implementation and the execution of the algorithm is without error. In practice, most outages to an AP system are not due to network issues, which the algorithm can handle, but rather to implementation defects, user errors, misconfiguration, resource limits, and misbehaving clients. Google found that in Spanner only 7.6% of its errors were network-related, whereas 52.5% of errors were user-related (such as overload and misconfiguration) and 13.3% of errors were due to bugs. Google actually refers to these errors as “incidents” since they were able to prevent most of them from affecting availability.7

At Yammer we have experience with AP systems, and we’ve seen loss of availability for both Cassandra and Riak for various reasons. Our AP systems have not been more reliable than our CP systems, yet they have been more difficult to work with and reason about in the presence of inconsistencies. Other companies have also seen outages with AP systems in production.8 So in practice, AP systems are just as susceptible as CP systems to outages due to issues such as human error and buggy code, both on the client side and the server side.

CP systems in practice

With Spanner, Google is able to attain an availability of 5 “9s”, which is 5.26 minutes of downtime per year.7 Likewise, Facebook uses HBase, another CP system based on Bigtable, and claims to be able to attain an availability of between 4 to 5 “9s”.9 In practice, mature CP systems can be made to be highly available. In fact, due to its strong consistency and high availability, Google refers to Spanner as “effectively” CA, which means they are focusing on effective availability (a practical measure) and not algorithmic availability (a theoretical property).

A bad tradeoff?

With an AP system, you are giving up consistency, and not really gaining anything in terms of effective availability, the type of availability you really care about. Some might think you can regain strong consistency in an AP system by using strict quorums (where the number of nodes written + number of nodes read > number of replicas). Cassandra calls this “tunable consistency”. However, Kleppmann has shown that even with strict quorums, inconsistencies can result.10 So when choosing (algorithmic) availability over consistency, you are giving up consistency for not much in return, as well as gaining complexity in your clients when they have to deal with inconsistencies.

Summary

There’s nothing wrong with using an AP system in general. An AP system might exhibit the lower latencies that you require (such as with a cache), or perhaps your data is immutable so you don’t care as much about strong consistency, or perhaps 99.9% consistency is “good enough”.11 These are all valid reasons for accepting eventual consistency. However, in practice AP systems are not necessarily more highly available than CP systems, so don’t settle for eventual consistency in order to gain availability. The availability you think you will be getting (effective) is not the availability you will actually get (algorithmic), which will not be as useful as you might think.

HGraphDB is a client framework for HBase that provides a TinkerPop Graph API. HGraphDB also provides integration with Apache Giraph, a graph compute engine for analyzing graphs that Facebook has shown to be massivelyscalable. In this blog we will show how to convert a sample Giraph computation that works with text files to instead work with HGraphDB.

In the Giraph quick start, the SimpleShortestPathsComputation is used to show how to run a Giraph computation against a graph contained in a file as a JSON representation. Here are the contents of the JSON file:

There is also a class called HBaseBulkLoader that can be used for more efficient creation of larger graphs.

Instead of using the JSON input format above, HGraphDB provides two input formats, HBaseVertexInputFormat and HBaseEdgeInputFormat, which will read from the vertices table and edges table in HBase, respectively. To use these formats, the Giraph computation needs to be changed slightly. Here is the original SimpleShortestPathsComputation:

The major difference is that when using HBaseVertexInputFormat, the “value” of a Giraph vertex is an instance of type VertexValueWritable, which is comprised of an HBaseVertex and a Writable value. Likewise when using HBaseEdgeInputFormat, the “value” of a Giraph edge is an instance of type EdgeValueWritable, which is comprised of an HBaseEdge and a Writable value. The instances of HBaseVertex and HBaseEdge should be considered read-only and only be used to obtain IDs and property values.

Running the above Giraph computation against HBase is similar to running the original example. Note that we also have to customize IdWithValueTextOutputFormat to work properly with VertexValueWritable.

As an alternative to using a text-based output format such as IdWithValueTextOutputFormat, HGraphDB provides two abstract output formats, HBaseVertexOutputFormat and HBaseEdgeOutputFormat, that can be used to modify the graph after a Giraph computation. For example, the shortest path result for each vertex could be set as a property on the vertex by extending HBaseVertexOutputFormat and implementing the method

The use of graph databases is common among social networking companies. A social network can easily be represented as a graph model, so a graph database is a natural fit. For instance, Facebook has a graph database called Tao, Twitter has FlockDB, and Pinterest has Zen. At Yammer, an enterprise social network, we rely on HBase for much of our messaging infrastructure, so I decided to see if HBase could also be used for some graph modelling and analysis.

Below I put together a wish list of what I wanted to see in a graph database.

It should allow property values to be of arbitrary type, including maps, arrays, and serializable objects.

It should support indexing vertices by label and property.

It should support indexing edges by label and property, specific to a given vertex.

It should support range queries and pagination with both vertex indices and edge indices.

I did not find a graph database that met all of the above criteria. For instance, Titan is a graph database that supports the TinkerPop API, but it is not implemented directly on HBase. Rather, it is implemented on top of an abstraction layer that can be integrated with HBase, Cassandra, or Berkeley DB as its underlying store. Also, Titan does not support user-supplied IDs. S2Graph is a graph database that is implemented directly on HBase, and it supports both user-supplied IDs and indices on edges, but it does not yet support the TinkerPop API nor does it support indices on vertices.

This led me to create HGraphDB, a TinkerPop 3 layer for HBase. It provides support for all of the above bullet points. Feel free to try it out if you are interested in using HBase as a graph database.

Two of the most useful and powerful features of HBase are its support for server-side filters and coprocessors. For example, custom filters can be used for efficient pagination, while custom coprocessors can be used to provide endpoints to provide efficient aggregation of data in HBase. In addition, more sophisticated filters and coprocessors can be used to turn HBase into an entirely different data store, such as a JSON document store (HDocDB), a relational database (Phoenix), or others.

While working with custom filters, I ran into a couple of issues that I didn’t find documented elsewhere (perhaps I missed them), so I thought I’d jot them down here to benefit others.

First, when writing a custom filter, the cells passed to the filterKeyValue method are a superset of the cells that will be returned to the client. The main reason for this is that even though a column family may be specified to retain only one version of a cell, multiple versions of the cell may still exist in the store because a compaction has not yet taken place, and the pruning of versions in the query result doesn’t happen until after filterKeyValue is called. This actually took me by surprise, as I didn’t find it documented anywhere, and my initial mental model assumed that the pruning of versions would happen before this method was called. (Update: This has since been filed as HBASE-17125.)

The second tip is in regard to the filterRowCells method. This method gives you the list of cells that have passed previous filter methods, and allows you to modify it before it is passed to the next phase of the filter pipeline. For example, here is how the DependentColumnFilter in HBase uses this method to filter out cells that don’t have a matching timestamp.

However, when implementing filterRowCells, the Iterator.remove method should not be used. This is because the underlying list of cells is passed as an ArrayList, and Iterator.remove is an O(n) operation for instances of ArrayList. As more and more elements are removed from within filterRowCells, the time complexity of this operation will begin to approach O(n2). Instead, the Guava method Iterables.removeIf should be preferred (or Collection.removeIf, if you are using Java 8).

The Iterables.removeIf method will check to see if the Iterable passed to it is an instance of RandomAccess (which is true for ArrayList), and if so, will remove all elements that pass the specified Predicate in total O(n) time (by making use of ArrayList.set).

One of our queries using a custom filter was passing tens of thousands of cells to filterRowCells and filtering a majority of the cells out using Iterator.remove. After changing the custom filter to use Iterables.removeIf, the query time dropped from 800 ms to 250 ms.

Since HBase already uses the Iterables class from Guava, I’ve submitted HBASE-16893 and PHOENIX-3393 to change the filters in the HBase and Phoenix codebases to use Iterables.removeIf instead of Iterator.remove.

When using HBase, it is often desirable to encrypt data in transit between an HBase client and an HBase server. This might be the case, for example, when storing PII (Personally Identifiable Information) in HBase, or when running HBase in a multi-tenant cloud environment.

Transport encryption is often enabled by configuring HBase to use SASL with GSSAPI/Kerberos to provide data confidentiality and integrity on a per-connection basis. However, the default implementation of GSSAPI/Kerberos does not seem to make use of AES-NI hardware acceleration. In our testing, we have seen up to a 50% increase in the P75 measurements for latencies of some of our HBase applications when using GSSAPI/Kerberos encryption versus no encryption.

One workaround is to bypass the encryption used by SASL and use an encryption library that can support AES-NI acceleration. This effort has already been completed for HDFS (HDFS-6606) and is in progress for Hadoop (HADOOP-10768). Based on some of this earlier work, similar changes can be made for HBase.

The way that the fix for HADOOP-10768 works is conceptually as follows. If the Hadoop client has been configured to negotiate a cipher suite in place of the one negotiated by SASL, then the following actions will take place:

The client will send the server a set of cipher suites that it supports.

The server will negotiate a mutually acceptable cipher suite.

At the end of the SASL handshake, the server will generate a pair of encryption keys using the cipher suite and send them to the client via the secure SASL channel.

The generated encryption keys, instead of the SASL layer, will be used to encrypt all subsequent traffic between the client and server.

Originally I was hoping that the work for HADOOP-10768 would be easily portable to the HBase codebase. It seems that some of the HBase code for SASL support originated from the corresponding Hadoop code, but has since diverged. For example, when performing the SASL handshake, the Hadoop client and server use protocol buffers to wrap the SASL state and SASL token, whereas the HBase client and server do not use protocol buffers when passing this data.

Instead, in HBase, during the SASL handshake the client sends

The integer length of the SASL token

The bytes of the SASL token

whereas the server sends

An integer which is either 0 for success or 1 for failure

In the case of success,

The integer length of the SASL token

The bytes of the SASL token

In the case of failure,

A string representing the class of the Exception

A string representing an error message

There is one exception to the above scheme, and that is if the server sends a special integer SWITCH_TO_SIMPLE_AUTH (represented as -88) in place of the length of the SASL token, the rest of the message is ignored and the client falls back to simple authentication instead of completing the SASL handshake.

In order to adapt the fix for HADOOP-10768 for HBase, I decided to use another special integer called USE_NEGOTIATED_CIPHER (represented as -89) for messages related to cipher suite negotiation between client and server. If the client is configured to negotiate a cipher suite, then at the beginning of the SASL handshake, in place of a message containing only the length and bytes of a SASL token, it will send a message of the form

USE_NEGOTIATED_CIPHER (-89)

A string representing the acceptable cipher suites

The integer length of the SASL token

The bytes of the SASL token

And at the end of the SASL handshake, the server will send one additional message of the form

A zero for success

USE_NEGOTIATED_CIPHER (-89)

A string representing the negotiated cipher suite

A pair of encryption keys

A pair of initialization vectors

We can turn on DEBUG logging for HBase to see what the client and server SASL negotiation normally looks like, without the custom cipher negotiation. Here is the client:

Creating SASL GSSAPI client. Server's Kerberos principal name is XXXX
Have sent token of size 688 from initSASLContext.
Will read input token of size 108 for processing by initSASLContext
Will send token of size 0 from initSASLContext.
Will read input token of size 32 for processing by initSASLContext
Will send token of size 32 from initSASLContext.
SASL client context established. Negotiated QoP: auth-cont

Once the cipher suite negotiation is complete, both the client and server will have created an instance of SaslCryptoCodec to perform the encryption. The client will call SaslCryptoCodec.wrap()/unwrap() instead of SaslClient.wrap()/unwrap() while the server will call SaslCryptoCodec.wrap()/unwrap() instead of SaslServer.wrap()/unwrap(). This is the same technique as used in HADOOP-10768.

With the above code deployed to our production servers, we can compare the latencies of different encryption modes for one of our HBase applications. (In order to run clients in different modes we have also patched our HBase servers with the fix for HBASE-14865.) Below we show the P50, P75, and P95 latencies over a 12 hour period. The higher line is an HBase client configured with GSSAPI/Kerberos encryption (higher is worse), the middle line is an HBase client configured with accelerated encryption, and the lower line is an HBase client configured with no encryption.

Also, here is the user CPU time for the three differently configured HBase clients (GSSAPI/Kerberos encryption, accelerated encryption, no encryption).

We can see that accelerated encryption provides a significant performance improvement over GSSAPI/Kerberos encryption. The changes I made to HBase in order to support accelerated encryption are available at HBASE-16633.

Recently I noticed that several NoSQL stores that claim to be multi-model data stores are implemented on top of a key-value layer. By using simple key-value pairs, such stores are able to support both documents and graphs.

A wide column store such as HBase seems like a more natural fit for a multi-model data store, since a key-value pair is just a row with a single column. There are many graph stores built on top of HBase, such as Zen, Titan, and S2Graph. However, I couldn’t find any document stores built on top of HBase. So I decided to see how hard it would be to create a document layer for HBase, which I call HDocDB.

Document databases tend to provide three different types of APIs. There are language-specific client APIs (MongoDB), REST APIs (CouchDB), and SQL-like APIs (CouchBase, Azure DocumentDB). For HDocDB, I decided to use a Java client library called OJAI.

One nice characteristic of HBase is that multiple operations to the same row can be performed atomically. If a document can be stored in columns that all reside in the same row, then the document can be kept consistent when modifications are made to different columns that comprise the document. Many graph layers on top of HBase use a “tall table” model where edges for the same graph are stored in different rows. Since operations which span rows are not atomic in HBase, inconsistencies can arise in a graph, which must be fixed using other means (batch jobs, map-reduce, etc.). By storing a single document using a single row, situations involving inconsistent documents can be avoided.

To store a document in a single row, we use a strategy called “shredding” that was developed when researchers first attempted to store XML in relational databases. In the case of JSON, which is easier to store than XML (due to the lack of schema and no requirement for preserving document order except in the case of arrays), we use a technique called key-flattening that was developed for the research system Argo. When key-flattening is adapted to HBase, each scalar value in the document is stored as a separate column, with the column qualifier being the full JSON path to that value. This allows different parts of the document to be read and modified independently.

For HDocDB, I also added basic support for global secondary indexes. The implementation is based on Hofhansl and Yates. For more sophisticated indexing, one can integrate HDocDB with ElasticSearch or Solr.

Since OJAI is integrated with Jackson, it is also easy to store plain Java objects into HDocDB. This means that HDocDB can also be seen as an object layer on top of HBase. We can now say HBase supports the following models:

Key-value

Wide column

Document (HDocDB)

Graph (Titan, Zen, S2Graph)

SQL (Phoenix)

Object (HDocDB)

So not only can HBase be seen as a solid CP store (as shown in a previous blog), but it can also be seen as a solid multi-model store.

The following post originally appeared in the Yammer Engineering blog on September 10, 2014.

In a previous blog, I demonstrated some good results for HBase using an automated test framework called Jepsen. In fact, they may have seemed too good. HBase is designed for strong consistency, yet also seemed to exhibit extraordinary availability during a network partition. How was this possible? Apparently, HBase clients will retry operations when they fail. This can be better seen during a sample run below:

During the network partition, no requests are successful; after the partition is healed, requests are able to succeed, and the request latencies slowly decrease.

Here is a longer network partition showing much greater latencies:

In fact, if the network partition is long enough, the HBase client will start to report timeouts:

Timed out waiting for some tasks to complete!

In such cases, not all requests will be successfully processed. Here is a typical result:

The following post originally appeared in the Yammer Engineering blog on September 10, 2014.

The Yammer architecture has evolved from a monolithic Rails application to a micro-service based architecture. Our services share many commonalities, including the use of Dropwizard and Metrics. However, as with many micro-service based architectures, each micro-service may have very different needs in terms of persistence. This has led to the adoption of polyglot persistence here at Yammer.

As such, engineers at Yammer are very interested in understanding and evaluating the differences between various databases. One particularly enlightening exposition has been the Call Me Maybe series by Aphyr. This series demonstrates how various databases behave in the presence of network partitions caused by an automated test framework called Jepsen(*). Some of the databases it covers are Postgres, Redis, MongoDB, Riak, Cassandra, and FoundationDB. I have often wondered how HBase would behave under Jepsen. Let’s try augmenting Jepsen to find out.

Jepsen actually consists of two parts. The first part is a provisioning framework written using salticid. This provisions a five-node Ubuntu cluster, where the nodes are referred to as n1, n2, n3, n4, and n5. It can then install and setup the desired database. The second part is a set of runtime tests written using Clojure. (Aphyr also has a great tutorial on Clojure called Clojure from the ground up.)

As I don’t have much experience with HBase, I decided to forego the salticid customizations and simply use Cloudera Manager to set up a five-node HBase cluster on EC2. I used CDH 5.1.2, which bundles hbase-0.98.1+cdh5.1.2+70. I then modified the /etc/hosts file on each of the nodes to add entries for n1 through n5 (with n1 being the ZooKeeper master). With that step done, it was time to write the actual tests, which are available here.

The first test I wrote, hbase-app, is based on one of the Postgres tests. It simply adds a single row for each number. While it is running, Jepsen uses iptables to simulate a network partition within the cluster. Let’s see how it behaves.

All 2000 writes succeeded and there was no data loss. So far, so good.

The second test I wrote, hbase-append-app, is based on one of the FoundationDB tests. It repeatedly writes to the same cell by attempting to append to a list stored as a blob while a network partition occurs. The test uses the checkAndPut call, which allows for atomic read-modify-write operations.

This time not all writes succeeded, since all clients are attempting to write to the same cell, and the chance is high that another client will write to the cell between the read and write of a given client’s read-modify-write operation. However, no data loss occurred, as all 282 successful writes are apparent in the final result.

The third test I wrote, hbase-isolation-app, is based on one of the Cassandra tests. It modifies two cells in the same row while a network partition occurs, to test if row updates are atomic.

Nice. Row updates are atomic as HBase claims. However, the above test modifies two cells in the same column family. Let’s try modifying two cells in different column families, but in the same row. From what I understand of HBase, cells in different column families are stored in different HBase “stores”. I ran a fourth test, hbase-isolation-multiple-cf-app, to see if it made a difference.

Finally, HBase claims to have atomic counters. In Dynamo-based systems such as Cassandra and Riak, a counter needs to be a CRDT to behave properly. Let’s see how HBase counters behave. I used a fifth test, hbase-counter-app, that is also based on one of the Cassandra tests.

HBase performed well in all five of the above tests. The claims in its documentation concerning atomic row updates and counter operations held up. I’m looking forward to learning more about HBase, which so far appears to be a very solid database.