Apache HBase Internals: Locking and Multiversion Concurrency Control

This following post was originally published via blog.apache.org; we republish it here for your convenience.

NOTE: This blog post describes how Apache HBase does concurrency control. This assumes knowledge of the HBase write path, which you can read more about in this other blog post.

Introduction

Apache HBase provides a consistent and understandable data model to the user while still offering high performance. In this blog, we’ll first discuss the guarantees of the HBase data model and how they differ from those of a traditional database. Next, we’ll motivate the need for concurrency control by studying concurrent writes and then introduce a simple concurrency control solution. Finally, we’ll study read/write concurrency control and discuss an efficient solution called Multiversion Concurrency Control.

Why Concurrency Control?

In order to understand HBase concurrency control, we first need to understand why HBase needs concurrency control at all; in other words, what properties does HBase guarantee about the data that requires concurrency control?

If you have experience with traditional relational databases, these terms may be familiar to you. Traditional relational databases typically provide ACID semantics across all the data in the database; for performance reasons, HBase only provides ACID semantics on a per-row basis. If you are not familiar with these terms, don’t worry. Instead of dwelling on the precise definitions, let’s look at a couple of examples.

That is, we write to the WAL for disaster recovery purposes and then update an in-memory copy (MemStore) of the data.

Now, assume we have no concurrency control over the writes and consider the following order of events:

Image 2. One possible order of events for two writes

At the end, we are left with the following state:

Image 3. Inconsistent result in absence of write-write synchronization

which is a role I’ve never held. In ACID terms, we have not provided Isolation for the writes, as the two writes became intermixed.

We clearly need some concurrency control. The simplest solution is to provide exclusive locks per row in order to provide isolation for writes that update the same row. So, our new list of steps for writes is as follows (new steps are in blue).

Read-Write Synchronization

So far, we’ve added row locks to writes in order to guarantee ACID semantics. Do we need to add any concurrency control for reads? Let’s consider another order of events for our example above (Note that this order follows the rules in List 2):Image 4. One possible order of operations for two writes and a read

Assume no concurrency control for reads and that we request a read concurrently with the two writes. Assume the read is executed directly before “Waiter” is written to the MemStore; this read action is represented by a red line above. In that case, we will again read the inconsistent row:

Image 5. Inconsistent result in absence of read-write synchronization

Therefore, we need some concurrency control to deal with read-write synchronization. The simplest solution would be to have the reads obtain and release the row locks in the same manner as the writes. This would resolve the ACID violation, but the downside is that our reads and writes would both contend for the row locks, slowing each other down.

Instead, HBase uses a form of Multiversion Concurrency Control (MVCC) to avoid requiring the reads to obtain row locks. Multiversion Concurrency Control works in HBase as follows:

For writes: (w1) After acquiring the RowLock, each write operation is immediately assigned a write number (w2) Each data cell in the write stores its write number. (w3) A write operation completes by declaring it is finished with the write number.

For reads: (r1) Each read operation is first assigned a read timestamp, called a read point. (r2) The read point is assigned to be the highest integer such that all writes with write number <= x have been completed. (r3) A read r for a certain (row, column) combination returns the data cell with the matching (row, column) whose write number is the largest value that is less than or equal to the read point of r. List 3. Multiversion Concurrency Control steps

Let’s look at the operations in Image 4 again, this time using MultiVersion Concurrency Control:

Image 6. Write steps with Multiversion Concurrency Control

Notice the new steps introduced for Multiversion Concurrency Control. Each write is assigned a write number (step w1), each data cell is written to the memstore with its write number (step w2, e.g. “Cloudera [wn=1]”) and each write completes by finishing its write number (step w3).

Now, let’s consider the read in Image 4, i.e. a read that begins after step “Restaurant [wn=2]” but before the step “Waiter [wn=2]”. From rule r1 and r2, its read point will be assigned to 1. From r3, it will read the values with write number of 1, leaving us with:

Image 7. Consistent answer with Multiversion Concurrency Control

A consistent response without requiring locking the row for the reads!

Let’s put this all together by listing the steps for a write with Multiversion Concurrency Control: (new steps required for read-write synchronization are in red):

Conclusion

In this blog we first defined HBase’s row-level ACID guarantees. We then demonstrated the need for concurrency control by studying concurrent writes and introduced a row-level locking solution. Finally, we investigated read-write concurrency control and presented an efficient mechanism called Multiversion Concurrency Control (MVCC).

This blog post is accurate as of HBase 0.92. HBase 0.94 has various optimizations, e.g. HBASE-5541 that will be described in a future blog post.

Gregory Chanan is a Software Engineer at Cloudera and an HBase committer.