Avoiding Full GCs in Apache HBase with MemStore-Local Allocation Buffers: Part 1

Today, rather than discussing new projects or use cases built on top of CDH, I’d like to switch gears a bit and share some details about the engineering that goes into our products. In this post, I’ll explain the MemStore-Local Allocation Buffer, a new component in the guts of Apache HBase which dramatically reduces the frequency of long garbage collection pauses. While you won’t need to understand these details to use Apache HBase, I hope it will provide an interesting view into the kind of work that engineers at Cloudera do.

This post will be the first in a three part series on this project.

Background

Heaps and heaps of RAM!

Over the last few years, the amount of memory available on inexpensive commodity servers has gone up and up. When the Apache HBase project started in 2007, typical machines running Hadoop had 4-8GB of RAM. Today, most Cloudera customers run with at least 24G of RAM, and larger amounts like 48G or even 72G are becoming increasingly common as costs continue to come down. On the surface, all this new memory capacity seems like a great win for latency-sensitive database software like HBase — with a lot of RAM, more data can fit in cache, avoiding expensive disk seeks on reads, and more data can fit in the memstore, the memory area that buffers writes before they flush to disk.

In practice, however, as typical heap sizes for HBase have crept up and up, the garbage collection algorithms available in production-quality JDKs have remained largely the same. This has led to one major problem for many users of HBase: lengthy stop-the-world garbage collection pauses which get longer and longer as heap sizes continue to grow. What does this mean in practice?

During a stop-the-world pause, any client requests to HBase are stalled, causing user-visible latency or even timeouts. If a request takes over a minute to respond because of a pause, HBase may as well be down – there’s often little value in such a delayed response.

HBase relies on Apache ZooKeeper to track cluster membership and liveness. If a server pauses for a significant amount of time, it will be unable to send heartbeat ping messages to the ZooKeeper quorum, and the rest of the servers will presume that the server has died. This causes the master to initiate certain recovery processes to account for the presumed-dead server. When the server comes out of its pause, it will find all of its leases revoked, and commit suicide. The HBase development team has affectionately dubbed this scenario a Juliet Pause — the master (Romeo) presumes the region server (Juliet) is dead when it’s really just sleeping, and thus takes some drastic action (recovery). When the server wakes up, it sees that a great mistake has been made and takes its own life. Makes for a good play, but a pretty awful failure scenario!

The above issues will be familiar to most who have done serious load testing of an HBase cluster. On typical hardware, it can pause 8-10 seconds per GB of heap — a 8G heap may pause for upwards of a minute. No matter how much tuning one might do, it turns out this problem is completely unavoidable in HBase 0.90.0 or older with today’s production-ready garbage collectors. Since this is such a common issue, and only getting worse, it became a priority for Cloudera at the beginning of the year. In the rest of this post, I’ll describe a solution we developed that largely eliminates the problem.

Java GC Background

In order to understand the GC pause issue thoroughly, it’s important to have some background in Java’s garbage collection techniques. Some simplifications will be made, so I highly encourage you to do further research for all the details.

If you’re already an expert in GC, feel free to skip this section.

Generational GC

Java’s garbage collector typically operates in a generational mode, relying on an assumption called the generational hypothesis: we assume that most objects either die young, or stick around for quite a long time. For example, the buffers used in processing an RPC request will only last for some milliseconds, whereas the data in a cache or the data in the HBase MemStore will likely survive for many minutes.

Given that objects have two different lifetime profiles, it’s intuitive that different garbage collection algorithms might do a better job on one profile than another. So, we split up the heap into two generations: the young (a.k.a new) generation and the old (a.k.a tenured). When objects are allocated, they start in the young generation, where we prefer an algorithm that operates efficiently when most of the data is short-lived. If an object survives several collections inside the young generation, we tenure it by relocating it into the old generation, where we assume that data is likely to die out much more slowly.

In most latency-sensitive workloads like HBase, we recommend the -XX:+UseParNewGC and -XX:+UseConcMarkSweepGC JVM flags. This enables the Parallel New collector for the young generation and the Concurrent Mark-Sweep collector for the old generation.

Young Generation – Parallel New Collector

The Parallel New collector is a stop-the-world copying collector. Whenever it runs, it first stops the world, suspending all Java threads. Then, it traces object references to determine which objects are live (still referenced by the program). Lastly, it moves the live objects over to a free section of the heap and updates any pointers into those objects to point to the new addresses. There are a few important points here about this collector:

It stops the world, but not for very long. Because the young generation is usually fairly small, and this collector runs with many threads, it can accomplish its work very quickly. For production workloads we usually recommend a young generation no larger than 512MB, which results in pauses of less than a few hundred milliseconds at the worst case.

It copies the live objects into a free heap area. This has the side effect of compacting the free space – after every collection, the free space in the young generation is one contiguous chunk, which means that allocation can be very efficient.

Each time the Parallel New collector copies an object, it increments a counter for that object. After an object has been copied around in the young generation several times, the algorithm decides that it must belong to the long-lived class of objects, and moves it to the old generation (tenures it). The number of times an object is copied inside the young generation before being tenured is called the tenuring threshold.

Old Generation – Concurrent Mark-Sweep

Every time the parallel new collector runs, it will tenure some objects into the old generation. So, of course, the old generation will eventually fill up, and we need a strategy for collecting it as well. The Concurrent-Mark-Sweep collector (CMS) is responsible for clearing dead objects in this generation.

The CMS collector operates in a series of phases. Some phases stop the world, and others run concurrently with the Java application. The major phases are:

initial-mark (stops the world). In this phase, the CMS collector places a mark on the root objects. A root object is something directly referenced from a live Thread – for example, the local variables in use by that thread. This phase is short because the number of roots is very small.

concurrent-mark (concurrent). The collector now follows every pointer starting from the root objects until it has marked all live objects in the system.

remark (stops the world). Since objects might have had references changed, and new objects might have been created during concurrent-mark, we need to go back and take those into account in this phase. This is short because a special data structure allows us to only inspect those objects that were modified during the prior phase.

concurrent-sweep (concurrent). Now, we proceed through all objects in the heap. Any object without a mark is collected and considered free space. New objects allocated during this time are marked as they are created so that they aren’t accidentally collected.

The important things to note here are:

The stop-the-world phases are made to be very short. The long work of scanning the whole heap and sweeping up the dead objects happens concurrently.

This collector does not relocate the live objects, so free space can be spread in different chunks throughout the heap. We’ll come back to this later!

CMS Failure Modes

As I described it, the CMS collector sounds pretty great – it only pauses for very short times and most of the heavy lifting is done concurrently. So how is it that we see multi-minute pauses when we run HBase under heavy load with large heaps? It turns out that the CMS collector has two failure modes.

Concurrent Mode Failure

The first failure mode, and the one more often discussed, is simple concurrent mode failure. This is best described with an example: suppose that there is an 8GB heap. When the heap is 7GB full, the CMS collector may begin its first phase. It’s happily chugging along with the concurrent-mark phase. Meanwhile, more data is being allocated and tenured into the old generation. If the tenuring rate is too fast, the generation may completely fill up before the collector is done marking. At that point, the program may not proceed because there is no free space to tenure more objects! The collector must abort its concurrent work and fall back to a stop-the-world single-threaded copying collection algorithm. This algorithm relocates all live objects to the beginning of the heap, and frees up all of the dead space. After the long pause, the program may proceed.

It turns out this is fairly easy to avoid with tuning: we simply need to encourage the collector to start its work earlier! Thus, it’s less likely that it will get overrun with new allocations before it’s done with its collection. This is tuned by setting -XX:CMSInitiatingOccupancyFraction=N where N is the percent of heap at which to start the collection process. The HBase region server carefully accounts its memory usage to stay within 60% of the heap, so we usually set this value to around 70.

Promotion Failure due to Fragmentation

This failure mode is a little bit more complicated. Recall that the CMS collector does not relocate objects, but simply tracks all of the separate areas of free space in the heap. As a thought experiment, imagine that I allocate 1 million objects, each 1KB, for a total usage of 1GB in a heap that is exactly 1GB. Then I free every odd-numbered object, so I have 500MB live. However, the free space will be solely made up of 1KB chunks. If I need to allocate a 2KB object, there is nowhere to put it, even though I ostensibly have 500MB of space free. This is termed memory fragmentation. No matter how early I ask the CMS collector to start, since it does not relocate objects, it cannot solve this problem!

When this problem occurs, the collector again falls back to the copying collector, which is able to compact all the objects and free up space.

Enough GC! Back to HBase!

Let’s come back up for air and use what we learned about Java GC to think about HBase. We can make two observations about the pauses we see in HBase:

By setting the CMSInitiatingOccupancyFraction tunable lower, we’ve seen that some users can avoid the GC issue. But for other workloads, it will always happen, no matter how low we set this tuning parameter.

We often see these pauses even when metrics and logs indicate that the heap has several GB of free space!

Given these observations, we hypothesize that our problem must be caused by fragmentation, rather than some kind of memory leak or improper tuning.

An experiment: measuring fragmentation

To confirm this hypothesis, we’ll run an experiment. The first step is to collect some measurements about heap fragmentation. After spelunking in the OpenJDK source code, I discovered the little-known parameter -XX:PrintFLSStatistics=1 which, when combined with other verbose GC logging options, causes the CMS collector to print statistical information about its free space before and after every collection. In particular, the metrics we care about are:

Free space – the total amount of free memory in the old generation

Num chunks – the total number of non-contiguous free chunks of memory

Max chunk size – the size of the largest one of these chunks (i.e the biggest single allocation we can satisfy without a pause)

I enabled this option, started up a cluster, and then ran three separate stress workloads against it using Yahoo Cloud Serving Benchmark (YCSB):

Read-only with cache churn: reads data randomly for 100M distinct row keys, so that the data does not fit in the LRU cache.

Read-only without cache churn: reads data randomly for 10K distinct row keys, so that the data fits entirely in the LRU cache.

Each workload will run at least an hour, so we can collect some good data about the GC behavior under that workload. The goal of this experiment is first to verify our hypothesis that the pauses are caused by fragmentation, and second to determine whether these issues were primarily caused by the read path (including the LRU cache) or the write path (including the MemStores for each region).

To be continued…

The next post in the series will show the results of this experiment and dig into HBase’s internals to understand how the different workloads affect memory layout.

Meanwhile, if you want to learn more about Java’s garbage collectors, I recommend the following links: