December 18, 2013

My phone rang at 4am one day last spring. When I dug it out from under my pillow, I wasn't greeted by the automated PagerDuty voice, which makes sense; I wasn't on call. It was the lead developer at 8thBridge, the social commerce startup where I do operations, and he didn't sound happy. "The HBase cluster is gone," he said. "Amazon says the instances are terminated. Can you fix it?"

Spoiler alert: the answer to that question turned out to be yes. In the process (and while stabilizing our cluster), I learned an assortment of things that weren't entirely clear to me from the AWS docs. This SysAdvent offering is not a step-by-step how-to; it's a collection of observations, anecdotes, and breadcrumbs for future googling.

You'll notice that Amazon Web Services calls the Elastic MapReduce Hadoop clusters "job flows". This term reveals a not-insignificant tenet of the AWS perception of your likely workflow: you are expected to spin up a job flow, load data, crunch data, send results elsewhere in a pipeline, and terminate the job flow. There is some mention of data warehousing in the docs, but the defaults are geared towards loading in data from external to your cluster (often S3).

Since AWS expects you to be launching and terminating clusters regularly, their config examples are either in the form of "bootstrap actions" (configuration options you can only pass to a cluster at start time; they run after instance launch but before daemons start) or "job flow steps" (commands you can run against your existing cluster while it is operational). The cluster lifecycle image in the AWS documentation makes this clear.

Because we don't launch clusters with the CLI but rather via the boto python interface, we start HBase as a bootstrap action, post-launch:

When AWS support says that clusters running HBase are automatically termination-protected, they mean "only if you launched them with the --hbase option or its gui equivalent".

There's also overlap in their terms. The options for long-running clusters show an example of setting "Auto-terminate" to No. This is the "Keep Alive" setting (--alive with the CLI) that prevents automatic cluster termination when a job ends successfully; it's not the same as Termination Protection, which prevents automatic cluster termination due to errors (human or machine). You'll want to set both if you're using HBase.

In our case, the cluster hit a bug in the Amazon Machine Image and crashed, which then led to automatic termination. Lesson the first: you can prevent this from happening to you!

Master group

For whatever size of cluster you launch, EMR will assign one node to be your Master node (the sole member of the Master group). The Master won't be running the map-reduce jobs; rather, it will be directing the work of the other nodes. You can choose a different-sized instance for your Master node, but we run the same size as we do for the others, since it actually does need plenty of memory and CPU. The Master node will also govern your HBase cluster.

Core group

By default, after one node is added to the Master group, EMR will assign the rest of the nodes in your cluster to what it terms the Core group; these are slave nodes in standard Hadoop terms. The Master, Core, and Task nodes (more on those in a moment) will all need to talk to one another. A lot. On ports you won't anticipate. Put them in security groups you open completely to one another (though not, obviously, the world - that's what SSH tunnels are for).

Every now and again, a Core node will become unreachable (like any EC2 instance can). These aren't EBS-backed instances; you can't stop and re-launch them. If they cannot be rebooted, you will need to terminate them and let them be automatically replaced. So, having each of your blocks replicated to more than one instance's local disk is wise. Also, there's a cluster ID file in HDFS called /hbase/hbase.id which HBase needs in order to work. You don't want to lose the instance with the only copy of that, or you'll need to restore it from backup.

If you are co-locating map-reduce jobs on the cluster where you also run HBase, you'll notice that AWS decreases the map and reduce slots available on Core nodes when you install HBase. For this use case, a Task group is very helpful; you can allow more mappers and reducers for that group.

Also, while you can grow a Core group, you can never shrink it. (Terminating instances leads to them being marked as dead and a replacement instance spawning.) If you want some temporary extra mapping and reducing power, you don't want to add Core nodes; you want to add a Task group.

Task group

Task nodes will only run TaskTracker, not any of the data-hosting processes. So, they'll help alleviate any mapper or reducer bottlenecks, but they won't help with resource starvation on your RegionServers.

A Task group can shrink and grow as needed. Setting a bid price at Task group creation is how to make the Task group use Spot Instances instead of on-demand instances. You cannot modify the bid price after Task group creation. Review the pricing history for your desired instance type in your desired region before you choose your bid price; you also cannot change instance type after Task group creation, and you can only have one Task group per cluster. If you intend an HBase cluster to persist, I do not recommend launching its Master or Core groups as Spot Instances; no matter how high your bid price, there's always a chance someone will outbid you and your cluster will be summarily terminated.

If you'd like a new node to choose its configs based on if it's in the Task group or not, you can find that information here:

With whatever you're using to configure new instances, you can tell the Task nodes to increase mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum. These are set in mapred-site.xml, and AWS documents their recommended amounts for various instance types. Now your Task nodes won't have the jobrunning constraints of the Core nodes that are busy serving HBase.

For any configuration setting, you'll need to specify which file you expect to find it in. This will vary greatly between Hadoop versions; check the defaults in your conf directory as a starting point. You can also specify your own bootstrap actions from a file on S3, like this:

s3://bucketname/identifying_info/bootstrap.sh

Bootstrap actions are performed on all newly-instantiated instances (including if you need to replace or add a Core instance), so you will need logic in your custom bootstrap that makes specific changes for only new Task instances if desired, as mentioned above.

As mentioned above, the AWS way to make changes after your cluster is running is something called "job flow steps".

While you may find it easier to edit the config files in ~hadoop/conf/ or to have your config management replace them, if you want to capture what you did, so you can replicate it when you relaunch your cluster, framing it as bootstrap actions or job flow steps in your launch script is advisable.

Note that a config change made in a job flow step just logs in and updates your config files for you; it does not restart any relevant daemons. You'll need to determine which one(s) you need to restart, and do so with the appropriate init script.

The recommended way to restart a daemon is to use the init script to stop it (or kill it, if necessary) and then let service nanny (part of EMR's stock image) restart it.

The service nanny process is supposed to keep your cluster humming along smoothly, restarting any processes that may have died. Warning, though; if you're running your Core nodes out of memory, service nanny might also get the OOM hatchet. I ended up just dropping in a once-a-minute nanny-cam cron job so that if the Nagios process check found it wasn't running, it would get restarted:

Ganglia allows you to see how the memory use in your cluster looks and to visualize what might be going on with cluster problems. You can install Ganglia on EMR quite easily, as long as you decide before cluster launch and set the two required bootstrap actions.

Use this bootstrap action to install Ganglia:

s3://elasticmapreduce/bootstrap-actions/install-ganglia

Use this bootstrap action to configure Ganglia for HBase:

s3://elasticmapreduce/bootstrap-actions/configure-hbase-for-ganglia

If you're using Task nodes which come and go, while your job flow persists, your Ganglia console will be cluttered with the ghosts of Task nodes past, and you'll be muttering, "No, my cluster is not 80% down." This is easy to fix by restarting gmond and gmetad on your EMR Master (ideally via cron on a regular basis - I do so hourly):

You don't need to restart gmetad; that will happen automagically. (And yes, you can get fancier with awk; going for straightforward, here.)

As for Nagios or your preferred alerting mechanism, answering on a port isn't any indication of HBase actually working. Certainly if a RegionServer process dies and doesn't restart, you'll want to correct that, but the most useful check is hbase hbck. Here's a nagios plugin to run it. This will alert in most cases that would cause HBase not to function as desired. I also run the usual NRPE checks on the Master and Core nodes, though allowing for the much higher memory use and loads that typify EMR instances. I don't actually bother monitoring the Task group nodes, as they are typically short-lived and don't run any daemons but TaskTracker. When memory frequently spikes on a Core node, that's a good sign that region hotspotting is happening. (More on that later.)

Other than HBase being functional, you might also want to keep an eye on region balance. In our long-running clusters, the HBase balancer, which is supposed to distribute regions evenly across the RegionServers, turns itself off after a RegionServer restart. I check the HBase console page with Nagios and alert if any RegionServer has fewer than 59 regions. (Configure that according to your own expected region counts.)

We're trying to keep around 70 regions per RegionServer, and if a RegionServer restarts, it often won't serve as many regions as it previously did. You can manually run the balancer from the HBase shell. The balance_switch command returns its previous status, while the balancer command returns its own success or failure.

Regions aren't well-balanced per-table in HBase 0.92.x, but that is reportedly improved in 0.94.x. You can manually move regions around, if you need to get a specific job such as a backup to run; you can also manually split regions. (I'll elaborate on that in a bit.) Note that the automatic balancer won't run immediately after a manual region move.

Make sure that in a case of an incremental backup failure, you immediately run a full backup and then carry on with periodic incrementals after that. If you need to restore from these backups, a failed incremental will break your chain back to the most recent full backup, and the restore will fail. It's possible to get around this via manual edits to the Manifest the backup stores on S3, but you're better off avoiding that.

To identify successful backups, you'll see this line on your EMR Master in the file /mnt/var/log/hbase/hbase-hadoop-master-YOUR_EMR_MASTER.out:

Backups created with S3DistCp leave temporary files in /tmp on your HDFS; failed backups leave even larger temporary files. To be safe you need as much room to run backups as your cluster occupies (that is, don't allow HDFS to get over 50% full, or your backups will fail.) This isn't as burdensome as it sounds; given the specs of the available EMR instance options, long before you run out of disk, you'll lack enough memory for your jobs to run.

If a backup job hangs, it is likely to hang your HBase cluster. Backups can hang if RegionServers crash. If you need to kill the backup, it's running on your HBase cluster like any map-reduce job and can be killed like any job. It will look like this in your jobtracker:

HBase cluster replication is not supported on EMR images before you get to AWS's premium offerings. If you ever need to migrate your data to a new cluster, you will be wanting replication, because backing up to and then restoring from S3 is not fast (and we haven't even discussed the write lock that consistent backups would want). If you plan to keep using your old cluster until your new one is up and operational, you'll end up needing to use CopyTable or Export/Import.

I've found it's easy to run your Core instances out of memory and hang your cluster with CopyTable if you try to use it on large tables with many regions. I've gotten better results using a time-delimited Export starting from before your last backup started, and then Importing it to your new cluster. Also note that although the docs on Export don't make it explicit, it's implied in CopyTable's example that the time format desired is epoch time in milliseconds (UTC). Export also requires that. Export respects HBase's internal versioning, so it won't overwrite newer data.

After asking AWS Support over many months, I was delighted to see that Hadoop MRv2 and HBase 0.94.x became available at the end of October. We're on the previous offering of MRv1 with HBase 0.92.0, and with retail clients we aren't going to upgrade during prime shopping season, but I look forward to January this year for reasons beyond great snowshoeing. For everything in this post, assume Hadoop 1.0.3 and HBase 0.92.0.

Since we are using Python to talk to HBase, we use the lightweight Thrift interface. (For actual savings, you want Spot Instances.) Running the Thrift daemon on the EMR Master and then querying it from our applications led to your friend and mine, the OOM-killer, hitting Thrift more often than not. Running it on our Core nodes didn't work well either; they need all their spare memory for the RegionServer processes. (Conspiracy theory sidebar: Java is memory-gobbling, and Sun Microsystems (RIP) made hardware. I'm just saying.) I considered and rejected a dedicated Thrift server, since I didn't want to introduce a new single point of failure. It ended up working better installing a local Thrift daemon on select servers (via a Chef recipe applied to their roles). We also use MongoDB, and talking to local Thrift ended up working much like talking to local mongos.

There's a lot of info out there about building Thrift from source. That's entirely unnecessary, since a Thrift daemon is included with HBase. So, all you need is a JVM and HBase (not that you'll use most of it.) Install HBase from a tarball or via your preferred method. Configure hbase-site.xml so that it can find your EMR Master; this is all that file needs:

You know, the corner cases that give you a heart attack and then you wake up in the morgue, only to actually wake up and realize you've been having a nightmare where you are trampled by yellow elephants... just me, then?

HBase HMaster process won't start

You know it's going to be a fun oncall when the HBase HMaster will not start, and logs this error:

NotServingRegionException: Region is not online: -ROOT-,,0

The AWS support team for EMR is very helpful, but of course none of them were awake when this happened. Enough googling eventually led me to the exact info I needed.

In this case, Zookeeper (which you will usually treat as a black box) is holding onto an incorrect instance IP for where the -ROOT- table is being hosted. Fortunately, this is easy to correct:

$ hbase zkcli
zookeeper_cli> rmr /hbase/root-region-server

Now you can restart HMaster if service-nanny hasn't beat you to it:

$ /etc/init.d/hbase-hmaster start

Instance Controller

If the instance controller stops running on your Master node, you can see strange side effects like an inability to launch new Task nodes or an inability to reschedule the time your backups run.

It's possible that you might need to edit /usr/bin/instance-controller and increase the amount of memory allocated to it in the -Xmx directive.

Another cause for failure is if the instance controller has too many logs it hasn't yet archived to S3, or if the disk with the logs fills up.

If the instance controller dies it can then go into a tailspin with the service-nanny attempting to respawn it forever. You may need to disable service-nanny, then stop and start the instance-controller with its init script before re-enabling service-nanny.

A Word On Hotspotting

Choose your keys in this key-value game carefully. If they are sequential, you're likely to end up with hotspotting. While you can certainly turn the balancer off and manually move your regions around between RegionServers, using a hashed key will save you a lot of hassle, if it's possible. (In some cases we key off organic characteristics in our data we can neither control nor predict, so it's not possible in our most hotspotting-prone tables.)

If you limit automatic splitting you might need to manually split a hot region before your backups will succeed. Your task logs will likely indicate which region is hot, though you can also check filesize changes in HDFS. The HBase console on your-emr-master:60010 has links to each table, and a section at the bottom of a table's page where you can do a split:

Optionally you can specify a "Region Key", but it took a bit to figure out which format this means. (The "Region Start Key" isn't the same thing.) The format you want for a Region Key when doing a manual split is what is listed as "Name" on that HBase console page. It will have a format like this:

In this, 1379638609844 is an epoch timestamp in milliseconds and dace217f50fb37b69844a0df864999bc is the region ID.

hbase.regionserver.max.filesize

The Apache project has a page called "the important configurations". An alternate title might be "if you don't set these, best of luck to you, because you're going to need it". Plenty of detailed diagrams out there to explain "regions", but from a resource consumption standpoint, you minimally want to know this:

A table starts out in one region.

If the region grows to the hbase.regionserver.max.filesize, the region splits in two. Now your table has two regions. Rinse, repeat.

Each region takes a minimum amount of memory to serve.

If your RegionServer processes end up burdened by serving more regions than ideal, they stand a good chance of encountering the Out of Memory killer (especially while running backups or other HBase-intensive jobs).

I chased a lot of ghosts (Juliet Pause, how I wanted it to be you!) before finally increasing hbase.regionserver.max.filesize. If running 0.92.x, it's not possible to use online merges to decrease the number of regions in a table. The best way I found to shrink our largest table's number of regions was to simply CopyTable it to a new name, then cut over production to write to the new name, then Export/Import changes. (Table renaming is supported starting in 0.94.x with the snapshot facility.)

The conventional wisdom says to limit the number of regions served by any given RegionServer to around 100. In my experience, while it's possible to serve three to four times more on m1.xlarges, you're liable to OOM your RegionServer processes every few days. This isn't great for cluster stability.

Closure on the Opening Anecdote

After that 4am phone call, I did get our HBase cluster and (nearly) all its data back. It was a monumental yak-shave spanning multiple days, but in short, here's how: I started a new cluster that was a restore from the last unbroken spot in the incremental chain. On that cluster, I restored the data from any other valid incremental backups on top of the original restore. Splitlog files in the partial backups on S3 turned out to be unhelpful here; editing the Manifest so it wasn't looking for them would have saved hassle. And for the backups that failed and couldn't be used for a restore with the AWS tools, I asked one of our developers with deep Java skills to parse out the desired data from each partial backup's files on S3 and write it to TSV for me. I then used Import to read those rows into any tables missing them. And we were back in business, map-reducing for fun and profit!

Now that you know everything I didn't, you can prevent a cascading failure of flaky RegionServers leading to failed backups on a cluster unprotected against termination. If you run HBase on EMR after reading all this, you may have new and exciting things go pear-shaped. Please blog about them, ideally somewhere that they will show up in my search results. :)

Lessons learned? Launch your HBase clusters with both Keep Alive and Termination Protection. Config early and often. Use Spot Instances (but only for Task nodes). Monitor and alert for great justice. Make sure your HBase backups are succeeding. If they aren't, take a close look at your RegionServers. Don't allow too much splitting.

And most important of all, have a happy and healthy holiday season while you're mapping, reducing, and enjoying wide columns in your data store!