Applied Research. Big Data. Distributed Systems. Open Source.

Building an Hadoop 0.20.x Version for HBase 0.90.2

Apr 14th, 2011

As of today, Hadoop 0.20.2 is the latest stable release of Apache Hadoop that is marked as ready for production
(neither 0.21 nor 0.22 are). Unfortunately, Hadoop 0.20.2 release is not compatible with the latest stable version of
HBase: if you run HBase on top of Hadoop 0.20.2, you risk to lose data! Hence HBase users are required to build their own
Hadoop 0.20.x version if they want to run HBase on a production cluster of Hadoop. In this article, I describe how to
build such a production-ready version of Hadoop 0.20.x that is compatible with HBase 0.90.2.

Update October 17, 2011: As of version 0.20.205.0 (marked as beta release), Hadoop does now supports HDFS append/hsynch/hflush out of the box and is thus compatible with Hbase 0.90.x. You can still follow the instructions described in this article to build your own version of Hadoop.

Before we start

The examples below use git (not svn)

In the following sections, I will use git as the version control system to work on the Hadoop source
code. Why? Because I am much more comfortable with git than svn, so please bear with me.

If you are using Subversion, feel free to adapt the git commands described below. You are invited to write a comment to
this article about your SVN experience so that other SVN users can benefit, too!

Hadoop 0.20.2 versus 0.20.203.0

Update June 11, 2011: Hadoop 0.20.203.0 and HBase 0.90.3 were released a few weeks after this article was published. While the article talks mostly about Hadoop 0.20.2, the build instructions should also work for Hadoop 0.20.203.0 but I haven’t had the time to test it yet myself. Feel free to leave a comment at the end of the article if you have run into any issues!

Hadoop is covered. What about HBase then?

I focus solely in this article on building a Hadoop 0.20.x version (see the Background section below) that is compatible
with HBase 0.90.2. In a future article, I may describe how to actually install and set up HBase 0.90.2 on the Hadoop
0.20.x version that we created here.

Version of Hadoop 0.20-append used in this article

The instructions below use the latest version of branch-0.20-append. As of this writing, the latest commit to the
append branch is git commit df0d79cc aka Subversion rev 1057313. For reference, the corresponding commit
message is “HDFS-1554. New semantics for recoverLease. Contributed by Hairong Kuang.” from January 10, 2011.

That said, the steps should also work for newer versions of branch-0.20-append.

Background

Hadoop and HBase: Which versions to pick for production clusters?

Hadoop 0.20.2 is the latest stable release of Apache Hadoop that is marked ready for
production. Unfortunately, the latest stable release of Apache HBase, i.e. HBase 0.90.2, is
not compatible with Hadoop 0.20.2: If you try to run HBase 0.90.2 on an unmodified version of Hadoop 0.20.2 release,
you might lose data!

This version of HBase [0.90.2] will only run on Hadoop 0.20.x. It will not run on Hadoop 0.21.x (nor 0.22.x). HBase will lose data unless it is running on an HDFS that has a durable sync. Currently only the branch-0.20-append branch has this attribute. No official releases have been made from this branch up to now so you will have to build your own Hadoop from the tip of this branch.

Here is a quick overview:

Hadoop version

HBase version

Compatible?

0.20.2 release

0.90.2

NO

0.20-append

0.90.2

YES

0.21.0 release

0.90.2

NO

0.22.x (in development)

0.90.2

NO

To be honest, it took me quite some time to get up to speed with the various requirements, dependencies, project
statuses, etc. for marrying Hadoop 0.20.x and HBase 0.90.2. Hence I want to contribute back to the Hadoop and HBase
communities by writing this article.

Alternatives to what we are doing here

Another option you have to get HBase up and running on Hadoop – rather than build Hadoop 0.20-append yourself – is
using Cloudera’s CDH3 distribution. CDH3 has the Hadoop 0.20-append patches needed to add
a durable sync, i.e. to make Hadoop 0.20.x compatible with HBase 0.90.2.

A word of caution and a Thank You

First, a warning: while I have taken great care to compile and describe the steps in the following sections, I still
cannot give you any guarantees. If in doubt, join our discussions on the HBase mailing list.

Second, I am only stitching together the pieces of the puzzle here. The heavy lifting has done by others. Hence I
would like to thank Michael Stack for his great feedback while
preparing the information for this article, and to both him and the rest of the HBase developers for their help on the
HBase mailing list. It’s much appreciated!

Building Hadoop 0.20-append from branch-0.20-append

Retrieve the Hadoop 0.20-append sources

Hadoop as of version 0.20.x is not separated into the Common, HDFS and MapReduce components as the versions >= 0.21.0
are. Hence you find all the required code in the Hadoop Common repository.

However, the previous git command only retrieved the latest version of Hadoop common, i.e. the tip aka HEAD
of the development for Hadoop Common. We however are only interested in the code tree for Hadoop 0.20-append, i.e.
the branch branch-0.20-append. Because git by default does not download remote branches from a cloned
repository, we must instruct it to explicitly do so:

Hadoop 0.20.2 release vs. Hadoop 0.20-append

Up to now, you might have asked yourself what the difference between the 0.20.2 release of Hadoop and its append branch actually is. Here’s the answer: The Hadoop 0.20-append branch is effectively a superset of Hadoop 0.20.2 release. In other words, there is not a single “real” commit in Hadoop 0.20.2 release that is not also in Hadoop 0.20-append. This means that Hadoop 0.20-append brings all the goodies that Hadoop 0.20.2 release has, great!

As you can see, there are only two commits in 0.20.2 release that are not in branch-0.20-append, namely the
commits “Hadoop 0.20.2 release” and “Hadoop 0.20.2-rc4”. Both of these commits are simple tagging commits, i.e. they
are just used for release management but do not introduce any changes to the content of the Hadoop source code.

Run the build process

Build commands

#this is essentialresolvers=internal
#you can increment this number as you see fitversion=0.20-append-for-hbase
project.version=${version}hadoop.version=${version}hadoop-core.version=${version}hadoop-hdfs.version=${version}hadoop-mapred.version=${version}

Note: The “version” key in build.properties will also determine the names of the generated Hadoop JAR files. If, for instance, you set “version” to “0.20-append-for-hbase”, the build process will generate files named hadoop-core-0.20-append-for-hbase.jar etc. Basically, you can use any version identifier that you like (though it would help if it makes sense).

The build.properties file should be placed (or available) in the hadoop-common top directory, i.e.
hadoop-common/build.properties. You can either place the file there directly or you can follow the
recommended approach, where you place the file in a parent directory
and create a symlink to it. The latter approach is convenient if you also have checked out the repositories of the
Hadoop sub-projects hadoop-hdfs and hadoop-mapreduce and thus want to use the same build.properties file
for all three sub-projects.

Now we are ready to compile Hadoop from source with ant. I used the command ant mvn-install as described on
Git and Hadoop. The build itself should only take a few minutes. Be
sure to run ant test as well (or only ant test-core if you’re lazy) but be aware that the tests take much
longer than the build (two hours on my 3-year old MacBook Pro, for instance).

123456789

# Make sure we are using the branch-0.20-append sources$ git checkout branch-0.20-append
# Run the build process$ ant mvn-install
# Optional: run the full test suite or just the core test suite$ ant test$ ant test-core

If you want to re-run builds or build tests: By default, “ant mvn-install” places the build output into $HOME/.m2/repository. In case you re-run the compile you might want to remove the previous build output from $HOME/.m2/repository, e.g. via “rm -rf $HOME/.m2/repository”. You might also want to run “ant clean-cache”. For details, see Git and Hadoop.

The build test fails, now what?

Now comes the more delicate part: If you run the build tests via ant test, you will notice that the build test
process always fails! One consistent test error is reported by TestFileAppend4 and logged to the file
build/test/TEST-org.apache.hadoop.hdfs.TestFileAppend4.txt. Here is a short excerpt of the test’s output:

1234567891011121314151617

2011-04-06 09:40:28,666 INFO ipc.Server (Server.java:run(970)) - IPC Server handler 5 on 47574, call append(/bbw.test, DFSClient_1066000827) from 127.0.0.1:45323: error: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /bbw.test for DFSClient_1066000827 on client 127.0.0.1, because this file is already being created by DFSClient_-95621936 on 127.0.0.1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1202)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1054)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1221)
at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:396)
[...]
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)
2011-04-06 09:40:28,667 INFO hdfs.TestFileAppend4 (TestFileAppend4.java:recoverFile(161)) - Failed open for append, waiting on lease recovery
[...]
Testcase: testRecoverFinalizedBlock took 5.555 sec
Caused an ERROR
No lease on /testRecoverFinalized File is not open for writing. Holder DFSClient_1816717192 does not have any open files.
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /testRecoverFinalized File is not open for writing. Holder DFSClient_1816717192 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1439)
[...]
at org.apache.hadoop.hdfs.TestFileAppend4$1.run(TestFileAppend4.java:636)

I do not know what might cause these occasional errors – maybe it is a problem of the machine I am running the tests
on. Still working on this.

Frankly, what I wrote above may sound discomforting to you. At least it does to me. Still, the feedback I have
received on the HBase mailing list indicates that the Hadoop 0.20-append build as done above is indeed correct.

Locate the build output (Hadoop JAR files)

By default, the build run via ant mvn-install places the generated Hadoop JAR files in $HOME/.m2/repository.
You can find the actual JAR files with the following command.

Install your Hadoop 0.20-append build in your Hadoop cluster

The only thing left to do now is to install the Hadoop 0.20-append build in your cluster. This step is easy: simply
replace the Hadoop JAR files of your existing installation of Hadoop 0.20.2 release with the ones you just created
above. You will also have to replace the Hadoop core JAR file in your HBase 0.90.2 installation
($HBASE_HOME/lib/hadoop-core-0.20-append-r1056497.jar) with the Hadoop core JAR file you created above
(hadoop-core-0.20-apppend-for-hbase.jar if you followed the instructions above).

Important: Since this is such an important step, I will repeat it again: The Hadoop JAR files used by Hadoop itself and by HBase must match!

Rename the build JAR files if you run Hadoop 0.20.2

Update June 11, 2011: The renaming instructions of this section are NOT required if you are using the latest stable release Hadoop 0.20.203.0.

Hadoop 0.20.2 release names its JAR files in the form of hadoop-VERSION-PACKAGE.jar, e.g.
hadoop-0.20.2-examples.jar. The build process above uses the different scheme hadoop-PACKAGE-VERSION.jar,
e.g. hadoop-examples-0.20-append-for-hbase.jar. You might therefore want to rename the JAR files you created in
the previous section so that they match the naming scheme of Hadoop 0.20.2 release (otherwise the bin/hadoop script
will not be able to add the Hadoop core JAR file to its CLASSPATH, and also command examples such as
hadoop jar hadoop-*-examples.jar pi 50 1000 in the Hadoop docs will not work as is).

In contrast, HBase uses the hadoop-PACKAGE-VERSION.jar scheme. So when you replace the Hadoop core JAR file
shipped with HBase 0.90.2 in $HBASE_HOME/lib, you can here opt for leaving the name of the newly built Hadoop core
JAR file as is.

Note for users running HBase 0.90.0 or 0.90.1: The Hadoop 0.20-append JAR files we created above are based on the tip of “branch-0.20-append“ and thus use an RPC version of 43. This is ok for HBase 0.90.2 but it will cause problems for HBase 0.90.0 and 0.90.1. See HBASE-3520 or Michael Stack’s comment for more information.

Maintaining your own version of Hadoop 0.20-append

If you must integrate some additional patches into Hadoop 0.20.2 and/or Hadoop 0.20-append (normally in the form of
backports of patches for Hadoop 0.21 or 22.0), you can create a local branch based on the Hadoop version you are
interested in. Yes, this creates some effort on your behalf so you should be sure to weigh the pros and cons of doing
so.

Imagine that, for instance, you use Hadoop 0.20-append based on branch-0.20-append because you also want to run the
latest stable release of HBase on your Hadoop cluster. While doing your
>benchmarking and stress testing of your cluster, you have unfortunately discovered a problem that you could track down to
HDFS-611. Now a patch is actually available (you might have to do
some tinkering to backport it properly) but it is not in the version of Hadoop you are running, i.e. it is not in the
vanilla branch-0.20-append.

What you can do is to create a local git branch based on your Hadoop version (here: branch-0.20-append) where you
can integrate and test any relevant patches you need. Please understand that I will only describe the basic approach
here – I do not go into details on how you can make sure to stay current with any changes to the Hadoop version you
are tracking after you followed the steps below. There are a lot of splendid git introductions such as the
Git Community Book that can explain this much better and thoroughly than I am able to.

12345

# Make sure we are in branch-0.20-append before running the next command$ git checkout branch-0.20-append
# Create your own local branch based on the latest version (HEAD) of the official branch-0.20-append$ git checkout -b branch-0.20-append-yourbranch

Apply the relevant patch to your branch. In the example below, I apply a backport of the patch for
HDFS-611 for branch-0.20-append via the file
HDFS-611.branch-0.20-append.v1.patch. Note that this backport is not available on the HDFS-611 page – I created
the backport myself based on the HDFS-611 patch for Hadoop 0.20.2 release
(HDFS-611.branch-20.v6.patch).

And by the way, if you want to see the commit differences between Hadoop 0.20.2 release, the official
branch-0.20-append and your own, patched branch-0.20-append-yourbranch, run the following git command: