In order to move forward on this topic, instead of an important refactoring, I started to work on standalone and atomic bundles that we can deploy in Karaf. The purpose is to avoid to change Hadoop core, but provides a good Hadoop support directly in Karaf.

I also deployed bundles on my Maven repository to give users the possibility to directly deploy karaf-hadoop in a running Karaf instance.

The purpose is to explain what you can do, the values about this, and maybe you will vote to “include” it in Hadoop directly 😉

To explain exactly what you can do, I prepared a serie of blog posts:

Article 1: Karaf as HDFS client. This is this first post. We will see the hadoop-karaf bundle installation, the hadoop and hdfs Karaf shell commands, and how you can use HDFS to store bundles or features using the HDFS URL handler.

Article 2: Karaf as MapReduce job client. We will see how to run MapReduce jobs directly from Karaf, and the “hot-deploy-and-run” of MapReduce jobs using the Hadoop deployer.

Article 3: Exposing Hadoop, HDFS, Yarn, and MapReduce features as OSGi services. We will see how to use Hadoop features programmatically thanks to OSGi services.

Article 4: Karaf as a HDFS datanode (and eventually namenode). Here, more than using Karaf as a simple HDFS client, Karaf will be part of HDFS acting as a datanode, and/or namenode.

Article 5: Karaf, Camel, Hadoop all together. In this article, we will use the Hadoop OSGi services now available in Karaf inside Camel routes (plus the camel-hdfs component).

Article 6: Karaf as complete Hadoop container. I will explain here what I did in Hadoop to add a complete support of OSGi and Karaf.

Karaf as HDFS client

Just a reminder about HDFS (Hadoop Distributed FileSystem).

HDFS is composed by:
– a namenode hosting the metadata of the filesystem (directories, blocks location, file permissions or modes, …). There is only one namenode per HDFS, and the metadata are stored in memory by default.
– a set of datanode hosting the file blocks. Files are composed by blocks (like in all filesystems). The blocks are located on different datanodes. The blocks can be replicated.

A HDFS client connects to the namenode to execute actions on the filesystem (ls, rm, mkdir, cat, …).

Preparing HDFS

The first step is to set up the HDFS filesystem.

I gonna use a “pseudo-cluster”: a HDFS with the namenode and only one datanode on a single machine.
To do so, I configure the $HADOOP_INSTALL/etc/hadoop/core-site.xml file like this:

Configuration and installation of hadoop-karaf

I created the hadoop-karaf bundle as standalone. It means that it embeds a lot of dependencies internally (directly in the bundle classloader).

The purpose is to:

avoid to alter anything in Hadoop core. Thanks to this approach, I can provide hadoop-karaf bundle for different Hadoop versions, and I don’t need to alter Hadoop itself.

ship all dependencies in the same bundle classloader. Of course it’s not ideal in term of OSGi, but to provide a very easy and ready to use bundle, I gather most of dependencies in the hadoop-karaf bundle.

I worked on trunk directly (for now, if you are interested I can provide hadoop-karaf for existing Hadoop releases): Hadoop 3.0.0-SNAPSHOT.

Before deploying the hadoop-karaf bundle, we have to prepare the Hadoop configuration. In order to be integrated in Karaf, I implemented a mechanism to create and populate the Hadoop configuration from OSGi ConfigAdmin.
The only requirement for the user is to create a org.apache.hadoop PID in the Karaf etc folder containing the Hadoop properties. Actually, it means to just create a $KARAF_INSTALL/etc/org.apache.hadoop.cfg file containing:

Conclusion

This first blog post shows how to use Karaf as a HDFS client. The big advantage is that the hadoop-karaf bundle doesn’t change anything from Hadoop core, and so I can provide it for Hadoop 0.20.x, 1.x, 2.x, or trunk (3.0.0-SNAPSHOT).
In Article 3, you will see how to leverage directly HDFS as OSGi services (and so use in your bundles, Camel routes, …).

Again, if you think that this articles serie is interesting, and you would like to see the Karaf support in Hadoop, feel free to post a comment, a message on the Hadoop mailing list, and whatever to promote it 😉