Hadoop

I have not given a formal introduction on HBase, but this post will help those who have already set up and have an active HBase installation. I will be dealing with the administrative work that can be done on HBase using the Java API. The API is vast and easy to use. I have explained the code wherever I find it necessary, but this
post is by all means incomplete. I have as usual, provided the full code at the end. Cheers. 🙂

If you want to follow along your better import all this, or if you are using an IDE like Eclipse, you’ll follow along just fine as it automatically fixes up your imports. The only thing you need to do is to set the class path to include all the jar files from the hadoop installation and/or hbase installation, especially the hadoop-0.*.*-core.jar and the jar files inside the lib folder. I’ll put in another post on that later.

I’d like to talk about doing some day to day administrative task on the Hadoop system. Although the hadoop fs <commands> can get you to do most of the things, its still worthwhile to explore the rich API in Java for Hadoop. This post is by no means complete, but can get you started well.

The most basic step is to create an object of this class.

HDFSClient client = new HDFSClient();

Of course, you need to import a bunch of stuff. But if you are using an IDE like Eclipse, you’ll follow along just fine just by importing these. This should word fine for the entire code.

1. Copying from Local file system to HDFS.
Copies a local file onto HDFS. You do have the hadoop file system command to do the same.

hadoop fs -copyFromLocal <local fs> <hadoop fs>

I am not explaining much here as the comments are quite helpful. Of course, while importing the configuration files, make sure to point them to your hadoop systems location. For mine, it looks like this:

It took a while for me to get a Hadoop cluster up and running. Especially after looking at all the available documentation and tutorials available on the internet. Moreover, for a starter into the Hadoop ecosystem, it can be quite frustrating in to decide to choose between a distribution like Cloudera or MapR for the same or just a direct installation from the apache site. I have chosen the later and it works fine for me. Yes, there are a number of good tutorials available on the internet, but well, I am sure this would help a few out there like me. Before I start, I do assume that you have a basic understanding of how Hadoop works or a general overview. If not, I suggest you do so.

“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.”

Now, as you may know, the three main configuration files are core-site.xml,hadoop-env.sh,hdfs-site.xml,mapred-site.xml. Refer to one of the above tutorials on how to set them, or better, the book, “Hadoop Definitive Guide”. Now if you have them ready, copy core-site.xml,hadoop-env.sh,hdfs-site.xml,mapred-site.xml to each of your datanode, or the main namenode if this is your first install and configure them appropriately. This itself would take a long time to explain and so I’ll write another post. If you have them ready on another hadoop server, do this. Yes, it is important that they all share the same attributes.

If this is to add a datanode to an existing hadoop system, you should add an entry to /etc/hosts for every new datanode.
Passphrase less ssh login from the namenode to datanodes. The idea is to copy the namenodes public key id_dsa,pub to the datanode created, that is to its, /home/hadoop/.ssh/authorized_keys. If you dont know how to create the keys, follow this link. It is explained very lucidly.

If you are doing it for the name node, you need to format the hdfs before you start the daemons.Do not format a running Hadoop filesystem, this will cause all your data to be erased. Before formatting, ensure that the dfs.name.dir directory exists. More on this here. http://wiki.apache.org/hadoop/GettingStartedWithHadoop

hadoop namenode -format

Stop the running cluster if it exists.

stop-mapred.sh
stop-dfs.sh

Do add the IP of the new data node in conf/slaves and conf/includes and restart/start the cluster.

start-dfs.sh
start-mapred.sh

This should get you up and running. Although by all means, this is not a complete listing, I have tried to keep it short and clean. I’d write more on the configuration files and other administrative stuff in later blogs. Comments and suggestions appreciated! 🙂