Hadoop Tutorial : Installing Hadoop on a Single Node Cluster – A Walkthrough

This article attempts to give a step by step walk through for creating a single Node Hadoop Cluster. It is an hands on tutorial so that even a novice user can follow the steps and create the Hadoop Cluster.

We are creating RSA type of key as indicated by the flat ‘-t’. Normally we should not keep password empty. It is done here to enable seamless interacations of the hadoop system with your node.

We need to indicate that the public keys are authorized for SSH access. This is done using the command.[email protected]:~/.ssh$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Let’s now test our setup.[email protected]:~/.ssh$ ssh localhost The authenticity of host ‘localhost (127.0.0.1)’ can’t be established. ECDSA key fingerprint is f9:be:8b:17:5a:8a:95:13:fa:96:22:c2:45:2b:08:cf. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added ‘localhost’ (ECDSA) to the list of known hosts. Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-27-generic-pae i686)
This gives a warning about the ‘unknown’ host. If you accept this and go ahead, this host is added to the known_hosts file in your .ssh directory. After this, you can verify again that you are able to login with ‘hadoop’ user without needing to enter your password.

Disable IPv6
I wanted to disable IPv6 only for Hadoop and not for the complete setup. So I chose to update the hadoop-env.sh file later- after installing Hadoop.

We will now set HADOOP_HOME, JAVA_HOME and add HADOOP_HOME to the path by editing .bashrc of the ‘hadoop’ user.#Add HADOOP_HOME, JAVA_HOME and update PATHexport HADOOP_HOME="/usr/local/hadoop-1.0.3" export JAVA_HOME="/usr/lib/jvm/java-6-openjdk-i386" export PATH=$PATH:$HADOOP_HOME/bin

If these changes are not taking effect when you switch user to hadoop or when you ssh, please add this line in your .bash_profile file in your home directory. If .bash_profile file does not exist create it first.source $HOME/.bashrc

Configuration
We need to configure JAVA_HOME variable for the hadoop environment as well. The configuration files will be usually in the ‘conf’ subdirectory while the executables will be in the ‘bin’ subdirectory.
The important files in ‘conf’ directory are
hadoop-env.sh, hdfs-site.xml, core-site.xml, mapred-site.xml.

hadoop-env.sh – Open the hadoop-env.sh file. It says on the top that hadoop specific environment variables are stored here. The only required variable is JAVA_HOME. In this file, the variable is already defined and the line is commented out. Edit the line to update the JAVA_HOME variable. In our case,

conf/*-site.xml – The earlier hadoop-site.xml file is now replaced with three different settings files – core-site.xml, hdfs-site.xml, mapred-site.xml. The main parameters that you need to refer to or modify in these three files are
core-site.xml – hadoop.tmp.dir, fs.default.name
hdfs-site.xml – dfs.replication
mapred-site.xml – mapred.job.tracker

hadoop.tmp.dir is used as a temporary directory for both local file system and for HDFS. We will use the directory ‘/app/hadoop/tmp/’ (Same as Michael Knoll). We need to create the directory and change its ownership.[email protected]:~$ sudo mkdir -p /app/hadoop/tmp
[sudo] password for sumod:[email protected]:~$ sudo chown hadoop:hadoop /app/hadoop/tmp[email protected]:~$
In the configuration files, add the properties mentioned above.

&amp;lt;configuration&amp;gt;
&amp;lt;property&amp;gt;
&amp;lt;name&amp;gt;mapred.job.tracker&amp;lt;/name&amp;gt;
&amp;lt;value&amp;gt;localhost:54311&amp;lt;/value&amp;gt;
&amp;lt;description&amp;gt;Host and port for jobtracker. As we use localhost,
it will be single map and reduce task.&amp;lt;/description&amp;gt;
&amp;lt;/property&amp;gt;
&amp;lt;/configuration&amp;gt;

Take some to think about why we are using different parameters and what their purpose is. Remember that HDFS is like a virtual file system on top of actual local file system. Virtual in a way that to the user, the different nodes should not appear separately. To the end user, HDFS should still appear homogeneous.

Now that we have downloaded, extracted and configured hadoop, it is time to start the installation. The first step would be to format Namenode. This initializes the FSNameSystem specified by the ‘dfs.name.dir’ variable. It will also write a VERSION file that specifies the namespace ID of this instance, ctime and version.If you format NameNode, you also have to clean up datanodes. Note that if you are just adding new datanodes to the cluster, you do not need to format NameNode.

Use jps to make sure all services are running as expected.
Note that – If jps is not found in your version of OpenJDK, you can update the JDK to get latest version and then use jps. You can run ‘sudo apt-get install openjdk-6-jdk’. I updated my JDK while hadoop was running and hadoop was not affected. But I do not advise that.[email protected]:/usr/local/hadoop-1.0.3/bin$ jps
9855 Jps
9488 SecondaryNameNode
9575 JobTracker
9810 TaskTracker
9266 DataNode
9053 NameNode

In this part, we will see how to run a sample MapReduce – MR – job. We will run the WordCount example. It should count the number of times each word appears and output the same. The output will be text files.

We will download books from Project Gutenberg that will serve as inputs. I have selected following books and downloaded them in Text UTF-8 format.
1. The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle
2. Pride and Prejudice by Jane Austen
3. Ulysses by James Joyce
4. War and Peace by graf Leo Tolstoy
5. Anna Karenina by graf Leo Tolstoy

You can see that the job run is a success. There is one output file and one log file. There is one file that indicates success of the job run.

Note the way I ran the jar file. Sometimes, people would run the job from the hadoop folder and give only the name of the file. I have chosen to run the job from the home and then specify hadoop home so that hadoop can locate the jar file correctly.

You can specify parameters on the command line using the option ‘-D’ and then = format.

Note that the quotes do not have much significance from hadoop point view. They will be dependent upon the string tokenizer.

Hadoop Web Interfaces
According to Michael Noll’s tutorial, the web interfaces can be found detailed in the file – conf/hadoop-default.xml. However, in my particular setup the settings for WebUI for NameNode and JobTracker dameon were found in the directory – src/packages/templates/conf in /hdfs-site.xml and in mapred-site.xml respectively. The setting for TaskTracker daemon was found in – /src/mapred/mapred-default.xml. The Web URLs are
NameNode daemon – http://localhost:50070/ –

Using NameNode web interface, we can browse the hadoop file system and logs. It is the HDFS layer of the system. Using the JobTracker, we can see the job history. Using the TaskTracker web interface, we can view the log files. JobTracker and TaskTracker come in the MapReduce layer of the system. We can also view number of Map and Reduce tasks scheduled. Using NameNode, we can view the output, input files, status of the nodes. I am able to see in my setup the default block size is 64 MB. In the usual hadoop setup, the default block size is 128 MB.

Well, that was pretty much about setting up Hadoop on a single node Ubuntu cluster. Thanks to Michael Noll for the helpful tutorial which is a fantastic reference. My goal is to provide more of a workshop than a tutorial. So I plan to experiment with the system further and update the blog. Thanks for reading!

4 thoughts on “Hadoop Tutorial : Installing Hadoop on a Single Node Cluster – A Walkthrough”

Can you please expand on this (I’m a new to ubuntu/hadoop and I’m not sure how to do ti): “Note that you may need to create login directory for ‘hadoop’ user by giving the -d option. You may also need to change the login shell of the user to that of your choice.” When I type ‘su -l hadoop’ I get the following: “No directory, loggin in with HOME=/”, are these two things related?