Pages

Sunday, February 24, 2013

Starting with Hadoop : Installation

Purpose
This document describes how to install, configure and manage non-trivial Hadoop clusters ranging from a few nodes to extremely large clusters with thousands of nodes.
Prerequisites
1. Make sure all required software is installed on all nodes in your cluster (i.e. jdk1.6 and greater).
2. Download the Hadoop software.

dfs.replication is the number of replicas of each block. dfs.name.dir is the path on the local filesystem where the NameNode stores the namespace and transactions logs persistently. dfs.data.dir is comma-separated list of paths on the local filesystem of a DataNode where it stores its blocks.

mapreduce.jobtracker.address is host or IP and port of JobTracker. mapreduce.jobtracker.system.dir is the path on the HDFS where where the Map/Reduce framework stores system files. mapreduce.cluster.local.dir is comma-separated list of paths on the local filesystem where temporary MapReduce data is written.

5. conf/masters
Delete localhost and add all the names of the namenode, each in on line.
For Example:
<IP of Namenode>

6. conf/slaves
Delete localhost and add all the names of the TaskTrackers, each in on line.
For Example:
<IP of Slave 1>
<IP of Slave 2>
….
….
<IP of Slave n>

7. Configuring SSH
In fully-distributed mode, we have to start daemons, and to do that, we need to have SSH installed. it merely starts daemons on the set of hosts in the cluster (defined by the slaves file) by SSH-ing to each host and starting a daemon process. So we need to make sure that we can SSH to localhost and log in without having to enter a password.

$ sudo apt-get install ssh

Then to enable password-less login, generate a new SSH key with an empty passphrase:

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Test this with:

$ ssh localhost

You should be logged in without having to type a password.
Copy the contents inside the /<user>/.ssh/ of masternode to other slave nodes as well to have passwordless access to other nodes from masternode.

8. Duplicate Hadoop configuration files to all nodes
We may duplicate the configuration files under conf directory to all nodes. The script mentioned above can be used. By now, we have finished copying Hadoop softwares and configuring the Hadoop. Now let’s have some fun with Hadoop.

Hadoop Startup
To start a Hadoop cluster you will need to start both the HDFS and Map/Reduce cluster. Format a new distributed file system:

$ bin/hadoop namenode -format

Start the HDFS with the following command, run on the designated NameNode:

$ bin/start-dfs.sh

The bin/start-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and
starts the DataNode daemon on all the listed slaves.
Start Map-Reduce with the following command, run on the designated JobTracker:

$ bin/start-mapred.sh

The bin/start-mapred.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the JobTracker and starts the TaskTracker daemon on all the listed slaves.

Hadoop Shutdown
Stop HDFS with the following command, run on the designated NameNode:

$ bin/stop-dfs.sh

The bin/stop-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and stops the DataNode daemon on all the listed slaves.
Stop Map/Reduce with the following command, run on the designated the designated JobTracker:

$ bin/stop-mapred.sh

The bin/stop-mapred.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the JobTracker and stops the TaskTracker daemon on all the listed slaves.