This post is written under the assumption that, an user reading this post already have an idea about installing and configuring hadoop on single node cluster. If not, it is better to go through the post Installing Hadoop on single node cluster

In this post we will briefly discuss about installing & configuring hadoop-2.3.0 version on multiple node cluster. For ease of simplicity, we will consider small cluster of 3 nodes, each with below minimum configuration.

Install Hadoop on Multi Node Cluster:

Prerequisite

All the three machines have latest Ubuntu 64-bit OS installed. At the of writing this post, Ubuntu 14.04 is the latest version available

All the three machines must have Java version 1.6 or higher is installed. If not, follow the instructions in the post Installing Java on Ubuntu in all three machines and setup JAVA_HOME & PATH environment variables appropriately.

All the three machines must have SSH (Secure Shell) installed. if not already installed please follow the instructions from post installing SSH on ubuntu

Lets consider the IP addresses and hostnames of these three machines in /etc/hosts file of the respective machine as shown below.

Shell

1

2

3

4

5

<Ip><hostname>

172.100.10.1master

172.100.10.2slave-1

172.100.10.3slave-2

And we will plan to consider master node to setup name node, resource manager along with data node and node manager but slave-1 and slave-2 will be setup only for running data node and node manager daemons.

Create a separate user hduser in all three machines with the command shown below.

1

2

$sudo adduser hduser

Configure /etc/hosts file on each machine

By default in Ubuntu file system, /etc/hosts file on each of the machine will have ip address and host name as shown below.

1

2

3

127.0.0.1localhost

127.0.1.1<hostname-1>// any host name

In order to recognize all the three machines each other, we need update this /etc/hosts file on each machine with IP addresses and hostnames of all three machines. Only Super user (sudo) will have access to edit this file . The format for specifying the host name and the IP will be as shown below

Shell

1

2

3

4

5

127.0.0.1localhost

172.100.10.1master

172.100.10.2slave-1

172.100.10.3slave-2

Copy the above four lines into /etc/hosts file of all the three machines. With this setting,each machine can recognize the other machine with, just by host name instead specifying the IP address every time.

SSH Configuration for Cluster setup

Before installing hadoop on three machines, we need to setup three machines in a cluster, where master node can able to connect with slave nodes without requiring a password and it (master) should be able to connect to itself without requiring any authentication/password.

For this, As per our prerequisite, SSH will be installed on all three machines.

Create new RSA public key on master node with below command.

Shell

1

2

hduser@master$ssh-keygen

Press Enter key on first request to take the default file to save key And for Pass phrase leave the space as it is and press Enter key, to make sure no password is needed to login.

We need to copy the above generated public key into the list of authorized keys on master node with the below command

Shell

1

2

hduser@master$cp.ssh/id_rsa.pub.ssh/authorized_keys

Now we will be able to login to master through SSH without giving any password.

Shell

1

2

hduser@master$ssh master

Now, we need to copy the same public key generated on master node (in $HOME/.ssh folder in master node), into corresponding SSH authorized keys files on all slave nodes. we can do this with the help of below commands.

Shell

1

2

3

hduser@master$ssh-copy-id-i$HOME/.ssh/id_rsa.pubhduser@slave-1

hduser@master$ssh-copy-id-i$HOME/.ssh/id_rsa.pubhduser@slave-2

Note: The above two commands need to be issued from master node but not from slave nodes. The above

Here hduser is the username and slave-1 and slave-2 are the host names.

Install Hadoop on each Machine

Now, we need to download latest stable version of hadoop and install it on each node usually in /usr/lib/hadoop location. This mainly include below three activities on each node.