How to Install and Set Up a 3-Node Hadoop Cluster

What is Hadoop?

Hadoop is an open-source Apache project that allows creation of parallel processing applications on large data sets, distributed across networked nodes. It’s composed of the Hadoop Distributed File System (HDFS™) that handles scalability and redundancy of data across nodes, and Hadoop YARN: a framework for job scheduling that executes data processing tasks on all nodes.

Before You Begin

Follow the Getting Started guide to create three (3) Linodes. They’ll be referred to throughout this guide as node-master, node1 and node2. It’s recommended that you set the hostname of each Linode to match this naming convention.

Run the steps in this guide from the node-master unless otherwise specified.

Follow the Securing Your Server guide to harden the three servers. Create a normal user for the install, and a user called hadoop for any Hadoop daemons. Do not create SSH keys for hadoop users. SSH keys will be addressed in a later section.

Install the JDK using the appropriate guide for your distribution, Debian, CentOS or Ubuntu, or grab the latest JDK from Oracle.

The steps below use example IPs for each node. Adjust each example according to your configuration:

node-master: 192.0.2.1

node1: 192.0.2.2

node2: 192.0.2.3

Note

This guide is written for a non-root user. Commands that require elevated privileges are prefixed with sudo. If you’re not familiar with the sudo command, see the Users and Groups guide. All commands in this guide are run with the hadoop user if not specified otherwise.

Architecture of a Hadoop Cluster

Before configuring the master and slave nodes, it’s important to understand the different components of a Hadoop cluster.

A master node keeps knowledge about the distributed file system, like the inode table on an ext3 filesystem, and schedules resources allocation. node-master will handle this role in this guide, and host two daemons:

The NameNode: manages the distributed file system and knows where stored data blocks inside the cluster are.

The ResourceManager: manages the YARN jobs and takes care of scheduling and executing processes on slave nodes.

Slave nodes store the actual data and provide processing power to run the jobs. They’ll be node1 and node2, and will host two daemons:

Configure the System

Create Host File on Each Node

For each node to communicate with its names, edit the /etc/hosts file to add the IP address of the three servers. Don’t forget to replace the sample IP with your IP:

/etc/hosts

1
2
3

192.0.2.1 node-master
192.0.2.2 node1
192.0.2.3 node2

Distribute Authentication Key-pairs for the Hadoop User

The master node will use an ssh-connection to connect to other nodes with key-pair authentication, to manage the cluster.

Login to node-master as the hadoop user, and generate an ssh-key:

ssh-keygen -b 4096

Copy the key to the other nodes. It’s good practice to also copy the key to the node-master itself, so that you can also use it as a DataNode if needed. Type the following commands, and enter the hadoop user’s password when asked. If you are prompted whether or not to add the key to known hosts, enter yes:

Set Environment Variables

Add Hadoop binaries to your PATH. Edit /home/hadoop/.profile and add the following line:

/home/hadoop/.profile

1

PATH=/home/hadoop/hadoop/bin:/home/hadoop/hadoop/sbin:$PATH

Configure the Master Node

Configuration will be done on node-master and replicated to other nodes.

Set JAVA_HOME

Get your Java installation path. If you installed open-jdk from your package manager, you can get the path with the command:

update-alternatives --display java

Take the value of the current link and remove the trailing /bin/java. For example on Debian, the link is /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java, so JAVA_HOME should be /usr/lib/jvm/java-8-openjdk-amd64/jre.

If you installed java from Oracle, JAVA_HOME is the path where you unzipped the java archive.

Edit ~/hadoop/etc/hadoop/hadoop-env.sh and replace this line:

export JAVA_HOME=${JAVA_HOME}

with your actual java installation path. For example on a Debian with open-jdk-8:

~/hadoop/etc/hadoop/hadoop-env.sh

1

exportJAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre

Set NameNode Location

On each node update ~/hadoop/etc/hadoop/core-site.xml you want to set the NameNode location to node-master on port 9000:

The last property, dfs.replication, indicates how many times data is replicated in the cluster. You can set 2 to have all the data duplicated on the two nodes. Don’t enter a value higher than the actual number of slave nodes.

Set YARN as Job Scheduler

In ~/hadoop/etc/hadoop/, rename mapred-site.xml.template to mapred-site.xml:

cd ~/hadoop/etc/hadoop
mv mapred-site.xml.template mapred-site.xml

Edit the file, setting yarn as the default framework for MapReduce operations:

Configure Slaves

The file slaves is used by startup scripts to start required daemons on all nodes. Edit ~/hadoop/etc/hadoop/slaves to be:

~/hadoop/etc/hadoop/slaves

1
2

node1
node2

Configure Memory Allocation

Memory allocation can be tricky on low RAM nodes because default values are not suitable for nodes with less than 8GB of RAM. This section will highlight how memory allocation works for MapReduce jobs, and provide a sample configuration for 2GB RAM nodes.

The Memory Allocation Properties

A YARN job is executed with two kind of resources:

An Application Master (AM) is responsible for monitoring the application and coordinating distributed executors in the cluster.

Some executors that are created by the AM actually run the job. For a MapReduce jobs, they’ll perform map or reduce operation, in parallel.

Both are run in containers on slave nodes. Each slave node runs a NodeManager daemon that’s responsible for container creation on the node. The whole cluster is managed by a ResourceManager that schedules container allocation on all the slave-nodes, depending on capacity requirements and current charge.

Four types of resource allocations need to be configured properly for the cluster to work. These are:

How much memory can be allocated for YARN containers on a single node. This limit should be higher than all the others; otherwise, container allocation will be rejected and applications will fail. However, it should not be the entire amount of RAM on the node.

This value is configured in yarn-site.xml with yarn.nodemanager.resource.memory-mb.

How much memory a single container can consume and the minimum memory allocation allowed. A container will never be bigger than the maximum, or else allocation will fail and will always be allocated as a multiple of the minimum amount of RAM.

Those values are configured in yarn-site.xml with yarn.scheduler.maximum-allocation-mb and yarn.scheduler.minimum-allocation-mb.

How much memory will be allocated to the ApplicationMaster. This is a constant value that should fit in the container maximum size.

This is configured in mapred-site.xml with yarn.app.mapreduce.am.resource.mb.

How much memory will be allocated to each map or reduce operation. This should be less than the maximum size.

This is configured in mapred-site.xml with properties mapreduce.map.memory.mb and mapreduce.reduce.memory.mb.

The relationship between all those properties can be seen in the following figure:

Sample Configuration for 2GB Nodes

For 2GB nodes, a working configuration may be:

Property

Value

yarn.nodemanager.resource.memory-mb

1536

yarn.scheduler.maximum-allocation-mb

1536

yarn.scheduler.minimum-allocation-mb

128

yarn.app.mapreduce.am.resource.mb

512

mapreduce.map.memory.mb

256

mapreduce.reduce.memory.mb

256

Edit /home/hadoop/hadoop/etc/hadoop/yarn-site.xml and add the following lines:

Run YARN

HDFS is a distributed storage system, it doesn’t provide any services for running and scheduling tasks in the cluster. This is the role of the YARN framework. The following section is about starting, monitoring, and submitting jobs to YARN.

Start and Stop YARN

Start YARN with the script:

start-yarn.sh

Check that everything is running with the jps command. In addition to the previous HDFS daemon, you should see a ResourceManager on node-master, and a NodeManager on node1 and node2.

To stop YARN, run the following command on node-master:

stop-yarn.sh

Monitor YARN

The yarn command provides utilities to manage your YARN cluster. You can also print a report of running nodes with the command:

As with HDFS, YARN provides a friendlier web UI, started by default on port 8088 of the Resource Manager. Point your browser to http://node-master-IP:8088 and browse the UI:

Submit MapReduce Jobs to YARN

Yarn jobs are packaged into jar files and submitted to YARN for execution with the command yarn jar. The Hadoop installation package provides sample applications that can be run to test your cluster. You’ll use them to run a word count on the three books previously uploaded to HDFS.

Next Steps

More Information

You may wish to consult the following resources for additional information on this topic. While these are provided in the hope that they will be useful, please note that we cannot vouch for the accuracy or timeliness of externally hosted materials.