Sunday, January 5, 2014

Big Data: Hadoop 2.x/YARN Multi-Node Cluster Installation

Apache Hadoop 2/YARN/MR2 Multi-node Cluster Installation for Beginners:
In this blog ,I will describe the steps for setting up a distributed,
multi-node Hadoop cluster running on Red Hat Linux/CentOS Linux distributions.Now we are comfortable with installation and execution of MapReduce applications on Single node in Pseudo-distributed Mode. [Click here for the details on single node installation].Let us move one step forward to deploy multi-node cluster .

Hadoop Cluster: Hadoop Cluster is designed for distributed processing of large data sets across group of commodity machines (low-cost servers). The Data could be unstructured, semi-structured and also could be structured data.It is
designed to scale up to thousands of machines, with a high degree of
fault tolerance and software has the intelligence to detect & handle
the failures at the application layer.

Thre are 3 types of machines based on their specific roles in Hadoop cluster environment

1] Client machines : - Loading the data (input files) into the cluster - Submission of jobs (in our case - its a MapReduce Job) - Collect the result and view the analytics 2] Master nodes : - The Name Node coordinates the data storage function (HDFS) keeping the Meta data information- The ResourceManager negotiates the necessary resources for a container and launches an ApplicationMaster to represent the submitted application. 3] Slave nodes : Major part of cluster consists of Slave Nodes to perform computation .The NodeManager manages each node within a YARN cluster. The NodeManager provides per-node services within the cluster - management of a container over its life cycle to monitoring resources and tracking the health of its node. Container represents an allocated resource in the cluster. The resource Manager is the sole authority to allocate any container to applications. The allocated container is always on a single node and has unique containerID. It has a specific amount of resource allocated. Typically, an ApplicationMaster receive the container from the ResourceManager during resource negotiation and then talks to the NodeManager to start/stop container. Resource models a set of computer resources. Currently it only models Memeory [may be in future other resources like CPUs will be added ].

Step 1: First thing here is to establish a network between master node and slave node.
Assign IP address to eth0 interface of node1 and node 2 and include those IP address and hostname to /etc/hosts file as shown here.

WordCount Example:
WordCount example reads text files and counts how often words occur. The
input is text files and the output is text files, each line of which
contains a word and the count of how often it occured, separated by a
tab.Each mapper takes a line as input and breaks it into words. It then
emits a key/value pair of the word and 1. Each reducer sums the counts
for each word and emits a single key/value with the word and sum.