I had Hadoop experience now for more than a year, thanks to a great series of Cloud Computing courses on Coursera.org, now after ~6 months of running via several cloud systems, I finally have time to put down some of my more practical notes in a form of an article here. I will not go much into theory, my target here would be to help someone construct his first small Hadoop cluster at home and show some of my amateur “HelloWorld” code that will count all words in all works of W. Shakespeare using the MapReduce. This should leave with with both a small cluster and a working compilation project using Maven to expand on your own later …

What I have used for my cluster is a home PC with 32G of RAM to run everything inside using vmWare Workstation. But this guide is applicable even if you run this usingVirtualBox, physical machines, or using virtual machines on some Internet cloud (e.g. AWS/Azure). The point will simply be 4 independent OS linux boxes that are together one a shared LAN to communicate between each other.

Lab Topology

For this one there is not much to say about topology, I simplified everything on network level to a single logical segment by bridging the virtual network to my real home LAN to make my own access simple. However in any real deployment with more systems you should consider both your logical network (ergo splitting to VLANs/subnets based on function) and also your rack structure as Hadoop and other cloud systems are very much delay sensitive) and physical network.

LAB topology for Hadoop small cluster

Versions of software used

This is a combination that I found stable in the last 6 months, of course you can try the latest versions of everything, just a friendly note that with these cloud systems library and versions compatibility troubleshooting can take days (I am not kidding). So if you are new to Hadoop, rather take recommendations before getting angry on weird dependency troubles (which you will sooner or later yourself).

Step 1) Preparing the environment on Ubuntu Server 14.04

There are several per-requisites that you need to do in order to have Hadoop working correctly. In a nutshell you need to:

Make the cluster nodes resolvable either via DNS or via local /etc/hosts file

Create password-less SSH login between the cluster nodes

Install Java

Setup environmental variables for Hadoop and Java

A. Update /etc/hosts

For my Lab and the IPs shown in the LAB topology, I needed to add this to the /etc/hosts file on all cluster nodes:

@on ALL nodes add this to /etc/hosts:

1

2

3

4

5

6

7

8

#master

192.168.10.135master

#secondary name node

192.168.10.136secondarymaster

#slave 1

192.168.10.140slave1

#slave 2

192.168.10.141slave2

B. Create password-less SSH login between nodes

This is essentially about generating privat/public DSA keypair and redistribute to all nodes as trusted, you can do this with the following steps:

@on MASTER node:

1

2

3

4

5

6

7

8

9

10

11

# GENERATE DSA KEY-PAIR

ssh-keygen-tdsa-f~/.ssh/id_dsa

# MAKE THE KEYPAIR TRUSTED

cat~/.ssh/id_dsa.pub>>~/.ssh/authorized_keys

# COPY THIS KEY TO ALL OTHER NODES:

scp-r~/.ssh ubuntu@secondarymaster:~/

scp-r~/.ssh ubuntu@slave1:~/

scp-r~/.ssh ubuntu@slave2:~/

scp-r~/.ssh ubuntu@cassandra1:~/

Test no with ssh if you can login to each server without password, for example from master node open ssh with “ssh ubuntu@slave1” to jump to slave1 console without being prompted for password, this i needed by hadoop to operate so should work!

NOTE: In production, you should always only move only the public part of the key id_dsa.pub, not the private key that should be unique for each server. Ergo the previous key generation procedure should be done on each server and then only the public keys should be exchanged between all the servers, what I am doing here is very unsecure that all servers use the same private key! If this one is compromised, all servers are compromised.

C. Install Java

We will simply install java and test we have correct version for Hadoop 2.7.1:

@on ALL nodes:

1

2

sudo add-apt-repository ppa:webupd8team/java

sudo apt-get update&&sudo apt-get-yinstall oracle-jdk7-installer

Afterwards you should test if you have correct java version with command “java -version” or that your path to it is “/usr/lib/jvm/java-7-oracle/bin/java -version”

D. Setup environmental variables

Add this to your ~/.bashrc file, we are preparing also here already some variables for the hadoop installation folder :

Step 3) Configuring Hadoop for first run

Actually out-of-the-box Hadoop is configured for pseudo-cluster mode, which means you will be able to execute it all inside one server, but this is not why we are here and as such our target here is to configure it for a real cluster. Here are the high level steps.

@on ALL nodes:

edit $HADOOP_CONF_DIR/core-site.xml

change from:

1

2

<configuration>

</configuration>

change to:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

<configuration>

<property>

<name>hadoop.tmp.dir</name>

<value>/home/ubuntu/hdfstmp</value>

</property>

<property>

<name>fs.default.name</name>

<value>hdfs://master:8020</value>

</property>

<property>

<name>fs.defaultFS</name>

<value>hdfs://master:8020</value>

</property>

</configuration>

@on ALL nodes:

edit $HADOOP_CONF_DIR/hdfs-site.xml

change from:

1

2

<configuration>

</configuration>

change to:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

<configuration>

<property>

<name>dfs.replication</name>

<value>2</value>

</property>

<property>

<name>dfs.permissions</name>

<value>false</value>

</property>

<property>

<name>dfs.namenode.secondary.http-address</name>

<value>secondarymaster:50090</value>

</property>

<property>

<name>fs.default.name</name>

<value>hdfs://master:8020</value>

</property>

<property>

<name>dfs.data.dir</name>

<value>/home/ubuntu/hdfstmp/dfs/name/data</value>

<final>true</final>

</property>

<property>

<name>dfs.name.dir</name>

<value>/home/ubuntu/hdfstmp/dfs/name</value>

<final>true</final>

</property>

</configuration>

@on ALL nodes:

edit (or create since missing) $HADOOP_CONF_DIR/mapred-site.xml

change to:

1

2

3

4

5

6

7

8

9

10

<configuration>

<property>

<name>mapred.job.tracker</name>

<value>hdfs://hadoopmaster:8021</value>

</property>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

</configuration>

@on ALL nodes:

edit $HADOOP_CONF_DIR/yarn-site.xml

change from:

1

2

<configuration>

</configuration>

# change to:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

<configuration>

<property>

<name>yarn.resourcemanager.hostname</name>

<value>master</value>

</property>

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<property>

<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

</configuration>

@on ALL nodes:

edit $HADOOP_CONF_DIR/hadoop-env.sh

change from:

1

export JAVA_HOME=${JAVA_HOME}

change to:

1

export JAVA_HOME=/usr/lib/jvm/java-7-oracle

@on MASTER

Remove “yarn.resourcemanager.hostname” property from yarn-site.xml ONLY ON MASTER node otherwise your master ResourceManager will listen only on localhost and other nodes will not be able to connect to it!

@on SECONDARYMASTER

Remove “dfs.namenode.secondary.http-address” from hdfs-site.xml ONLY ON SECONDARYMASTER node Remove “dfs.namenode.secondary.http-address” from hdfs-site.xml ONLY ON SECONDARYMASTER node

@on MASTER and SECONDARYMASTER

edit $HADOOP_CONF_DIR/slaves

change from:

1

localhost

change to:

1

2

slave1

slave2

Step 4) Format HDFS and first run of Hadoop

Since we now have Hadoop fully configured, we can format the HDFS on all nodes and try to run it from Master.

Step 5) Verification of start

The basic test is to check what services are running in the java with the “jps” command. This is how it should look like on each node:@MASTER:

1

2

3

4

ubuntu@master:~$jps

4525NameNode

5048Jps

4791ResourceManager

@SECONDARY MASTER

1

2

3

ubuntu@secondarymaster:~$jps

4088SecondaryNameNode

4140Jps

@SLAVE1

1

2

3

4

ubuntu@slave1:~$jps

3406DataNode

3645Jps

3547NodeManager

@SLAVE2

1

2

3

4

ubuntu@slave2:~$jps

3536NodeManager

3395DataNode

3634Jps

Explanation is that ResourceManager is YARN master component, while NodeManager is YARN components on slaves. The HDFS composes of NameNode and SecondaryNamenode, while DataNode is HDFS component on slaves. All these components have to exist (And be able to communicate with each other via LAN) for the Hadoop cluster to work.

Additional verification can be done by checking the WEB interfaces, most importantly (which you should bookmark for checking also status of applicaitons) is to open your browser and to to “https://master:8088”. This 8088 is a web interface of the YARN scheduler. Here are some example what you can see there, the most important for you is that:

You are able to actually visit 8088 on master (means RsourceManager is running)

Check the number of DataNodes visible to YARN (picture below), if you have this it means that the slaves have managed to register to the ResourceManager as available resources.

YARN ResourceManager WEB interface on port 8088, here it also proves that master can see two “Active Nodes”

Step 6) Running your first “HelloWorld” Hadoop application

Ok, there two paths here.

Use Hadoop provided Pi program example, but this might have high RAM requirements that my 2G slaves had trouble to run

Use super-small Hadoop Java program that I provide step-by-step below to build your own application and run it to count all words in all the plays of W. Shakespeare.

Option #1:

There is already a pre-compiled example program to do Pi calculations (ergo extrapolating Pi number with very high amount of decimal places). You can immediately run this with the following command using pre-compiled examples JAR that came with Hadoop installation:

However, for me this didn’t worked by default because Pi example asked for 8G of Ram in the YARN scheduler that my 2G slaves were not able to allocate, which resulted in the application to be “ACCEPTED”, but never scheduled for execution by YARN. To solve this, check below extra references on RAM management that you can optionally use here.

So right now you should be inside the project directory where I provided the following files:

1

2

3

4

5

6

7

8

9

10

ubuntu@master:~/maven_wordcount_example_networkgeekstuff$ls-l

total32

drwxrwxr-x2ubuntu ubuntu4096Jun209:53input

-rw-r--r--1ubuntu ubuntu4030Jun209:53pom.xml

drwxrwxr-x3ubuntu ubuntu4096Jun209:53src

-rwxr--r--1ubuntu ubuntu197Jun209:53step0_prepare_HDFS_input

-rwxr--r--1ubuntu ubuntu87Jun209:53step1_compile

-rwxr--r--1ubuntu ubuntu476Jun209:53step2_execute

-rwxr--r--1ubuntu ubuntu102Jun209:53step3_read_output

drwxrwxr-x2ubuntu ubuntu4096Jun209:53target

To make this SIMPLE FOR YOU, you can notice I have provided these 4 super small scripts:

step0_prepare_HDFS_input

step1_compile

step2_execute

step3_read_output

So you can simply start executing these one by one and you will manage to get at the end a result of counting all the works of William Shakespeare (provided as txt inside “./input” directory from the download).

But lets go via these files one by one for explanation:

@step0_prepare_HDFS_input

Simply uses HDFS manipulation commands to create input and output directories in HDFS and upload a local file with Shakespeare texts into the input folder.

1

2

3

4

5

6

#!/bin/bash

echo"putting shakespear into HDFS";

hadoop fs-mkdir/networkgeekstuff_input

hadoop fs-mkdir/networkgeekstuff_output

hadoop fs-put./input/shakespear.txt/networkgeekstuff_input/

@step1_compile

This one is more interesting, it uses Maven framework to download all the java library dependencies (these are described in pom.xml file together with compilation parameters, names and other build details for the target JAR).

1

2

3

4

5

#!/bin/bash

echo"==========="

echo"Compilation"

echo"==========="

mvn clean package

The result of this simple command will be that there appears a new Java JAR file inside the “target” directory that we can later use with Hadoop, please take good look on the compilation process. To save space in this article I didn’t provide the whole output, but at the very end you should get a message like this:

NOTE: As first part I am always removing the output folder, the point is that the Java JAR is not checking if the output files already exist and if there is a collision the execution would fail, therefore ALWAYS delete all output files before attempting to re-run your programs with HDFS.

The “hadoop jar” command takes the following arguments:

./target/hadoop_wordcount_project-0.0.1-jar-with-dependencies.jar -> JAR file to run

com.examples.WordCount – Java Class that is to be executed by YARN scheduler on slaves

/networkgeekstuff_input – This is first argument that is passed to the Java class, the Java code is processing this as folder as INPUT

/networkgeekstuff_output – This is second argument that is passed to the Java class, the Java code is processing any second argument as forlder for OUTPUT to store results

This is how a successful run of the Hadoop program should look like, notice that here since this is a very small program it very quickly jumped to “map 100% reduce 100%”, in larger programs you would see many many lines showing status of progress on both map and recude parts:

NOTE: you can see that you can get a WEB url in this output (in the example above it was : http://hadoopmaster:8088/proxy/application_1464859507107_0002/) to track the application progress (very useful in large computations that take many hours)

@step3_read_output

The last simple step is simple to read the results of the Hadoop code by reading all the TXT files in the OUTPUT folder.

1

2

3

#!/bin/bash

echo"This is result of our wordcount example";

hadoop fs-cat/networkgeekstuff_output/*

The output will be really long, because this very simple program is not removing special characters and as such the results are not very clean, I challenge you that for a homework you can work on the Java code to clear special characters from the counting and then second interesting problem to solve is sorting, which is very different in the Hadoop MapReduce logic.

Step 7) MapReduce Java code from the HelloWorld example we just run?

Now that we run this code successfully, lets have a look on it, if you open the single .java file in the src directory, it will look like this:

Now I will only tell you here that the Hadoop is using a programming methodology called MapReduce, where you first have to divide inputs based on defined key (here simply any word is a key) in the Mapping phase and then group them together while counting the number of instances of a given key during Reduce phase.

I do not want to go into explaining this in detail if you are new, and would very much like to recommend a free online course that you can take based on which I have learned how to program in Hadoop. With high recommendation visit here: https://www.coursera.org/specializations/cloud-computing

(optional) Step 8) RAM management for YARN cluster

One thing that you might have noticed here is the fact by default, the YARN is not setting much limits on the so called “containers” in which applications can run, this means that application can request 15G of RAM and YARN will accept this, but if he doesn’t find this resources available, it will block the execution and your application will be accepted by the YARN, but never scheduled. One way how to help these situations is to configure YARN to have much more real RAM expectations on small VM nodes like we used here (you remember we have here slaves with 2G RAM each).

Before showing you my solution to push RAM utilization down to 1G of RAM per slave, the underlining logic how to calculate these numbers for your cluster can be found in these two best resources:

Please consider mandatory reading because you WILL have RAM related problems very soon yourself, if not with low RAM, then also alternatively if you are using slaves with more than 8G of ram, by default Hadoop will not use it so you have to do these configurations to also avoid under-utilization on large clusters.

In my own cluster, at the end I pushed the RAM use to use 512M RAM per application container, while maximim of 4096MB (because 2G RAM + 2G SWAP might be able to handle this on my slaves). Additionally you have to also consider the Java JVM machine overhead on each process, so you run all Java code with optional arguments to lower java to 450m of RAM (recommended 80% of total ram so this is my best guess from 512MB)

Here is the configuration that needs to be added to yarn-site.xml between the <configuration> tags:

1

2

3

4

5

6

7

8

9

10

11

12

<property>

<name>yarn.scheduler.minimum-allocation-mb</name>

<value>512</value>

</property>

<property>

<name>yarn.scheduler.maximum-allocation-mb</name>

<value>4096</value>

</property>

<property>

<name>yarn.nodemanager.resource.memory-mb</name>

<value>4096</value>

</property>

And here configuration for mapred-site.xml also to be added between the <configuration> tages:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

<property>

<name>yarn.app.mapreduce.am.resource.mb</name>

<value>512</value>

</property>

<property>

<name>yarn.app.mapreduce.am.command-opts</name>

<value>-Xmx450m</value>

</property>

<property>

<name>mapreduce.map.memory.mb</name>

<value>512</value>

</property>

<property>

<name>mapreduce.reduce.memory.mb</name>

<value>512</value>

</property>

<property>

<name>mapreduce.map.java.opts</name>

<value>-Xmx450m</value>

</property>

<property>

<name>mapreduce.reduce.java.opts</name>

<value>-Xmx450m</value>

</property>

Summary and where to go next …

So on this point you should have your own small home Hadoop cluster and you have successfully compiled your first HelloWorld “Word Couting” application written to use the MapReduce approach to count all the words in all works of W. Shakespeare and store results in a HDFS cluster.

Your next steps from here should be to explor my example Java code of this Word Couting example (because there is bazzilion explanations on WordCouting in Hadoop on the net, I didn’t put one here) and if you want to truly understand the principles and also go further to writing more usefull applications, I cannot recommend anouth the free coursera.org Cloud Computing specialization courses (which are free) and I spent on them last year lot of time learning not only Hadoop, but also Cassandra DB, Spark, Kafka and other trendy names and how to write usefull code for them.

My next step here is that I will try to write simipar quick LAB example with expanding this LAB also with Spark (as it also uses YARN in the background) which is a representative system for stream processing. Stream processin is very interesting alternative to MapReduce approach that has its own set of problems where it can be more usefull than basic MapReduce.

Final NOTE: Hadoop and the whole ecosystem is very much a live project that is constantly changing, for example before using Hadoop 2.7.1 I have literally spent hours and hours troubleshooting other versions until I find out that they are not compatible with some libraries on ubuntu 14.04, or for example spent another hours when integrating Cassandra DB and Java API for cassandra (called Datalex) until I realized that these simply cannot be compinded inside one server as each demands different Java and some libraries, as such my warning when going into OpenSource BigData is that you will definitelly get tired/angry/mad until you have a working cluster if you run into a wrong combination of versions. Just be ready for it and accept it as a fact.