Running word-count example on a Hadoop commodity-hardware cluster and on a Hadoop local installation

Sep 20, 2012

Last weekend I spent some hours assembling old computer parts to create my commodity hardware cluster for running Hadoop. I already had a local installation in my notebook, so I thought it would be cool to run the word-count example in both scenarios to see what would be the results.

But first, let's review the hardware configurations:

Cluster set up

devcluster01 (NameNode)

Intel Core 2 Duo 2.3 GHz

4 GB

200 GB

100 Mbps Full Duplex network card

devcluster02

AMD Athlon 1.6 GHz

500 MB

4 GB

100 Mbps Full Duplex network card

devcluster03

Celeron 1.3 GHz

300MB

10 GB

100 Mbps Full Duplex network card

devcluster04 (TaskTracker)

Celeron 2.26 GHz

512 MB

80 GB

100 Mbps Full Duplex network card

devcluster05

AMD Duron 1.1 GHz

512MB

40 GB

100 Mbps Full Duplex network card

Network

Ethernet 10/100 D-Link hub

Standalone installation

Intel Core i5 2.3 GHz (quad core)

6 GB

500 GB

Notice that there is one NameNode (exclusive) and one JobTracker (exclusive too). I'm following the default for a cluster installation, but will try switching a DataNode/TaskTracker with less computing power for the NameNode or the JobTracker (they were randomly selected), and using both servers as DataNode/TaskTracker too.

First I used the default data from this tutorial, that includes only three books. Then I increased to 8 books. Not happy with the result I tried 30, and finally 66 books. You can get the data from the same GitHub repository mentioned above.

Using the web interface I retrieved the total time to execute each job, and using the following R script, plotted the graph below (for more on plotting R graphs, check this link).

a1=c(73,75,132,248)# time in the clustera2=c(40,48,121,224)# time running locallyfiles=c(3,8,30,66)# number of files usedplot(x=files,xlab="Number of files",y=a1,ylab="Time (s)",col="red",type="o")# plot cluster linelines(y=a2,x=files,type="o",pch=22,lty=2,col="blue")# add the local linetitle(main="Hadoop Execution in seconds",col.name="black",font.main=2)g_range<-range(0,a1,files)legend(2,g_range[2],c("Cluster","Local"),cex=0.8,col=c("red","blue"),pch=21:22,lty=1:2)#legend

The cluster is running slower than the standalone installation. During this week I'll investigate how to get better results with the cluster. I have three computers running tasks in the distributed cluster (the other two are the NameNode and JobTracker), and my notebook has four cores. It may be influencing the results. There is also the network latency, low memory in some nodes and changing the NameNode and JobTracker.

All in all, it's been fun to configure the cluster and run the experiments. It is good for practicing with Hadoop and HDFS, as well as getting a better idea on how to manage a cluster.