Link List

Jan 28, 2011

Setting up Pseudo-Distributed Apache Hadoop 0.21.0 in 10 minutes

I'm providing this as collateral material for my Hadoop Presentation at Data Day Austin. Firstly, Pseudo-Distributed mode is effectively a 1 node Hadoop Cluster setup. This is really the best way to get started with Hadoop as it makes it really easy to modify the config to be fully distributed once you've got a handle on the basics. This is also a good developer setup. You'll notices in some of the the *-site config files that are modified, the path values I provide are pathed off of my Hadoop install directory. This is because I have several different installations of Hadoop running on one machine.

Ready? Here we go:

1) Setup Passwordless SSH:

$ ssh-keygen -t dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Test that you can ssh without password:

$ ssh localhost

2) Download Hadoop and Untar it in your desired directory (make sure you user has permission to this directory)

$ tar -xf hadoop-0.21.0.tar.gz

3) Uncomment and set JAVA_HOME in the $HADOOP_HOME/conf/hadoop-env.sh

4) Insert the following XML between <configuration> tags in the $HADOOP_HOME/conf/core-site.xml

<property>

<name>hadoop.tmp.dir</name>

<value>/tmp/hadoop</value>

<description>A base for other temporary directories.</description>

</property>

<property>

<name>fs.default.name</name>

<value>hdfs://localhost:54310</value>

</property>5) Insert the following XML between <configuration> tags in the $HADOOP_HOME/conf/mapred-site.xml

For the mapred.system.dir , create a $HADOOP_HOME/tmp directory and specify your own path