Joe Crobak's Website

Getting Started with Apache Hadoop 0.23.0

Dec 4, 2011

Hadoop 0.23.0 was released November 11, 2011. Being the future of the Hadoop platform, it's worth checking out even though it is an alpha release.

Note:Many of the instructions in this article came from trial and error, and there are lots of alternative (and possibly better ways) to configure the systems. Please feel free to suggest improvements in the comments. Also, all commands were only tested on Mac OS X.

Download

Once downloaded, decompress the file. The bundled documentation is available in share/doc/hadoop/index.html

Notes for Users of Previous Versions of Hadoop

The directory layout of the hadoop distribution changed in hadoop 0.23.0 and 0.20.204 vs. previous versions. In particular, there are now sbin, libexec, and etc directories in the root of distribution tarball.

scripts and executables

In hadoop 0.23.0, a number of commonly used scripts from the bin directory have been removed or drastically changed. Specifically, the following scripts were removed (vs 0.20.205.0):

hadoop-config.sh

hadoop-daemon(s).sh

start-balancher.sh and stop-balancer.sh

start-dfs.sh and stop-dfs.sh

start-jobhistoryserver.sh and stop-jobhistoryserver.sh

start-mapred.sh and stop-mapred.sh

task-controller

The start/stop mapred-related scripts have been replaced by "map-reduce 2.0" scripts called yarn-*. The start-all.sh and stop-all.sh scripts no longer start or stop HDFS, but they are used to start and stop the yarn daemons. Finally, bin/hadoop has been deprecated. Instead, users should use bin/hdfs and bin/mapred.

Hadoop distributions now also include scripts in a sbin directory. The scripts include start-all.sh, start-dfs.sh, and start-balancer.sh (and the stop versions of those scripts).

configuration directories and files

The conf directory that comes with Hadoop is no longer the default configuration directory. Rather, Hadoop looks in etc/hadoop for configuration files. The libexec directory contains scripts hadoop-config.sh and hdfs-config.sh for configuring where Hadoop pulls configuration information, and it's possible to override the location of the configuration directory the following ways:

hadoop-config.sh accepts a --config option for specifying a config directory, or the directory can be specified using $HADOOP_CONF_DIR.

This scripts also accepts a --hosts parameter to specify the hosts / slaves

This script uses variables typically set in hadoop-env.sh, such as: $JAVA_HOME, $HADOOP_HEAPSIZE, $HADOOP_CLASSPATH, $HADOOP_LOG_DIR, $HADOOP_LOGFILE and more. See the file for a full list of variables.

Configure HDFS

To start hdfs, we will use sbin/start-dfs.sh which pulls configuration from etc/hadoop by default. We'll be putting configuration files in that directory, starting with core-site.xml. In core-site.xml, we must specify a fs.default.name:

Next, we want to override the locations that the NameNode and DataNode store data so that it's in a non-transient location. The two relevant parameters are dfs.namenode.name.dir and dfs.datanode.data.dir. We also set replication to 1, since we're using a single datanode.

as of HDFS-456 and HDFS-873, the namenode and datanode dirs should be specified with a full URI.

by default, hadoop starts up with 1000 megabytes of RAM allocated to each daemon. You can change this by adding a hadoop-env.sh to etc/hadoop. There's a template that can be added with: $ cp ./share/hadoop/common/templates/conf/hadoop-env.sh etc/hadoop

The template sets up a bogus value for HADOOPLOGDIR

HADOOPPIDDIR defaults to /tmp, so you might want to change that variable, too.

Start HDFS

Start the NameNode:

sbin/hadoop-daemon.sh start namenode

Start a DataNode:

sbin/hadoop-daemon.sh start datanode

(Optionally) start the SecondaryNameNode (this is not required for local development, but definitely for production).

sbin/hadoop-daemon.sh start secondarynamenode

To confirm that the processes are running, issue jps and look for lines for NameNode, DataNode and SecondaryNameNode:

$ jps
55036 Jps
55000 SecondaryNameNode
54807 NameNode
54928 DataNode

Notes:

the hadoop daemons log to the "logs" dir. Stdout goes to a file ending in ".out" and a logfile ends in ".log". If a daemon doesn't start up, check the file that includes the daemon name (e.g. logs/hadoop-joecrow-datanode-jcmba.local.out).

the commands might say "Unable to load realm info from SCDynamicStore" (at least on Mac OS X). This appears to be harmless output, see HADOOP-7489 for details.

Stopping HDFS

Eventually you'll want to stop HDFS. Here are the commands to execute, in the given order:

Running an example MR Job

This section just gives the commands for configuring and starting the Resource Manager, Node Manager, and Job History Server, but it doesn't explain the details of those. Please refer to the References and Links section for more details.

The Yarn daemons use the conf directory in the distribution for configuration by default. Since we used etc/hadoop as the configuration directory for HDFS, it would be nice to use that as the config directory for mapreduce, too. As a result, we update the following files:

In conf/yarn-env.sh, add the following lines under the definition of YARNCONFDIR:

Conclusion

While Hadoop 0.23 is an alpha-release, getting it up and running in psuedo-distributed mode isn't too difficult. The new architecture will take some getting used to for users of previous releases of Hadoop, but it's an exciting step forward.

Observations and Notes

There are a few bugs or gotchas that I discovered or verified to keep an eye on as you're going through these steps. These include: