Monday, June 2, 2014

Little Lessons in Hadoop

Hadoop is notoriously under-documented, as I recently discovered. I am using Hadoop in my summer research position, and have launched myself into the wonderful and aggravating world of servers and open-source map-reduce programs. And one of the fun aspects of releasing open-source software, I suppose, is no one can complain if you leave it largely undocumented.

However, this does leave the experience of installing and running Hadoop as a rather harrowing experience for the uninitiated. But hands-on learning is the best way! And there are some pretty good, if often incomplete or outdated, tutorials out there, including this and this.

Those, along with a few dozen web searches, and hours of pain, struggle, and frustration, led me to the successful operation of Hadoop on the standard WordCount trial code.

I record my efforts, failures, and discoveries now for my own benefit as well as for any who might be struggling with same.

The "No such file or directory" error.
When Hadoop is set up, and you attempt to start the instance using start-all.sh or start-dfs.sh, you may get the error noted above. It is likely that either your HADOOP_HOME directory is not set for the user Hadoop is running under, or mkdir failed to create the log directory due to permissions errors.
To check for the first of these cases, type "echo $HADOOP_HOME", to see if the variable is set. If you see nothing but a blank line, or get an error telling you that the directory cannot be found, you'll need to change this directory to the true Hadoop installation directory (like "/home/<user>/hadoop" or wherever you placed it). You can change this with the export command.
If HADOOP_HOME prints correctly, you will need to chmod the permissions on the Hadoop directory. Instructions on using chmod can be found here. Remember the -R flag to include subdirectories.

HADOOP_OPTS and HADOOP_CLASSPATH
Contrary to what several tutorials indicate, you will likely not need to have your HADOOP_OPTS variable set -- in fact, it can be empty.
On the other hand, the HADOOP_CLASSPATH should contain the location of the hadoop/lib directory, e.g. "<user>/hadoop/lib" (use the export command for this as well).

Other small but Important Items

Don't forget your 'sudo'. If you're operating on files from a different user's directory (like if you're using a Hadoop-specific user but saving files on the standard user), you'll need to sudo most of your commands.

Likewise, chmod all the important directories before you get started.

The PATH environment variable must have the "bin" folder within it, e.g. "/home/<user>/hadoop/bin". You can add this with the export command (don't forget to use the ":" concatenator to avoid overwriting existing locations).

When creating new directories, for input or output files, etc, use the -p flag to ignore any non-existing parent directories and create them along the way. For instance, if your <user>/Documents directory is empty, you can create the <user>/Documents/hadoop-output/wordcount-results using mkdir with a -p flag.

When running a program such as WordCount, you will need to handle the HDFS; if you're not sure how this is set up, you can use Hadoop's LS command to look around the same as with the equivalent command line operation: "hadoop fs -ls <directory>".

Attempting to test the Hadoop setup, I had difficulty ascertaining the location of the WordCount example -- every tutorial seemed to show it in a different place. As of Hadoop 2.3.0, the jar with this example is in "<main Hadoop directory>/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0.jar".

To save some typing, of which you will be doing plenty, consider using aliases on the more common commands. For instance, you might use "h-start' as an alias for "<main Hadoop directory>/bin/start-all.sh". You can learn about aliases here.

Good luck with your Hadooping! I will add more hints and tips as I encounter them.