Setting up Hadoop on OSX Mountain Lion

24 Feb 2013

Everyone I know that deals with large amounts of data has been looking
closer at Hadoop as it's matured.
Especially with tools like Hive, old datawarehouse hands are taking a
serious look at it as a better type of long time data archive and
storage. You probably should too.

While most of the time for the types of real work you'd be doing, it
makes more sense to spin up Amazon's EC2, Elastic Map
Reduce or another flvour of
virtualized Hadoop instance in the cloud for the clustering and
crunching benefits, it's very good to have a local install for
development and testing.

This walkthrough should take you about 15 minutes to get Hadoop up and
running.

So What Is Hadoop?

Basically, what it allows you to do is take huge amounts of data, or
computationally difficult problems, chop them into little bits, solve
the little bits and then recombine them all to arrive at an answer for
the original big problem.

While it uses an open source implementation of Google's MapReduce and
GFS file system to do that, Hadoop is also the open-source set of tools
around this incredible ability to solve data-intensive, distributed
applications.

Most importantly perhaps, those tools allow the running of those
applications on large clusters of cheap computers, making problems
formerly the domain of super computers within the reach of us normal
mortals and even more quickly solvable in some cases.

Learning it, and all the tools, does have a fairly steep learning curve,
so a nice little HOWTO to get it installed locally is a Good Thing. ™

Here we go.

Prerequisities

There are a couple. Just for reference, at time of writing, I'm running
OSX 10.8.2 with all System Updates and Hadoop is at 1.1.1. The amazing
Homebrew needs to be installed as
that's what we'll be using to get the package (if you're not using it,
you really should be.).

Java

You do need java installed and it needs to be of the 1.6.x version, so
do make sure you have that installed and that it's working:

If you don't have java installed, checking for the version
should get you a prompt asking if you'd like it installed.

ssh

Hadoop nodes are managed via ssh. So there are two things here:

You need to have Remote Logins checked on your System Preferences |
Sharing panel

You need a ssh key public/private keypair to be able to ssh into your node

I would normally assume anyone attempting to install hadoop is
already exposed to ssh, and probably uses it in their dev work (if you do,
skip to the end two commands and just log into your local box and say
yes to authorization.). Just in case you don't though, type:

$ ssh-keygen -t rsa -P ""

and let them be installed to their default locations. When that's done,
you need to make sure your public key is authorized. Easiest CLI-fu to do that is:

$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

ssh into your localhost machine and into your actual host to make sure
it works.

$ ssh localhost
$ ssh Gunter.local

(replace Gunter with your machine's network name, of course).

Installing on Mountain Lion

We're just going to install a single node Hadoop running in
pseudo-distributed mode. This is where where each Hadoop daemon runs in
a separate Java process.

The wonders of homebrew never cease.
This is how you install hadoop with homebrew:

$ brew install hadoop

Note that when it installs, it will set $JAVA_HOME to
/usr/libexec/java_home just in case you run into trouble with a custom
configuration or have been using other javas installed elsewhere.

Configuring Hadoop

So, it's not quite that easy. Now you have to do a bit of fiddling
with conf files before you can fire up your own data crunching elephant
of fury.

There are four config files that need to be modified and they are all
located in /usr/local/Cellar/hadoop/1.1.1/libexec/conf

hadoop-env.sh

core-site.xml

hdfs-site.xml

mapred-site.xml

hadoop-env.sh

With homebrew, most of your work is already done here, though a bug
introduced in OSX 10.7 Lion, appears to still be affecting Mountain Lion
which may give you a Unable to load realm info from SCDynamicStore
error.

Add the following line into the file with your text editor of choice.
Personally, I put it in line 19 where HADOOP_OPTS are asked for.

I've used localhost here just because I am only assuming that I will
use this myself for development. If you wanted to make it available to
others on the network, you should change this to the .local name for
your machine. For example, for me I would replace the localhost with
hdfs://Gunter.local:9000 for users across my network.

Hadoop must be able to write to the tmp directories to work.

hdfs-site.xml

This file controls the configuration of the Hadoop distributed file
system. Since we're only using a single node here, we just need to let
it know it's a single node and to keep one copy. You can optionally also
inform it where it should keep the data in hadoop writable directories
but that appears to be extraneous.

mapred-site.xml

Unsurprisingly, this file controls MapReduce overrides. The maximum
values for the map and reduce tasks are completely optional though wise.
Note, that you can also, in new betas of Hadoop, set up for MapReduce 1
or MapReduce 2 in this file by noting classic or yarn as a value on name
mapreduce.framework.name under a distinct property.

A good starting point for the max on Mappers seems to be 1 x each
virtual core you possess and for Reducers 1 x each for each physical
disk or figure 2 x for each SSD (thanks to
@andykent for that tip, though he says
to use that as a starting point and then experiment.). On my 2012 i5
Macbook Air, I've got 4 virtual cores due to Hyperthreading (on the one
chip with two physical processors).

Again, as with the core-site file I've simply used localhost here.
To make this available across the network you could put in
network name for your machine like Gunter.local and you should be able
to access it.

Spooling up the FTLs - Intitializing

The HDFS system need to be initialized before you can use it. It also
ensures that hadoop can write to the directories it needs to.

$ hadoop namenode -format

This should give you a nice few lines of output which, if successful,
should end with a message to the effect the Storage Director (in my case
/tmp/hadoop-daryl/dfs/name) has been successfully formatted, followed
by a shutdown message.

Congratulations! Your hadoop is ready to rock!

Jump! - Running Hadoop

You can get Hadoop started simply enough with:

$ /usr/local/Cellar/hadoop/1.1.1/libexec/bin/start-all.sh

and a few entries of your password if you haven't gone for the
passphraseless ssh-key option above (I always use a password - I also
recommend a ln -s to shorten the above command as it's handy.).

Just to check hadoop is running you can use the handy jps command. You
should see output very similar to this:

Shutting Down Hadoop

Why you ever would is beyond me, but just in case you don't like all
those background processes lying around, you can

$ /usr/local/Cellar/hadoop/1.1.1/libexec/bin/stop-all.sh

Again, something I'd personally ln -s and also you'll need to type your
passphrase a few times if you haven't got this set on a passphraseless
ssh key from above.

Conclusion

And that's about it. I intend to tweak this install guide as necessary
and also as I find out more about the best ways to run hadoop as well as
if there are better ways to configure the defaults (so please, if you
see anything amiss, please let me know. Happy to hear from you.).

In later posts, I'll be going over installing and using Pig and some of
the other tools in the Hadoop stack as well as some knowledge on
Wukong, the Ruby on Hadoop
DSL (um, as soon as I learn something about it myself.).