Hadoop in Azure

Is it possible to deploy a Hadoop cluster in Azure? It sure is and setting one up is not difficult, here’s how you do it.

Update: Microsoft has recently announced a new 'Big Data' strategy and Hadoop is an integral part of it. Microsoft will provide an optimized experience for Hadoop running on Azure and on Windows Servers.

In this post I will demonstrate how to create a typical cluster with a Name Node, a Job Tracker and a customizable number of Slaves. You will also be able to dynamically change the number of Slaves using the Azure Management Portal. I will save the explanation of the mechanics for another post.

Follow these steps to create an Azure package for your Hadoop cluster:

Install the latest Azure SDK. As of this writing the latest version was 1.4.

The Hadoop binaries. I used version 0.21. Hadoop is distributed in a tar.gz file, you will need to convert it to a ZIP file. You can use 7-zip for the task.

Now install Cygwin and package it in a single ZIP file. Hadoop 0.21 requires Cygwin under Windows. It’s fine if you don’t know anything about it, Hadoop uses it behind the scenes so you won’t need to even launch it. There is an on-going effort to remove this dependency for Hadoop 0.22 but it’s not ready yet. Just run the Cygwin install and accept all defaults. You should end up with Cygwin installed in c:\cygwin. Create a compressed folder of c:\cygwin called cygwin.zip.

The last dependency is a Java VM to host Hadoop and YAJSW. If you don’t want to update any of the configuration files in this guide you will need to bundle your favorite JVM in a zip file called jdk.zip. All JVM files must be in a folder also called jvm in the ZIP file. If you have your JVM installed under C:\Program Files\Java\jdk1.6.0_<revision>\ you will need to rename (or copy) the jdk1.6.0_<revision> folder to jdk and zip it.

Configure your cluster

The custer-config.zip you downloaded has all the files needed to configure your Hadoop cluster. You will find the familiar [core|hdfs|mapred]-site.xml files there. Ignore all other files for now, I’ll explain what they are for in a later post. Edit the *-site.xml files as needed for your cluster configuration. Make sure to only add and not change any of the properties.

Create a new cluster-config.zip if you updated any for the XML files.

Upload all dependencies to your Azure Storage account

Create a container called bin and upload all zip files to it. Use your favorite tool for the job, I like ClumsyLeaf’s CloudXplorer. You should end up with these files in the bin container

Configure your Azure Deployment

Unzip the Visual Studio 2010 project. You can either use Visual Studio from here or update the required files using any text editor. I included a batch file to package the deployment if you are going down the command line route.

If you are using Visual Studio the only file you must change is NameNode\SetEnvironment.cmd. The projects that require this file have links to it. If you are not using Visual Studio you have to change it in three other places NameNode\bin\Debug, JobTracker\bin\Debug and NameNode\bin\Debug. Get an access key from your storage account and construct a connection string then paste it in the first line replacing [your connection string]. An Azure connection string has this format:

Unless you used different versions of any of the dependencies you won’t need to change anything else.

The Azure deployment is set to use 1 large VM for the Name Node, 1 large VM for the Job Tracker and 4 Extra Large nodes as Slaves. If you are ok with that configuration skip to the next step. If you like anything different change the Roles configuration directly in Visual Studio or edit both the HadoopAzure\ServiceDefinition.csdef and HadoopAzure\ServiceConfiguration.cscfg to the desired VM sizes and count.

Deploy your cluster

Create a new service to host your Hadoop cluster. The project is pre-configured for remote access to the machines in the cluster. If you didn’t change the project configuration you will need to upload the certificate AzureHadoop.pfx in the root of the project to your service. The certificate password is hadoop. The deployment will fail if you don’t have this certificate.

If you are using Visual Studio 2010 you can deploy by right-clicking on the Cloud project and selecting deploy. If you are not just run the buildPackage.cmd from the root of the project using a Windows Azure SDK Command Prompt. You will get the Azure package Hadoop.cspkg to deploy using the Azure Management Portal.

Deploy your service (you can ignore the warning message). Wait for it to complete, you should see something like this:

Using your Hadoop cluster

Now that everything is up and running you can navigate to the Name Node Summary page. The URL is http://<your service name>.cloudapp.net:50070.

If you click on “Browse the filesystem” Hadoop will construct a URL with the IP address of one of the Slaves. That IP address is not accessible from the Internet so you will need to replace it in the URL with <your service name>.cloudapp.net and you will be able to browse the file system:

Let’s run one of the example jobs Hadoop provides to confirm the cluster is working. As it is configured right now you must log in to the Job Tracker to start a new job. I will present alternatives in a future post (hint Azure Connect).

Go back to the Azure Management Portal and RDP into the Job Tracker by selecting it and clicking “Connect” in the toolbar. The username is hadoop and the password is H1Doop. After you log in open a command prompt window and execute the following commands:

E:\AppRoot\SetEnvironment.cmd

cd /d %HADOOP_HOME%

Now you are ready to run a job. I put together a hadoop script so you don’t have to deal with Cygwin for launching jobs. The syntax is the same as the regular hadoop scripts. Let’s launch a small job:

bin\hadoop jar hadoop-mapred-examples-0.21.0.jar pi 20 200

If you navigate to the Job Tracker page you will see the job running. The URL is http://<your service name>.cloudapp.net:50030.

Congratulations, you just ran your first Hadoop job in Azure!

What can I do with my Hadoop cluster?

The cluster is fully operational. You can run any job you would like. You can also use the Azure Management Portal to dynamically change the number of Slaves. Hadoop will discover the new node(s) or find out nodes were removed and reconfigure the cluster accordingly.

I added an extra Slave node

And my cluster changed to

If you used Hadoop in production you know you must take extra steps to prepare the Name Node, mostly around high availability. That’s a topic in itself which I plan to discuss in another post. If you can’t wait and want setup a backup and/or a checkpoint node go ahead, that’s certainly part of the solution. Azure Drive is probably another piece.

There are plenty of alternatives to YAJSW. I'll be discussing them along with pros/cons in another post when I explain how all of this works. If you want to try it now but don't want to use beta software you can use the Java Service Wrapper (wrapper.tanukisoftware.com). It's a fully supported commecial product and the configuration is not much different from YAJSW, should be relatively easy to convert the configuration files.