Category Archives: Microsoft Azure

Introduction

Our last post was about Microsoft and Hortonworks joint effort to deliver Hadoop on Microsoft Windows Azure dubbed HDInsight. One of the key Microsoft HDInsight components is Mahout, a scalable machine learning library that provides a number of algorithms relying on the Hadoop platform. Machine learning supports a wide range of use cases from email spam filtering to fraud detection to recommending books or movies, similar to Amazon.com features.These algorithms can be divided into three main categories: recommenders/collaborative filtering, categorization and clustering. More details about these algorithms can be read on Apache Mahout wiki.

Recommendation engine on Azure

A standard recommender example in machine learning is a movie recommender. Surprisingly enough this example is not among the provided HDInsight examples so we need to implement it on our own using mahout-0.5 components.

The movie recommender is an item based recommendation algorithm: the key concept is that having a large dataset of users, movies and values indicating how much a user liked that particular movie, the algorithm will recommend movies to the users. A commonly used dataset for movie recommendation is from GroupLens.

The downloadable file that is part of the 100K dataset (u.data) is not suitable for Mahout as is because its format is like:

Mahout requires the data to be in the following format: userid,itemid,value
so the content has to be converted to

196,242,3
186,302,3
22,377,1
....

There is no web based console to execute Mahout on Azure, we need to go to Remote Desktop to download the RDP configuration and then login to Azure headnode via RDP. Then we have to run Hadoop command line to get a prompt.

The standard mahout.cmd seems to have a few bugs, if we run mahout.cmd then it will throw an error complaining about java usage. I had to modify the file to remove setting HADOOP_CLASSPATH envrionment variable, see the changes in bold-italic:

The similarityClass is the name of the distributed similarity class and it can be SIMILARITY_EUCLIDEAN_DISTANCE, SIMILARITY_LOGLIKELIHOOD, SIMILARITY_PEARSON_CORRELATION, etc. This class determine the algorithm to calculate similarities between the items.

The execution of MapReduce tasks can be monitored via Hadoop MapReduce admin console:

Once the job is finished, we need to use hadoop filesystem commands to display the output file produced by the RecommenderJob:

For those inclined to theory and scientific papers, I suggest to read the paper from Sarwar, Karypis, Konstand and Riedl that provides the background of the item based recommendation algorithms.

Mahout examples on Azure

Hadoop on Azure comes with two predefined examples: one for classification, one for clustering. They require command line to be executed – a smilar way as described above for the item based recommendation engine.

The classification demo is based on naive Bayes classifier- first you need to train your classifier with a set of known data and then you can run the algorithm on the actual data set. This concept is called supervised learning.

This will kick off the Hadoop MapReduce job and after a while it will spit out the confusion matrix based on Bayes algorithm. The confusion matrix will tell us what categories were correctly identified by the classifier and what were incorrect.

For instance, it has a category called rec.motorcycles (column a), and the classifier correctly identified 381 items out of 398 belonging to this cathegory, while it defined 9 items incorrectly as belonging to rec.autos (column f), 2 items incorrectly as belonging to sci.electronics (column n), etc.

Introduction

Traditionally Microsoft Windows used to be a sort of stepchild in Hadoop world – the ‘hadoop’ command to manage actions from command line and the startup/shutdown scripts were written in Linux/*nix in mind assuming bash. Thus if you wanted to run Hadoop on Windows, you had to install cygwin. Also Apache Hadoop document states the following (quotes from Hadoop R1.1.0 documentation):
“•GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes
•Win32 is supported as a development platform. Distributed operation has not been well tested on Win32, so it is not supported as a production platform.”

Microsoft and Hortonworks joined their forces to make Hadoop available on Windows Server for on-premise deployments as well as on Windows Azure to support big data in the cloud, too.

This post covers Windows Azure HDInsight (Hadoop on Azure, see https://www.hadooponazure.com) . As of writing, the service requires an invitation to participate in the CTP (Community Technology Preview) but the invitation process is very efficiently managed – after filling in the survey, I received the service access code within a couple of days.

New Cluster Request

The first step is to request a new cluster, you need to define the cluster name and the credentials to be able to login to the headnode. By default the cluster consists of 3 nodes.

After a few minutes, you will have a running cluster, then click on the “Go to Cluster” link to navigate to the main page.

WordCount with HDInsight on Azure

No Hadoop test is complete without the standard WordCount application – Microsoft Azure HDInsight provides an example file (davinci.txt) and the Java jar file to run wordcount – the Hello World of Hadoop.

First you need to go to the JavaScript console to upload the text file using fs.put():

Microsoft HDInsight Streaming – Hadoop job in C#

Hadoop Streaming is a utility to support running external map and reduce jobs. These external jobs can be written in various programming languages such as Python or Ruby – should we talk about Microsoft HDInsight, the example better be based on .NET C#…

The demo application for C# streaming is again a wordcount example using the imitation of Unix cat and wc commands. You could run the demo from the “Samples” tile but I prefer to demonstrate Hadoop Streaming from the command line to have a closer look at what is going on under the hood.

In order to run Hadoop command line from Windows cmd prompt, you need to login to the HDInsight headnode using Remote Desktop. First you need to click on “Remote Desktop” tile, then login the remote node using the credentials you defined at cluster creation time. Once you logged in, click on Hadoop Coomand Line shortcut.

In Hadoop Command Line, go to the Hadoop distribution directory (As of writing this post, Microsoft Azure HDInsight is based on Hadoop 1.1.0):

Under the hood

Microsoft and Hortonworks have re-implemented the key binaries (namenode, jobtracker, secondarynamenode, datanode, tasktracker) as executables (exe files) and they are running as services in the background. The key ‘hadoop’ command – which is traditionally a bash script – is also re-implemented as hadoop.cmd.