Hadoop, bigdata, cloud computing and mobile BI

Main menu

Category Archives: Machine learning

Introduction

One of the exciting APIs among the 50+ APIs offered by Google is the Prediction API. It provides pattern matching and machine learning capabilities like recommendations or categorization. The notion is similar to the machine learning capabilities that we can see in other solutions (e.g. in Apache Mahout): we can train the system with a set of training data and then the applications based on Prediction API can recommend (“predict”) what products the user might like or they can categories spams, etc.

In this post we go through an example how to categorize SMS messages – whether they are spams or valuable texts (“hams”).

Using Prediction API

In order to be able to use Prediction API, the service needs to be enabled via Google API console. To upload training data, Prediction API also requires Google Cloud Storage.

The dataset used in this post is from UCI Machine Learning Repository. UCI Machine Learning repository has 235 datasets publicly available, this post is based on SMS Spam Collections dataset.

To upload the training data first we need to create a bucket in Google Cloud Storage. From Google API console we need to click on Google Cloud Storage and then on Google Cloud Storage Manager: This will open a webpage whe we can create new buckets and upload or delete files.

The UCI SMS Spam Collection file is not suitable as is for Prediction API, it needs to be converted into the following format (the categories – ham/spam – need to be quoted as well as the SMS text):

“ham” “Go until jurong point, crazy.. Available only in bugis n great world la e buffet… Cine there got amore wat…”

Google Prediction API offers a handful of commands that can be invoked via REST interface. The simplest way of testing Prediction API is to use Prediction API explorer.

Once the training data is available on Google Cloud Storage, we can start training the machine learning system behind Prediction API. To begin training our model, we need to run prediction.trainedmodels.insert. All commands require authentication, it is based on OAuth 2.0 standard.

In the insert menu we need to specify the fields that we want to be included in the response. In the request body we need to define an id (this will be used as a reference to the model in the commands used later on), a storageDataLocation where we have the training data uploaded (the Google Cloud Storage path) and the modelType (could be regression or classification, for spam filtering it is classification):

The training runs for a while, we can check the status using prediction.trainedmodels.get command. The status field is going to be RUNNING and then will be changed to DONE, once the training is finished.

Now we are ready to run our test against the machine learning system and it is going to classify whether the given text is spam or ham. The Prediction API command for this action is prediction.trainedmodels.predict. In the id field we have to refer to the id that we defined for the prediction.trainedmodels.insert command (bighadoop-00001) and we also need to specify the request body – input will be csvInstance and then we enter the text that we want to get categorized (e.g. “Free entry”)

The system then returns with the category (spam) and the score (0.822158 for spam, 0.177842 for ham):

Google Prediction API libraries

Google also offers a featured sample application that includes all the code required to run it on Google App Engine. It is called Try-Prediction and the code is written in Python and also in Java. The application can be tested at http://try-prediction.appspot.com.

For instance, if we enter a quote for the Language Detection model from Niels Bohr: “Prediction is very difficult, especially if it’s about the future.”, it will return that it is likely to be an English text (54,4%).

Introduction

Our last post was about Microsoft and Hortonworks joint effort to deliver Hadoop on Microsoft Windows Azure dubbed HDInsight. One of the key Microsoft HDInsight components is Mahout, a scalable machine learning library that provides a number of algorithms relying on the Hadoop platform. Machine learning supports a wide range of use cases from email spam filtering to fraud detection to recommending books or movies, similar to Amazon.com features.These algorithms can be divided into three main categories: recommenders/collaborative filtering, categorization and clustering. More details about these algorithms can be read on Apache Mahout wiki.

Recommendation engine on Azure

A standard recommender example in machine learning is a movie recommender. Surprisingly enough this example is not among the provided HDInsight examples so we need to implement it on our own using mahout-0.5 components.

The movie recommender is an item based recommendation algorithm: the key concept is that having a large dataset of users, movies and values indicating how much a user liked that particular movie, the algorithm will recommend movies to the users. A commonly used dataset for movie recommendation is from GroupLens.

The downloadable file that is part of the 100K dataset (u.data) is not suitable for Mahout as is because its format is like:

Mahout requires the data to be in the following format: userid,itemid,value
so the content has to be converted to

196,242,3
186,302,3
22,377,1
....

There is no web based console to execute Mahout on Azure, we need to go to Remote Desktop to download the RDP configuration and then login to Azure headnode via RDP. Then we have to run Hadoop command line to get a prompt.

The standard mahout.cmd seems to have a few bugs, if we run mahout.cmd then it will throw an error complaining about java usage. I had to modify the file to remove setting HADOOP_CLASSPATH envrionment variable, see the changes in bold-italic:

The similarityClass is the name of the distributed similarity class and it can be SIMILARITY_EUCLIDEAN_DISTANCE, SIMILARITY_LOGLIKELIHOOD, SIMILARITY_PEARSON_CORRELATION, etc. This class determine the algorithm to calculate similarities between the items.

The execution of MapReduce tasks can be monitored via Hadoop MapReduce admin console:

Once the job is finished, we need to use hadoop filesystem commands to display the output file produced by the RecommenderJob:

For those inclined to theory and scientific papers, I suggest to read the paper from Sarwar, Karypis, Konstand and Riedl that provides the background of the item based recommendation algorithms.

Mahout examples on Azure

Hadoop on Azure comes with two predefined examples: one for classification, one for clustering. They require command line to be executed – a smilar way as described above for the item based recommendation engine.

The classification demo is based on naive Bayes classifier- first you need to train your classifier with a set of known data and then you can run the algorithm on the actual data set. This concept is called supervised learning.

This will kick off the Hadoop MapReduce job and after a while it will spit out the confusion matrix based on Bayes algorithm. The confusion matrix will tell us what categories were correctly identified by the classifier and what were incorrect.

For instance, it has a category called rec.motorcycles (column a), and the classifier correctly identified 381 items out of 398 belonging to this cathegory, while it defined 9 items incorrectly as belonging to rec.autos (column f), 2 items incorrectly as belonging to sci.electronics (column n), etc.