Classification algorithms can be used to automatically classify documents, images, implement spam filters and in many other domains. In this tutorial we are going to use Mahout to classify tweets using the Naive Bayes Classifier. The algorithm works by using a training set which is a set of documents already associated to a category. Using this set, the classifier determines for each word, the probability that it makes a document belong to each of the considered categories. To compute the probability that a document belongs to a category, it multiplies together the individual probability of each of its word in this category. The category with the highest probability is the one the document is most likely to belong to.

To get more details on how the Naive Bayes Classifier is implemented, you can look at the mahout wiki page.

This tutorial will give you a step-by-step description on how to create a training set, train the Naive Bayes classifier and then use it to classify new tweets.

Requirement

For this tutorial, you would need:

jdk >= 1.6

maven

hadoop (preferably 1.1.1)

mahout >= 0.7

To install hadoop and mahout, you can follow the steps described on a previous post that shows how to use the mahout recommender.

When you are done installing hadoop and mahout, make sure you set them in your PATH so you can easily call them:

export PATH=$PATH:[HADOOP_DIR]/bin:$PATH:[MAHOUT_DIR]/bin

In our tutorial, we will limit the tweets to deals by getting the tweets containing the hashtags #deal, #deals and #discount. We will classify them in the following categories:

apparel (clothes, shoes, watches, …)

art (Book, DVD, Music, …)

camera

event (travel, concert, …)

health (beauty, spa, …)

home (kitchen, furniture, garden, …)

tech (computer, laptop, tablet, …)

You can get the scripts and java programs used in this tutorial from our git repository on github:

$ git clone https://github.com/fredang/mahout-naive-bayes-example.git

You can compile the java programs by typing:

$ mvn clean package assembly:single

Preparing the training set

UPDATE(2013/06/23): this section was updated to support twitter 1.1 api (1.0 was just shutdown).

As preparing the training set is very time consuming, we have provided in the source repository a training set so that you don’t need to build it. The file is data/tweets-train.tsv. If you choose to use it, you can directly jump to the next section.

To prepare a training set, we fetched the tweets with the following hashtags: #deals, #deal or #discount by using the script twitter_fetcher.py. It is using the python-tweepy 2.1 library (make sure to install the latest version as we have to use the twitter 1.1 api now). You can install it by typing:

You need to have consumer keys/secrets and access token key/secrets to use the api. If you don’t have them, simply login on the twitter website then go to: https://dev.twitter.com/apps. Then create a new application.
When you are done, you should see in the section ‘OAuth settings’, the Consumer Key and secret, and in the section ‘Your access token’, the Access Token and the Access Token secret.

Edit the file script/twitter_fetcher.py and change the following lines to use your twitter keys and secrets:

If the percentage of correctly classified instance is too low, you might need to improve your training set by adding more tweets or by changing your categories to not have too many similar categories or by removing categories that are used very rarely. After you are done with your changes, you would need to restart the training process.

To use the classifier to classify new documents, we would need to copy several files from HDFS:

model (matrix word id x label id)

labelindex (mapping between a label and its id)

dictionary.file-0 (mapping between a word and its id)

df-count (document frequency: number of documents each word is appearing in)

Most of the tweets are classified properly but some are not. For example, the tweet “J and R – Roku 3 Streaming Player 4200R $89.99” is incorrectly classified as camera. To fix that, we can add this tweet to the training set and classify it as tech. You can do the same for the other tweets which are incorrectly classified. When you are done, you can repeat the training process and check the results again.

Conclusion

In this tutorial we have seen how to build a training set, then how to use it with Mahout to train the Naive Bayes model. We showed how to test the classifier and how to improve the training set to get a better classification. Finally we use it to build an application to automatically assign a category to a tweet. In this post, we only study one Mahout classifier among many others: SGD, SVM, Neural Network, Random Forests, …. We will see in future posts how to use them.

Misc

View content of sequence files

To show the content of a file in HDFS, you can use the command

$ hadoop fs -text [FILE_NAME]

However, there might be some sequence file which are encoded using mahout classes. You can tell hadoop where to find those classes by editing the file [HADOOP_DIR]conf/hadoop-env.sh and add the following line:

Using your own testing set with mahout

Previously, we showed how to generate a testing set from the training set using the mahout split command.

In this section, we are going to describe how to use our own testing set and run mahout to check the accuracy of the testing set.

We have a small testing set in data/tweets-test-set.tsv that we are transforming into a tfidf vector sequence file:
the tweet words are converted into word id using the dictionary file and are associated to their tf x idf value:

014-06-09 21:16:44.065:INFO::Started SelectChannelConnector@0.0.0.0:8080
[INFO] Started Jetty Server
java.lang.IllegalArgumentException: Unknown flags set: %d [-1110101]
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:148)
at org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:88)
at org.apache.mahout.math.VectorWritable.readVector(VectorWritable.java:199)
at org.apache.mahout.classifier.naivebayes.NaiveBayesModel.materialize(NaiveBayesModel.java:112)
at org.example.SpamClassifier.classify(SpamClassifier.java:49)
at org.example.SpamClassifierServlet.doPost(SpamClassifierServlet.java:24)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:533)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:475)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:514)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:920)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:403)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:184)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:856)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:247)
at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:151)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:114)
at org.eclipse.jetty.server.Server.handle(Server.java:352)
at org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:596)
at org.eclipse.jetty.server.HttpConnection$RequestHandler.content(HttpConnection.java:1066)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:805)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:218)
at org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:426)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:510)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint.access$000(SelectChannelEndPoint.java:34)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:40)
at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:450)
at java.lang.Thread.run(Thread.java:745)

Hi chimpler
I have huge doubt.I can find only text classification examples on internet .I am creating heart disease prediction system(cleveland dataset from UCI repository) and I want to write classifier for the same after creating model.I am not able to understand how to write ?I am using twitter classifier code but its not working,can you suggest something for this?

This is by far the best mahout tutorial I found!! seriously, you touch on several points such as how to run a test on an existing data set from a model that is already generated, with details….this article is great!!!!! THUMBS UP>>>>

SOLVED:
This exception occurs because you trained the model with -c flag (Complementary Naive Bayes)
Just replace
StandardNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(model);
to
ComplementaryNaiveBayesClassifier classifier = new ComplementaryNaiveBayesClassifier(model);