Sentiment analysis using Mahout naive Bayes

Sentiment analysis or opinion mining is the identification of subjective information from text. This tutorial will show how to do sentiment analysis on Twitter feeds using the naive Bayes classification algorithm available on Apache Mahout. Although far from a production ready implementation, this simple demo Java application will help you understand how to use Mahout’s naive Bayes algorithm to classify text. We will start by explaining the problem, we will then see how naive Bayes can help us solve this problem and at the end we will build a working sample to see the algorithm in action. Basic Java programming knowledge is required for this tutorial.

Naive Bayes for sentiment analysis

Sentiment analysis aims to detect the attitude of a text. A simple subtask of sentiment analysis is to determine the polarity of the text: positive, negative or neutral. In this tutorial we concentrate on detecting if a short text like a Twitter message is positive or negative. For example:

for the tweet “Have a nice day!” the algorithm should tell us that this is a positive message.

for the tweet “I had a bad day” the algorithm should tell us that this is a negative message.

From the machine learning domain point of view this can be seen as a classification task and naive Bayes is an algorithm which suits well this kind of task.

The naive Bayes algorithm uses probabilities to decide which class best matches for a given input text. The classification decision is based on a model obtained after the training process. Model training is done by analysing the relationship between the words in the training text and their classification categories. The algorithm is considered naive because it assumes that the value of a particular feature is independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 3″ in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness and diameter features (Naive Bayes classifier on Wikipedia).

Each text we will classify contains words noted with Wi (i=1..n) . For each word Wi from the training data set we can extract the following probabilities (noted with P):

P(Wi given Positive) = (The number of positive Texts with the Wi) / The number of positive Texts

P(Wi given Negative) = (The number of negative Texts with the Wi) / The number of negative Texts

For the entire test set we will have:

P(Positive) = (The number of positive Texts) / The total number of Texts

P(Negative) = (The number of negative Texts) / The number of Texts

For calculating the probability of a Text being positive or negative, given the containing words we will use the following theorem:

At the end one will compare P(Positive given Text) and P(Negative given Text) and the term with the higher probability will decide if the text is positive or negative. To increase the quality of the classifier, instead of using raw term frequency we will use TF-IDF. This way the least significant words are ignored when calculating the probabilities.

Java project for sentiment analysis

The project will use 100 tweets as input for the training phase. The input file will contain 100 lines, each line having the category (1 for positive and 0 for negative) and the tweet text.

In a real world project this dataset must contain millions of tweets for a accurate results. The initial data is usually split in training and test data, but for this simple demo we will use all the data for training.

As you can see in the mail method, we start by transforming the input file to sequence file format. This file format is used by Hadoop which is further used by Mahout for parallel processing.

The next method, sequenceFileToSparseVector(), uses the previously created sequence file to create SparseVectors. These vectors contain the TFIDF measurement for the words in the tweets and will be used to train the classifier.

The trainNaiveBayesModel() method creates the model file, staring from the TFIDF vectors.

The last method, classifyNewTweet, takes a new tweet, creates the TFIDF vector from the words and calculates the probability for this tweet of being positive or negative. The highest probability decides the polarity of the tweet.

Conclusion

For every business it is important to gather feedback about the own product and services. Reviews, ratings, comments, recommendations, tweets, blogs etc. are a rich source of information which can help a company improve and evolve. In this context, sentiment analysis is a valuable tool which helps by automating the process of extracting the sentiment from different content sources. Beside companies, the political parties are also increasingly interested in this kind of analysis to extract opinion polarity from tweets, Facebook messages and blogs.

About The Author

leo

Is it possible to have 3 tweet categories (positive,negative,neutral) instead of just using 2 categories (positive,negative) ? if possible, which part that should change?
And what is the code for splitting the data into training and testing?

Hello There,
i am devling a system on sentimental analysis and i am using the above code to classify the facebook comments.the code above not working on the windows platform.In all the methods its showing chmod exceptions.could you please tell me how could i execute these methods into the windows environment and then start using the classifier.

In the article’s prerequisites section I mentioned that the code will run only under Linux or MacOS. The problem is that Hadoop (which in turn is used by the Mahout algorithm), needs special setup under Windows. You can follow this tutorial for Windows: http://alans.se/blog/2010/mahout-on-hadoop-in-cygwin/. Otherwise you can run the sample code on Linux or a Linux virtual machine.

Good observation! The probability of a document belonging to a category should be between 0 and 1.

To compute the probability of a document belonging to a category, we compute the products of the probability of each word to belong the category. As these probabilities are small, by multiplying them the precision is lost. For this reason the naive Bayes implementation of Mahout uses the logarithmic function: Sum( log (probabilities) ). Taking into consideration that the logarithmic function between 0 and 1 is negative, the sum of these probabilities will also be negative.