Binary Classification: Twitter sentiment analysis

This experiment demonstrates the use of the Execute R Script, Feature Selection, Feature Hashing modules to train a text sentiment classification engine.

# Binary Classification: Twitter sentiment analysis
In this article, we'll explain how to to build an experiment for sentiment analysis using *Microsoft Azure Machine Learning Studio*. Sentiment analysis is a special case of text mining that is increasingly important in business intelligence and and social media analysis. For example, sentiment analysis of user reviews and tweets can help companies monitor public sentiment about their brands, or help consumers who want to identify opinion polarity before purchasing a product.
This experiment demonstrates the use of the **Feature Hashing**, **Execute R Script** and **Filter-Based Feature Selection** modules to train a sentiment analysis engine. We use a data-driven machine learning approach instead of a lexicon-based approach, as the latter is known to have high precision but low coverage compared to an approach that learns from a corpus of annotated tweets.
The hashing features are used to train a model using the **Two-Class Support Vector Machine** (SVM), and the trained model is used to predict the opinion polarity of unseen tweets. The output predictions can be aggregated over all the tweets containing a certain keyword, such as brand, celebrity, product, book names, etc in order to find out the overall sentiment around that keyword. The experiment is generic enough that you could use this framework to solve any text classification task given a reasonable amount of labeled training data.
##Experiment Creation
The main steps of the experiment are:
- [Step 1: Get data]
- [Step 2: Text preprocessing using R]
- [Step 3: Feature engineering]
- [Step 4: Split the data into train and test]
- [Step 5: Train prediction model]
- [Step 6: Evaluate model performance]
- [Step 7: Publish prediction web service]
[Step 1: Get data]:#step-1-get-data
[Step 2: Text preprocessing using R]:#step-2-pre-process-text
[Step 3: Feature engineering]:#step-3-feature-engineering
[Step 4: Split the data into train and test]:#step-4-split-data
[Step 5: Train prediction model]:#step-5-train-model
[Step 6: Evaluate model performance]:#step-6-evaluate-model
[Step 7: Publish prediction web service]:#step-7-publish-web-service
![][image-overall]
### <a name="step-1-get-data"></a>Step 1: Get data
The data used in this experiment is [Sentiment140 dataset](http://help.sentiment140.com/), a publicly available data set created by three graduate students at Stanford University: Alec Go, Richa Bhayani, and Lei Huang. The data comprises approximately 1,600,000 automatically annotated tweets.
The tweets were collected by using the Twitter Search API and keyword search. During automatic annotation, any tweet with positive emoticons, like :), were assumed to bear positive sentiment, and tweets with negative emoticons, like :(, were supposed to bear negative polarity. Tweets containing both positive and negative emoticons were removed. Additional information about this data and the automatic annotation process can be found in the technical report written by Alec Go, Richa Bhayani and Lei Huang, *Twitter Sentiment Classification using Distant Supervision*, in 2009.
For this experiment, we extracted a 10% sample of the data and shared it as a public Blob in a Windows Azure Storage account. You can use this shared data to follow the steps in this experiment, or you can get the full data set from the Sentiment140 dataset home page.
![][image-data-reader]
Each instance in the data set has 6 fields:
* sentiment_label - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
* tweet_id - the id of the tweet
* time_stamp - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
* target - the query (lyx). If there is no query, then this value is NO_QUERY.
* user_id - the user who posted the tweet
* tweet_text - the text of the tweet
We have uploaded in the experiment only the two fields that are required for training as shown below:
![][image-data-view]
### <a name="step-2-pre-process-text"></a>Step 2: Text preprocessing using R
Unstructured text such as a tweet usually requires some preprocessing before it can be analyzed. We used the following R code to remove punctuation marks, special character and digits, and then performed case normalization:
![][image-text-preprocess]
After the text was cleaned, we used the **Metadata Editor** module to change the metadata of the text column as follows.
- We marked the text column as non-categorical column.
- We also marked the text column as a non-feature.
The reason is that we want the learner to ignore the source text and not use it as a feature when training the model, but rather to use the extracted features that we build in the next step.
![][image-metadata-editor]
### <a name="step-3-feature-engineering"></a> Step 3: Feature engineering
#### Feature hashing
The **Feature Hashing** module can be used to represent variable-length text documents as equal-length numeric feature vectors. An added benefit of using feature hashing is that it reduces the dimensionality of the data, and makes lookup of feature weights faster by replacing string comparison with hash value comparison.
In this experiment, we set the number of hashing bits to 17 and the number of N-grams to 2. With these settings, the hash table can hold 2^17 or 131,072 entries in which each hashing feature will represent one or more unigram or bigram features. For many problems, this is plenty, but in some cases, more space is needed to avoid collisions. You should experiment with a different number of bits and evaluate the performance of your machine learning solution.
![][image-feat-hash]
#### Feature selection
The classification complexity of a linear model is linear with respect to the number of features. However, even with feature hashing, a text classification model can have too many features for a good solution. Therefore, we used the **Filter Based Feature Selection** module to select a compact feature subset from the exhaustive list of extracted hashing features. The aim is to reduce the computational complexity without affecting classification accuracy.
We chose the Chi-squared score function to rank the hashing features in descending order, and returned the top 20,000 most relevant features with respect to the sentiment label, out of the 2^17 extracted features.
![][image-feature-selection]
### <a name="step-4-split-data"></a>Step 4: Split the data into train and test
The *Split* module in Azure ML is used to split the data into train and test sets where the split is stratified. The stratification will maintain the class ratios into the two output groups. We use the first 80% of the Sentiment140 sample tweets for training and the remaining 20% for testing the performance of the trained model.
![][image-data-split]
### <a name="step-5-train-model"></a>Step 5: Train prediction model
To train the model, we connected the text features created in the previous steps (the training data) to the ***Train Model** module. Microsoft Azure Machine Learning Studio supports a number of learning algorithms but we select SVM for illustration.
![][image-train-model]
The parameters used in the **Two-Class Support Vector Machine** module are shown in the following graphic:
![][image-svm-parameters]
### <a name="step-6-evaluate-model"></a>Step 6: Evaluate trained model performance
In order to evaluate the generalization ability of the trained Support Vector Machine model on unseen data, the output model and the test data set are connected to the *Score Model* module in order to score the tweets of the test set. Then connect the out predictions to the *Evaluate Model* module in order to get a number of performance evaluation metrics as shown below. Note that the performance mentioned below is resulting from training the model on the full Sentiment140 dataset. In order to reproduce the same performance, please replace the 10% sample data attached to the experiment with the full data set.
Finally, we added the **Evaluate Model** module, to get the evaluation metrics (ROC, precision/recall, and lift) shown in the following charts.
Note that the metrics shown here resulted from training the model on the full Sentiment140 dataset. Therefore, to reproduce these results, you should replace the 10% sample dataset with the full data set.
![][image-evaluate-model]
#### ROC curve
![][image-ROC]
#### Precision/Recall curve
![][image-PR-Curve]
#### Lift curve
![][image-Lift]
### <a name="step-7-publish-web-service"></a>Step 7: Publish prediction web service
A key feature of Azure Machine Learning is the ability to easily publish models as web services on Windows Azure. In order to publish the trained sentiment prediction model, first we must save the trained model. To do this, just click the output port of the **Train Model** module and select **Save as Trained Model**.
![][image-save-trained-model]
Next, we created a new experiment that has only the scoring module, with the saved model attached. We also provided a sample schema for the input data, which we created by sampling one percent of the tweets in the **`Sentiment140`** dataset and saving that as a dataset.
Web service entry and exit points are defined using the special Web Service modules. Note that the **Web service input** module is attached to the node in the experiment where input data would enter.
![][image-scoring-exp]
To map the confidence scores into sentiment labels (positive, neutral and negative), we added the following R code to an **Execute R Script** module.
![][image-output-preparation]
After successfully running the experiment, it can be published by clicking **Publish Web Service** at the bottom of the experiment canvas.
![][image-publish-web-service]
<!-- Images -->
[image-data-reader]:http://az712634.vo.msecnd.net/samplesimg/v1/13/data-reader.PNG
[image-data-view]:http://az712634.vo.msecnd.net/samplesimg/v1/13/data-view.PNG
[image-overall]:http://az712634.vo.msecnd.net/samplesimg/v1/13/training-exp.PNG
[image-text-preprocess]:http://az712634.vo.msecnd.net/samplesimg/v1/13/text-preprocessing-R.PNG
[image-metadata-editor]:http://az712634.vo.msecnd.net/samplesimg/v1/13/metadata-editor.PNG
[image-feat-hash]:http://az712634.vo.msecnd.net/samplesimg/v1/13/feature-hashing.PNG
[image-data-split]:http://az712634.vo.msecnd.net/samplesimg/v1/13/data-split.PNG
[image-feature-selection]:http://az712634.vo.msecnd.net/samplesimg/v1/13/feature-selection.PNG
[image-train-model]:http://az712634.vo.msecnd.net/samplesimg/v1/13/train-model.PNG
[image-svm-parameters]:http://az712634.vo.msecnd.net/samplesimg/v1/13/svm-parameters.PNG
[image-save-trained-model]:http://az712634.vo.msecnd.net/samplesimg/v1/13/save-trained-model.PNG
[image-scoring-exp]:http://az712634.vo.msecnd.net/samplesimg/v1/13/scoring-exp.PNG
[image-evaluate-model]:http://az712634.vo.msecnd.net/samplesimg/v1/13/evaluate-model.PNG
[image-ROC]:http://az712634.vo.msecnd.net/samplesimg/v1/13/ROC.PNG
[image-PR-Curve]:http://az712634.vo.msecnd.net/samplesimg/v1/13/PR-Curve.PNG
[image-Lift]:http://az712634.vo.msecnd.net/samplesimg/v1/13/Lift.PNG
[image-publish-web-service]:http://az712634.vo.msecnd.net/samplesimg/v1/13/publish-web-service.PNG
[image-output-preparation]:http://az712634.vo.msecnd.net/samplesimg/v1/13/output-preparation.PNG