Classifier Precision

Precision measures the exactness of a classifier. A higher precision means less false positives, while a lower precision means more false positives. This is often at odds with recall, as an easy way to improve precision is to decrease recall.

Classifier Recall

Recall measures the completeness, or sensitivity, of a classifier. Higher recall means less false negatives, while lower recall means more false negatives. Improving recall can often decrease precision because it gets increasingly harder to be precise as the sample space increases.

F-measure Metric

Precision and recall can be combined to produce a single metric known as F-measure, which is the weighted harmonic mean of precision and recall. I find F-measure to be about as useful as accuracy. Or in other words, compared to precision & recall, F-measure is mostly useless, as you’ll see below.

Measuring Precision and Recall of a Naive Bayes Classifier

The NLTK metrics module provides functions for calculating all three metrics mentioned above. But to do so, you need to build 2 sets for each classification label: a reference set of correct values, and a test set of observed values. Below is a modified version of the code from the previous article, where we trained a Naive Bayes Classifier. This time, instead of measuring accuracy, we’ll collect reference values and observed values for each label (pos or neg), then use those sets to calculate the precision, recall, and F-measure of the naive bayes classifier. The actual values collected are simply the index of each featureset using enumerate.

Nearly every file that is pos is correctly identified as such, with 98% recall. This means very few false negatives in the pos class.

But, a file given a pos classification is only 65% likely to be correct. Not so good precision leads to 35% false positives for the pos label.

Any file that is identified as neg is 96% likely to be correct (high precision). This means very few false positives for the neg class.

But many files that are neg are incorrectly classified. Low recall causes 52% false negatives for the neg label.

F-measure provides no useful information. There’s no insight to be gained from having it, and we wouldn’t lose any knowledge if it was taken away.

Improving Results with Better Feature Selection

One possible explanation for the above results is that people use normally positives words in negative reviews, but the word is preceded by “not” (or some other negative word), such as “not great”. And since the classifier uses the bag of words model, which assumes every word is independent, it cannot learn that “not great” is a negative. If this is the case, then these metrics should improve if we also train on multiple words, a topic I’ll explore in a future article.

Another possibility is the abundance of naturally neutral words, the kind of words that are devoid of sentiment. But the classifier treats all words the same, and has to assign each word to either pos or neg. So maybe otherwise neutral or meaningless words are being placed in the pos class because the classifier doesn’t know what else to do. If this is the case, then the metrics should improve if we eliminate the neutral or meaningless words from the featuresets, and only classify using sentiment rich words. This is usually done using the concept of information gain, aka mutual information, to improve feature selection, which I’ll also explore in a future article.

If you have your own theories to explain the results, or ideas on how to improve precision and recall, please share in the comments.

Thanks Stijn. I'm hoping my posts help increase NLP appreciation, or at least awareness of how to do it effectively. Looks like you're trying to do the same for information design, and now I have a new blog to explore 🙂

Thanks Jacob, good stuff here. I'm really wanting to do this on some of my own data but am scratching my head as to how my sentiment data needs to be formatted to read in the sentiment id. Any idea how this works?

The movie reviews corpus is in 2 directories, one for “pos” and another for “neg”. Then it uses the CategorizedCorpusReader to specify the categories based on which directory each file is in. So I'd read up on the CategorizedCorpusReader at http://nltk.googlecode.com/svn/trunk/doc/api/nl… and look at some of the other categorized corpora for examples (brown, reuters) to figure out what would work best for organizing your own data.

Adam P Leary

Thanks Jacob. After I posted this, I found that info as well. I am using the CategorizedPlainTextCorpusReader. It looks like I use the constructor to create my own reader. I am including another category neutral to the mix.

guest

Thanks for the information. But I like to know how can we categorize a text ie. how can we find a text(given by the user) is pos/neg?

Once you have a trained classifier, then for every piece of text you want classify, get the bag of words and pass that into the classifier, like classifier.classify(word_feats(text)). This will return one of the known labels, such as pos or neg.

Thank you, but sorry for troubling you again. Since I am new to this, I just want to make sure what I am doing is correct. I downloaded the Subjectivity datasets into a new directory called rotten_imdb under nltk_data/corpora/ and changed the file names to neutral.txt and polar.txt. Then I created a reader and categorized the new corpora as – polar_neutral_review = CategorizedPlaintextCorpusReader(root, ‘.*txt’, cat_pattern='(w+).txt’). Then I used the same code given for text classification using bigram algorithm except I changed the the following line – trainfeats = negfeats + posfeats, since there is only one file for each category. Upto this point it doesnt give any error, but after training when I categorize the user given text, it always returns ‘polar’. I have no idea where I went wrong.Need Help

Have you measured the precision & recall of the trainfeats? I’m pretty sure it’ll be skewed one way, and you’ll have to follow the instructions in the next article of this series: use only high information words. Or, since you have the files named appropriately, you should also be able to use train_classifier.py in github.com/japerk/nltk-trainer with options like –sents –min_score 3 to train a classifier.

Hi Jacob! Thank you for your brilliant tutorials, they are really helpful =)
I have couple of questions.

1) When training the classifier with smaller corpus sometimes the most informative features function shows words with the value ‘None’. According to NLTK’s documentation: “The feature value ‘None’ is reserved for unseen feature values” but using your example I get None values for words which are in both training and test sets. Why? I don’t get it.

2) To implement neutrality you just used another classifier with the subjectivity dataset, right?

We’re reducing our sets to 10 files for each label. Obviously this is not what we want in a real environment and this doesn’t happen by training a relatively large corpus, but it just caught my attention to have ‘None’ values for words that appear both in the training and test sets. Shouldn’t all be True?

2) Great! I have to make a similar implementation using the Spanish language so it’s time for me to compile a neutral corpus 🙂

So what I think is going on is that words with None have been seen in one of the categories, but not the other, and so not seeing them becomes an indicator of the category to choose. For example, “really = None pos : neg = 3.7 : 1.0” would mean that if “really” is not in a featureset, it’s more likely to be positive (since the classifier only saw it in negative training examples).

Schillermika

Hi,

I’m classifying text as either vulgar or clean. I ran the metrics on the data several times and got slightly different results each time. For example, notice that every score has changed slightly from the first to the second evaluation even though it’s the exact same data. Any idea why this might be?

Here’s your problem: random.shuffle(featuresets)
If you do this before splitting the train_feats & test_feats, then of course you’ll be getting different results, because the training features are different each time.

Schillermika

Why didn’t I see that?…thnx! Really good blog, btw. I’ve learned more practical stuff on here and through your book than anywhere else.

Fahd

Thanks Jac,
I found this article very helpful.

I wonder if there is a way to calculate the overall precision, recall and F-measure for all classes.

I believe this would be very helpful instead of calculating the average of these measures.

You can do this with binary classifiers, if you assume one class is the positive class, and the other is the negative class (this isn’t referring to sentiment, but positive as in true, and negative as in false). Then you count the number of true positives, false positives, and false negatives, and calculate the precision and recall as defined at https://en.wikipedia.org/wiki/Precision_and_recall#Definition_.28classification_context.29

In improving results with better feature selection, you have mentioned that the sentiments change by negative words like “Not Great” will be on multiple words/Ngrams. Have you got chance to do that? Let me know where I can refer to your work. thanks

outstanding tutorial. You make NLTK easy for “human beginners” .
I’m trying to generate a ROC Curve after the analysis, however so far for NLTK the most approachable library is PyROC and, still, it is hard to use because of the neverending incompatibility between lists/strings/dics.

Hi Jacob,
Thanks so much for the support you are rendering via your blog.

I have a couple of questions to ask you?

1. We are working on a similar text classification for sentiment analysis problem – Pull Request Comments, with positive, negative, and neutral labels. Using NTLK Libraries and Bigram Techniques as explained here, can we split our dataset into three files : positive_f, negative_f, and neutral_f, each file containing only the corpus comments for the purpose of text mining?

2. If we want to plot the distribution of the said labels across the detaset or aggregating the data by any attribute in the dataset, how do we accomplish these using your guidelines?
Thanks in advance.

1. Yes, and if you separate each comment into blanklines, then you it should be easy to treat each comment as a separate instance, maybe by using one of NLTK’s corpus readers and the paragraphs() method.

2. I’m not sure about plotting, lots of people do that differently, but matplotlib is fairly popular.

K-IFY

Thanks so much and sorry for late response.

Regarding question 1, I understand from your code that I could use one instance of dataset to account for the three classes instead of dataset as follows:

neuids = dataset_name.fileids(‘neu’)…
Thus, we could plot the distribution of the said labels across the detaset or aggregating the data by any attribute in the dataset using the Matplotlib Library you suggested.

I’m a greenhorn in Machine Learning and Python language. I will appreciate it if you could give me more guidance on how to go about it.

I don’t know matplotlib, so I tend to stick with simple counting, such as “how many pos fileids?”

Bhushan Tembhurne

The code is not working!
The previous code to compute sentiment analysis using naive bayes classifier worked fine for me but when I tried to run this code it does not worked. It says error “name nltk is not defined”.
When I add “import nltk ” to code it returns another error AttributeError: ‘module’ object has no attribute ‘precision’.
Please resolve the problem ASAP!

Maybe your code is wrong. I don’t see any code above that is just “import nltk”.

Nikos Spatiotis

i had the same problem….you have to insert this code at the beginning: from nltk.metrics import precision, recall, f_measure and then change print and write this print ‘pos precision:’, precision(refsets[‘pos’], testsets[‘pos’])

print ‘pos recall:’, recall(refsets[‘pos’], testsets[‘pos’])

print ‘pos F-measure:’, f_measure(refsets[‘pos’], testsets[‘pos’])

print ‘neg precision:’, precision(refsets[‘neg’], testsets[‘neg’])

print ‘neg recall:’, recall(refsets[‘neg’], testsets[‘neg’])

print ‘neg F-measure:’, f_measure(refsets[‘neg’], testsets[‘neg’])

Adaboo Azeem Jnr

Hello, I am using this tutorial to do a sentiment analyzer on twitter feed…..How do I classify a single tweet or sentence……tweet_classified = classifier.classify(“the car is beautiful”).
But I get an error….Thanks