The most direct definition of the task is: “Does a text express a positive or negative sentiment?”. Usually, we assign a polarity value to a text. This value is usually in the [-1, 1] interval, 1 being very positive, -1 very negative.`

Why is sentiment analysis useful

Sentiment analysis can have a multitude of uses, some of the most prominent being:

Discover a brand’s / product’s presence online

Check the reviews for a product

Customer support

Why sentiment analysis is hard

There are a few problems that make sentiment analysis specifically hard:

1. Negations

A classic argument for why using a bag of words model doesn’t work properly for sentiment analysis. “I like the product” and “I do not like the product” should be opposites. A classic machine learning approach would probably score these sentences identically.

2. Metaphors, Irony, Jokes

Computers always have trouble understanding figurative language. “The best I can say about this product is that it was definitely interesting …”. Here, the word “interesting” plays a different role than the classic, positive meaning.

3. Multiple sentiments in the same text

A complex text can be segmented into different sections. Some sections can be positive, others negative. How do we aggregate the polarities?

“The phone’s design is the best I’ve seen so far, but the battery can definitely use some improvements”

Here we can see the presence of two sentiments. Is the review a positive one or a negative one? Is having a not-so-great battery a deal breaker?

These seem indeed to be complex problems. The solutions aren’t simple at all. In fact, all these issues are open problems in the field of Natural Language Processing.

For now, the best approach is to tune your algorithms to your problem as best as possible. If you are analyzing tweets, you should take emoticons very seriously into account. If you are studying political reviews, you should correlate the polarity with present events. In the case of the phone review, you should weigh the different properties of the phone according to a set of rules, maybe combine the approach with some domain-specific knowledge.

Available Corpora

There are a few resources that can come in handy when doing sentiment analysis.

In this tutorial, we’ll use the IMDB movie reviews corpus. It has enough samples to do some interesting analysis on it. Download it from here: IMDB movie reviews on kaggle. The corpus has many files, containing unlabeled data and test data. We’re only interested in the labeledTrainData.tsv.zip file. Unzip the file somewhere at your convenience and let’s start.

The sentiment in this corpus is 0 for negative and 1 for positive. As you can see, it also contains some HTML tags, so remember to clean those up later. Let’s shuffle the data and split it for training and testing.

1

2

3

4

5

6

7

8

9

10

11

import random

sentiment_data=zip(data["review"],data["sentiment"])

random.shuffle(sentiment_data)

# 80% for training

train_X,train_y=zip(*sentiment_data[:20000])

# Keep 20% for testing

test_X,test_y=zip(*sentiment_data[20000:])

Using SentiWordnet

One of the most straightforward approaches is to use SentiWordnet to compute the polarity of the words and average that value. The plan is to use this model as a baseline for future approaches. It’s also a good idea to know about SentiWordnet and how to use it.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

from nltk.stem import WordNetLemmatizer

from nltk.corpus import wordnet aswn

from nltk.corpus import sentiwordnet asswn

from nltk import sent_tokenize,word_tokenize,pos_tag

lemmatizer=WordNetLemmatizer()

def penn_to_wn(tag):

"""

Convert between the PennTreebank tags to simple Wordnet tags

"""

iftag.startswith('J'):

returnwn.ADJ

elif tag.startswith('N'):

returnwn.NOUN

elif tag.startswith('R'):

returnwn.ADV

elif tag.startswith('V'):

returnwn.VERB

returnNone

def clean_text(text):

text=text.replace("<br />"," ")

text=text.decode("utf-8")

returntext

def swn_polarity(text):

"""

Return a sentiment polarity: 0 = negative, 1 = positive

"""

sentiment=0.0

tokens_count=0

text=clean_text(text)

raw_sentences=sent_tokenize(text)

forraw_sentence inraw_sentences:

tagged_sentence=pos_tag(word_tokenize(raw_sentence))

forword,tag intagged_sentence:

wn_tag=penn_to_wn(tag)

ifwn_tag notin(wn.NOUN,wn.ADJ,wn.ADV):

continue

lemma=lemmatizer.lemmatize(word,pos=wn_tag)

ifnotlemma:

continue

synsets=wn.synsets(lemma,pos=wn_tag)

ifnotsynsets:

continue

# Take the first sense, the most common

synset=synsets[0]

swn_synset=swn.senti_synset(synset.name())

sentiment+=swn_synset.pos_score()-swn_synset.neg_score()

tokens_count+=1

# judgment call ? Default to positive or negative

ifnottokens_count:

return0

# sum greater than 0 => positive sentiment

ifsentiment>=0:

return1

# negative sentiment

return0

# Since we're shuffling, you'll get diffrent results

print swn_polarity(train_X[0]),train_y[0]# 1 1

print swn_polarity(train_X[1]),train_y[1]# 0 0

print swn_polarity(train_X[2]),train_y[2]# 0 1

print swn_polarity(train_X[3]),train_y[3]# 1 1

print swn_polarity(train_X[4]),train_y[4]# 1 1`

Let’s compute the accuracy of the SWN method

1

2

3

4

5

from sklearn.metrics import accuracy_score

pred_y=[swn_polarity(text)fortext intest_X]

print accuracy_score(test_y,pred_y)# 0.6518

The SentiWordnet approach produced only a 0.6518 accuracy. In case this figure looks good, keep in mind that in the case of binary classification, 0.5 accuracy is the chance accuracy. If the test examples are equally distributed between classes, flipping a coin would yield a 0.5 accuracy.

NLTK SentimentAnalyzer

NLTK has some neat built in utilities for doing sentiment analysis. I wouldn’t name them “industry ready”, but they are definitely useful and good for didactical purposes. Let’s check out SentimentAnalyzer.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

from unidecode import unidecode

from nltk import word_tokenize

from nltk.classify import NaiveBayesClassifier

from nltk.sentiment import SentimentAnalyzer

from nltk.sentiment.util import extract_unigram_feats,mark_negation

# mark_negation appends a "_NEG" to words after a negation untill a punctuation mark.

# this means that the same after a negation will be handled differently

NLTK VADER Sentiment Intensity Analyzer

It is a lexicon and rule-based sentiment analysis tool specifically created for working with messy social media texts. Let’s see how well it works for our movie reviews.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

from nltk.sentiment.vader import SentimentIntensityAnalyzer

from sklearn.metrics import accuracy_score

vader=SentimentIntensityAnalyzer()

def vader_polarity(text):

""" Transform the output to a binary 0/1 result """

score=vader.polarity_scores(text)

return1ifscore['pos']>score['neg']else0

print vader_polarity(train_X[0]),train_y[0]# 0 1

print vader_polarity(train_X[1]),train_y[1]# 0 0

print vader_polarity(train_X[2]),train_y[2]# 1 1

print vader_polarity(train_X[3]),train_y[3]# 0 1

print vader_polarity(train_X[4]),train_y[4]# 0 0

pred_y=[vader_polarity(text)fortext intest_X]

print accuracy_score(test_y,pred_y)# 0.6892

Pretty disappointing: 0.6892. I know for a fact that VADER works well for other types of text. It’s just not the case for our problem. Keep this tool in mind for your projects. Let’s try to tie things up, and build a proper classifier with Scikit-Learn.

Building a binary classifier with scikit-learn

For our last experiment, we’re going to play with a SVM model from Scikit-Learn. We’ve played already with text classification in the Text Classification Recipe. Make sure you brush up on the text classification task.

One new important addition is using bigrams. Bigrams are pairs of consecutive words. In general N-grams are tuples of N consecutive words. Here’s what I mean:

That’s it. This was an extensive introduction to sentiment analysis. Hopefully, you got an understanding of what the task of doing Sentiment Analysis implies, what are the most important problems we face and how to overcome them.

20 Comments

Dear George.,
Excuse me how can I get steps of build dictionary using Vader dictionary on facebook comments and can you help me by sending me figure shows a list of output ( positive and negative words ).
Thank you.
Regards.

Thank you bro
for the method that I work on it as the below
Methodology
The Implementation of this study start with extract message column from Facebook comments using face pager tool then the second step is the pre-processed for extracted data which started with Tokenization then Stemming finely POS Tagging in this step identified which words are adjectives then we identified which words related with sentiments (adjectives) by getting an example of happy words an sad words by using vader dictionary via nltk tool after that selected the words (positive , negative ) we have to classifiy these words as positive and negative words from these point we built a lexicon and we produced like a database in side lexicon there is a list of words( positive , negative) . Once we got this lexicon we used it to tag our post again to built training data . the process of building training data include it comments and there class so after that to we have to do machine learning on this data to train the midel by using one of the clasifires like ( Naivebyse ,SVM ….) Once the model get high accuracy parameters we will test it in our work. Figure 1 shows an architectural methodology of the proposed combined Lexicon and Machine Learning techniques. Our report progress consist of three steps namely data extraction, pre-processing and build dictionary.

I have progress report but I do not know how can explain the part of build dictionery
thank you.

Thanks a lot for this post, this was very helpful. I am trying now with some samples to see how the predictor is doing, but I am struggling on how the input-data has to look like. When I tokenize it, I get an array of predictions per word. When I enter a sentence I get an error message, that it was expecting an interable, not a string. When I enter a list, python wants me to send a string, etc… How would I have to structure a simple sentence or paragraph to get the prediction?

[…] a way, you created a Bag-Of-Words model when you tried text classification or sentiment analysis. It basically means you take the available words in a text and keep count of how many times they […]

# mark_negation appends a "_NEG" to words after a negation untill a punctuation mark.
# this means that the same after a negation will be handled differently
# than the word that's not after a negation by the classifier
print (mark_negation("I like the movie .".split())) # ['I', 'like', 'the', 'movie.']
print (mark_negation("I don't like the movie .".split())) # ['I', "don't", 'like_NEG', 'the_NEG', 'movie._NEG']

# The nltk classifier won't be able to handle the whole training set
TRAINING_COUNT = 5000