Many sentiment applications rely on lexicons to supply features to
a model. This section reviews some publicly available resources and
their relationships, and it seeks to identify some best practices for
using sentiment lexicons effectively.

SentiWordNet (note: this site was hacked recently; take care when visiting it)
(Baccianella,
Esuli, and Sebastiani 2010) attaches positive and negative
real-valued sentiment scores to WordNet synsets
(Fellbaum1998). It is freely
distributed for noncommercial use, and licensed are available for
commercial applications. (See the website for
details.) Table tab:sentiwordnet
summarizes its structure. (For extensive discussion of WordNet synsets
and related objects,
see this
introduction).

Linguistic Inquiry and Word
Counts (LIWC) is a propriety database consisting of a lot of
categorized regular expressions. It costs about $90. Its
classifications are highly correlated with those of the Harvard
General Inquirer. Table tab:liwc
gives some of its sentiment-relevant categories with example regular
expressions.

All of the above lexicons provide basic polarity classifications.
Their underlying vocabularies are different, so it is difficult to
compare them comprehensively, but we can see how often they explicitly
disagree with each other in that they supply opposite polarity values
for a given word. Table tab:lexicon_disagreement
reports on the results of such comparisons.

(Where a lexicon had part-of-speech tags, I removed them and
selected the most sentiment-rich sense available for the resulting
string. For SentiWordNet, I counted a word as positive if its
positive score was larger than its negative score; negative if its
negative score was larger than its positive score; else neutral, which
means that words with equal non-0 positive and negative scores are
neutral.)

Table tab:lexicon_disagreement

Disagreement levels for the sentiment lexicons reviewed above.

MPQA

Opinion Lexicon

Inquirer

SentiWordNet

LIWC

MPQA

–

33/5402 (0.6%)

49/2867 (2%)

1127/4214 (27%)

12/363 (3%)

Opinion Lexicon

–

32/2411 (1%)

1004/3994 (25%)

9/403 (2%)

Inquirer

–

520/2306 (23%)

1/204 (0.5%)

SentiWordNet

–

174/694 (25%)

LIWC

–

I can imagine two equally reasonable reaction to the
disagreements. The first would be to resolve them in favor of some
particular sense. The second would be to combine the values derived
from theses resources, thereby allowing the conflicts to persist, as a
way of capturing the fact that the disagreements arise from genuine
sense ambiguities.

The above lexicons are useful for a wide range of tasks, but they
are fixed resources. This section is devoted to developing new
resources. This can have three benefits, which we will see in various
combinations:

Much larger lexicons can be developed inferentially.

We can capture different dimensions of sentiment that might be pressing for specific tasks.

We can develop lexicons that are sensitive to the norms of specific domains.

The algorithm begins with n small, hand-crafted seed-sets
and then follows WordNet relations from them, thereby expanding their
size. The expanded sets of iteration i are used as seed-sets
for iteration i+1, generally after pruning any pairwise
overlap between them.

Free hyper parameters: the seed-sets, the
WordNet relations called in SamePolarity and OtherPolarity, the number
of iterations, the decision to remove overlap.

The algorithm has a number of free parameters: the seed-sets, the
WordNet relations called in SamePolarity and OtherPolarity, the number
of iterations, the decision to remove overlap. The demo allows you to
try out different combinations of values:

Table tab:wnpropagate_exs
provides some additional seed-sets, drawing from other distinctions
found in the Harvard Inquirer. These can be pasted into the demo if
one wants a sense for how well new lexical classes propagate.

Table tab:wnpropagate_exs

Propagation example seed-sets to try.

Category

Seed set

Pleasur

amuse, calm, ecstasy, enjoy, joy

Pain

agony, disconcerted, fearful, regret, remorse

Strong

illustrious, rich, control, perseverance

Weak

lowly, poor, sorry, sluggish, weak

MALE

boy, brother, gentleman, male, guy

Female

girl, sister, bride, female, lady

To assess the algorithm for polarity sense-preservation, I began
with the seed-sets
in table tab:seeds
and then allowed the propagation algorithm to run for 20 iterations,
checking each for its effectiveness at reproducing the
Positiv/Negativ/Neither distinctions in the subset of Harvard General
Inquirer that is also in
WordNet.

Table tab:seeds

Seed sets used to evaluate the WordNet
propagation algorithm against the Harvard General Inquirer.

WordNet score propagation example. The authors propose a
further rescaling of the scores: log(abs(s)) * sign(s) if abs(s) >
1, else 0. However, in the example, we would lose the sentiment
score for good if we stopped before
iteration 6. In my experiments, rescaling resulted in dramatically
fewer non-0 values.

I ran the algorithm using the full Harvard General Inquirer
Positiv/Negative/Neither classes as seeds-sets. The output in
archived CSV format:

In my informal assessment, the positive and negative scores it
assigns tend to be accurate. The disappointment is that so many of
the scores are 0, as see
in figure fig:wnscores_scoredist.
I think this could be addressed by following more relations that just
the basic synset one, as we do for the simple WordNet propagation
algorithm, but I've not tried it yet.

The focus of this section is the relationship between the review
authors' language and the star ratings they choose to assign, from the
range 1-10 stars (with the exception of This is Spinal Tap,
which goes to 11). Intuitively, the idea is that the author's chosen
star rating affects, and is affected by, the text she produces. The
star rating is a particular kind of high-level summary of the
evaluative aspects of the review text, and thus we can use that
high-level summary to get a grip on what's happening
linguistically.

The data I'll be working with are all in the format described
in table tab:data.
Each row represents a star-rating category. Thus, for example,
in these data, (bad, a) is used 122,232
in 1-star reviews, and the total token count for 1-star reviews is
25,395,214.

Table tab:data

The data format. Some of the files linked
above do not have the Tag column, and most of them are based in 5
stars rather than 10 stars.

Word

Tag

Category

Count

Total

bad

a

1

122232

25395214

bad

a

2

40491

11755132

bad

a

3

37787

13995838

bad

a

4

33070

14963866

bad

a

5

39205

20390515

bad

a

6

43101

27420036

bad

a

7

46696

40192077

bad

a

8

42228

48723444

bad

a

9

29588

40277743

bad

a

10

51778

73948447

The next few sections describe methods for deriving sentiment
lexicons from such data. The methods should generalize to other kinds
of ordered sentiment metadata (e.g., helpfulness ratings, confidence
ratings).

As we saw above, the raw Count values are likely to be misleading
due to the very large size imbalances among the categories. For
example, there are more tokens of (bad, a)
in 10-star reviews than in 2-star ones, which seems highly
counter-intuitive. Plotting the values reveals that the Count
distribution is very heavily influenced by the overall distribution of
words (figure fig:counts).

Figure fig:counts

Count distribution
for (bad, a) (left) and the overall
category size (right; repeated from
figure fig:totals). The
distribution is heavily influenced by the category sizes.

The source of this odd picture is clear: the 10-star category is 7
times bigger than the 1-star category, so the absolute counts do not
necessarily reflect the rate of usage.

One drawback to RelFreq values is that they are highly sensitive to
overall frequency. For example, (bad, a) is
significantly more frequent than
(horrible, a), which means that the RelFreq
values for the two words are hard to directly
compare. Figure fig:relfreq_cmp
nonetheless attempts a comparison.

Figure fig:relfreq_cmp

Comparing words via their RelFreq
distributions.

It is possible to discern that (bad, a)
is less extreme in its negativity than (horrible,
a). However, the effect looks subtle. The next measure we look
at abstracts away from overall frequency, which facilitates this kind
of direct comparison.

A drawback to RelFreq values, at least for present purposes, is
that they are extremely sensitive to the overall frequency of the word
in question. There is a comparable value that is insensitive to this
quantity:

Definition: Pr values

RelFreq / sum(RelFreq)

Pr values are just rescaled RelFreq values: we divide by a constant
to get from RelFreq to Pr. As a result, the distributions have
exactly the same shape, as we see
in figure fig:pr.

Figure fig:pr

Comparing Pr
values (left) with RelFreq values
(right; repeated from figure fig:relfreq). The shapes are exactly the same
(Pr is a rescaling of RelFreq).

A technical note: The move from RelFreq to Pr involves an
application of Bayes Rule.

RelFreq Values can be thought of as estimates of the conditional
distribution P(word|rating): given
that I am in rating category rating, how likely am I
to produce word?

Bayes Rule allows us to obtain the inverse
distribution P(rating|word):

P(rating|word) = P(word|rating)P(rating) / P(word)

However, we would not want to directly apply this rule, because
of the term P(rating) in the
numerator. That would naturally be approximated by the
distribution given by Total, as in
figure fig:totals,
which would simply re-introduce all of those unwanted biases.

Thus, we keep P(rating)
constant, which is just to say that we leave it out:

Comparing the Pr distributions
of (bad, a)
and (horrible, a). The comparison is
easier than it was with RelFreq values
(figure fig:relfreq).

I think these plots clearly convey that (bad,
a) is less intensely negative
than (horrible, a). For example,
whereas (bad, a) is at least used
throughout the scale, even at the top, (horrible,
a) is effectively never used at the top of the scale.

Expected ratings are easy to calculate and quite intuitive, but it
is hard to know how confident we can be in them, because they are
insensitive to the amount and kind of data that went into
them. Suppose the ER for words v
and w are both 10, but we have 500 tokens
of v and just 10 tokens
of w. This suggests that we can have a high
degree of confidence in our ER for v, but
not for
w. However, ER values don't encode this
uncertainty, nor is there an obvious way to capture it.

Logistic regression provides a useful way to do the work of ERs but
with the added benefits of having a model and associated test
statistics and measures of confidence. For our purposes, we can stick
to a simple model that uses Category values to predict word usage. The
intuition here is just the one that we have been working with so far:
the star-ratings are correlated with the usage of some words. For a
word like (bad, a), the correlation is
negative: usage drops as the ratings get higher. For a word
like (amazing, a), the correlation is
positive.

With our logistic regression models, we will essentially fit lines
through our RelFreq data points, just as
one would with a linear regression involving one predictor. However,
the logistic regression model fits these values in log-odds space and
uses the inverse logit function (plogis in R) to ensure that all the
predicted values lie in [0,1], i.e., that they are all true
probability values. Unfortunately, there is not enough time to go into
much more detail about the nature of this kind of modeling. I refer
to Gelman and Hill 2008, §5-6
for an accessible, empirically-driven overview. Instead, let's simply
fit a model and try to build up intuitions about what it does and
says.

The simple linear regression model
for bad is given
in table tab:bad_fit.
The model simply uses the rating values to predict the usage
(log-odds) of the word in each category.

Here, we can use the coefficient for Category as our sentiment
score. Where the value is negative (negative slope), the word is
negative. Where it is positive, the word is positive. Informally, we
can also use the size of the coefficient as a measure of its
intensity.

The great strength of this approach is that we can use the p-values
to determine whether a score is
trustworthy. Figure fig:cmp
helps to convey why this is an important new power. (Here and in later
plots, I've rescaled the values into Pr space to facilitate comparisons.)

Figure fig:cmp

Comparing words using our assessment values.

This leads to the following method for inducing a sentiment
lexicon from these data:

Definition: Sentiment lexicon via logistic regression

Let Coef(w) be the Category coefficient for
if that coefficient is significant at the chosen level, else 0

If Coef(w) = 0, then w is objective/neutral

If Coef(w) > 0, then w is positive

If Coef(w) < 0, then w is negative

A word's intensity is abs(Coef(w))

Depending on where the significance value is set, this can learn
conservative lexicons of a few thousand words or very liberal lexicons
of tens of thousands.

This method of comparing coefficient values is likely to irk
statisticians, but it works well in practice. For a more exact and
careful method, as well as a proposal for how to compare words with
non-linear relationships to the ratings,
see this
talk I gave recently on creating lexical scales.

The Experience
Project is a social networking website that allows users to share
stories about their own personal experiences. At the confessions
portion of the site, users write typically very emotional stories
about themselves, and readers can then chose from among five reaction
categories to the story, but clicking on one of the five icons
in figure fig:ep_cats. The
categories provide rich new dimensions of sentiment, ones that are
generally orthogonal to the positive/negative one that most people
study but that nonetheless models important aspects of sentiment
expression and social interaction (Potts 2010b,
Socher, Pennington, Huang, Ng
and Manning 2011).

Figure fig:ep_cats

Experience
Project categories. "You rock" is a positive exclamative
category. "Teehee" is a playful, lighthearted
category. "I understand" is an expression of
solidarity. "Sorry, hugs" is a sympathetic category. And
"Wow, just wow" is negative exclamative, the least used
category on the site.

This section presents a simple method for using these data to
develop sentiment lexicons.

As with the IMDB data above, I've put the word-level information
into an easy-to-use CSV format, as
in table tab:ep_data.
Thus, as long as you require only word-level statistics, you needn't
scrape the site again.

There are a number of good fixed lexicons for sentiment. They are
negligible to high levels of disagreement with each other. These
can be exploited strategically — resolve the conflicts somehow
or allow them to persist as genuine points of uncertainty.

WordNet can be used to derive interesting lexicons from small
seeds sets, even for distinctions that are not directly encoded in
WordNet's structure.

Naturally occurring metadata are a rich source of lexical
entries. Statistical models are valuable for such lexicon
induction.

A major advantage of inducing a lexicon directly from data is that
one can then capture domain specific effects, which are very common
in sentiment. (See also the discussion of vector-space models for
lexicon induction methods that don't any metadata.)