Tags

The Road to Gamma

Gamma: The Learned Classifier

In machine learning and statistics, classification is the problem of identifying a semantic class for an observation, on the basis of a training set of data containing observations (or instances) whose class membership is known.

a learned classifier:

Gamma is a classifier, that given a document, will give us the class that document belongs to (with an associated degree of probability to boot!). Our job is the build the function gamma, which takes a document, and returns a class.

Application: Sentiment Analysis

Given a movie review, we want to return the classifier (gamma). The classifier will give us a degree of probability that this movie review is either Positive or Negative. We could define more classes (Excellent, Good, Neutral, Weak, Terrible), but we'll keep this example simple.

If a reviewer states that the latest Bruce Willis movie was "quite simply one of the worst days of the 'Die Hard' series", the gamma classifier should return

x : "quite simply one of the worst days of the 'Die Hard' series" => Positive

y : "quite simply one of the worst days of the 'Die Hard' series" => Negative

Where x and y are % probabilities, and we we might assume the probability the classifier would consider this review Negative to be far higher than the probability of this review being Positive.

Such results however, would depend entirely upon how we supervise (train) the classifier.

What data is the classifier learning on?

Who supervises that learning?

And how?

Classification (Per Token)

One approach to building a classifier would be to simply take all words that seem to suggest positive sentiment and put them in one list, and take all words that would seem to suggest negative sentiment and put them in another list. For a given review, see how many words occur in each list, and return a probability based on that.

Senti-Wordnet considers that the token "worst" has a NegScore (Negative Score) of 0.75 and a PosScore (Positive Score) of 0.25. This indicates that, on average, one in four uses of the token "worst" is in a positive application (perhaps "I wanted to see this movie in the worst way!").

The presence of the word "worst" in the review:

"quite simply one of the worst days of the 'Die Hard' series"

Would contribute toward both negative and positive sentiment classification.

The application of both a PosScore and NegScore to a token may seem counter-intuitive at first, but consider this example:

unbelievably disappointing.

Full of zany characters and richly applied satire, and some great plot twists

This is the greatest screwball comedy ever filmed

It was pathetic. The worst part about it was the boxing scenes.

The first and last reviews are negative, and the middle reviews are positive. But language is ambiguous.

What about these examples?

This movie was bad ass!

That was such a bad dude!

... Historically, he was a terrible villain ...

Not so good

Not so great

About as exciting as watching grass grow

These might get annotated as:

This movie was bad ass!

That was such a bad dude!

... Historically, he was a terrible villain ...

Not so good

Not so bad

About as exciting as watching grass grow

Tokens that might seem "negative" can be used in a positive connotation, hence the use of both positive and negative scores in Senti-Wordnet.

We could choose to search Senti-Wordnet for each token in the review and add up the scores. We might conceivably choose to skip certain parts-of-speech, such as articles or determiners, and if our application supported the ability, to skip named entities (such as Movie Titles).

Given this process, we could end up with:

Token

Positive Score

Negative Score

quite

0.75

0.0

simply

0.75

0.0

one

worst

0.25

0.75

days

0.13

0.0

series

0.0

0.0

Total

1.38

1.00

Average

0.28

0.20

Which would seem to indicate that this was more likely to have been a positive review, which is an erroneous conclusion. Clearly, more context is needed.

Classification (Language Model)

Context can be provided through the use of a language model.

Basic Probability

Assume we have some finite vocabulary

V = { bruce willis, does, what, he, does, best }

It is not uncommon for this vocabulary to be very large, but we'll assume a small set for this example. Given the vocabulary, there are an infinite variety of possible sentences that can be created from this vocabulary (assuming no restrictions apply on the number of times a token can be used):

V1 = { bruce STOP }

V2 = { bruce willis STOP }

V3 = { bruce willis does STOP }

V4 = { bruce bruce bruce STOP }

V5 = { what he does he does best STOP }

V6 = { STOP }

A well-formed sentence has:

Zero-or-more words (where each word is drawn from the vocabulary V)

followed by a special symbol (STOP)

A sentence doesn't have to "make sense" to be considered well-formed. Now let's assume we have a training sammple in English. Maybe you collect all the sentences you see in the WSJ for the last 20 years. Or all the tweets in the last month.

This training sample can be quite large.

Year

Training Set Size

mid-90s

20 million words

late-90s

1 billion words

Last few years (Web Scale)

100+ billion words

P-Value

Given a training sample, we want to learn a distribution (P) over sentences in a language.

P is going to be a function that satisfies 2 conditions.

for any sentence X:

This formula can be read as:

the probability of this sentence is greater than or equal to 0

the probability of this sentence is greater than or equal to 0

if we sum over all the sentences in the language

we have something that sums to a value of 1 (e.g. 100%)

In other words, the formula gives the probability of any given sentence for the vocabulary.

Language Model Basics

A language model is a collection of tokens that occur next to each other (collocated) and their frequencies:

Name

# of Tokens

Sample

Sample Frequency

Unigram

1 Token

"quick"

10-5

Bigram

2 Tokens

"quick brow"

10-6

Trigram

3 Tokens

"quick brown fox"

10-8

Quadgram

4 Tokens

"quick brown fox jumped"

10-8

Let's assume I have 10,000 movie reviews. Perhaps the reviews come from a site like imdb.com where reviewers score their own reviews on a scale of 1-10 stars:

In this case, the reviewer scores the movie 5/10 (50%). Note that 42 out of 75 people agree with the reviewer. With enough information we could weight this as a meta-data score. If only 5% of the population agrees with this reviewer, we might choose to discount the negative sentiment.

A trigram language model of this review would look like this:

Trigram

Term Frequency

Document Frequency

Bruce Willis does

1

1

Willis does what

1

1

does what he

1

1

what he does

1

1

he does best

1

1

does best but

1

1

best but this

1

1

but this is

1

1

this is quite

1

1

is quite simply

1

1

quite simply one

1

1

simply one of

1

1

one of the

1

1

of the worst

1

1

the worst days

1

1

worst days of

1

1

days of the

1

1

of the Die

1

1

the Die Hard

1

1

Die Hard Trilogy

1

1

Naturally the list of trigrams would increase greatly with a larger corpus of reviews. The term frequency and document frequency counts will help build out a TF/IDF score. A partial (top 500) trigram model for the NY Times from January, 1987 with TF/IDF frequencies is here. Not surprisingly, New York City is the second most common trigram from that list.

Given a trigram language model of movie reviews, we can apply the same technique that we used with Senti-Wordnet, but rather than dealing with individual tokens, we would be dealing with the trigram "is quite simply", or "quite simply one". In our example, the trigram is related to a negative sentiment, but this could change as the model changes.

Shortcomings

While the approach of creating a language model and associating n-grams to classes (positive or negative sentiment in this case) is a valid one, there is a short coming in the formula we use. If we come across a trigram, phrase or token that is not already associated to a class in our langauge model, the probability of this association will be 0.

In Senti-Wordnet, "quite" and "simply" both contain a degree of positive sentiment, with no negative sentinment. In a trigram model of movie reviews, we might have a better idea of how often.

Building the Language Model

So we build our language model based on reviews that are analyzed at positive / negative / neutral. Then we can analyze each new tweet with a high probability of confidence. Note that memes on twitter frequently change. Due to the nature of this dynamic corpus, language models built around twitter will need frequent updating.

This is called "Supervised Machine Learning". ("Naive Sentiment Analysis" vs "Supervised Machine Learning"). We have a training set of documents that have been hand-labeled for their class.

Input:

a document

a fixed set of classes

a training set of "m" hand-labeled documents:"

Output:

a learned classifier:

The Road to Gamma

The goal of this supervised machine learning is to produce gamma.

For each class, compute the probability of each word occurring in that class

a learned classifier:

This could also be calculated with higher precision for n-gram language models, where n > 1:

a learned classifier:

The "NB" in CNB stands for "Naive Bayes", that is, the best class by the Naive Bayes assumption (or, that class that maximizes these prior probabilities).

Addendum / Misc

Rules for hand-labeling:

Any text with overly profane or overtly racist remarks will be classified as negative

Business advertisments will be considered neutral

Mere statements of fact without obvious positive to negative sentiment will be annotated as unknown