Vector-space models for sentiment

In our previous classes, we relied on naturalistic annotations — star ratings at review sites and reader reactions at the Experience Project — to develop sentiment-rich lexicons. In NLP parlance, this was supervised lexicon building, in the sense that we made inferences only where we could be guided by the annotations.

There is an abundance of such data on the Web these days, at sites covering a variety of topics and social situations, so we could use those same methods to develop lots of specialized lexicons.

However, inevitably, there will be situations in which we simply don't have annotated data of the right type. Some examples:

You have star ratings for albums, but they were written by the general population, whereas you care about the particular ways in which long-suffering but devoted fans express themselves. Thus, your star ratings might mislead you, and they are unlikely to capture the emotive dimensions you care about.

You are interested in how people express nervous anticipation, as distinct from anxiety and excitement, but no site even comes close to asking readers to provide such labels or make such distinctions.

Sentiment attaches to specific people in different ways; geeky, wonky, flamboyant, stern, and unpredictable might be positive or negative depending on whether they are applied to a movie star, a politician, or an academic advisor. Thus, even where there are star ratings, one might want to relativize them to particular topics.

The goal of this lecture is to provide you with some simple but powerful methods for capturing some of these phenomena where you have no sentiment labels. The approaches fall under the heading of unsupervised vector space models. Such approaches tend to be less accurate than the supervised ones we covered before, but they are considerably more flexible.

Define a notion of co-occurrence context. This could be an entire document, a paragraph, a sentence, a clause, an NP — whatever domain seems likely to capture the associations you care about.

Scan through your corpus building a dictionary d mapping word-pairs to counts. Every time a pair of words w and w' occur in the same context (as you defined it in 1), increment d[(w, w')] by 1.

Using the count dictionary d that you collected in 2, establish your full vocabulary V, an ordered list of words types. For large collections of documents, |V| will typically be huge. You will probably want to winnow the vocabulary at this point, to get it down to < 5000 words. You might do this by filtering to a specific subset, or just imposing a minimum count threshold. You might impose a minimum count threshold even if |V| is small — for words with very low counts, you simply don't have enough evidence to say anything interesting.

Now build a matrix M of dimension |V| × |V|. Both the rows and the columns of M represent words. Each cell M[i, j] is filled with the count d[(w_i, w_j)].

That's the basic recipe. For different design matrices, the procedure differs slightly.

For example, if you are building a word × document matrix, then the rows of M represent words and the columns of M represent documents. The scan in step 2 then just keeps track of (word, document) pairs — compiling the number of times that word appears in document. Such matrices are often used in information retrieval, because the columns are multi-set representations of documents. They are much sparser than the the word × word matrices we will work with here. (In my experience, they yield lower quality lexicons, but others have reported good results with them.)

We'll return to matrix design types near the end of this lecture. Before doing that, let's get a feel for what the word × word matrices can tell us.

The code distribution for this course includes a few different word × word matrices, and code for working with them. Here's a look at a slice of the IMDB word × word matrix, which is a (5000 x 5000)-sized matrix derived from this large public release of IMDB reviews. The notion of context is co-occurrence in the same review. The vocabulary was chosen by sorting the words by overall frequency and keeping the top 5000 items.

## Load the code for this unit:

source('vsm.R')

## The matrix is large and so might take a while to load:

imdb = Csv2Matrix('imdb-wordword.csv')

## Now we look at a slice; I didn't use the top-right corner because the first group of words

## involves punctuation that R's display distorts:

imdb[100:110, 100:110]

against age agent ages ago agree ahead ain.t air aka al

against 2003 90 39 20 88 57 33 15 58 22 24

age 90 1492 14 39 71 38 12 4 18 4 39

agent 39 14 507 2 21 5 10 3 9 8 25

ages 20 39 2 290 32 5 4 3 6 1 6

ago 88 71 21 32 1164 37 25 11 34 11 38

agree 57 38 5 5 37 627 12 2 16 19 14

ahead 33 12 10 4 25 12 429 4 12 10 7

ain't 15 4 3 3 11 2 4 166 0 3 3

air 58 18 9 6 34 16 12 0 746 5 11

aka 22 4 8 1 11 19 10 3 5 261 9

al 24 39 25 6 38 14 7 3 11 9 861

The diagonals in this matrix give the total token count of the word, and the values in imdb[i, j] and imdb[j, i] are always identical. Our guiding idea will be that the rows in these matrices (equivalently, the columns, given our design) capture important aspects of the meanings of the words in our lexicon. Additionally, we will assume that similarity between rows correlates with similarity of meaning.

It's worth quickly noting that you might consider an initial smoothing step at this stage. In the simplest case, this just means adding a small uniform value to each cell. In R, you can do this to a matrix m with m = m + x, where x is the smoothing factor. Since it is easy to smooth the matrix after building it, I recommend not smoothing at the initial count stage, so that you have flexibility later.

If we stick with vectors of two elements, treating the first as the x coordinate and the second as the y coordinate, then we can plot our vectors on a two-dimensional plane. Euclidean distance the measures the shortest line we can draw between the points.

The vsm.R code that you already loaded (with source('vsm.R')) contains a function EuclideanDistance that implements this for vectors. Here is an illustration:

The numerical values of course jibe well with the raw visual distance in the plot. However, suppose we think of the vectors as word meanings in the vector-space sense. In that case, the values don't look good: the distributions of b and c are more or less directly opposed, suggesting very different meanings, whereas a and b are rather closely aligned, abstracting away from the fact that the first is far less frequent than the second.

In terms of the large vector space models we will soon explore, a and b resemble a pair like superb and good, which have similar meanings but very different frequencies. In contrast, b and c are like good and bad — similar overall frequencies but different distributions with respect to the overall vocabulary.

These affinities are immediately apparent if we normalize the vectors by their length:

Vector length

Given a vector \(x\) of dimension \(n\), the length of \(x\) is \[ VectorLength(x) =\sqrt{\sum_{i=1}^{n} x_{i}^{2}}\]

Length (L2) normalization

Given a vector \(x\) of dimension \(n\), the normalization of
\(x\) is a vector \(\hat{x}\) also of dimension \(n\) obtained by
dividing each element of \(x\) by \(VectorLength(x)\).

The following plot shows the effect that cosine similarity has on our running example. The distances can be generated with CosineDistance:

CosineDistance(a, b)

[1] 0.007722123

CosineDistance(a, c)

[1] 0.1162121

CosineDistance(b, c)

[1] 0.06500247

Cosine similarity is typically better for capturing semantic similarities, because these are usually not dependent upon overall frequency, but rather only on how the words are distributed with respect to other words.

The vsm.R code also makes available KL divergence (KLDivergence) and skew (Skew), two other metrics that also correct for overall frequency. Additional measures can easily be added to the code. Some examples: Jaccard distance, Dice distance, Manhattan distance, matching coefficient, and overlap. As long as your implementation takes in an R numeric vector and returns a single numeric value, it should be compatible with the functions described below.

Exercisevsm:ex:dist
Use KLDivergence on the vectors a, b, and c defined above. How do the results compare with EuclideanDistance and CosineDistance?

Exercisevsm:ex:skew
The Skew function uses KL-divergence but with a prior value between 0 and 1 that pushes the distribution being compared to be more like the reference distribution:

\[
Skew_{\alpha}(p, q) = D(p||\alpha q + (1 - \alpha)p)
\]

Beginning with the the count vectors p = c(1,2,7) and q = c(7,2,1), run Skew(p, q, alpha=x) for values of x, including at least 0 and 1. Describe what happens as alpha goes from 0 to 1.

Exercisevsm:ex:others
Implement Jaccard and Dice distance metrics. What is noteworthy about the values and rankings that these two deliver?

The function Neighbors in vsm.R is a highly flexible interface for seeing which associations a given distance metric finds in a matrix.

By default, if you give Neighbors just a matrix as its first argument and a word as its second argument, it will assume that the word labels a row and build a data.frame that gives the distances from that word for each word in vocabulary (each row name):

df = Neighbors(imdb, 'happy')

head(df)

Neighbor Distance

happy happy 0.000000000

when when 0.002304525

now now 0.002390762

again again 0.002392370

go go 0.002453869

find find 0.002494124

To find the things that are farthest from happy, use tail(df).

Here is an equivalent call with the default arguments specified:

df = Neighbors(imdb, 'happy', byrow=TRUE, distfunc=CosineDistance)

head(df)

Neighbor Distance

happy happy 0.000000000

when when 0.002304525

now now 0.002390762

again again 0.002392370

go go 0.002453869

find find 0.002494124

If byrow=FALSE, then Neighbors will assume you've given a column name and will thus give column neighbors. For our word × word matrix, this is equivalent (though, due to floating point imprecision, the values are sometimes slightly shuffled).

You can give different distance functions as the values of CosineDistance, either those already provided by vsm.R or ones you defined on your own.

Exercisevsm:ex:neighbors
Pick a few words from a single domain and see what their neighbors are like using Neighbors, comparing CosineDistance with EuclideanDistance.

Exercisevsm:ex:freq1
We saw that euclidean distance favors raw frequencies. Find words in the matrix imdb that help make this point: a pair that are semantically unrelated but close according to EuclideanDistance, and a pair that are semantically related by far apart according to EuclideanDistance.

Exercisevsm:ex:freq2
To what extent does using CosineDistance address the problem you uncovered in exercise vsm:ex:freq1?

We've seen that, to capture semantic similarities, we need to normalize the vectors, either using LengthNorm or by turning them into probability distributions (in KL divergence and its variants). This step needs to be taken, but it is risky to take alone, because it erases the amount of evidence we have about each word. For example, intuitively, we are in different evidential situations with the following vectors:

a = c(1000, 2000, 3000)

b = c(1, 2, 3)

However, if we turn them into probability distributions, then the distinction is completely erased:

a/sum(a)

[1] 0.1666667 0.3333333 0.5000000

b/sum(b)

[1] 0.1666667 0.3333333 0.5000000

The same thing happens with length normalization:

LengthNorm(a)

[1] 0.2672612 0.5345225 0.8017837

LengthNorm(b)

[1] 0.2672612 0.5345225 0.801783

This section seeks to address this problem by re-weighting the counts in the matrix to better capture the underlying relationships, amplifying the things for which we have evidence and reducing the things that we have little evidence for.

Before moving on, it's worth noting that remapping each row/column by turning it into a vector of probabilities and length normalizing are both kinds of re-weighting. They suffer from the above drawback, which is why we need to do more, but the more complex weighting schemes all capture the normalization insight that these remappings embody.

The method I focus on is a variant of pointwise mutual information (PMI). The basic PMI calculation takes a count matrix M and reweights each cell as follows:

Turn M into a matrix of joint probabilities by dividing it by its total size: F = M/sum(M)

The value sum(F[i, ]) is the row probability, and the value sum(F[ , j]) is the column probability. This is implemented in vsm.R as PMI:

m = matrix(

c(10,10,10,10,

5, 5, 5, 5,

10,10, 0, 0,

0, 0, 0, 1), nrow=4, byrow=TRUE)

p = PMI(m)

p

[,1] [,2] [,3] [,4]

[1,] -0.2107210 -0.2107210 0.3001046 0.2355661

[2,] -0.2107210 -0.2107210 0.3001046 0.2355661

[3,] 0.4824261 0.4824261 0.0000000 0.0000000

[4,] 0.0000000 0.0000000 0.0000000 1.6218604

What have we done? Well, the first two rows are identical, a kind of normalization as we saw above. The first two values in row 3 have been greatly amplified in virtue of the fact that the rows have relatively low probability. This is arguably good; whereas the first two rows involve things that look like function words (they occur almost everywhere), row 3 contains something that might be genuinely informative. The biggest worry I see is that p[4,4] is positively enormous. Thinking row-wise, it is basically an isolate, and column-wise it is also surprising. PMI treats it as though it were extremely valuable and special. However, in truth, it is probably some kind of weird and untrustworthy oddity in our data.

A common initial smoothing step in vector-space models is to reduce all negative values to 0: Positive PMI (PPMI).

pp = PMI(m, positive=TRUE)

pp

[,1] [,2] [,3] [,4]

[1,] 0.0000000 0.0000000 0.3001046 0.2355661

[2,] 0.0000000 0.0000000 0.3001046 0.2355661

[3,] 0.4824261 0.4824261 0.0000000 0.0000000

[4,] 0.0000000 0.0000000 0.0000000 1.6218604

The more important step is what I'll call contextual discounting. It pushes back against the tendency of PMI to favor things with extremely low counts. It does this by penalizing items that find themselves in rows or columns that are sparse:

TF-IDF is implemented in vsm.R as TFIDF. Run TFIDF on the toy matrix m that we defined above, and describe the results. How do they compare with those of PMI and its variants?

Exercisevsm:ex:tfidf2
Reweight the imdb matrix using TFIDF, and then explore the results using the Neighbors function. To my eye, the results are not as good as those of PPMI with contextual discounting. What features of the matrix are likely culprits? Are there matrix types where you would expect TF-IDF to do better?

You can begin to get a feel for what your matrix is like by poking around with the Neighbors function to see who is close to or far from whom. But this kind of sampling is unlikely to lead to new insights, unless you luck out and start to see an interesting cluster of associations developing.

The t-SNE visualization technique is a large-scale way of identifying associations in an intuitive way, to guide later and more precise investigations. This function takes the matrix, does some dimensionality reduction on it, and then finds a clever way to display the resulting distances in two dimensions.

The vsm.R function for visualizing a matrix this way is Matrix2Tsne. The underlying R implementation is slow, so it's a good idea to check the code with a toy example before waiting a long time only to encounter a bug:

library(tsne)

tsneViz(m)

I've found that the best results come from first reweighting the matrix using Positive PMI with contextual discounting and then visualizing:

tsneViz(imdb.ppcd)

Figure vsm:tsne-imdb-rows

t-SNE visualization of the IMDB matrix with PPMI and contextual discounting.

The R implementation is slow — the above takes more than an hour on my laptop, but the results are worth it! In interpreting the results, you should regard tight clusters of words as interestingly related. Conversely, large evenly spaced groups of words are those for which t-SNE could not find associations. At the macro-level, distance between clusters seems not to be meaningful, and overall position in the diagram (top, left, etc.) is arbitrary — chosen by t-SNE to deliver a perspicuous representation, not to capture anything about the data directly.

Here's an example involving adjective-adverb pairs derived from the NYT section of the English Gigaword. The matrix design is adjective × adverb. The pairs come from advmod dependencies as defined by the Stanford Dependency Parser. The first plot clusters adjectives, the second adverbs.

Figure vsm:tsne-imdb-cols

t-SNE visualization of the rows in an adjective × adverb matrix derived from dependency parses of the NYT section of Gigaword.

Figure vsm:tsne-advadj

t-SNE visualization of the columns in an adjective × adverb matrix derived from dependency parses of the NYT section of Gigaword.

The t-SNE visualization of our running example suggests some lexical clusters that in turn suggest the beginnings of a lexicon. The semantic orientation method of Turney and Littman is a general method for building such lexicons for any desired semantic dimension.

The SemanticOrientation function allows you to leave off the word argument. If you do this, it churns through the whole vocabulary, scoring it against the supplied seeds sets. The result is an ordering of the vocabulary, with the top (lowest scoring) elements being nearest to seeds1 and the bottom (highest scoring) elements associating with seeds2:

The above methods deliver useable lexicons for a wide variety of seed-sets. However, they are not capable of capturing higher-order associations in the data. For example, both gnarly and wicked are used as slangily positive adjectives. We thus expect them to have many of the same neighbors. However, at least stereotypically, gnarly is Californian and wicked is Bostonian. Thus, they are unlikely to occur often in the same texts. Dimensionality reduction techniques are often capable of capturing their semantic similarity (and have the added advantage of shrinking the size of our data structures).

The following matrix implements my gnarly/wicked example in a word × document matrix:

gw = matrix(c(

1,0,1,0,0,0,

0,1,0,1,0,0,

1,1,1,1,0,0,

0,0,0,0,1,1,

0,0,0,0,0,1), nrow=5, byrow=TRUE)

rownames(gw) = c('gnarly', 'wicked', 'awesome', 'lame', 'terrible')

colnames(gw) = paste(rep('d', 6), seq(1,6), sep='')

gw

The two words of interest never co-occur, but they co-occur with roughly the same set of other items. As expected, our usual favored scheme cannot capture this:

Neighbors(gw, 'gnarly')

Neighbor Distance

1 gnarly 2.220446e-16

2 awesome 2.928932e-01

3 lame 1.000000e+00

4 terrible 1.000000e+00

5 wicked 1.000000e+00

Neighbors(gw, 'wicked')

Neighbor Distance

1 wicked 2.220446e-16

2 awesome 2.928932e-01

3 gnarly 1.000000e+00

4 lame 1.000000e+00

5 terrible 1.000000e+00

PPMI with discounting is no help:

Neighbors(gw.ppcd, 'gnarly')

Neighbor Distance

1 gnarly 1.110223e-16

2 awesome 2.928932e-01

3 lame 1.000000e+00

4 terrible 1.000000e+00

5 wicked 1.000000e+00

Neighbors(gw.ppcd, 'wicked')

Neighbor Distance

1 wicked 1.110223e-16

2 awesome 2.928932e-01

3 gnarly 1.000000e+00

4 lame 1.000000e+00

5 terrible 1.000000e+00

Latent Semantic Analysis (LSA) to the rescue. LSA is based in singular value decomposition (SVD), a technique for decomposing a single matrix into three matrices: one for the rows, one for the columns, and one representing the singular values. The rows and columns of the new matrices are othonormal, which means roughly that all correlations their contain have been factored out. In addition, the singular values are organized from most to least important. If we simply multiply all three matrices together again, we get back to our original matrix. However, if we multiply the first \(k\) columns of the row matrix by the first \(k\) singular values, then we obtain a reduced dimensional version of our row matrix, one in which all correlations have been removed.

LSA is implemented in vsm.R as LSA. The only preliminary step is to run svd. (This can take a long time, so it is best to do it once and then fiddle around with it.)

s = svd(gw)

s

$d

[1] 2.449490 1.618034 1.414214 0.618034 0.000000

$u

[,1] [,2] [,3] [,4] [,5]

[1,] 0.4082483 0.0000000 7.071068e-01 0.0000000 -0.5773503

[2,] 0.4082483 0.0000000 -7.071068e-01 0.0000000 -0.5773503

[3,] 0.8164966 0.0000000 -1.110223e-16 0.0000000 0.5773503

[4,] 0.0000000 0.8506508 0.000000e+00 -0.5257311 0.0000000

[5,] 0.0000000 0.5257311 0.000000e+00 0.8506508 0.0000000

$v

[,1] [,2] [,3] [,4] [,5]

[1,] 0.5 0.000000e+00 0.5 0.000000e+00 -0.7071068

[2,] 0.5 7.850462e-17 -0.5 5.887847e-17 0.0000000

[3,] 0.5 0.000000e+00 0.5 0.000000e+00 0.7071068

[4,] 0.5 -7.850462e-17 -0.5 -5.887847e-17 0.0000000

[5,] 0.0 5.257311e-01 0.0 -8.506508e-01 0.0000000

[6,] 0.0 8.506508e-01 0.0 5.257311e-01 0.0000000

We can then call LSA with the svd output as the first argument and optional additional arguments. For a large matrix like imdb, k=100 is a good starting point. Here, we use k=2 on our small example:

Exercisevsm:ex:lsa
Run LSA on the imdb and imdb.pccd matrices and then check a few words you care about — ideally, some that mix different semantic dimensions and different overall frequencies. How do the results compare with what you saw for these matrices before reduction?s

Exercisevsm:ex:k
What happens if you set k=1 using LSA. What do the results look like then? What do you think this first (and now only) dimension is capturing?

I focussed on word × word matrices above, and we saw a few word × document matrices in illustrative examples. I also showed some pictures of an adjective × adverb matrix. This is just a glimpse of the infinitude of possible designs. Some others, including one that goes beyond two dimensions:

word × document

word × search query

word × syntactic context

pair × pattern (e.g., mason : stone, cuts)

adj. × modified noun

word × dependency rel.

person × product

word × person

word × word × pattern

verb × subject × object

In any vector-space project, the matrix design will be the most important step you take. No amount of reweighting and dimensionality reduction will transcend a bad choice at this stage, and smart choices will amplify the utility of those steps.

Exercisevsm:ex:additions
Propose three additions to the above list of matrix designs, and say briefly what research questions those designs would engage (preferably in the area of sentiment, but not necessarily.)

Exercisevsm:ex:worddoc
The code distribution for today also contains a word × document matrix derived from the same data as imdb used above: imdb-worddoc.csv. Using the above techniques and measures, try to get a feel for how this matrix behaves — how it differs from the word × word version and what that means for analysis.

Exercisevsm:ex:advmod
The code distribution for today also contains the adjective × adverb matrix derived from Gigaword, as used in figure vsm:tsne-advadj: gigawordnyt-advmod-matrix.csv. Using the above techniques and measures, try to get a feel for what can be done with this matrix.

Exercisevsm:ex:scaletypes A bit of background: Syrett and Lidz (2010; 30-Month-olds use the distribution and meaning of adverbs to interpret novel adjectives. Language Learning and Development 6(4): 258-282) report on a corpus study of adverb–adjective combinations in English, seeking to find patterns that can be traced to interpretive restrictions and preferences. For example, half full sounds normal and is easy to interpret, whereas half tall is odd and hard to interpret. This pattern presumably traces to the sort of scales along which we measure full and tall. (You might mutter to yourself different combinations of completely, somewhat, and mostly with full, smart, and damp to get a feel for what the patterns are like.) Use the gigawordnyt-advmod-matrix.csv matrix to explore these claims.