Sentiment and semantic composition

We've so far studied just word meanings. We have tried to embrace the fact that word senses are constantly being pushed around by the morphosyntactic and discourse context in which they occur, but we were still, in the end, just creating lexicons. A viable theory of meaning has to come to grips with semantic composition — how word meanings combine to form more complex meanings. The goal of this lecture is to take some tentative steps in that direction. This should also provide some connections with Noah Goodman's course and Shalom Lappin's lecture today (this place, next period!).

Semantic composition is an extremely active area of research in natural language processing; having gotten good at building rich lexicons, the NLPers have now turned their attention to the core problem of formal semantics, and they are making rapid progress, with new papers appearing all the time. Here's a partial list of people who are pursuing this issue, from a variety of perspectives (please let me know whom I've forgotten!):

In addition, a number of computational workshops and conferences have come together this year to form *Sem, which will probably evolve into a primary outlet for work on this topic.

It's too much to tackle the full problem of compositionality. Thus, I'm going to focus on a particular kind of compositional interaction, namely, those involving adverb–adjective combinations. In our ratings data, we were starting to see that modifiers have complex but systematic effects on the sentiment profiles of adjectives, so this seems like a natural place to start. It's also a vital part of good sentiment analysis, since one's text-level predictions about sentiment can be greatly improved simply by being sensitive to negation, attenuation, and emphasis as they relate to sentiment adjectives.

Here is my framing of the problem: suppose we have a bigram w1 w2 (like absolutely amazing) that has vector representation V. Suppose also that we also have separate vector representations for w1 and w2, call them v1 and v2. How can we use v1 and v2 to construct V? This is just the question of semantic compositionality (how does the meaning of the parts determine the meaning of the whole?), but phrased in terms of vector-space representations of meaning.

In principle, we can go after this question using any vector representation. In the last class, our vectors were derived from co-occurrence patterns. In the classes before that, our vectors were probability values, one for each category (star rating or EP reaction distribution). Here, we'll use just the star-rating vectors obtained from IMDB data, because it's possible to grasp intuitively what those vectors are like, whereas the 1000-plus dimensions of the unsupervised approaches are mind-boggling.

As before I've written some code to facilitate interacting with the data:

source('composition.R')

The basic structure of the code is similar to previous lectures: functions for extracting subframes of the data and functions for visualizing the results. What's really new is the set of functions for making predictions about compositionality.

The subframe extraction function is bigramCollapsedFrame. It allows you to extract bigrams, or even unigrams, based on their column values. Its only required argument is a data.frame like bi above, though calling it with only this argument might cause some kind of explosition (not literally), since it will try to collapse the entire file.

The Category values are centered at 0 to create a more intuitive scale, with negativity corresponding to negative numbers, as in our first lecture.

You can leave off one or both of the tag arguments:

bigramCollapsedFrame(bi, word1='very', tag1='r', word2='best')

Phrase Category Count Total Freq Pr

1 very/r best -4.5 97 25395214 3.819617e-06 0.03780675

2 very/r best -3.5 48 11755132 4.083323e-06 0.04041692

3 very/r best -2.5 65 13995838 4.644238e-06 0.04596888

4 very/r best -1.5 69 14963866 4.611108e-06 0.04564096

5 very/r best -0.5 112 20390515 5.492750e-06 0.05436749

6 very/r best 0.5 190 27420036 6.929240e-06 0.06858593

7 very/r best 1.5 414 40192077 1.030054e-05 0.10195519

8 very/r best 2.5 662 48723444 1.358689e-05 0.13448364

9 very/r best 3.5 822 40277743 2.040829e-05 0.20200222

10 very/r best 4.5 2008 73948447 2.715405e-05 0.26877204

You can even leave off one or both of the word arguments:

bigramCollapsedFrame(bi, word1='very', tag1='r', tag2='a')

Phrase Category Count Total Freq Pr

1 very/r a -4.5 28379 25395214 0.001117494 0.06905723

2 very/r a -3.5 15310 11755132 0.001302410 0.08048438

3 very/r a -2.5 20293 13995838 0.001449931 0.08960068

4 very/r a -1.5 23060 14963866 0.001541046 0.09523124

5 very/r a -0.5 31801 20390515 0.001559598 0.09637769

6 very/r a 0.5 45374 27420036 0.001654775 0.10225934

7 very/r a 1.5 75510 40192077 0.001878728 0.11609886

8 very/r a 2.5 98178 48723444 0.002015005 0.12452029

9 very/r a 3.5 77862 40277743 0.001933127 0.11946051

10 very/r a 4.5 127933 73948447 0.001730030 0.10690979

You can look at a specific value for just Word1 or Word2, with or without their tag specifications:

bigramCollapsedFrame(bi, word2='amazing', tag2='a')

Phrase Category Count Total Freq Pr

1 amazing/a 1 1158 25395214 4.559914e-05 0.04336273

2 amazing/a 2 468 11755132 3.981240e-05 0.03785979

3 amazing/a 3 734 13995838 5.244416e-05 0.04987203

4 amazing/a 4 744 14963866 4.971977e-05 0.04728126

5 amazing/a 5 1128 20390515 5.531984e-05 0.05260667

6 amazing/a 6 1915 27420036 6.983944e-05 0.06641415

7 amazing/a 7 3553 40192077 8.840051e-05 0.08406489

8 amazing/a 8 6479 48723444 1.329750e-04 0.12645322

9 amazing/a 9 7994 40277743 1.984719e-04 0.18873781

10 amazing/a 10 23589 73948447 3.189925e-04 0.30334746

A word of caution: it often matters whether you use word1 or word2 to specify the word you want to look at (and similarly for tag1 and tag2. This is because we are looking at a selected subset of the bigrams data, and thus the unigram distributions derived from the two slots can differ.

The word and tag arguments behave like the corresponding arguments for bigramCollapsedFrame.

You can use ylim to adjust the y-axis, which is important if you need a uniform y-axis in order to do comparisons.

The col argument gives the color of the plot lines. 1 is the same as "black". (Colors can be specified with integers or with strings like "black", "blue", and so forth.)

You can mostly ignored the add argument. If set to TRUE, it will try to add your plot line to an existing plot. This is used by the function bigramCompositionPlot, which is described in the next section.

Exerciseex:plot
Play around with bigramPlot. In anticipation of the work we want to do with this data, you might focus on some particular interactions — for example, can you start to discern how a given modifier interacts with the things it modifies?

The function bigramCompositionPlot allows you to compare a bigram with its constituent parts. Here is a typical call:

bigramCompositionPlot(bi, 'really','good')

Figure fig:reallygood

The bigram really good compared with its constituents.

You can optionally specify tag1 and tag2 to further restrict your gaze. ylim allows you to specify the y-axis values.

To add expected category values, use ec=TRUE.

Exerciseex:plot
Continue the investigations you began above, now using bigramCompositionPlot to get an even sharper perspective on what modifiers do to their arguments. At this point, you might pick a single modifier and sample argument (word2) values to see whether patterns emerge. The ec=TRUE flag might start to suggest a coarse-grained semantics.

Semantic theory strongly suggests that, in adverb–adjective constructions, the adverb will take the adjective as its argument and do something with it to create a new adjectival meaning. In present terms, this comes down to see how adverbs modulate the distributional profiles of the things they modify.

One simple high-level method for doing this is to look exhautively at all of the modification data we have for a particular pattern. Expected Category (EC) values provide a rough first summary: we might expect some adverbs to push these values out towards the edges, with others pushing them to the middle.

Thus, for any adverb Adv and adjective Adj, we can look at the different between the EC for Adv and the EC for Adj. Perhaps Adv is a function that does something systematic to EC values.

Compiling all of these differences takes a lot of computing time and some extra trickery with R tables. Thus, I've pre-compiled it into a CSV file — just for the adverbs and adjectives, because my laptop was struggling under the weight of the full vocabulary for the bigrams data:

aa = read.csv('advadj-ecs-probs.csv')

head(aa)

Word1 Tag1 Word2 Tag2 BigramEC B1 B2 B3

1 about r bad a -0.08451206 0.34973878 0.00000000 0.00000000

2 about r big a -0.25807297 0.24525559 0.07569117 0.00000000

3 about r different a 0.25645245 0.11588358 0.16093879 0.09011521

4 about r enough a -1.14190434 0.06129747 0.19863609 0.11122323

5 about r good a -0.22050796 0.12391260 0.13384737 0.08647590

6 about r great a 0.57533210 0.05244627 0.00000000 0.19032577

B4 B5 B6 B7 B8 B9

1 0.00000000 0.00000000 0.10797082 0.22098114 0.06076261 0.22051114

2 0.05946055 0.08727192 0.12979702 0.13282614 0.07304572 0.08836242

3 0.02809519 0.05154512 0.04599694 0.09937082 0.11648541 0.14091083

4 0.15604213 0.22902742 0.02838549 0.11619173 0.07987235 0.01932410

5 0.09705802 0.10684109 0.09269264 0.06624853 0.07948881 0.08413698

6 0.26702009 0.00000000 0.00000000 0.09941394 0.02733559 0.16533750

B10 ArgumentEC A1 A2 A3 A4

1 0.04003551 -1.7473124 0.23192727 0.17646539 0.14067880 0.11415540

2 0.10828946 -0.1378916 0.09608838 0.10263114 0.10698114 0.10981386

3 0.15065812 0.7925735 0.06265967 0.06635101 0.07147188 0.08086372

4 0.00000000 -0.4101139 0.09909160 0.10491410 0.11348335 0.12606232

5 0.12929807 -0.0333225 0.08795664 0.09388235 0.09935886 0.10677708

6 0.19812082 1.1029807 0.05216052 0.05749075 0.06556529 0.07425254

A5 A6 A7 A8 A9 A10

1 0.09778160 0.07773984 0.05537165 0.03957109 0.03343531 0.03287364

2 0.10520268 0.10940290 0.10078790 0.09548835 0.08733057 0.08627308

3 0.08848067 0.10295605 0.12084785 0.13491833 0.14058823 0.13086260

4 0.12717739 0.12065487 0.09754466 0.07370361 0.06967177 0.06769633

5 0.11172191 0.11513720 0.11482930 0.10185136 0.08713044 0.08135486

6 0.08586029 0.09679526 0.11185751 0.13135473 0.14841190 0.17625122

As you can see, this contains the distributions for both the bigram and for Word2. We will use these later. For now, we can focus on the columns relevant for ECs:

The function ecAdjustmentPlot plots, for any given adverb (value of Word1), the distribution of ArgumentEC - BigramEC.

ecAdjustmentPlot(aa, 'really')

Figure fig:really

The distribution of EC adjustments imposed by really. The green line just marks 0, to help orient you.

Exerciseex:ecadj
Continue your adverbial investigations (begun in exercise ex:plot and exercise ex:ec), but now using ecAdjustmentPlot. Try to use the distributions you see to formulate generalizations about how specific classes of adverbs work.

As before, EC values are useful but too limited (and sometimes too untrustworthy) to carry the day. Since they are point estimates, they ignore a lot of the information we have in the sentiment distributions.

What we a really want is to define adverbs as functions that take in adjectival sentiment distributions and morph them somehow. A good theory will be one that morphs them into something resembling the distributions we observe for the corresponding bigram. In the terms of these sentiment distributions, that just is semantic composition.

The language of probability suggests two simple hypotheses right off the bat:

We can multiply the adverb vector and the adjectival vector, point-wise, and then renormalize those values to create a new probability distribution. This corresponds to the semantic claim that composition is intersective.

We can add the adverb vector and the adjectival vector, point-wise, and then renormalize the values. This corresponds to the claim that semantic composition is disjunctive.

I've called this section "Symmetric compositional hypotheses" because both of these analyses assume that the adverb and the adjective are equal partners, with neither truly a functor on the other.

Both of these analyses will seem wrong to the semanticist in you, but, bear with me! It is still instructive to see how well they work.

Returning to bigramCompositionPlot, we can plot these predictions alongside the empirical estimates by filling in values for the optional prediction.func argument. The function Intersective models the multiplicative hypothesis, and the function Disjunctive implements the additive hypotheses:

Our guiding insight from compositional analyses is that the adverb will be a functor on the adjective, taking in its meaning and transforming it in some way to produce a meaning for the whole. The above simple analyses don't make good on this.

As a first pass, suppose you want to analyze a particular adverb Adv. Consider all of the adverb–adjective bigrams it participates in. In each case, it warps the adjective to create a new vector. That is, in each case, for each star rating, it performs an adjustment, moving the probability up or down:

Hypothesis: an adverb is a function that takes an adjective meaning (qua probability vector) V and adjusts each Vi by the mean difference it imposes for category i, with the mean take over all the adjectives in our data.

The mean differences are the maximum likelihood estimates, so this is a natural hypothesis to start with, assuming that the data are underlyingly linear in a way that makes differences appropriate.

The file that we loaded above as aa

aa = read.csv('advadj-ecs-probs.csv')

contains the values we need to calculate the mean difference vector for each adverb. Here is a example of how to do that:

mod = subset(aa, Word1=='really')

## Bigram vectors:

bipr = mod[, paste('B', seq(1,10), sep='')]

## Adjective vectors:

adjpr = mod[, paste('A', seq(1,10), sep='')]

## Differences

diffs = bipr - adjpr

## Averages; the function apply takes in the data.frame as first argument

## The 2 says to calculate via columns (1 = rows)

## The third argument is the function:

apply(diffs, 2, mean)

B1 B2 B3 B4 B5

-0.002339578 -0.004261728 0.001659273 -0.001756302 0.006486497

B6 B7 B8 B9 B10

0.009839113 0.002702927 -0.004702173 -0.004021621 -0.003606409

The adjustments are then made by adding the means from the probabilty values for the adjective, and then renormalizing to get back into probability space.

To see these predictions in plots, we can again use bigramCompositionPlot, here with prediction.func given by Differences. You also need to use the keyword argument prediction.func.arg=a so that Differences gets that table of values as one of its arguments.

Predicted value for really good given good and our additive, mean-based model for really.

Exerciseex:additive
Use bigramCompositionPlot to try to home in on the strengths and weaknesses of this additive hypothesis about how modifiers work. Are there inherent limitations to this approach that we should be aware of?

To assess the above functions, I calculated all the predictions for the bigrams in our aa data.frame and used the KL-divergence as a measure of how close we had come to reconstructing the observed bigram distribution. Here are the (surprising) results; lower (smaller divergence from truth) is better. The symmetric intersective analysis is in the lead!

Figure fig:results

Assessing the composition methods using KL-divergence. All the differences are pairwise significant at 0.05 according to wilcox.test, which implements the Wilcoxon signed rank test (an improvement on the t-test that does not presuppose normally distributed data). The full plot is on the left, and an easier-to-read detail is at the right.

Figure fig:socher

Table of results from Socher et al. 2012, who use the same data set as we do. The third model is our disjunctive one, and the fourth is our intersective one. I am not sure why his numbers are so noteably better than ours above.

Exerciseex:mult
Another natural asymmetic analysis is a multiplicative version of the additive one above. Here, we would multiply the bigram and adjective probabilities to get a vector of adjustments X and then use a ratio X/A, where A is the vector of adjective probabilities, to make adjustments during composition. Implement this using FunctorStyleModel in composition.R, on the model of Differences, and see how it does.