26 July 2016

I remember back in grad school days some subset of the field was thinking about the following question. I train an unsupervised HMM on some language data to get something-like-part-of-speech tags out. And naturally the question arises: these tags that come out... what are they actually encoding?

At the time, there were essentially three ways of approaching this question that I knew about:

Do a head-to-head comparison, in which you build an offline matching between induced tags and "true" tags, and then evaluate the accuracy of that matching. This was the standard evaluation strategy for unsupervised POS tagging, but is really just trying to get at the question of: how correlated are the induced tags with what we hope comes out.

Take a system that expects true POS tags and give it induced POS tags instead (at both training and test time). See how much it suffers (if at all). Joshua Goodman told me a few times (though I can't find his paper on this) that word clusters were just as good as POS tags if your task was NER.

Do something like #2, but also give the system both POS tags and induced tags, and see if the POS tags give you anything above and beyond the induced tags.

Now, ten years later since we're in "everything old is new again" phase, we're going through the same exercises, but with word embeddings instead of induced tags. This makes things slightly more complicated because it means that we have to have mechanisms that deal with continuous representations rather than discrete representations, but you basically see the same ideas floating around.

In fact, of the above approaches, the only one that requires any modification is #1 because there's not an obvious way to do the matching. The alternative is to let a classifier do the matching, rather than an offline process. In particular, you take your embeddings, and then try to train a classifier that predicts POS tags from the embeddings directly. (Note: I claim this generalizes #1 because if you did this with discrete tags, the classifier would simply learn to do the matching that we used to compute "by hand" offline.) If your classifier can do a good job, then you're happy.

This approach naturally has flaws (all do), but I think it's worth thinking about this seriously. To do so, we have to take a step back and ask ourselves: what are we trying to do? Typically, it seems we want to make an argument that a system that was not (obviously) designed to encode some phenomenon (like POS tags) and was not trained (specifically) to predict that phenomenon has nonetheless managed to infer that structure. (We then typically go on to say something like "who needs no POS tags" even though we just demonstrated our belief that they're meaningful by evaluating them... but okay.)

As a first observation, there is an entire field of study dedicated to answering questions like this: (psycho)linguists. Admittedly they only answer questions like this in humans and not in machines, but if you've ever posed to yourself the question "do humans encode/represented phrase structures in their brains" and don't know the answer (or if you've never thought about this question!) then you should go talk to some linguists. More classical linguists would answer these questions with tests like, for instance, constituency tests or scoping tests. I like Colin Phillips' encyclopedia article on syntax for a gentle introduction (and is what I start with for syntax in intro NLP).

So, as a starting point for "has my system learned X" we might ask our linguist friends how they determine if a human has learned X. Some techniques are difficult to replicate in machine (e.g., eye movement experiments, though of course models that have something akin to alignment---or "attention" if you must---could be thought of as having something like eye movements, though I would be hesitant to take this analogy too far). But many are not, for instance behavioral experiments, analyzing errors, and, I hesitate somewhat to say it, grammaticality judgements.

My second comment has to do with the notion of "can these encodings be used to predict POS tags." Suppose the answer is "yes." What does that mean? Suppose the answer is "no."

In order to interpret the answer to these questions, we have to get a bit more formal. We're going to train a classifier to do something like "predict POS given embedding." Okay, so what hypothesis space does that classifier have access to? Perhaps you say it gets a linear hypothesis space, in which case I ask: if it fails, why is that useful? It just means that POS cannot be decoded linearly from this encoding. Perhaps you make the hypothesis space outrageously complicated, in which case I ask: if it succeeds, what does that tell us?

The reason I ask these questions is because I think it's useful to think about two extreme cases.

We know that we can embed 200k words in about 300 dimensions with nearly orthogonal vectors. This means that for all intents and purposes, if we wanted, we could consider ourselves to be working with a one-hot word representation. We know that, to some degree, POS tags are predictable from words, especially if we allow for complex hypothesis spaces. But this is uninteresting because by any reasonable account, this representation has not encoded anything interesting: it's just the output classifier that's doing something interesting. That is to say: if your test can do well on the raw words as input, then it's dubious as a test.

We also know that some things are just unpredictable. Suppose I had a representation that perfectly encoded everything I could possibly want. But then in the "last layer" it got run through some encryption protocol. All of the information is still there, so the representation in some sense "contains" the POS tags, but no classifier is going to be able to extract it. That is to say, just because the encoded isn't on the "surface" doesn't mean it's not there. Now, one could reasonably argue something like "well if the information is there in an impossible-to-decode format then it might as well not be there" but this slope gets slippery very quickly.

Currently, I much prefer to think about essentially the equivalent of "behavioral" experiments. For instance, if you're machine translating and want to know if your system can handle scoping, then give it a minimal pair to translate that only differs in the scoping of some negation. Or if you're interesting in knowing whether it knows about POS tags, perhaps look at errors in its output and see if they fall along POS categories.

EDIT 26 Jul 2016, 8:24p Eastern: It's unclear to a few people so clarification. I'm mostly not talking about type-level word embeddings above, but embeddings in context. At a type-level, you could imagine evaluating (1) on out of vocabular terms, which would be totally reasonable. I'm think more something like: the state of your biLSTM in a neural MT system. The issue is that if, for instance, this biLSTM can repredict the input (as in an autoencoder), then it could be that the POS tagger is doing all the work. See this conversation thread with Yoav Goldberg for some discussion.

6 comments:

"Perhaps you make the hypothesis space outrageously complicated, in which case I ask: if it succeeds, what does that tell us?" -- can't cross-validation solve this in a very simple way? A complex hypothesis space would have trouble generalizing to unseen word embeddings. Of course, one can't still solve the "encrypted" case (which is as hard as solving decryption), but it seems feasible to spot simple correlations if we avoid overfitting.

My argument above still works with token-based (context dependent) embeddings. Suppose you have a sequence of word tokens X and some model (a BiLSTM if you like) that reads X and produces a sequence of embeddings E. In addition you have a sequence of gold POS tags Y. Then if you create splits E', Y' and E'', Y'' you can train a classifier that tries to predict Y' from E' and evaluate it on the other split. (CV is even better since it can give a std deviation.) As long as there are no repeated sentences in the two splits this should be fine.

I think this is roughly what Yoav was proposing. I still don't buy it, because if E' == X, you'll still do totally reasonably according to this metric, but E'==X could not really be claimed to "encode" POS.

I see what you mean. I was thinking that the performance with E' := X' should be a lower bound (I'm assuming the POS classifier E' -> Y' does independent decisions for every token, i.e., is not allowed to use the context as typical POS tagger) but this might still be a hard baseline to beat (actually, is it? does anyone ever compared the performance with E' = X' with that of a more reasonable E'? If E' really encodes contextual information it could beat this lower bound.)

In any case, I think the fundamental problem here is exactly the same as in classical many-to-1 matching evaluations of unsupervised POS induction: unless one bounds the number of clusters (e.g. make it equal to the number of POS tags) each word might in the extreme case get its own singleton cluster (same situation as E' = X') which breaks the evaluation. One possibility is to try to emulate 1-to-1 matching -- for your example, this could mean something like: - having good accuracy predicting Y from E- having good accuracy predicting E from Y (with a squared loss, this would be something like the sum of the variances of the clusters whose centroids are the average embedding of all words for each POS tag).