29 January 2006

There are at least two reasonable goals for NLP research: (1) produce predictive systems that mimic humans; (2) produce explanatory models. (2) gets more into CL and CogSci and I may discuss it at a later date. This post is about (1), at least the machine learning approach to (1).

Our goal is to mimic a human on some task. This means we will, at the end of the day, want to measure our performance against human performance. This measurement is our loss function, and is, in my mind, the only thing that really matters. Coming up with a good loss function is really really hard and I will certainly discuss this issue in another post. But suppose we can come up with one: l(t,h) = how much it hurts to produce a hypothesis h when the true answer was t.

The subsequent question is: given some collected data, how can I minimize l. The dominant approach seems to be: approximate l by some well-known loss (binary loss, squared error, absolute loss, etc.) and then use one of the many good classifiers/regressors to solve our original problem.

This step irks me greatly. Not because this approximation is always bad, but because it is never (rarely) stated. Many papers say something like "Our task in XYZ is to minimize misclassification error of problems created by ABC." This is rarely actually true. I would be much happier with the following explanation: "Our task in XYZ is to minimize TUV loss, but we don't know how to do that, so we'll instead minimize misclassification error." This issue manifests itself commonly in another way: putting the cart before the horse. We often gather a data set, label it, solve the labeling problem (eg., by multiclass classification) and then go back and define the problem in such a way that this is an appropriate solution. This is harder to detect, but equally disingenuous.

So let's say we've done things right: we've defined our loss function and we've gotten some labeled data. Now we have to minimize our loss on this data. If we can do this directly, great. If not, we can say so and that we'll approximate it by some other loss l'.

Now we can start looking at features. We have an l' that we can minimize and we want to come up with features that are useful for this minimization. In particular, if ours is a structured problem, and l' is invariant to some change in structure, we needn't reflect this aspect of the structure in our feature functions function (the prototypical example here being Markov features in sequence labeling for Hamming [per-label] loss, though this issue itself could take a whole other post). We should set aside part of the training data for manual inspection to do feature engineering.

Experimentation is the last real step. We've split off a training set, development set and test set (*). We train a model on the training set, evaluate it on the development set. We look at the dev predictions and see what went wrong. We reengineer some features to fix these problems and repeat. Only at the end do we run on the test set and report our results in a paper.

(*) I'm not a fan of cross-validation in such a cyclic process. The features will be engineered based on the training and development data sets, and to then fold all this back into the bag and do cross-validation will give us unrealistically good results.

And as a very final step, in the case that our optimized l' is not the same as l, I would like to see some results that show that they at least correlate reasonably well. And easy way to do this is to run several versions of the model and to report both l and l' scores and let us see that improving l' does indeed improve l.