18 January 2007

The "Last Words" article in the Dec 2006 issue of Computational Linguistics is by Annie Zaenen from PARC. (I hope that everyone can access this freely, but I sadly suspect it is not so... I'm half tempted to reproduce it, since I think it's really worth reading, but I don't want to piss off the Gods at MIT press too much.)

The main point I got from the article is that we really need to pay attention to how annotation is done. A lot of our exuberance for annotating is due to the success of machine learning approaches on the Treebank, so we have since gone out and annotated probably hundreds of corpora for dozens of other tasks. The article focuses on coreference, but I think most of the claims apply broadly. The first point made is that the Treebank annotation was controlled, and done by experts (linguists). Many other annotates are not done so: are done without real standards and without deep analysis of the task. The immediate problem, then, is that a learning algorithm that "succeeds" on the annotated data is not necessarily solving the right task.

There was a similar story that my ex-office-mate Alex Fraser ran across in machine translation; specifically, with evaluating alignments for machine translation. The basic problem was two-fold. First, the dataset that everyone used (the French-English data from Aachen) was essentially broken, due largely to its distinction between "sure" and "possible" links -- almost every word pair was possibly linked. This, together with the broken evaluation metric (alignment error rate --- or AER) made results on this dataset virtually useless. The conclusion is essentially: don't use the Aachen data and don't use AER. That is, don't use them if you want improved MT performance, i.e., if you expect higher alignment performance to imply higher MT performance. (If this sounds familiar, it's perhaps because I mentioned it before.)

I should say I largely agree with the article. Where I differ (perhaps only by epsilon) is that the article seems to pick on annotation for machine learning, but I really don't see any reason why the fact that we're using machine learning matters. The issue is really one of evaluation: we need to know that at the end of the day, when we compute a number, that number matters. We can compute a number intrinsically or extrinsically. In the extrinsic case, we are golden, assuming the extrinsic task is real (turtles upon turtles). In the intrinsic case, the situation is fishy. We have to make sure that both our annotations mean something and our method of computing error rate means something (ala the error metric types and the use of F for named entities). While I've argued on this blog that the error metric is important, the CL article argues that the annotation is important. I think that as someone who is on the machine learning side, this is easy to forget.