11 June 2007

Our M.O. in NLP land is to evaluate our systems in a first-best setting, typically against a balanced F measure (balanced F means that precision and recall are weighed equally). Occasionally we see precision/recall curves, but this is typically in straightforward classification tasks, not in more complex applications.

Why is this (potentially) bad? Well, it's typically because our evaluation criteria is uncalibrated against human use studies. In other words, picking on balanced F for a second, it may turn out that for some applications it's better to have higher precisions, while for others its better to have higher recall. Reporting a balanced F removes our ability to judge this. Sure, one can report precision, recall and F (and people often do this), but this doesn't give us a good sense of the trade-off. For instance, if I report P=70, R=50, F=58, can I conclude that I could just as easily get P=50, R=70, F=58 or P=R=F=58 using the same system but tweaked differently? Likely not. But this seems to be the sort of conclusion we like to draw, especially when we compare across systems by using balanced F as a summary.

The issue is essentially that it's essentially impossible for any single metric to capture everything we need to know about the performance of a system. This even holds up the line in applications like MT. The sort of translations that are required to do cross-lingual IR, for instance, are of a different nature than those that are required to put a translation in front of a human. (I'm told that for cross lingual IR, it's hard to beat just doing "query expansion" using model 1 translation tables.)

I don't think the solution is to proliferate error metrics, as has been seemingly popular recently. The problem is that once you start to apply 10 different metrics to a single problem (something I'm guilty of myself), you actually cease to be able to understand the results. It's reasonable for someone to develop a sufficiently deep intuition about a single metric, or two metrics, or maybe even three metrics, to be able to look at numbers and have an idea what they mean. I feel that this is pretty impossible with ten very diverse metrics. (And even if possible, it may just be a waste of time.)

One solution is to evaluate a different "cutoffs" ala precision/recall curves, or ROC curves. The problem is that while this is easy for thresholded binary classifiers (just change the threshold), it is less clear for other classifiers, much less complex applications. For instance, in my named entity tagger, I can trade-off precision/recall by postprocessing the weights and increasing the "bias" toward the "out of entity" tag. While this is an easy hack to accomplish, there's nothing to guarantee that this is actually doing the right thing. In other words, I might be able to do much better were I to directly optimize some sort of unbalanced F. For a brain teaser, how might one do this in Pharaoh? (Solutions welcome in comments!)

Another option is to force systems to produce more than a first-best output. In the limit, if you can get every possible output together with a probability, you can compute something like expected loss. This is good, but limits you to probabilistic classifiers, which makes like really hard in structure land where things quickly become #P-hard or worse to normalize. Alternatively, one could produce ranked lists (up to, say, 100 best) and then look at something like precision a 5, 10, 20, 40, etc. as they do in IR. But this presupposes that your algorithm can produce k-best lists. Moreover, it doesn't answer the question of how to optimize for producing k-best lists.

I don't think there's a one-size fits all answer. Depending on your application and your system, some of the above options may work. Some may not. I think the important thing to keep in mind is that it's entirely possible (and likely) that different approaches will be better at different points of trade-off.