18 October 2006

Shared tasks have been increasing in popularity over the past half decade. These are effectively competitions (though perhaps that word is rightfully disdained) for building systems that perform well on a given task, for a specific data set. Typically a lot of stuff is given to you for free: the data, variously preprocessing steps, evaluation scripts, etc. Anywhere from a handful of people to dozens enter these shared tasks. Probably the most well known are the CoNLL shared tasks, but they have also taken place in other workshops (eg., the twoSMT workshops and many others). Goverment-run competitions (eg., GALE, ACE, DUC (to some degree) and others) are somehow similar, with the added bonus that money is often contingent on performance, but for the most part, I'll be talking about the community-driven shared tasks. (I'll note that shared tasks exist in other communities, but not to the extent that they exist in NLP, to my knowledge.)

I think there are both good and bad things about having these shared tasks, and a lot depends on how they are run. Perhaps some analysis (and discussion?) can serve to help future shared task organizers make decisions about how to run these things.

Many pros of shared tasks are perhaps obvious:

Increases community attention to the task.

Often leads to development or convergence of techniques by getting lots of people together to talk about the same problem.

Significantly reduces the barrier of entry to the task (via the freely available, preprocessed data and evaluation scripts).

(Potentially) enables us to learn what works and what doesn't work for the task.

Makes a standardized benchmark against which future algorithms can be compared.

Many of these are quite compelling. I think (3) and (5) are the biggest wins (with the caveat that it's dangerous to test against the same data set for an extended period of time). My impression (which may be dead wrong) is that cf. (1), there has been a huge source of interest in semantic role labeling due to the CoNLL shared task. I can't comment on how useful (2) is, though it seems that there is at least quite a bit of potential there. I know there have been at least a handful of shared task paper that I've read that gave me an idea along the lines of "I should try that feature."

In my opinion, (4) seems like it should be the real reason to do these things. I think the reason why people don't tend to learn as much as might be possible about what does and does not work is that there's very little systematization in the shared tasks. At the very least, almost everyone will use (A) a different learning algorithm and (B) a different feature set. This means that it's often very hard to tell -- when someone does well -- whether it was the learning or the features.

Unfortunately (were it not the case!) there are some cons associated with shared tasks, generally closely tied to corresponding pros.

May artificially bloat the attention given to one particular task.

Usefulness of results is sometimes obscured by multiple dimensions of variability.

Standardization can lead to inapplicability of certain options that might otherwise work well.

Leads to repeated testing on the same data.

Many of these are personal taste issues, but I think some argument can be made for them all. For (1), it is certainly true that having a shared task on X increases the amount of time the collective research community spends on X. If X is chosen well, this is often fine. But, in general, there are lots of really interesting problems to work on, and this increased focus might lead to narrowing. There's recently been something of a narrowing in our field, and there is certainly a correlation (though I make no claim of causation) with increased shared tasks.

(2) and (3) are, unfortunately, almost opposed. You can, for instance, fix the feature set and only allow people to vary the learning. Then we can see who does learning best. Aside from the obvious problem here, there's an additional problem that another learning algorithm might do better, if it had different features. Alternatively, you could fix the learning and let people do feature engineering. I think this would actually be quite interesting. I've thought for a while about putting out a version of Searn for a particular task and just charge people with coming up with better features. This might be especially interesting if we did it for, say, both Searn and Mallet (the UMass CRF implementation) so we can get a few more points of comparison.

To be more concrete about (3), a simple example is in machine translation. The sort of preprocessing (eg., tokenization) that is good for one MT (eg., a phrase-based system) may be very different from the preprocessing that is good for another (eg., syntax-based). One solution here is to give multiple versions of the data (raw, preprocessed, etc.), but then this makes the (2) situation worse: how can we tell who is doing best, and is it just because they have a darn good tokenizer (don't under-estimate the importance of this!).

(4) doesn't really need any extra discussion.

My personal take-away from putting some extra thought into this is that it can be very beneficial to have shared tasks, if we set at the beginning what are the goals. If our goal is to understand what features are important, maybe we should consider fixing the learning to a small set of algorithms. If our goal is learning, do the opposite. If we want both, maybe ask people to do feature ablation and/or try with a few different learning techniques (this is perhaps too much burden, though). I think we should definitely keep the (3) of low barrier of entry: to me, this is one of the biggest pros. I think the SMT workshops did a phenomenal job here, for a task as complex as MT. And, of course, we should choose the tasks carefully.

10 comments:

I totally agree about the pros and cons, but shared tasks mean me a bit different. They are the places where the tools don't matter, only your performance. Someone tune on features others on learning, it isn't a problem (to my personal view). It could highlight new topics which worth the time to deal with (at least when you built an application) e.g. a simple system with a postprocessing step (containing few expert rules) can beat the most sophisticated algorithms. Obviously it is interesting only if the goal is to solve problems, not just to release theories.

Nice post Hal. Having participated in a couple shared tasks in the past I have often thought about this. The idea of fixing either the learning or features and varying the other is a reasonable thought, but I am not sure how feasible it is. In my opinion, the most interesting part of these shared tasks is representation. For example, at last years CoNLL task on dependency parsing, people used: spanning tree, stacked-based, CFG, plus some other representations of the problem. In this case, it is not always true that you can fix the learning or feature-set since each representation of the problem may be incompatible with a particular choice. For instance, lets say we want the learning to be discriminative max-likelihood (CRF). For the spanning tree parsing methods it is not clear to me that this can be done (i.e., I know how to do inference and normalization, but I do not know how to compute feature expectations. Just curious, does anyone?). I think leaving as many dimensions free as possible is beneficial. After the shared-task, people can then go and create studies comparing various techniques in a more controlled setting. The only dimension I think is important to be strict about is resources, especially now that people are finding good ways to use unlabeled data. In this case one can have two-tracks "open" and "closed" to allow people to experiment.

Another serious drawback of shared tasks is that many people see them not as research, a share opportuntity to learn, but as a competitive event. This leads to behavious that increases the idiosyncacies of sumitted systems (adding ad-hoc hacks to "rank" better, to improve the score) at the cost of control. This is why often, no true learning experience is the result.