25 June 2009

At the NAACL SSL-NLP Workshop recently, we discussed whether there ought to be a "shared task" for semi-supervised learning in NLP. The panel discussion consisted of Hal Daume, David McClosky, and Andrew Goldberg as panelists and audience input from Jason Eisner, Tom Mitchell, and many others. Here we will briefly summarize the points raised and hopefully solicit some feedback from blog readers.

Three motivations for a shared task

A1. Fair comparison of methods: A common dataset will allow us to compare different methods in an insightful way. Currently, different research papers use different datasets or data-splits, making it difficult to draw general intuitions from the combined body of research.

A2. Establish a methodology for evaluating SSLNLP results: How exactly should a semi-supervised learning method be evaluated? Should we would evaluate the same method for both low-resource scenarios (few labeled points, many unlabeled points) and high-resource scenarios (many labeled points, even more unlabeled points)? Should we evaluate the same method under different ratios of labeled/unlabeled data? Currently there is no standard methodology for evaluating SSLNLP results, which means that the completeness/quality of experimental sections in research papers varies considerably.

A3. Encourage more research in the area: A shared task can potentially lower the barrier of entry to SSLNLP, especially if it involves pre-processed data and community support network. This will make it easier for researchers in outside fields, or researchers with smaller budgets to contribute their expertise to the field. Furthermore, a shared task can potentially direct the community research efforts in a particular subfield. For example, "online/lifelong learning for SSL" and "SSL as joint inference of multiple tasks and heterogeneous labels" (a la Jason Eisner's keynote) were identified as new, promising areas to focus on in the panel discussion. A shared task along those lines may help us rally the community behind these efforts.

Arguments against the above points

B1. Fair comparison: Nobody really argues against fair comparison of methods. The bigger question, however, is whether there exist a *common* dataset or task where everyone is interested in. At the SSLNLP Workshop, for example, we had papers in a wide range of areas ranging from information extraction to parsing to text classification to speech. We also had papers where the need for unlabeled data is intimately tied in to particular components of a larger system. So, a common dataset is good, but what dataset can we all agree upon?

B2. Evaluation methodology: A consistent standard for evaluating SSLNLP results is nice to have, but this can be done independently from a shared task through, e.g. an influential paper or gradual recognition of its importance by reviewers. Further, one may argue: what makes you think that your particular evaluation methodology is the best? What makes you think people will adopt it generally, both inside and outside of the shared task?

B3. Encourage more research: It is nice to lower the barriers to entry, especially if we have pre-processed data and scripts. However, it has been observed in other shared tasks that often it is the pre-processing and features that matter most (more than the actual training algorithm). This presents a dilemma: If the shared task pre-processes the data to make it easy for anyone to join, will we lose the insights that may be gained via domain knowledge? On the other hand, if we present the data in raw form, will this actually encourage outside researchers to join the field?

Rejoinder

A straw poll at the panel discussion showed that people are generally in favor of looking into the idea of a shared task. The important question is how to make it work, and especially how to address counterpoints B1 (what task to choose) and B3 (how to prepare the data). We did not have enough time during the panel discussion to go through the details, but here are some suggestions:

We can view NLP problems as several big "classes" of problems: sequence prediction, tree prediction, multi-class classification, etc. In choosing a task, we can pick a representative task in each class, such as name-entity recognition for sequence prediction, dependency parsing for tree prediction, etc. This common dataset won't attract everyone in NLP, but at least it will be relevant for a large subset of researchers.

If participants are allowed to pre-process their own data, the evaluation might require participant to submit a supervised system along with their semi-supervised system, using the same feature set and setup, if possible. This may make it easier to learn from results if there are differences in pre-processing.

There should also be a standard supervised and semi-supervised baseline (software) provided by the shared task organizer. This may lower the barrier of entry for new participants, as well as establish a common baseline result.

40 comments:

I find it rather ironic that people want a bakeoff corpus in this area. Isn't the whole point getting by with fewer labeled resources?

The evaluation I've seen used again and again is to use whatever evaluation you had for the fully supervised task and draw learning curves vs. amount of data deleted from the supervised set. (This also seems to be the standard paradigm in active learning evals.)

I'll actually come out as against so-called "fair evaluations". I think having everyone work on the same evaluation strongly encourages solutions that are very strongly fit to the eval data (e.g. Penn Treebank POS tagging and parsing).

Although I don't see this widely discussed, after the held out evaluation of the main bakeoff, shared data sets become development sets, not test sets.

While I'm sure none of Hal's readers would do this, I see too many papers that compare their results on a "standard" eval by choosing their best operating point. I'm all for received performance curves (ROC or precision/recall), but the maximum point on the curve shouldn't be misinterpreted as an estimate of system performance on held-out data.

Perhaps I should've made this clearer: The idea is to release standard sets of both labeled and unlabeled data--we definitely weren't thinking of releasing just the labeled part and letting participants gather their own unlabeled data.

Bob, I agree it's important for papers to show results for all operating points, rather than a select few. That is one thing I hope a shared task can help establish.

If anyone has any really good ideas about the shared task, please post/email! This would be a strong reason for a second SSL-NLP workshop. :)

I agree with Bob's arguments against "fair evaluations" or "shared-tasks", but I think the positives still out-weigh the negatives. It is hard not to find tremendous value in the efforts of CoNLL shared-task, TREC, DUC, BioCreative, etc. organizers. I think we understand the space of problems these tasks covered much more now than we ever would have if these efforts were not made.

SSL seems like a good candidate for a shared-task. Many SSL learning techniques are either non-trivial to implement or require a significant amount of tuning to work. As a result, not one really compares to anyone else.

I think if such a task were to be developed, it must be for atomic classification. If you start looking at sequence labeling or parsing, the barrier will be too high for entry.

I'll echo Ryan on the value of bakeoffs in general, and I'd add machine translation (the NIST evaluations) to the list of bakeoffs that have spurred interesting research.

I also thought Ryan would have mentioned the CoNLL 2007 dependency parsing domain adaptation task as an example of a "semi-supervised" bakeoff before. If I understand your planned setup correctly, Kevin, the CoNLL task was similar (albeit only for parsing). The system that won that year (from Kenji Sagae) was doing multiple bootstrapped "self-training" of several discriminative dependency parsers. This seems like a fairly general "semi-supervised" technique, as well.

It sounds like you're leaning this way anyway, but to encourage more general solutions, it seems that we would want to evaluate on several tasks simultaneously. Of course, in this case, it is difficult to enforce a uniform learning technique. If the bakeoff was well-designed, though, I think it would be well worth it!

So since I wasn't at the straw poll, I'll give you a cautiously optimistic "yes" :).

Shared community resources do reduce the barriers to entry, because new researchers don't have to do all that boring data preparation. It seems pretty clear that you get more research done if there is a shared data set.

In the end, however the goal is to make scientific and technological progress, so the quality of the new ideas and reusable techniques generated matters more than the sheer volume of research (well, OK, if the shared data brings more creative researchers into the field, then the provision of opportunities to write papers that give them tenure is good in itself, but that's seccondary). So the question is: will having a standard dataset produceinfluential research.

There are many possible ways of providing a small measure of supervision, definitely not limited toa mix of labelled and unlabelled data. (you could have lexical information but no labels for instances, for example) I'd hate it if the term "semi-supervised" was co-opted for somestandard setting, and we began to lose the impulse to try out different ways of working with the data.

To clarify, I'm all for community data. Almost everyone working on NLP as currently understood by conference and journal reviewers relies on annotated data. It's a great community service when projects like Genia or CoNLL generate data and give it away.

But does the "tremendous value" cited by Ryan come from the data or the bakeoffs themselves? For instance, I don't know if the Penn Treebank ever had a formal bakeoff. But it does have an "official" train/test split, the focus on which I feel is counterproductive.

Doesn't everyone remember the Banko/Brill paper? Hasn't anyone measured variance across cross-validation folds and seen that it dominates small tweaks to parameters, not all of which optimize globally? Specifically, we're fooling ourselves if we use these highly tuned results as a proxy for progress on the "state of the art". Luckily (for the field), more papers are doing wider meta-analyses across different data sets.

I still don't see the point of creating data specifically for semi-supervised learning (at least in Hal's version where you take a supervised task and add unsupervised data). We can use any old labeled data that's out there for the supervised portion, but the unsupervised data's too huge, too volatile, and too tied up with IP to easily distribute (e.g. TREC-size web corpora, Wikipedia, MEDLINE, Entrez-Gene, Wordnet, etc.).

Chris raised an especially intriguing point: "There are many possible ways of providing a small measure of supervision, definitely not limited to a mix of labelled and unlabelled data... I'd hate it if the term "semi-supervised" was co-opted for some standard setting." This reminds me of Jason Eisner's argument for joint inference on heterogeneously-labeled data in his SSNLP keynote, where the philosophy is to throw in whatever resources we have whenever they are available (rather than restricting ourselves to clear-cut labeled/unlabeled splits).

I feel there are two distinct camps in SSL nowadays. Folks with more ML focus tend to think of SSL in a clear-cut way, consisting one set of totally labeled data and another set of totally unlabeled data. Folks from NLP (esp. info extraction and lexical semantics work, in my experience) see SSL as adding supervision (from heterogeneous sources) to an otherwise unsupervised process. The question is, should we view both of these as living in the same SSL world? Should they talk to each other and are the innovations shareable? This somewhat corresponds (but not exactly) to the "semi-supervised vs. semi-unsupervised dichotomy" Hal referred to before.

There were definitely papers from both camps at our workshop, and I would hope that there was value in getting these distinct approaches together. But if not, maybe it is worth considering splitting into two sub-fields? My feeling is there is still value in keeping both under the same roof, but as Chris said, a clearly-defined shared task would favor one at the expense of the other.

Instead of diving head on into the fine print of SSL, we might want to ask first what are really valuable tasks that we cannot do well without SSL. The CoNLL and Biocreative tasks I was involved in were valuable on their own, independently of particular methods. The bake-offs were interesting because they identified good learning methods for those valuable tasks. An SSL shared task is upside down: pick a class of methods, and then look for tasks to compare the methods in the class. I'd rather that we focus on something of greater importance: for most Web-scale learning tasks, the plausible fraction of accurately labeled training data is trivial, so we have no choice but to learn all we can from unlabeled data.

I can see that you are an expert at your field! I am launching a website soon, and your information will be very useful for me.. Thanks for all your help and wishing you all the success in your business Thanks for all your help and wishing you all the success in your business Please come visit my site Port St. Lucie Florida Business Directory when you got time.

sorry to ask this here but… I really love your theme, would it happen to be a free one i can download somewhere, or is it a custom theme you had made? Soon i will be launching my own blog, however i’m not great with designs but i do like the style of your site so it would be excellent if i could find (or buy) something with a similar look as my last designer cannot finish my site. Thanks! Please come visit my site Local Business Directory Of Riverside U.S.A. when you got time.

I am not really sure if best practices have emerged around things like that, but I am sure that your great job is clearly identifed. I was wondering if you offer any subscription to your RSS feeds as I would be very interested and can’t find any link to subscribe here. Please come visit my site Anchorage Business Directory when you got time.

This is just another reason why I like your website. I like your style of writing you tell your stories without out sending us to 5 other sites to complete the story. Please come visit my site Local Business Directory Of Buffalo U.S.A. when you got time.

that your great job is clearly identifed. I was wondering if you offer any subscription to your RSS feeds as I would be very interested and can’t find any link to subscribe here. Please come visit my site Riverside California CA when you got time.

that your great job is clearly identifed. I was wondering if you offer any subscription to your RSS feeds as I would be very interested and can’t find any link to subscribe here. Please come visit my site Stockton Yellow Page Business Directory when you got time.

I enjoyed reading your work! GREAT post! I looked around for this… but I found you! Anyway, would you mind if I threw up a backlink from my site? Please come visit my site Directory San Francisco City when you got time.

I wanted to thank you for this great read!! I definitely enjoying every little bit of it :) I have you bookmarked to check out new stuff you post. Please visit my Please visit my blog when you have time Please come visit my site Reno Business Directory Forum Blog Classifieds when you got time.

I wanted to thank you for this great read!! I definitely enjoying every little bit of it :) I have you bookmarked to check out new stuff you post. Please visit my Please visit my blog when you have time Please come visit my site Reno Business Search Engine when you got time.

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..sesli sohbetsesli chatsesli sohbet siteleri