05 June 2007

I believe a resource well needed in the ACL community is a "state-of-the-art-repository", that is a public location in which one can find information about the current state-of-the-art results, papers and software for various NLP tasks (e.g. NER, Parsing, WSD, PP-Attachment, Chunking, Dependency Parsing, Summarization, QA, ...). This will help newcomers to the field to get the feel of "what's available" in terms of both tasks and available tools, and will allow active researchers to keep current on fields other than their own.

For example, I am currently quite up to date with what's going on with parsing, PoS tagging and chunking (and of course the CoNLL shared tasks are great when available, yet in many cases not updated enough), but I recently needed to do some Anaphora Resolution,and was quite lost as for where to start looking...

I think the ACL Wiki is an ideal platform for this, and if enough people will show some interest, I will create a "StateOfTheArt" page and start populating it. But, before I do that, I would like to (a) know if there really is an interest in something like this and (b) hear any comments you might have about it (how you think it should be organized, what should be the scope, how it can be advertised other than in this blog, etc).

I find this especially amusing because this is something that I'd been planning to blog about for a few weeks and just haven't found the time! I think that this is a great idea. If we could start a community effect where everytime you publish a paper with new results on a common task, you also publish those results on the wiki, it would make life a lot easier for everyone.

I would suggest that the pages essentially consist of a table with the following columns: paper reference (and link), scores in whatever the approate metric(s) are, brief description of extra resources used. If people feel compelled, they would also be encouraged to write a paragraph summary under the table with a bit more detail.

I would certainly agree to use this and to support the effort, I would be happy to go back through all my old papers and post their results on this page. It would be nice if someone (Yoav perhaps???) could initialize pages for the main tasks, so that that burden is lifted.

This is a great idea. In general, wikis seem to be a great medium for keepingtrack of the "state of the art" in any field. There are the appropriateincentives for individual authors to post their own results in such a wiki, sothis seems to be a self-sustaining approach. Whoever believes that has thebest tool for some task, they can post their entry, and gain visibility.

I was thinking of doing the same for survey papers that summarize the state ofthe art in a particular field. (See therelated blog entry.)

One of the issues raised for maintaining such "state of the art" lists was thelack of support from current wikis for adding semantically meaningful linksthat can connect the different papers, techniques, tools, and so on. (e.g.,tool A "complements" tool B, tool C "outperforms" tool D). Still, I believe thatthis approach has potential.

The mandate of the ACL Wiki is "to facilitate the sharing of information on all aspects of Computational Linguistics". Survey papers and state-of-the-art-repositories fit the mandate perfectly. As they say at Wikipedia, "Be bold!"

One worry with this proposal is that published results do not define the state-of-the-art; reproduced results are what is needed. All too often, published results are not reproducible or very difficult to reproduce. I have seen instances of papers that were rejected because their results were not better than a "state-of-the-art" that no one could reproduce. At the very least, state-of-the-art status requires published code and data that will yield the state-of-the-art results. That is not the standard in our field yet.

I agree with Fernando that reproducible results are far more important than claimed results, and think an "Available Software" column in the listing can go a long way in solving this issue.

Another issue that I would like to hear comments about before I bootstrap some pages in the wiki is how to deal with similar-yet-different tasks. Three instantiations of this are (1) tasks that really have a lot in common, or that subsume each other (e.g. NP Bracketing vs. NP Chunking vs. Chunking). (2) Different learning frameworks (e.g. Rule based vs. Supervised vs. Semi Supervised vs. Unsupervised). And (3) languages other than English.

How should these be organized? Should they be considered the same task? A completely different task? Subtasks in some kind of an hierarchy? Any other suggestions?

Funny this should come up. I am trying to do the very same thing Hal is discussing here, and let me tell you, it's a mess out there. What should I read up on while unemployed?

At the University of Berkeley they parse really fast, and they have a great POS tagging demo. Wow. Should I quickly learn their techniques?

I am reading Bikel et. al.'s NER extraction paper. Great paper, but a thorough understanding of HMMs is required. After you get the main technique, does the remainder of the paper (endless smoothing formulae with Lambdas) just contain lab-specific solutions, and is it worth wading through?

What are is the basic knowledge an NLPer out on the job market needs, anyway?

I have tried to sift what *I* think are the highlights of the past 10 years, and really, there are not that many. I said 'highlights' That does not mean I frown on all the intense research as not being potential highlights.

Thing that struck me as that for instance, Jurakfski and Galdea are wporking on lexical semantics, but this was tackled at BBN 14 years ago. Penelope Sibun, in what is almost an afterthought in Cutting's Xerox paper, claims good results relating arguments and assigning semantic roles.

I am lost. My interpretation of all this is: there is not all that much ground-breaking innovation, and if there is, we don't know it yet (or I don't know it yet).

I have tried to put all this together on a fledgling set of webpages. If you think that's a contribution to this conversation, great.

"Thing that struck me as that for instance, Jurakfski and Galdea are wporking on lexical semantics, but this was tackled at BBN 14 years ago. Penelope Sibun, in what is almost an afterthought in Cutting's Xerox paper, claims good results relating arguments and assigning semantic roles.

I am lost. My interpretation of all this is: there is not all that much ground-breaking innovation, and if there is, we don't know it yet (or I don't know it yet)."

I suggest you re-read the Cutting paper and the latest papers on semantic role labeling (perhaps the work of Pradhan et al.). The Cutting paper reports 80% accuracy on a coarse-grained classification task, whereas modern papers report 90%+ on a more fine-grained classification task.

I am not sure where you get the impression that there has been no ground breaking research. What about machine translation? Systems have gone from language specific and totally unusable to robust, language general and very much useful (though with many more improvements still needed).

Discriminative models, rich feature sets and other developments have led to named-entity taggers with accuracies above 90% This is not only for simple categories like people names and places, but also for complex entity types like genes and chemical compounds. Entity taggers are so robust today, that they can often be (and are) used out-of-the-box for many real world applications.

It might be true that it is rare for a single paper to be considered "ground breaking innovation". However, I think it is simplistic to expect that. Language is complex and difficult. Though we want our solutions to ultimately be as simple as possible, we should expect the path in which we reach those solutions to also be complex and as a result incremental. When taken as a whole, I think it would be hard to argue that the body of research over the past 10 years has not been innovative.

An interesting take on incremental research can be seen in a post by Fernando Pereira

yoav -- available software is a big plus. i'm not sure how to handle the similar tasks -- a reasonably dense linking structure might be the way to go. imo, you should make it so that it is as easy as possible for people to add their info, even if this makes it slightly harder to find. if it's hard to enter, no one will and it will be useless. if it's easy to enter but hard(er) to find, then it's still better than combing 100s of papers, so there's still benefit.

koos/ryan: i think ryan is right. a lot of times it's somewhat hard to track progress because the problems are a bit amorphous. the same problem goes by different names; similar yet different problems by the same name. i would say that while there have been few papers over the past decade that alone have been amazingly groundbreaking, the sum progress is huge. i'm oversimplifying here, but 10 years ago things didn't work at all. today many things work well enough.

Ryan wrote in response to my posting: I suggest you re-read the Cutting paper and the latest papers on semantic role labeling (perhaps the work of Pradhan et al.). The Cutting paper reports 80% accuracy on a coarse-grained classification task, whereas modern papers report 90%+ on a more fine-grained classification task.

My reply: Thank you for your reaction (and man, do my typos look embarrassing). Please realize my post should be taken in the spirit of this discussion, which I interpret to be "how can we see the forest for the trees?"

Ryan wrote in response to my posting:I am not sure where you get the impression that there has been no ground breaking research.

My reply: There are a number of reasons why I have that impression, the main one being *I* am having a hard time seeing the forest for the trees. (This implies others may not have a similarly hard time).

As an 'industrial linguist', but not one working at a major research lab, it is hard for me to determine which particular line of research is important and will bear fruit in the (near) future. I might have formulated my anguish ( :) ) as a question very much in keeping with this particular topic: how will any serious researcher determine which papers/lines of research are the Church's, the Cutting et. al.'s and the Weischedels of the present? In other words, I am not saying there is no progress, per se (I did say that verbatim, but phrased it awkwardly); I am saying - what is the most effective way for an 'industrial linguist' to stay informed of significant research.

Ryan wrote in response to my posting: What about machine translation? Systems have gone from language specific and totally unusable to robust, language general and very much useful (though with many more improvements still needed).

My reply: I am all-too-happy to hear it, having done some actual work in MT. And yes, it used to be an intractable problem. My current interest, however, lies in working with other textual technologies.

Ryan wrote in response to my posting:Discriminative models, rich feature sets and other developments have led to named-entity taggers with accuracies above 90% This is not only for simple categories like people names and places, but also for complex entity types like genes and chemical compounds. Entity taggers are so robust today, that they can often be (and are) used out-of-the-box for many real world applications.

My reply: I am aware of this, but, in a way, my awareness is too dim. And that's in keeping with the purpose of this partcular conversation: how do we see the forest for the trees?

Ryan wrote in response to my posting:It might be true that it is rare for a single paper to be considered "ground breaking innovation". However, I think it is simplistic to expect that. Language is complex and difficult. Though we want our solutions to ultimately be as simple as possible, we should expect the path in which we reach those solutions to also be complex and as a result incremental. When taken as a whole, I think it would be hard to argue that the body of research over the past 10 years has not been innovative.

My reply: You are absolutely correct in the previous paragraph. Again, though, my question is "How do we, in the field, having CTOs and CEOs that expect results, effectively wade through the deluge of papers and information to keep up?" There are several routes to take on which one can read/study incrementally.

Again, take my web visit to Berkeley as an example. The demo there is downright impressive. The tagger is incredibly fast, and the parser even faster. It's also accurate, and it deals with unseen data. Does this imply I should start reading their every research paper? Of course not, but then what *should* I read? Again, that seems what this conversation is supposed to address, correct?

Ryan wrote in response to my posting: I suggest you re-read the Cutting paper and the latest papers on semantic role labeling (perhaps the work of Pradhan et al.). The Cutting paper reports 80% accuracy on a coarse-grained classification task, whereas modern papers report 90%+ on a more fine-grained classification task.

My reply: Thank you for your reaction (and man, do my typos look embarrassing). Please realize my post should be taken in the spirit of this discussion, which I interpret to be "how can we see the forest for the trees?"

Ryan wrote in response to my posting:I am not sure where you get the impression that there has been no ground breaking research.

My reply: There are a number of reasons why I have that impression, the main one being *I* am having a hard time seeing the forest for the trees. (This implies others may not have a similarly hard time).

As an 'industrial linguist', but not one working at a major research lab, it is hard for me to determine which particular line of research is important and will bear fruit in the (near) future. I might have formulated my anguish ( :) ) as a question very much in keeping with this particular topic: how will any serious researcher determine which papers/lines of research are the Church's, the Cutting et. al.'s and the Weischedels of the present? In other words, I am not saying there is no progress, per se (I did say that verbatim, but phrased it awkwardly); I am saying - what is the most effective way for an 'industrial linguist' to stay informed of significant research.

Ryan wrote in response to my posting: What about machine translation? Systems have gone from language specific and totally unusable to robust, language general and very much useful (though with many more improvements still needed).

My reply: I am all-too-happy to hear it, having done some actual work in MT. And yes, it used to be an intractable problem. My current interest, however, lies in working with other textual technologies.

Ryan wrote in response to my posting:Discriminative models, rich feature sets and other developments have led to named-entity taggers with accuracies above 90% This is not only for simple categories like people names and places, but also for complex entity types like genes and chemical compounds. Entity taggers are so robust today, that they can often be (and are) used out-of-the-box for many real world applications.

My reply: I am aware of this, but, in a way, my awareness is too dim. And that's in keeping with the purpose of this partcular conversation: how do we see the forest for the trees?

Ryan wrote in response to my posting:It might be true that it is rare for a single paper to be considered "ground breaking innovation". However, I think it is simplistic to expect that. Language is complex and difficult. Though we want our solutions to ultimately be as simple as possible, we should expect the path in which we reach those solutions to also be complex and as a result incremental. When taken as a whole, I think it would be hard to argue that the body of research over the past 10 years has not been innovative.

My reply: You are absolutely correct in the previous paragraph. Again, though, my question is "How do we, in the field, having CTOs and CEOs that expect results, effectively wade through the deluge of papers and information to keep up?" There are several routes to take on which one can read/study incrementally.

Again, take my web visit to Berkeley as an example. The demo there is downright impressive. The tagger is incredibly fast, and the parser even faster. It's also accurate, and it deals with unseen data. Does this imply I should start reading their every research paper? Of course not, but then what *should* I read? Again, that seems what this conversation is supposed to address, correct?

Ok, I created a new Wiki category called "State of the Art" with a link from the first ACLWiki page. I populated it with skeletons for some core NLP tasks, and started filling in some of the entries (for now some POS tagging and some Parsing, more will follow soon).

I checked out the new Wiki for results. In the POS tagging entry, I noticed Libin Shen et al.'s new tagging paper from ACL '07.

It reports an improvement from Toutanova et al.'s 97.24 to 97.33 on the same old sections of the treebank (test on sections 22-24). I can't afford the treebank, so I'm just estimating here, but there are about 1M words, and about 25 sections, so the test set is only about 120K words.

A simple binomial hypothesis test would put a one-sigma confidence interval at sqrt(.97 * (1 - .97) / 120,000), or 0.0005. The 95% confidence interval would be 2 sigma, or about.001, or about .1%, or just about the improvement noted in the paper.

So is the result "significant"? No, it's not, because the confidence interval is still too fat. For it to be a true confidence interval, the tests would have to be taken at random. But they're not -- they're all taken from section 22-24 of the Treebank, in which there are all kinds of temporal and topical dependendencies within the whole articles making up the corpus. For instance, the same phrase shows up again and again referring to a person, but the evals treat them as independent.

Another assumption is that we don't build gazillions of systems and then choose the best one post-hoc. The multi-way significance eval would be much stricter.

I don't mean to pick on Shen et al. I had the same reaction to Michael Collins's paper on improving his parser some fractional degree. And often reimplementations of the same "idea" have this much noise in them (e.g. Bikel's reimplementation of Collins's parser).

This is a problem in our field and how it misunderstands significant improvements. I've had papers rejected for not evaluating on a "standard" test set, even when there wasn't one.

Finally, I'd like to plea for memory and time reporting for results. Ideally with the amount of human effort spent feature tweaking. When I'm shopping for a technique for a commercial app, these are overriding concerns that dwarf 0.001 improvements in accuracy on an "easy" test set that matches the training data. In that vein, I'd love to see results on words not in the training set.

Wouldn't it also be nice to have the information about the language for which the results were obtained? I'm new into this field, but I assume most results are language dependent, and I can also imagine that there are languages for which the performance will lag behind forever in comparison to, for example, English. Moreover, I agree that reproducibility is crucial thus I think it would be nice to have an indication whether and where the results have been reproduced.

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..sesli sohbetsesli chatsesli sohbet siteleri