28 February 2007

I feel like this is starting to be a bit of dead horse, but I wanted to follow up a bit on previous posts talking about f-score versus accuracy for chunking problems.

An easy observation is that if you have a chunking problem for which the majority of the chunks are multi-word tokens, then it is possible to get a model that achieves quite good accuracy, but abysmal f-score. Of course, real world taggers may not actually do this. What I wanted to know was the extent to which accuracy, f-score and ACE score are not correlated when used with a real world tagger.

Here's the experiment. I use TagChunk for all the experiments. I do experiments on both syntactic chunking (CoNLL data) and NER (also CoNLL data), both in English. In the experiments, I vary (A) the amount of training data used, (B) the size of the beam used by the model, (C) the number of iterations of training run. When (A) is varied, I run five sets of training data, each randomly selected.

The data set sizes I used are: 8, 16, 32, 64, 125, 250, 500, 1000, 2000, 4000 and 8000. (There are number of sentences, not words. For the NER data, I remove the "DOCSTART" sentences.) The beam sizes I use are 1, 5 and 10. The number of iterations is from 1 to 10.

For the chunking problem, I tracked Hamming accuracy and f-score (on test) in each of these settings. These are drawn below (click on the image for a bigger version):

As we can see, the relationship is ridiculously strongly linear. Basically once we've gotten above an accuracy of 80%, it would be really really hard to improve accuracy and not improve F-score. The correlation coefficient for this data is 0.9979 and Kendall's tau is 0.9795, both indicating incredibly strong correlation (formal caveat: these samples are not independent).

For the NER task, I do the same experiments, but this time I keep track of accuracy, F-score and (a slightly simplified version of) the ACE metric. The results are below (again, click for a bigger version):

The left-most image is accuracy-versus-F, the middle is accuracy-versus-ACE and the right is F-versus-ACE. The ACE seems to be the outlier: it produces the least correlation. As before, with accuracy-to-F, we get a ridiculously high correlation coefficient (0.9959) and tau (0.9761). This drops somewhat when going to accuracy-to-ACE (corr=0.9596 and tau=0.9286) or to F-to-ACE (corr=0.9715 and tau=0.9253).

Nevertheless, the majority of the non-linearity occurs in the "very low accuracy" region. Here, that region is in the 0.8-0.9 range, not the 0.1-0.5 range as in chunking. This is because in chunking, almost every word is in a chunk, whereas in NER there are a ton of "out of chunk" words.

The take-away from these experiments is that it seems like, so long as you have a reasonably good model (i.e., once you're getting accuracies that are sufficiently high), it doesn't really matter what you optimize. If your model is terrible or if you don't have much data, then it does. It also seems to make a much bigger difference if the end metric is F or ACE. For F, it's pretty much always okay to just optimize accuracy. For ACE, it's not so much, particularly if you don't have sufficient data.

25 February 2007

I feel it is increasingly interesting to look at problems that have both a linguistic and a non-linguistic signal. The simplest example I can think of here is recent work in looking at images and their captions (eg., the matching words and pictures approach, though this is not the only such example). The idea in many of these techniques is that we have a single data set that contains two types of data. One is language (eg., the captions) and the other is something else (eg., the images). There are many other potentially interesting sources of data like this that I'll discuss after saying why I think these problems are interesting.

I think there are two basic reasons why such "dual" problems are interesting:

By looking at multiple modalities, we can get cross-modality reinforcement of what we're learning. I think a great example of that is this paper, which uses face recognition together with textual coreference in order to do unsupervised coref in captions. The idea is that if you can tell that two different pictures contain the entity we call "Bill Clinton" in them, then it's more likely that a name in their caption will corefer. This gives us a greater ability to share data across a single data set.

When language is treated in the context of some other source of information, we get a sort of grounding effect. That is, continuing the coref example from #1, we have--in some sense--grounded the string "Bill Clinton" to a pictorial representation of a person. Sure, this representation is still just a bunch of symbols, but they're markedly different symbols from the ones we use to represent the entity in purely linguistic terms. Perhaps hard-core cognitive scientists wouldn't call this grounding, but it's getting there. (Also along these lines, one of my favorite papers from a few years ago was my ex-ISI friend Mike doing grounding by looking and language and action in video games.)

There are probably other reasons why such tasks are interesting, but these are the two that appeal to me most strongly.

It seems that vision+language is the hot topic, or at least the warmest of the bunch. This is probably because vision people and language people tend to overlap and meet at machine learning conferences. It's probably also because the sorts of techniques used by the two communities are perhaps the most similar. But I think there's lots of room for looking at other "X"s. For instance, biology. There are lots of data sets (eg., GEO) that contain textual information ("these are cell lines depicting ovarian cancer") plus the actual cell lines. Heck, many papers in PubMed contain such information, albeit in figures rather than matrices. Robotics is another option. Ray Mooney's group down in Texas has worked at understanding orders given to robocup competitors based on language information (eg., this paper). Perhaps the oldest that actually lives within the NLP community is NLP + databases, which we really haven't seen much of in the past 5-10 years.

I think this is an interesting and fruitful area of future research and is one that I'll probably be exploring myself (but I won't give away what "X"s!).

13 February 2007

I've posted on structured prediction a fewtimesbeforeandthensome. This is another post, but with a different perspective. Besides, the most recent time was last May, so it's been a while. This post is prompted by a recent OpEd by John Lafferty and Larry Wasserman on Challenges in Machine Learning in Statistica Sinica, where they mention SP as one of three major open questions in machine learning (the other two are semi-sup and sparse learning in high dimensions).

I've been trying to think of a good way of ontologizing SP methods recently, primarily because I want to understand them better. The more I think about it, the more I feel like we really don't understand the important issues. This is partially evidenced by an observation that in the past six years (essentially, since the first CRF paper), there have been like 30 different algorithms proposed for this problem. The fact that there are so many proposed algorithms tells me that we don't really understand what's going on, otherwise we wouldn't need to keep coming up with new techniques. (I'm not innocent.)

To me, the strongest division between techniques is between techniques that require tractable inference in the underlying problem (eg, CRFs, M3Ns, SVM-struct, structured perceptron, MIRA, MEMMs, HMMs, etc.) and those that don't (eg, incremental perceptron, LaSO, Searn, the Liang/Lacoste-Julien/Klein/Taskar MT algorithm, local predictors ala Roth and colleagues, etc.). Whether this truly is the most important distinction is unclear, but to me it is. I think of the former set as a sort of "top down" or "technology-push" techniques, while the latter are a sort of "bottom up" or "application-pull" techniques. Both are entirely reasonable and good ways of going at looking at the problem, but as of now the connections between the two types are only weakly understood.

An alternative division is between generative techniques (HMMs), conditional methods (CRFs) and margin-based techniques (everything else, minus Searn which is sort of non-committal with respect to this issue). I don't really think this is an important distinction, because with the exception of generative being quite different, conditional methods and margin-based methods are essentially the same in my mind. (Yes, I understand there are important differences, but it seems that in the grand scheme of things, this is not such a relevant distinctions.)

A related issue is whether the algorithms admit a kernelized version. Pretty much all do, and even if not, I also don't see this as a very important issue.

There are other issues, some of which are mentioned in the Lafferty/Wasserman paper. One is consistency. (For those unfamiliar with the term, a consistent algorithm is essentially one that is guaranteed to find the optimal solution if given infinite amounts of data.) CRFs are consistent. M3Ns are not. Searn is not, even if you use a consistent classifier as the base learning algorithm. My sense is that to statisticians, things like consistency matter a lot. In practice, my opinion is that they're less important because we never have that much data.

Even within the context of techniques that don't require tractability, there is a great deal of variation. To my knowledge, this family of techniques was essentially spawned by Collins and Roark with the incremental perceptron. My LaSO thing was basically just a minor tweak on the IP. Searn is rather different, but other published methods are largely other minor variants on the IP and/or LaSO, depending on where you measure from. And I truly mean minor tweaks: the essentially difference between LaSO and IP is whether you restart your search at the truth when you err. Doing restarts tends to help, and it lets you prove a theorem. We later had a NIPS workshop paper that restarted, but allowed the algorithm to take a few steps off the right path before doing so. This helped experimentally and also admitted another theorem. I've seen other work that does essentially the same thing, but tweaks how updates are done exactly and/or how restarts happen. The fact that we're essentially trying all permutations tells me that many things are reasonable, and we're currently in the "gather data" part of trying to figure out the best way to do things.

05 February 2007

I was recently asked by a colleague if I knew what the first paper was that used the bag of words model. I'm pretty certain it would be an IR paper, but have no idea what I would be. Manning+Schutze and Jurafsky+Martin don't have it. I know tf-idf is due to Sparck-Jones, but I presumably BOW existed before that. The vector space model is often credited to Salton, which is probably the earliest thing I know of, but my guess is that BOW predated even that. Anyone know a citation?

03 February 2007

Errors happen and sometimes get in to papers. A recent example is the JAIR paper I had with Daniel on Domain Adaptation last year. I actually didn't catch the error myself -- it was caught by someone who was reimplementing the technique. And it's a totally not-insignificant error: essentially, the update equation for the generative parameters is completely botched. If you look through the derivation in the Appendix, it's clear where the error crept in.

Thankfully, this sort of error is essentially a typo. That is, the error was introduced when I was typing up the paper, not when I was doing the research. Why this is important is that it means the the implementation reflects the correct updates: only the paper has the mistake. This means that the experimental results from the paper are valid, contingent on the fact that you rederive the updates yourself, or just ask me what they should be.

I'm writing this post because it's somewhat unclear what to do when such a thing arises. One temptation is to do nothing. I have to admit that I was completely embarrassed when this was pointed out to me. There was a part of me that wanted to ignore it. It seems that this is the wrong approach for a variety of reasons, not the least of which is to make sure that correct information does get out. The question, to some degree, is exactly how to do this. I have a blog, which means I can write an entry like this. I can also put an errata on my web page that points out the errors (I'm writing this up as we "speak"). Given that this is a pub in an online journal, I believe I am able to submit updates, or at least additional appendices, which means that the "official version" can probably be remedied.

But what about conference pubs? If this had appeared in ACL and I didn't have a blog, the situation would be something different (ironically, an earlier version with the correct updates had been rejected from ACL because the derivations were omitted for space and two reviewers couldn't verify them). Also, what if someone hadn't pointed it out to me? I certainly wouldn't have noticed -- that paper was behind me. But then anyone who noticed the errors might dismiss the results on the grounds that they could assume that the implementation was also incorrect (it's not inconceivable that an erroneous implementation can still get good results). This would also not be good because the idea in the paper (any paper with such errors) might actually be interesting.

False things are published all the time. The STOC/FOCS community (i.e., theory community) has a handful of examples...for them, errors are easy to identify because you can prove the opposite of any theorem. I recall hearing of a sequence of several papers that incrementally used results from a previous, but the first was in error, putting the rest in error (I also recall hearing that many of the subsequent results could be salvaged, despite the ancestral mistake).

I don't know if there's a good solution, given our publication mechanisms (essentially, publish-once-then-appear-in-the-anthology). But I'm pretty sure mine is not the first paper with such errors. At least I hope not :).