20 December 2006

The holidays are upon us (hence the lack of posts), but I wanted to give a brief nod to NIPS, which I thought went pretty well this year. The highlights for me (note that I didn't attend everything, so the usual qualifications apply) were:

The Netflex Prize talk: Bennett talked about the origins of the prize and how well people are doing. They seem to firmly believe that the goal will be met. There's some indication it actually won't take all that long. Currently leading the pack at the time of the talk was wxyz, who is one of my old CMU buddies, Yi Zhang, now at UCSC. Nearly tied for first was Geoff Hinton's group up at Toronto. There weren't any details given about what people are doing to do so well, but it seems reasonable to assume that Toronto is doing something with deep belief nets (more later).

Free Lunches (Invited talk by Dan Ariely). This was probably my favorite talk of the whole conference. The gist of the talk is that it's really easy to get humans to behave in ways that defy the standard "cost/benefit analysis" setting. A few fun examples. Split a bunch of people into two groups. Have one group write down 3 things they like about their significant other; have the other group write 10 things. Then ask them how much they love their SO. The result is that the "3 things" group loves them much more (presumably because no one can actually list 10 things). There were also a lot of examples about how often people cheat; the basic result is that it seems that knowing you cannot be caught does not necessarily make you cheat more. Also, if you prime people by having them sign an honor code, they cheat less. There were many more examples. I'm not quite positive what the take-home message was, but it was a very interesting talk.

Analysis of Representations for Domain Adaptation (Blitzer et al). This is the first compelling analysis I've seen of the domain adaptation problem. The basic idea is to bound generalization error based on the distance between the source and target distributions. Quite clever and John says that they may even try to develop algorithms that explicitly minimize this bound.

Boosting Structured Prediction for Imitation Learning (Ratliff et al). A cute feature-boosting algorithm for SP problems, used for the "following a map" problem.

Large Margin Gaussian MMs for ASR (Sha, Saul). Do standard Gaussian mixture modeling for ASR, but put a large margin constraint on the Gaussians. Do a hard-EM-like thing to get fixed cluster assignments, and you can write this as a semi-definite program. Very cute.

Greedy Layer-wise Training of Deep Networks (Bengio et al). Deep belief nets can do a really good job at some vision problems, but they're quite hard to train (symmetries, etc.). The basic idea here is to initialize the training by a simple layer-at-a-time method. The thing I found especially interesting is that their initialization looks a lot like a "predict yourself" sort of strategy, which seems to be a recurring theme in a lot of unsupervised/semi-supervised learning problem.

There were also a lot of good workshops. I most enjoyed the one on covariate shift (i.e., domain adaptation). I enjoyed it so much, I plan to post separately on it in a few days (holiday schedule permitting).

20 November 2006

I gave a talk today at our small group meeting about feature engineering for coreference resolution problems. I think it is oft underappreciated by people (myself occasionally included) who spend a lot of time working on fancy machine learning algorithms that the difference between success and failure often hinges more on clever features than on fancy learning. For those who want to look at the slides, you can take an OpenOffice or PDF version for your perusal. I forewarn that it may be a bit hard to follow without the soundtrack, but it's sort of an "untold story" version of the coref paper we had at HLT/EMNLP 2005.

So what did we do that's clever (or, at least, as far clever as I knew at the time and know now)? Well, a few things. First, I tried extending the notion of Hobbs' distance to discourse trees. This actually works surprisingly well. The true referent is the first element according to syntactic Hobbs' distance in about 70-80% of the cases and we (almost) see the same thing at the discourse level. The only oddity is that you have to flip the nuclearity of the attribution relation. Then you get about 80% holding, assuming perfect discourse trees. Of course, we never have perfect discourse trees and the noise introduced by poor parsing is enough to make this feature more or less useless.

The second thing we tried was to use this name/instance data that Mike Fleischman gathered at ISI before going off to MIT. This data was gathered in the context of Q/A to answer questions like "Who is the president of the U.S.?" by using lookup-tables. But it's great data for coref! It includes about 2 million examples of names paired with their occupations. Maybe you can find this data interesting too. Mike describes it more in his ACL 2003 paper.These guys make a huge difference in coref performance.

The next thing that helped a lot was a clever, non-standard use of gazetteers. What we do is take the ID of the gazetteer to which the anaphor belongs and pair it with the lexical item of the antecedent. Why? Well, suppose we have a gazetteer that contains words like "California" and "Utah" and "New York." By pairing this list id (eg., "list-5") with antecedents, we get features like "list-5 *AND* state" or "list-5 and country" etc. We'll probably learn that the former is a good feature and the latter is not. We also have a feature that checks to see if the two referents are on the same gazetteer but are not the same word. Why? This is a negative feature. "California" and "Utah" appear on the same list (states) but are not identical (string-wise). So they're probably not coreferent.

We tried pretty hard to get WordNet to work, but it just kept not helping. WordNet distance is a terrible measure (eg., "Japan" and "Russia" have distance 2 -- they are both under "country"). On the other hand, like in the gazetteer case, "nearby in WordNet but not the same word" is a reasonably good feature, though it turned out not to help much. Hypernym and hyponym were the most promising, but we never actually got any benefit from them.

The rest of the stuff in the talk made it fairly close to the front of the paper. The most surprising result was that count-based features help a lot! These features look at things like the entity to mention ratio, the number of entities detected so far, etc. These can't be included in a simple coref model that makes pairwise decisions, but for us it's easy.

The end result is that using just string-based features (standard "substring" and edit distance stuff), lexical features (word pairs), count-based features and Mike's data, we can get an ACE score of 87.6 using LaSO for doing the learning (this goes up about 0.8 if you use Searn in most cases, but I don't have all the numbers). Adding in some inference to predict things like number and gender gets you up to 88.3. Gazetteers get you up to 88.7. After that, the rest of the features (regular expression patterns, word clusters, WordNet, discourse, and syntax) don't help...all together, they get you up to 89.1, but that's not much bang for the buck.

My conclusion is basically that while learning makes a difference (Searn is a bit better than LaSO), the feature engineering makes a bigger one[*]. Many of the features we found useful were a priori not really that obvious. I spent a month or two of my life trying to come up with ways of being clever to help improve performance (playing on dev data so as to be fair). It's definitely a non-trivial enterprise, but it often pays off.

[*] This is assuming that your learning is robust enough to handle the features you want: for instance, the standard pairwise models could not handle the count-based features that helped us a lot.

13 November 2006

I was listening to NPR a few weeks ago while pre-election news was still accosting my ears. Though not in my state (shockingly!), there was apparently a measure on the ballots in a lot of states referencing stem cell research. The title of the measure was, apparently, "Should any stem cell research be allowed?"

Okay, quick quiz: how did you first interpret that sentence? If you were in favor of certain varieties, but not all varieties of stem cell research, would you vote for it? In other words, does "any" mean "at least one type of" or "all types of"?

To me, and at least a few other people I've talked to, the prefered interpretation is "all"...it's also much easier to reach this interpretation if you put a bit of stress on "any." But it's also really easy to get the other interpretation by unstressing "any" or by putting another sentence in front of it (eg., "Stem cell research in all forms is purely evil." :P).

The strange thing is that I'm really having a hard time coming up with other examples where the preference for "all" is so strong (even with the stress). It seems that the fact that stem cell research is a class of things, rather than a single thing is important. It also seems that the presence of "should" is pretty much necessary. But even so, I really can't get anything else to come out with such a strong preference for "all".

So what does this have to do with NLP? Well, not too much other than language is hard. Something like this would probably kill a textual entailment system, but given that it's somewhat ambiguous even to people (the degree of ambiguity is person relative though: a Brit here tells me that he has a really hard time getting the "all" interpretation at all), maybe there's just nothing that can be done about it.

07 November 2006

The vast majority of work on summarization is at the text level. A document (or collection of documents) comes in as text, and a summary goes out as text. Although I'm certainly not one of the ones pushing summarization of speech, I think it's quite an interesting task. (Here, by speech summarization, I pretty much mean speech in, speech out ... though similar arguments could be made for text in, speech out applications.) A lot of my conclusion comes from personal experience, but I also had a really great conversation with Alan Black (who works on speech synthesis problems).

Let me motivate this by an personal anecdote. My mom tends to leave really long voicemail messages. Like, really long. Usually she overruns the alloted time and has to call back. It's not uncommon for her messages to span three mailbox slots. (Sorry, mom, if you're reading this!) The most interesting thing about these messages is that they really don't contain that much information. In fact, there's usually only one or two sentences in the whole thing that really matter (at least from an "information" sense).

The problem is that you can't "skim" voicemail. If I get an equally long email (say, one that would take me 3 minutes to read), I can fairly easily skim it to get the gist. Especially with some tailored highlighting, I'm able to absorb much more text per second than speech per second. Word on the street (or at least around the croissant table at ACL) is that studies have been done that show that people can get at information from text just as fast if its presented in a summary as if its not summarized at all, essentially because we're really good at skimming (I knew high school English taught me something!). (Incidentally, if anyone has a reference for this, let me know. I believe I saw it ages ago, but I can't remember where and I'd love to have it.)

So one solution to the "Mom's message" problem is to run automatic speech recognition and turn it into an email. This has the added advantage that I check my email much more frequently than I check my voicemail, it gives me a record of the message after I've deleted the speech signal to save space, and allows for indexing. But there are also times when I don't have easy access to a computer and I really do want listen to the voicemail. This is where speech summarization becomes interesting.

Anyway, if you're now convinced that speech summarization is interesting, well, I don't really know what to tell you since I really haven't looked at this problem. The work I am aware of (and this is a very biased sample...I apologize that I missed something) is:

Sentence-extractive automatic speech summarization and evaluation techniques by Makoto Hirohata, Yosuke Shinnaka, Koji Iwano and Sadaoki Furui (Furui has been working in this area for a while...I can't find this exact paper, but his lab has others)

01 November 2006

Okay, at long last it's time for another edition of "Getting Started In" (p.s., if you want to write one of these for your area of expertise, I'm very open to it!).

So, what is sequence labeling? It's the meta problem that we face all the time in NLP tasks where we want to assign a single label to each element in a sequence. For us, a sequence is usually a sentence and a word is an element. The elements we are trying to assign are usually things like parts of speech, syntactic chunk labels (is this part of a noun phrase, verb phrase, etc.), named entity labels (is this a person?) and so on. Information extraction systems (i.e., extracting meeting times and locations from emails) can also be treated as sequence labeling problems.

There are roughly two varieties of sequence labeling: (1) raw labeling and (2) joint segmentation and labeling. The latter is (IMO) more common. Raw labeling is something like POS tagging where each element gets a single tag. Joint segmentation and labeling is where whole segments get the same label. For instance, in named entity recognition, a sentence like "Yesterday , George Bush gave a speech ." contains example one named entity ("George Bush"). Here, we want to assign the label "PERSON" to the entire phrase "George Bush", not to individual words (for instance, if two named abut, we need to know where they separate).

The easiest way to deal with segmentation problems is to transform them into raw labeling problems. The standard way to do this is the "BIO" encoding, where we label each word by "B-X", "I-X" or "O". Here, "B-X" means "begin a phrase of type X", "I-X" means "continue a phrase of type X" and "O" means "not in a phrase." For instance, the Bush sentence would be labeled as: "Yesterday/O ,/O George/B-PER Bush/I-PER gave/O a/O speech/O ./O" Once can now treat this as a raw labeling problem, perhaps being careful to avoid producing impossible sequences at test time (eg., an "I-X" can only follow a "B-X" or another "I-X"). Other encodings are also possible. So for now, I'll concentrate on raw labeling and revisit the true joint segmentation problem at the end.

In raw labeling, we're faced with the problem of assigning a single label to each word in a sequence. Perhaps the easiest approach is to predict each label independently. Then, we just have a collection of multiclass classification tasks (each label is a different class). We can use whatever classifier we want for this. Despite it's simplicity, this approach is actually quite effective for many problems.

Intuitively, we usually believe that the labels in these problems are not independent. For instance, in POS tagging, it's basically impossible to have a verb immediately following a determiner. We would therefore like to use some local sequence information to improve our performance. The "old school" approach to doing this is to use a hidden Markov model (HMM). I'll refer you to Manning+Schutze for details here (keeping in mind that really what we're using is just a Markov model...in these problems, there is nothing that's hidden). But the basic idea here is that we have two probability distributions: a transition distribution (how likely is it that a verb will follow a determiner) and an emission distribution (how likely is it that we'll see the word "the" given that we know this word should be a determiner). If our transition probabilities are local, then the Viterbi algorithm will run efficiently at test time. This locality is referred to as the "Markov assumption." Specifically, a first-order Markov assumption says that the probability of label at time t+2 is independent of label at time t given the label at time t+1. The higher Markov order you use, the harder it is to decode (complexity-wise).

The potential problem with using HMMs is that the emission probabilities are of the form p(word | tag), where we'd really like to model p(tag | word). The latter is preferable because it's easier to include lots of overlapping features (capitalization, word identity, prefixes, suffixes, stems, etc.). The partial solution is to use a maximum entropy Markov model (MEMM), where we model p(tag | word) using a maximum entropy model, but keep everything else as in an HMM. MEMMs are only slightly more complex to train than HMMs, but work a whole lot better. At training time, we essentially include features having to do with the previous labels, but otherwise this is just as in the independent classifiers approach. Viterbi search still runs at test time, so we're limited to the same Markov order constraints as HMMs.

The potential problem with MEMMs (noticing a trend here? :P) is that when the models are trained, they are trained against CORRECT previous labels. That is, when we create a classification example corresponding to the label at time t+1, we include features that depend on the label at time t. But these will always be correct at training time, but can be wrong at test time. This leads to the infamous "label bias" problem. The conditional random field (CRF) is essentially an answer to this problem. Instead of training to predict each label independently, but then running Viterbi at test time, we train to get the whole sequence right. (This is a good instance of having identical training and testing situations.) Training CRFs is a bit of a bore, since each iteration of training requires one to run the forward-backward algorithm over each training instance. But CRFs do often perform better than plain MEMMs. The UMass group has been nice enough to release their CRF implementation.

Once we get to CRFs, we can play a bunch of other games. For instance, we can essentially come up with a margin-based version of the CRF. This is roughly like moving from maxent to SVMs. The result is either max-margin Markov networks or SVMstruct, depending on exactly how you formulate the problem. The latter has a nice implementation available to play with.

So that's a whole lot of learning, but where the action really is in the features. For me, there's essentially a standard feature set I always use for these problems, and then add and subtract as I see fit. The features I usually use are the following (with examples based on the word "George": the exact word ("George"), the stem ("george"), prefixes of length 1-3 ("G", "Ge" and "Geo"), suffixes of length 1-3 ("rge", "ge" and "e") and some regexps. The most useful I use is to transform all capital letters to "A", all lower-case to "a", all numbers to "0", and all punctuation to ".". Then, we collapse identical adjacent letters. So George -> Aaaaaa -> Aa. Finally, I use list of people, places and organizations collected from various gazetteers and check whether each word falls in any of these lists. (Incidentally, these are available as part of my TagChunk program for solving joint tagging/chunking problems.) All of these features are typically applied in a window of +/- 1,2 or 3 words around the given word. You can see exactly what I calculate in this perl script.

The final issue in features is what to do with the transition features. In particular, one can think of putting the lexical features on "nodes" or "edges". By putting them on nodes, I mean that you have features like "previous-tag-is-DT current-word-is-George current-case-is-Aa". But you can also do a conjunction thing and say "previous-tag-is-DT current-word-is-George-and-previous-tag-is-DT current-case-is-Aa-and-previous-tag-is-DT". This obviously blows up your feature space, but if you have enough data, it's almost always worth doing.

Now, getting back to the segmentation issue. The fact that there are multiple valid encodings for mapping segmentation -> raw labeling is probably a bad thing. It seems that, ideally, we'd like to learn to do the segmentation directly. This is really not that hard, and it comes with a bunch of benefits, especially in what features you can employ. For instance, you can check if the whole string "George Bush" falls in a list of names. Or you can notice that both words are capitalized. These are both great "phrase-based" features. Moreevidence has been published that treating these problems as segmentation problems directly is beneficial.

26 October 2006

So, why am I interesting in how you save read papers? Well, I don't want to ruin the surprise yet. First, let's take a look at the (still incoming) results. The most popular method (roughly 60% of the population surveyed) is to save them locally. People have also pointed to some tools for archiving, though my guess is that these are probably under utilized. I'm actually a bit surprised more people don't use delicious, though I do not so perhaps I shouldn't be surprised. (Incidentally, I fall into the majority class.)

The reason I'm curious is that I spend a nontrivial amount of time browsing people's web pages to see what papers they put up. Some of this has to do with the fact that I only follow about a dozen conferences with regularity, which means that something that appears in, say, CIKM, often falls off my radar. Moreover, it seems to be increasingly popular to simply put papers up on web pages before they are published formally. Whether this is good or not is a whole separate debate, but it is happening more and more. And I strongly believe that it will continue to increase in popularity. So, I have a dozen or so researcher's whose web pages I visit once a month or so to see if they have any new papers out that I care about. And, just like the dozen conferences I follow, there are lots that fall off my radar here.

But this is (almost) exactly the sort of research problem I like to solve: we have too much information and we need it fed to us. I've recently been making a fairly obvious extension to my Bayesian query-focused summarization system that enables one to also account for "prior knowledge" (i.e., I've read such and such news stories -- give me a summary that updates me). I've been thinking about whether to try such a thing out on research articles. The basic idea would be to feed it your directory containing the papers you've read, and then it would routinely go around and find new papers that you should find interesting. Such a thing could probably be hooked into something like delicious, though given the rather sparse showing here, it's unclear that would be worthwhile.

Of course, it's a nontrivial undertaking to get such a thing actually running beyond my controlled research environment (my desktop), so I wanted to get a sense of whether anyone might actually be interested. Ross's comment actually really got my attention because it would be probably easier technologically if everything could be done online (so one wouldn't have to worry about cross-platform, etc.).

Anyway, this is something I've been thinking about for a while and it seems like a lot of the tools exist out there already.

20 October 2006

I'm going to have a go at doing a mini-poll. Basically, I'm interested in whether or not papers you have read (and, presumably, find interesting) find their way into a permanent spot on your machine or your physical space. Please vote :).

18 October 2006

Shared tasks have been increasing in popularity over the past half decade. These are effectively competitions (though perhaps that word is rightfully disdained) for building systems that perform well on a given task, for a specific data set. Typically a lot of stuff is given to you for free: the data, variously preprocessing steps, evaluation scripts, etc. Anywhere from a handful of people to dozens enter these shared tasks. Probably the most well known are the CoNLL shared tasks, but they have also taken place in other workshops (eg., the twoSMT workshops and many others). Goverment-run competitions (eg., GALE, ACE, DUC (to some degree) and others) are somehow similar, with the added bonus that money is often contingent on performance, but for the most part, I'll be talking about the community-driven shared tasks. (I'll note that shared tasks exist in other communities, but not to the extent that they exist in NLP, to my knowledge.)

I think there are both good and bad things about having these shared tasks, and a lot depends on how they are run. Perhaps some analysis (and discussion?) can serve to help future shared task organizers make decisions about how to run these things.

Many pros of shared tasks are perhaps obvious:

Increases community attention to the task.

Often leads to development or convergence of techniques by getting lots of people together to talk about the same problem.

Significantly reduces the barrier of entry to the task (via the freely available, preprocessed data and evaluation scripts).

(Potentially) enables us to learn what works and what doesn't work for the task.

Makes a standardized benchmark against which future algorithms can be compared.

Many of these are quite compelling. I think (3) and (5) are the biggest wins (with the caveat that it's dangerous to test against the same data set for an extended period of time). My impression (which may be dead wrong) is that cf. (1), there has been a huge source of interest in semantic role labeling due to the CoNLL shared task. I can't comment on how useful (2) is, though it seems that there is at least quite a bit of potential there. I know there have been at least a handful of shared task paper that I've read that gave me an idea along the lines of "I should try that feature."

In my opinion, (4) seems like it should be the real reason to do these things. I think the reason why people don't tend to learn as much as might be possible about what does and does not work is that there's very little systematization in the shared tasks. At the very least, almost everyone will use (A) a different learning algorithm and (B) a different feature set. This means that it's often very hard to tell -- when someone does well -- whether it was the learning or the features.

Unfortunately (were it not the case!) there are some cons associated with shared tasks, generally closely tied to corresponding pros.

May artificially bloat the attention given to one particular task.

Usefulness of results is sometimes obscured by multiple dimensions of variability.

Standardization can lead to inapplicability of certain options that might otherwise work well.

Leads to repeated testing on the same data.

Many of these are personal taste issues, but I think some argument can be made for them all. For (1), it is certainly true that having a shared task on X increases the amount of time the collective research community spends on X. If X is chosen well, this is often fine. But, in general, there are lots of really interesting problems to work on, and this increased focus might lead to narrowing. There's recently been something of a narrowing in our field, and there is certainly a correlation (though I make no claim of causation) with increased shared tasks.

(2) and (3) are, unfortunately, almost opposed. You can, for instance, fix the feature set and only allow people to vary the learning. Then we can see who does learning best. Aside from the obvious problem here, there's an additional problem that another learning algorithm might do better, if it had different features. Alternatively, you could fix the learning and let people do feature engineering. I think this would actually be quite interesting. I've thought for a while about putting out a version of Searn for a particular task and just charge people with coming up with better features. This might be especially interesting if we did it for, say, both Searn and Mallet (the UMass CRF implementation) so we can get a few more points of comparison.

To be more concrete about (3), a simple example is in machine translation. The sort of preprocessing (eg., tokenization) that is good for one MT (eg., a phrase-based system) may be very different from the preprocessing that is good for another (eg., syntax-based). One solution here is to give multiple versions of the data (raw, preprocessed, etc.), but then this makes the (2) situation worse: how can we tell who is doing best, and is it just because they have a darn good tokenizer (don't under-estimate the importance of this!).

(4) doesn't really need any extra discussion.

My personal take-away from putting some extra thought into this is that it can be very beneficial to have shared tasks, if we set at the beginning what are the goals. If our goal is to understand what features are important, maybe we should consider fixing the learning to a small set of algorithms. If our goal is learning, do the opposite. If we want both, maybe ask people to do feature ablation and/or try with a few different learning techniques (this is perhaps too much burden, though). I think we should definitely keep the (3) of low barrier of entry: to me, this is one of the biggest pros. I think the SMT workshops did a phenomenal job here, for a task as complex as MT. And, of course, we should choose the tasks carefully.

10 October 2006

In NLP, we often live in the idealized learning world where we have more data than we really know what to do with. The oft-cited Banko + Brill results are perhaps extreme in this regard (in the sense that we rarely have quite that much data), but we certainly have far more than most fields. The great thing about having lots of data is that large data sets support complex statistical analysis. As a stupid example, consider estimating a Gaussian. We estimate the mean and covariance (or generate a posterior over these quantities, if you prefer to be Bayesian). In a small data setting, we'd almost always approximate the Gaussian by either a diagonal, or constant diagonal covariance matrix. Especially if the number of data points is less than the number of dimensions (true Bayesians might not do this, but this is probably tangengtial). But if we have billions of data points, there's likely enough information in there to reliably estimate quite a few parameters (or approximate their posteriors) and we can do the full covariance matrix estimation.

The problem is that the full covariance estimation is computationally really expensive. Not only do we have to play with O(D^2) parameters (D is the dimensionality), but we also have to perform complex operations on the data that typically scale at least as O(N^2) (N Is the number of data points).

This is incredibly frustrating. We have the data to support a complex statistical analysis, but we don't have the computation time to actually perform the analysis. So we either throw out data to get the computation time down and do something more complex (which may now not be supported by the data) or, more often than not, do something simple on the large data set. Now, there is often nothing wrong with doing something simple, but if we cannot even try to do things that are more complex, then it's hard to say for sure whether simple is enough.

So then the question is: how can we scale. I only know a handful of answers to this question, but maybe other people can contribute some.

Get a job at Google and just use a billion machines (and/or some really clever Google engineers, ala the Google SMT system). This is obviously not a very satisfying option for everyone.

Subsample the data. This is also not very satisfying (and, perhaps, even worse than the first option).

Use a randomized algorithm, such as what Deepak did in his thesis. The message here is that if your complexity hinges on pairwise computations that look something like distance metrics, you can introduce randomization and do this in something like O(N) rather than O(N^2) time.

Use smart data structures. Things like kd-trees are becoming increasingly popular in the ML community for solving pairwise problems. The idea is to recursively divide your data space (in an intelligent fashion) so that you can store sufficient statistics about what's under a node at that node itself. (I think one reason these haven't taken off in NLP is that they appear at first glance to be much better suited to real-valued mid-dimensional data, rather than sparse, discrete, super-high-dimensional data...is there an alternative for us?)

There may be other general solutions, but I'm not aware of them. As it stands, with the exception of Deepak and few others, the solution appears to be basically to hire a bunch of smart people (and/or grad students) to do lots of engineering. But I'd prefer general principles.

06 October 2006

Just a quick pointer that was referred to me. In addition to the well known Stanford StatNLP link list, Francois-Régis Chaumartin also maintains a list of NLP resources and tools at proxem.com. Any other lists people find especially useful (I suppose this would lead to a meta-list :P)?

02 October 2006

If you feel like you have the world's greatest recommender system, you should enter the NetFlix challenge for improving their movie recs. In addition to the possibility of winning a lot of money and achieving fame, you also get an order-of-magnitude larger data set for this task than has been available to date. (Note that in order to win, you have to improve performance over their system for 10%, which is a steep requirement.) I'll offer an additional reward: if you do this using NLP technology (by analysing movie information, rather than just the review matrix), I'll sweeten the pot by $10.

29 September 2006

Dirichlet process techniques are increasingly popular tools for Bayesian analysis. There's not enough space here to describe how they work, so I'll assume you know. With the exception of the Variational DP techniques that Dave Blei and Michael Jordan developed, one typically uses MCMC techniques to perform sampling from the posterior distribution and then uses these samples to compute properties of interest. In many cases, the properties of interest are simply the cluster assignments. Since it's unclear how to use multiple samples over the cluster assignments to generate a single one (except, perhaps, by some MBR method), one typically just chooses the single cluster assignment from the sample that has maximal marginal likelihood. This is, of course, not really Bayesian, but it still seems a reasonable thing to do.

For this post, I'm going to consider a model in which we have a likelihood term F(x | theta) and a mean prior G0, where we first draw G from DP(G0,alpha) and then theta from G and then x from theta. In particular, I will assume G0 and F are conjugate, which means that in the Gibbs sampler, we can analytically integrate out both G and theta and draw only for the cluster assignments (this is Neal's algorithm 3). This algorithm is nice because it converges much faster than ones that also have to sample over theta. (I know there are other good algorithms...MH works, as do split/merge proposals.)

The potential worry is that if all you want is the cluster assignment that maximizes the marginal likelihood, is a Gibbs sampler a good search algorithm? The answer is no.

Let's consider another way to search. We arbitrarily order the observed data, then label left-to-right over this ordering. When we get to x_n, we'll have already created k clusters, and then there are k+1 possible choices. It is straightforward to compute a partial marginal likelihood for each of these possibilities, which leads to a direct implementation for breadth-first search. But we can (often) do better. If F is in the exponential family and G0 is its natural prior, we can construct a reasonably good admissible heuristic and apply A* search (I'm working on writing up the details, and I'll make the code available shortly...it's quite simple to implement, but proving that the heuristic is admissible is a bit involved and I'm not 100% sure it's true for all exponential family members or just specific ones).

Here are some artificial experiments. I generated ten documents over a vocabulary of 40 words, based on three clusters drawn from a symmetric Dirichlet with parameter alpha=4, approximately 40 words per document (distributed Poisson). I then use a DP with G0=Dirichlet(2) and F=Multinomial, with the scale parameter on the DP=2. I ran two Gibbs samplers (one initializing with a single large cluster, one initializing with many singleton clusters), and three search algorithms on this data. The first search was full A*. The second was beamed A* with a beam of 5. The last was A* with an inadmissible, but in some sense tigher, heuristic, that's even easier to compute. The results are in the following graph:

The y axis is negative log probability (lower is better) and the x axis is time in seconds. This is all in matlab and I didn't do anything fancy to make either algorithm especially fast. The horizonal lines are, top to bottom, heuristic A*, beam A* and full A*. The timing are, <0.1s, 1.6s and 2.1s, respectively (variances over 5 runs with different permutations of the input are shown in parens). So the search does significantly better (attains a higher marginal likelihood than the sampler) in very little time (even after 30 seconds, the Gibbs sampler is still really far from getting down to even the performance of heuristic A*).

So that's for small data. It turns out that the heuristic isn't good enough to work on huge data sets, unless you use a really small beam, which hampers performance (I'm investigating this). But if we use the inadmissible heuristic, we can handle large-ish data sets fairly easily. I ran the same experiment, but with 1000 docs over a vocabulary of 400 words, with 20 clusters, 1000 words per document and a symmetric Dirichlet prior with alpha=4. The Gibbs here actually sucks. Within about ten iterations, it gets suck with a neg log lik of 724842 and 16 clusters (about 50 seconds per Gibbs iteration). The heuristic A* takes about 2300 seconds total and ends with a neg log like of 707020 (and 20 clusters), quite significantly better. Combining heuristic A* with beam search (beam=100) leads to a neg log lik of 707260 (slightly worse, but no where near as bad as Gibbs) in only 1800 seconds.

(Incidentally, the Gibbs gets stuck because the documents are so long, that the marginal posterior likelihoods completely dwarf the vanilla marginals, so it essentially never moves out of a local maximum. With shorter documents, this doesn't happen as much.)

I'm still working on this a bit...I'm using DP clustering enough that this result make a huge difference for me. I think there's a lot of room for improvement, even over what I have so far. I'm also applying it to real data to see if it still helps. But overall, it looks like this is a fairly promising direction (which is actually quite surprising, because clustering is typically not something that we would typically attack in a "left-to-right" fashion).

23 September 2006

Several months ago I became temporarily interested in trying to automatically identify if entries in online discussions are informative, interesting, humorous, etc. (This was somewhat in the context of a summarization sort of system, but the problem seems more generic.) It turns out that in the comments section of slashdot, people manually tag comments into such categories. I spent a few weeks crawling slashdot (eventually getting my IP banned because this is apparently not allowed) and grabbed a few thousand stories and associated comments. I spent a few hours building a straightforward classifier based on the comment labels. It turns out one of the hardest sorts of comments to classify correctly are the funny ones.

In general, I think identifying humor (or attempted humor) is a very hard problem. It seems to almost require a substantial amount of world knowledge and inference capabilities, since humorous comments are rarely signalled by straightforward lexical cues (though having three exclamation points or a smiley is a good indicator, these actually occur surprisingly rarely).

To get a sense of why this is so hard, let's look at some examples. These are grabbed from slashdot two days ago (the 21st).

In one article titled Motorola Unveils Phone Vending Machines (which talks about how you can buy cell phones from vending machines and they they are delivered by robotic arm rather than dropping ala sodas), we have the following comments marked humorous: "can i use the cell phones I want to purchases to purchases the cell phone I am purchasing?" and "I have a hard enough time trying to pull a big old stuffed animal out with those robotic arms much less a tiny tiny phone. At 50 bucks a pop rather than 50 cents, I'm going to waste a lot of money."

In another article about Googling for ATM Master Passwords, we have the following comments. "[Subj: The default password is...] I thought it was up, up, down, down, left, right, left, right, B, A, Start ..." (for those not of my generation, this is the cheat code for the NES game Contra and several other Konami games). Additionally, in response to "Whoever makes these ATMs deserves all the bad publicity that they get." someone comments "Might it be Diebold, by any chance?"

Finally, in commenting about the article Fish Work as Anti-terror Agents (which discusses how fish like the bluegill help detect poisonous substances in water supplies), we get comments like "In Australia, we have stingrays guarding us from pests." and "How do we know this isn't a red herring by some terroist group?" and finally "Does this mean we can carry water bottles on planes again -- if they have bluefish swimming in them?"

You may take issue with the degree to which these comments are funny, but regardless of whether they actually are funny, the certainly were intended to be funny.

What I find fascinating about all these examples is that they're essentially playing the game of drawing surprising comparisons between the article at hand and other common knowledge. For instance, the "robotic arms" comment is based on our shared experience of failing at fairs to get stuffed animals. The stingray comment is in regards to Steve Irwin's recent death, and the waterbottle joke is in reference to the new airline policies. While some (eg., the waterbottle joke) are perhaps easy to identify because they seem "off topic" somehow, other ones (like the Diebold comment or the stingray comment) really are on topic for the article, but just play against some alternative story that we're all expected to know.

I'm not sure what my conclusion is, but if you're out there looking for a really hard text classification problem for which it at least seems that a lot of knowledge and inference is required, you may find humor detection fun.

17 September 2006

bact' brings up an interesting point, perhaps more provocative than my original (intended-to-be provocative) pseudo-question. To quote, he says:

and some also said,statistical natural language processing is not language processing at all, only statistics :P

My impression is that the only sense in which this sentence is true is if you insist that what goes on inside the black box of statistical NLP is somehow explaining what goes on inside our heads. I see it as essentially parallel to the argument against "neural-style" machine learning. Some neural networks people used to claim (some still do, I hear) that what happens in an artificial neural net is essentially the same as what goes on in our minds. My impression (though this is now outside what I really know for sure) is that most cognitive scientists would strongly disagree with this claim. I get the sense that the majority of people who use NNets in practice use them because they work well, not out of some desire to mimic what goes on in our heads.

I feel the same is probably true for most statistical NLP. I don't know of anyone who would claim that when people parse sentences they do chart parsing (I know some people claim something more along the lines of incremental parsing actually does happen and this seems somewhat plausible to me). Or that when people translate sentences they apply IBM Model 4 :).

On the other hand, the alternative to statistical NLP is essentially rule-based NLP. I have an equally hard time believing that we behave simply as rule processing machines when parsing or translating, and that we efficiently store and search through millions of rules in order to do processing. In fact, I think I have a harder time believing this than believing the model 4 story :P.

Taking a step back, it seems that there are several goals one can have with dealing with language on a computer. One can be trying to carry out tasks that have to do with language, which I typically refer to as NLP. Alternatively, one can be trying to model how humans work with language. I would probably call this CogNLP or something like that. One could instead try to use computers and language data to uncover "truths" about language. This is typically considered computational linguistics. I don't think any of these goals is a priori better than the others, but they are very different. My general feeling is that NLPers cannot solve all problems, CogNLPers don't really know what goes on in our minds and CLers are a long way from understanding how language functions. Given this, I think it's usually best to confine a particular piece of work to one of the fields, since trying to solve two or three at a time is likely going to basically be impossible.

08 September 2006

There has been a trend for quite some time now toward developing algorithms and techniques to be applicable to a wide range of languages. Examples include parsing (witness the recent CoNLL challenge), machine translation, named entity recognition, etc. I know that in at least one or two of my own papers, I have claimed (without any experimental substantiation, of course :P) that there is no reason why the exact same system could not be run on languages other than English, provided a sufficient amount of labeled training data (and a native speaker who can deal with the annoying tokenization/normalization issues in the non-English language).

I get the feeling that a large part of the surge is blowback against older NLP systems, for which hundreds of/or thousands of human hours were put into writing language-specific grammars and rules and lexicons. The replacement idea is to spend thouse hundreds of/or thousands of hours annotating data, and then repeatedly reusing this data to solve different problems (or to try to come up with better solutions to an existing problem, despite the associated fears in doing so).

I think that, overall, this is a good trend. The problem that I see is that it is potentially limiting. In order to develop a system that could plausibly be applied to (nearly) any language, one has to resort to features that are universal across all languages. This is fine, but for the most part the only universal features we know of that are reasonably computable are things like "language is made up of words and words are sort of semanticy units on their own" (of course, this misses a lot of compounds in German and is hard to do in Chinese without spaces) and "words sometimes have prefixes and suffixes and these are syntactically useful" (oops, Arabic has infixes) and "capitalization is often a good indicator of something proper-noun-like" (except for German where many common nouns are capitalized or Japanese where there isn't case marking). These are sometimes compounded "adjacent words carry semantic meaning." But all in all, these features are relatively weak from the perspective of "language understanding."

This distinction seems analogous to the "domain independent" versus "domain specific" one that we've also seen. If you are willing to limit yourself to a specific domain (eg., counter-terrorism), you can probably do a pretty good job doing reasonably deep understanding. On the other hand, if you want to work at the other end---applicable across all domains---there's little you can do because you're better off going for shallow with complete coverage rather than deep but sparse. Where I think that the domain specific people have it right is that they actually do take advantage of being in a specific domain. I know that when I work on a problem that's language specific (eg., summarization or coreference), I've only seldom taken advantage of the fact that the language is English. Sure, for summarization I've occasionally made use of an English parser and for coreference I've made use of mined data that's specific to English, but overall, I treat it as pretty much "any old language." This would probably be fine if I then ran my system on Arabic and Chinese and Portuguese and showed that it worked. But I don't. This seems to tell me that I'm missing something: that I have not been clear about my goal. Maybe I should take a hint from the domain specific people and decide which side of the language independent camp I want to be on.

(The one counterargument that I will use to save face is that applying to other languages is often a lot of relatively needless work...you often have a pretty good idea of what's going to happen and I'd be surprised if people have strongly believed they've built something that's reasonably language independent and it turns out not to be.)

26 August 2006

I heard a story on NPR while driving today that referenced a recent Financial Times story about the use of natural language generation technology for producing financial stories. The financial company, Thomson, is now generating some of its stories using NLG. The NLG technology, based on the little bit of information I gathered from the NPR story and reading the associated article, seems to be straightforward template filling. This is not terribly surprising, given the stagnant nature of financial articles (grab any such article from WSJ and you see things like "The DOW went up ____ percent (___ points) because of ___." Most of this can be easily gathered from databases.

One might suspect that the final "___" (the reason) would be hard to come up with automatically, but apparently this is highly standardized as well. According to the NPR story following the one on the NLG technology, the number of reasons that fill that blank is nearly closed class ("profits" or "middle east" or "terrorism" or "bin Laden videotape" or "strong trading" or ...). And, based on some hear-say evidence, the reasons are largely bogus anyway (this is not surprising: summarizing one billion trades as being driven by a bin Laden videotape is highly suspect).

An amusing comment was made when the journalist at NPR asked the Thomson rep if his job was in danger. The response was that it was, only if he believed that his abilities topped out at filling out templates.

Anyway, if anyone knows more about the technology, and if there's anything interesting in it, it would be fun to know. Beyond that, I just thought it was fun that NLP technology made a spot on NPR.

25 August 2006

Among ML-oriented nlpers, using a simple F1 of precision and recall is the standard way to evaluate Named Entity Recognition. Using F1 seems familiar and comfortable, but I think most nlpers haven't actually thought through the rather different character that the F1 measure takes on when applied to evaluating sequence models. It's not just that it's a type 4 loss (a simple, intuition-driven measure like accuracy): In most cases such measures are reasonable enough for what they are, but using F1 for NER has an under-appreciated dysfunctional character. You wouldn't want to optimize for it!

This post explains what I was really thinking about when I made the comment that Hal referred to previously (fortunately, I didn't try to say all this after the talk!). I agree with Hal: the paper was a very nice piece of work technically. I just think that the authors, Jun Suzuki et al., chose a bad peak to climb.

Everyone is familiar with the F1 measure for simple classification decisions. You draw a 2x2 contingency table of whether something should be yes/no, and whether the system guessed yes/no, and then calculate the harmonic mean of precision and recall. But now think about Named Entity Recognition. You're chugging through text, and every now-and-again there is an entity, which your system recognizes or doesn't or fantasizes. I will use the notation word/GOLD/GUESS throughout, with O denoting the special background class of not-an-entity. So there are stretches of plain text (drove/O/O along/O/O a/O/O narrow/O/O road/O/O). These are the non-coding regions of NER. Then there are sequences (of one or more tokens) where there was an entity and the system guessed it right (in/O/OPalo/LOC/LOCAlto/LOC/LOC ./O/O), where there was an entity but the system missed it (in/O/O Palo/LOC/O Alto/LOC/O ./O/O), and where there wasn't an entity but the system hypothesized one (an/O/O Awful/O/ORG Headache/O/ORG ./O/O).

Things look good up until here: those events map naturally on to the false negatives (fn), true positives (tp), false negatives (fp), and false positives (fp) of the simple classification case. The problem is that there are other events that can happen. A system can notice that there is an entity but give it the wrong label (I/O/O live/O/O in/O/O Palo/LOC/ORG Alto/LOC/ORG ./O/O). A system can notice that there isan entity but get its boundaries wrong (Unless/O/PERS Karl/PERS/PERS Smith/PERS/PERS resigns/O/O). Or it can make both mistakes at once (Unless/O/ORG Karl/PERS/ORG Smith/PERS/ORG resigns/O/O). I'll call these events a labeling error (le), a boundary error (be), and a label-boundary error (lbe).

I started thinking along these lines just as an intuitive, natural way to characterize happenings in NER output, where entities are sparse occurrences in stretches of background text. But you can make it formal (I wrote a Perl script!). Moving along the sequence, the subsequence boundaries are: (i) at start and end of document, (ii) anywhere there is a change to or from a word/O/O token from or to a token where either guess or gold is not O, and (iii) anywhere that both systems change their class assignment simultaneously, regardless of whether they agree. If you chop into subsequences like that, each can be assigned to one of the above seven classes.

Now, the thing to notice is that for the first 4 event types, you are either correct or you get 1 demerit, assessed to either precision or recall. In the simple classification case, that's the end of the story and the F1 measure is sensible. But when doing precision and recall over subsequences, there are these other three event types. Each of them is assessed a minimum of 2 demerits, with both precision and recall being hit. Therefore, it is fairly clear that optimizing for F1 in this context will encourage a system to do the following: if I'm moderately uncertain of either the class label or the boundaries of the entity, because a mistake would cost me a minimum of 2 demerits, I'm better off proposing no entity, which will cost me only 1 demerit.

(Two notes:

(i) As I've defined events, the possible demerits for an event in the last three classes is unbounded, though in practice 2 is the most common case. For example, this lbe event would be assessed 4 demerits (3 to precision, and 1 to recall): Smith/ORG/PERS and/ORG/O Newcomb/ORG/PERS and/ORG/O Co./ORG/ORG.

(ii) Despite my title, the problem here isn't with the F measure per se, as Bob Moore emphasized to me at a coffee break during ACL 2006 (thanks!). The problem would occur with any measure that combines precision and recall and which is increasing in both arguments, such as the simple arithmetic mean of precision and recall.)

Observe that this behavior is the opposite of the way things were meant to work: people adopted F1 in IR rather than using accuracy because accuracy gives high scores to a system that returns no documents, which obviously isn't useful. But, here, optimizing for F1 is encouraging a system to not mark entities.

Now let's look at some data. I used this event classification system on the output of my NER system on the CoNLL 2003 shared task English testa data. Here is how many events of each type there were:

tn 5583 tp 4792 fn 118 fp 120 le 472 be 102 lbe 75

Note in particular that over 2/3 of the errors are in those 3 extra categories that are multiply penalized. The ratios of classes vary with the task. For example, in biological NER, you tend to get many more boundary errors. But in my experience it is always the case that lots of the errors are in the last 3 classes.

Moreover, some of the errors in the le and be classes are not that bad, and sometimes even reflect subtle judgement calls and human annotator inconsistency in the gold standard. For instance, in the GENIA data you can find both regulation/O of/O human/DNA interleukin-2/DNA gene/DNA expression and transduction/O to/O the/O human/O IL-2/DNA gene/DNA, where it is unclear whether to include human in the name of the gene. Or in a newswire phrase like the Leeds stadium, it's not always very clear whether Leeds should be tagged ORG as a reference to the football team or LOC as a reference to the city. In almost any imaginable task, you would prefer systems that made these errors to ones that missed such entities entirely. In other words, the F1 measure is punishing more severely mistakes that should be punished less according to reasonable intuitions of task utility.

Has this been noticed before? I think so. The ACE program has a system for giving partial credit. But most ML people react very negatively to a scoring system that you couldn't possibly write on a napkin and which involves various very arbitrary-looking constants.... Do these observations undermine the last decade of work in NER? I don't think so. It turns out that there are lots of measures that are pretty okay providing you do not specifically optimize for them, but are dysfunctional if you do. A well-known example is traditional readability measures.

p.s. As I finish writing this guest post, it's occurred to me that I think this is the first nlpers post with some actual natural language examples in it. If you're reading this post, I guess that at least shows that such content isn't actively filtered out!

24 August 2006

When I was visiting Singapore, Lan Man gave a talk at NUS right before mine (for extra fun, try to find her web site without using my link!). The work she presented is, for the purposes of this discussion, essentially a feature weighting technique for document classification.

More specifically, let's consider the standard bag of words model for text classification. Here, we have a feature set that is exactly the size of the vocabulary. The easiest thing to do is to use binary features (word present or not), but other things seem to work better. For instance, the number of times the word appears (term frequency, or tf) often works better. This is not so surprising. The problem with just using tf is that words like "the" appear a lot and they're somehow less "important." So instead, what we often do is tf*idf, where idf is the "inverse document frequency." There are a few variants of idf (that use logs, etc.) but the basic idea is to count how in many documents each word appears at least once. Then, we divide the number of documents by this count. So the idf value is always at least one, and is one only for words that appear in all documents (eg., "the"). What Lan Man was proposing was essentially a different value to use instead of idf (I don't recall the details, but it was similarly an easy-to-compute function of the corpus). The result is essentially that binary does worst, tf does next best, followed by tf*idf and then tf*her_proposal worked best (for most data sets).

The importance of using tf*idf arose in IR, where people started computing cosine similarities between word vectors. Here, downweighting things like "the" was of utmost importance, due to how cosine works. But it makes much less sense in supervised learning. That is, for linear classifiers (which are the norm, and which she used for many of her experiments), it really shouldn't matter if we apply a constant scaling factor to each feature. The learned feature weight itself is what is supposed to determine the importance of a feature. If we learn weights for tf*idf, we could easily convert them into just-as-good weights for tf by multiplying each weight by the corresponding df value.

I can think of a few explanations of this phenomenon, but none is at all satisfying (eg., by changing the scaling, we may be increasing the size of the margin while reducing the maximum norm vector size, which, for most generalization bounds, would be a good thing. Another answer is an interplay with regularization -- it becomes less "expensive" to have an overall high weight -- learned weight times df -- for infrequent informative words than for irrelevant words.). The problem is that this is a problem! It means that even for applications as simple as text categorization, not only is feature engineering an important issue, but feature weighting seems also to be.

(Incidentally, I'm curious if other fields -- eg., vision -- have similar techniques to "idf." For instance, one could imagine weighting pixel values by "idf"s of how often a pixel is on in a dataset. I wonder if this helps. If not, it may be because for smallish images, the number of features is small already. I wonder if there is an algorithm design for machine learning that would get around this problem.)

18 August 2006

Ed Hovy has a particular label he assigns to a large variety of work in NLP: it performs a change of notation. The canonical example is POS tagging, where we are changing from one notation (words) to another notation (POS tags). He also applies this notion to most forms of parsing (though this is a bit more like adding notation), machine translation (one notation is Arabic, the other English), etc. My impression (though it's sometimes hard to tell where Ed really stands on such things) is that solving a task with a change-of-notation technique is suboptimal. I've never fully grasped what Ed was trying to say specifically until recently (and I make no promises I didn't get it wrong, but this is my interpretation).

I think what Ed means by change of notation is that what is being done is pure symbol manipulation. That is, we have a bunch of symbols (typically words) and we just push them around and change them, but the symbols don't actually "mean" anything. There's no sense that the symbol "chair" is grounded to an actual chair. Or, as a logician might say, the manipulation is purely syntactic, not semantic. There is no model in which statements are evaluated. My interpretation is that by "notation" Ed simply means "syntax" (the mathematical notion of syntax, not the linguistic one).

On the other hand, it's hard not to say that everything we do, by definition, must be symbol manipulation. Computer science is all about playing with symbols. There may be a sense in which the word "chair" is grounded to an actual chair, but this actual chair, once inside the computer, will again be a symbol!

Despite this, I think that what one can think about is something like "how large is the connection between my model and the real world?" For most applications, the connection is probably limited to the specific form of training data. This is probably as close to the smallest connection possible. Sometimes people attempt to integrate ontological information, either handcrafted (via something like WordNet) or automatically derived. This is, in some sense, attempting to open the "program to world" pipe a bit more. And there is some work that explicitly focuses on learning grounding, but I'm not aware of many actual uses for this yet. Additionally, some applications cannot help but to be tied more closely to the real world (eg., robot navigation). But for the majority of tasks, the connection is quite small.

13 August 2006

What we, as NLPers, typically do vis-a-vis machine learning is what might be called "Human in the Loop" machine learning. Why? Typically, we develop a model and set of features, train it on some training data and evaluate it on some development data. We then augment/change the model and features, retrain and retest on the dev data. Only once we are satisfied with our dev results do we run it on the test data and report scores in a paper. This is "human in the loop" because in each iteration we're using our human knowledge to, say, add some additional features that will hopefully correct for errors on the development data.

To many machine learning people, this is not ideal: the whole point of machine learning is that the human is not in the loop. One can liken the "adding more features" to something akin to adding new sensors to a robot: when it is deployed on Mars, we cannot easily send a person to duct-tape on a new sensor.

But I'm not sure that this argument really applies here. We are, after all, trying to either learn new things about language or build systems that work. For both of these goals, retwiddling models and adding features is the right thing to do. If we could have an automated system to help us with this, that would be great. But I don't think we should shy away from keeping ourselves in the loop.

Perhaps another reason why machine learning tends to shy away from HITLL is that it goes against the long-standing models of supervised learning, which typically begin by saying something like "Let D be a distribution over pairs of inputs x and outputs y and let L be a labeled training set drawn iid from D." Here, the inputs (and hence the distribution) specify exactly what information we have access to in terms of features. This model essentially takes features for granted, as simply part of the learning problem. This is quite reasonable from the ML perspective, but is somehow lacking if one wants to talk about feature engineering, etc. There may be another, more apt model out there, but everything I can think of reduces trivially to the standard supervised model, unless inappropriate assumptions are made.

04 August 2006

The Document Understanding Conference features a yearly summarization competition. For the past few years, the task has been query-focused summarization of clusters of (essentially entirely) news documents. There will be a pilot task next year and based on comments made during DUC 2006, it appears it will be one of the following:

Multidocument, (probably) query-focused summarization of blog posts.

Multidocument summarization of news, with respect to known information.

The idea in (1) is that there are several "novel" aspects one has to deal with. First, blog posts are out of domain for most parsers, etc., which means we'll get noisy input but not as noisy as speech. Second, although the blog posts (the blogs would be from the TREC blog collection) will essentially all focus on news topics (saldy, NLPers is not in the corpus), they are almost certainly more emotionally fueled than vanilla news. The identification of sentiment and opinion, which are both in vogue these days, will potentially become more useful.

The idea in (2) is that in most real world situations, the user who desires the summary has some background information on the topic. The idea is that the summarization engine would be handed a collection of 5-10 documents that the user has presumably read, then 5-10 new documents to be summarized. The novel aspect of this task is, essentially, detecting novelty.

Personally, I think both are potentially interesting, though not without their drawbacks. The biggest potential problem I see with the blogs idea is that I think we're reentering the phase of not being able to achieve any sort of human agreement without fairly strict guidelines. It's unclear if, say, two viewpoints are expressed, how a summary should reflect these. The biggest problem I see with idea (2) is that it is very reminiscent of some TREC-style tasks, like TDT, and I'm not sure that doing anything more than essentially doing normal query-focused summarization with an MMR-style term to account for "known information." That's not to say these aren't worth exploring -- I think both are quite interesting -- but, as always, we should be careful.

28 July 2006

There was a talk in the session I chaired at ACL about directly optimizing CRFs to produce low F-scores for problems like NE tagging and chunking. The technique is fairly clever and is based on the observation that you can use very similar dynamic programming techniques to do max F-score as to do max log-probability in CRFs.

The details are not particularly important, but during the question phase, Chris Manning asked the following question: Given that F-score is not really motivated (i.e., is a type 4 loss), should we really be trying to optimize it? Conditional probability seems like a completely reasonable thing to want to maximize, given that we don't know how the tags will be used down the pipeline. (It seems Chris is also somewhat biased by a paper he subsequently had at EMNLP talking about sampling in pipelines.)

I think Chris' point is well taken. Absent any other information, conditional probably seems like a quite plausible thing to want to optimize, since given the true conditional probabilities, we can plug in any loss function at test time and do minimum Bayes risk (in theory, at least).

On the other hand, there is an interesting subtlety here. Conditional probability of what? The standard CRF optimization tries to maximize conditional probability of the entire sequence. The "alternative objective" of Kakade, Teh and Roweis optimizes the sub of the conditional probabilities of each label. These are two quite different criterea, and which one should we choose? In fact, neither really seems appropriate. Conditional probability of the sequence doesn't make sense because it would rather improve a bad label from probability 0.01 to 0.1 than improve a bad label from 0.4 to 0.6 and thus get it right. But summed conditional probability of labels doesn't make sense in NE tagging tasks because always assigning probability 0.9 to "not an entity" will do quite well. This is essentially the "accuracy versus f-score" problem, where, when few elements are actually "on," accuracy is a pretty terrible metric.

If we take Chris' advice and desire a conditional probability, it seems what we really want is direct conditional probability over the chunks! But how do we formulate this and how do we optimize it? My impression is that a direct modification of the paper Chris was asking about would actually enable us to do exactly that. So, while the authors of this paper were focusing on optimizing F-score, I think they've also given us a way to optimize conditional chunk probabilities (actually this should be easier than F-score because there are fewer forward/backward dependencies), similar to what Kakade et al. did for conditional label probabilities.

25 July 2006

Back from Sydney now, and despite being a bit jetlagged, I thought I'd begin with an COLING/ACL-related post. There was a paper presented by Ben Wellington that discusses whether current syntax-driven models for translation can capture the sorts of reorderings/alignments observed in hand-aligned data. I'm not looking at the paper, but what I remember from the talk is that something like 5-10% of sentences are missed assuming an ITG-like model, without gapping.

At the end of the talk, Dekai Wu commented that they are aware of this for ITG and that an analysis of where it happens is essentially is "weird" extrapositions that occur in English, such as wh-movement, topical preposing, adverbial movements, etc. Apparently such things occur in roughly 5-10% of sentences (from whatever corpora were being used). This talk was near the end of the conference, so my buffer was near full, but my recollection is that the suggestion (implied or explicit) was that English could (should?) be "repaired" by unextraposing such examples, at which time a model like ITG could be directly applied to nearly all the data.

To me this seems like a strange suggestion. ITG is a very simple, very easy to understand model of translation; more natural, in many regards, than, say, the IBM models. (Okay, maybe in all regards.) One nice thing about the model is that it is largely un-hackish. Given this, the last thing I'd really want to do is to fix English by applying some preprocessing step to it, so that the output could then be fed into this nice pretty model.

There's an obvious generalization to this question, hinted at in the title of this post. It seems that in many problems (translation with ITG being but one example), we can either choose a theoretically clean but maybe not completely appropriate model plus some preprocessing, or a somewhat messier but maybe 99.9% recall model with no preprocessing. I suppose one could argue that this is the long sought after integration of rule-based and statistically-based NLP, but this seems uncompelling, because it's not really an integration. I suppose that I'd almost always prefer the latter sort of solution, but that's not to say the first isn't also useful. For instance, in the ITG case, after identifying cases where the ITG assumption fails, we could try to minimally enhance the ITG model to be able to capture these corner cases.

15 July 2006

I just spent a few days visiting Yee Whye and NUS (photos to prove it!). YW and I talked about many things, but one that stood out as a ripper is attempting to answer the question: as a machine learning person (i.e., YW), what problems are there in NLP that are likely to be amenable to interesting machine learning techniques (i.e., Bayesian methods, or non-parametric methods, etc.). This was a question we tried to answer at the workshop last year, but I don't think we reached a conclusion.

At first, we talked about some potential areas, mostly focusing around problems for which one really needs to perform some sort of inference at a global scale, rather than just locally. I think that this answer is something of a pearler, but not the one I want to dwell on.

Another potential partial answer arose, which I think bears consideration: it will not be on any problem that is popular right now. Why? Well, what we as NLPers usually do these days is use simple (often hueristic) techniques to solve problems. And we're doing a sick job at it, for the well studied tasks (translation, factoid QA, ad hoc search, parsing, tagging, etc.). The hunch is that one of the reasons such problems are so popular these days is because such techniques work so bloody well. Given this, you'd have to be a flamin' galah to try to apply something really fancy to solve one of these tasks.

This answer isn't incompatible with the original answer (globalness). After all, most current techniques only use local information. There is a push toward "joint inference" problems and reducing our use of pipelining, but this tends to be at a fairly weak level.

This is not to say that Bayesian techniques (or, fancy machine learning more generally) is not applicable to problems with only local information, but there seems to be little need to move toward integrating in large amounts of global uncertainty. Of course, you may disagree and if you do, no wuckers.

13 July 2006

There are two conference formats I'm familiar with. One is typified by ACL and ICML; the other by NIPS. The ACL format is a roughly 3 day schedule, with three parallel sessions, which gives time for somewhere around 80 presentations, each 20 mins + 5 for questions. ACL has smallish poster sessions that are typically associated with "short papers." My impression is that many people do not attend the poster sessions.

In contrast, NIPS typically accepts many more papers (120ish per year), but the vast majority are presented as posters. There is only one track at the conference, with several invited talks, a few paper presentations (maybe 15 total) that last 20 minutes. They also have "spotlight presentations", which are 2 slide, 2 minute talks designed to draw attention to particularly interesting posters. The poster sessions run from (roughly) 6 until 10 every night and are attended by everyone.

My feeling is that both formats have advantages and disadvantages. I personally like the NIPS format, essentially because I feel that at the end of the conference, I have seen more of what is there (because I spend so much time at posters). This also means that I'm completely exhausted at the end of NIPS. The major disadvantage to the NIPS format seems to be that I'm used to using post-conference time as dinner socialization time, and this seems to happen somewhat less at NIPS (this is confirmed by a friend who is more in the "in crowd" at NIPS than I am). I think that it might be a little intimidating for junior researchers to go up to a "famous" poster, but on the other hand, this forces you to immediately get to know everyone. The "lots of posters" format also means that the decision about the number of papers to accept at NIPS is essentially limited by physical space, rather than length of the conference. Ken Church seems to think we should be accepting more papers, anyway.

Are there other major disadvantages to having a single-track conference, with the majority of papers presented as posters? I don't expect to see a shift in how our conferences are held, but I'd like to understand all the pros and cons.

07 July 2006

I've read lots of papers and seen lots of talks that justify a task as being useful to doing "natural language understanding." The problem I have here is that I really have no idea what NLU actually is. I think the standard line is that NLU is the process of transforming natural language text into "some representation" that is "easy" for a computer to "manipulate." Of course, the quoted words are so vacuous in meaning that I can easily imagine reasonable definitions for them that make this task either trivial or essentially impossible.

What I find unfortunate here is that, without trying hard to pin down definitions, this seems like a completely reasonable goal for NLP technology; in fact, my impression is that up until 10 or 20 years ago, this actually was one of the major goals of the field. According to the anthology, roughly 25% of the papers in 1979 contained the terms "NLU" or "language understanding", compared to 20% in the 1980s, 10% in the 90s and 7% in the 2000s. One possible explanation of the dwindling of this topic is that publication has become increasingly driven by experimental results, and if one cannot pin down a definition of a task, one cannot reliable compare results across multiple papers.

The recent push on textual entailment bears some semblance to NLU, but is simultaneously better defined and more restricted (though I admit to having heard some grumbling that some more work needs to be done to get textual entailment to be a very well defined task). There also continues to be a modicum of work on database filling and natural langauge database queries, though certainly not at the same rate as before.

03 July 2006

(Minor note: the latest version of the EDT section didn't get folded in properly to the version of the thesis that went up yesterday; this is fixed now.)

Many people have asked me how I settled on a thesis topic, I think largely because they are trying to find their own paths. My story is (at least to me) a bit odd and circuitous, but I'll tell it anyway.

When I came to USC, I knew I wanted to do NLP and, more specifically, I wanted to do summarization. Working with Daniel was a natural choice. I was particularly interested in coming up with good models for getting around the sentence extraction paradigm that dominates the field. My first work was on extending Kevin and Daniel's sentence compression work to the document level by using discourse information. This worked reasonably well. My next milestone was to try to leverage alignment information of document/abstract pairs in order to learn complex transformations. This led to the segment HMM model for doc/abs alignment, that met with reasonable success (considering how darned hard this problem is).

At that point, it became clear that trying to do a full abstractive system just wasn't going to work. So I started looking at interesting subproblems, for instance sentence fusion. Unfortunately, humans cannot reliably do this task, so I threw that out, along with a few other ideas. Around the same time, I began noticing that to have any sort of reasonably interesting model that did more than sentence extraction, I really was going to need to run a coreference resolution system. So I went to the web (this was back in 2003) and found one to download.

Except I didn't because there weren't any publicly available.

So I went to build my own. I wanted to do is "right" in the sense that I wanted a principled, joint system that could be trained and run efficiently. I read a lot of literature and didn't find anything. So then I read a lot of machine learning literature (I was beginning to get into ML fairly hardcore at this point) to find a framework that I could apply. I couldn't find anything.

So I decided to build my own thing and came up with LaSO, which is essentially a formalization and tweak on Mike Collins and Brian Roark's incremental perceptron. My thesis proposal used LaSO to solve the EDT problem, and I presented LaSO at ICML 2005. Also at ICML 2005 I met John Langford, having previously noticed that what I was doing with LaSO looked something like some recent work he had done on reinforcement learning. We had a long dinner and conversation and, after a few visits to Chicago, many emails, and lots of phone calls, came up with Searn. With both LaSO and Searn, I always had in the back of my mind that I didn't want to make any assumptions that would render it inapplicable to MT, since everyone else at ISI only does MT :).

At about the same time, I started thinking about how to do a better job of sentence compression and came up with the vine-growth model that eventually made it into my thesis. This was really the first point that I started thinking about summarization again, and, in the evolution of LaSO to Searn, it now became possible to do learning in this model.

So, in a sense, I had come full circle. I started with summarization, lost hope and interest, began to get more interested in EDT and machine learning, and then finally returned to summarization, this time with a new hammer.

27 June 2006

Our entry into the NIST MT eval this year has a recapitalization component, currently being done by a large language model (500mb, gzipped) together with SRILM's "disambig" tool. I was recently conscripted to try to improve this. After about a week's worth of work, using a bunch of tools and rules and a bunch of other stuff (though not 100% tuned), I've eeked out a 0.25 increase in BLEU scores on the 2003 and 2004 test data for three of our MT systems (phrase-based, syntax-based and Hiero). This could easily be upped to maybe 0.5 with a few more hours of work (there are tokenization differences between my training and test data).

The story that it's hard to beat an n-gram language model is fairly ubiquitous. In fact, my most useful features for classification are based on the n-best outputs of disambig using the large LM. There are older results that suggest that once you have enough data, using stupid techniques is sufficient. This seems to be more true for NLP than for many other areas I know about, though perhaps this is because there exist NLP problems for which it is even possible to get a "sufficient" amount of data.

While this story is true for English, I'm not sure that it would hold for other languages. My guess is that n-gram models work significantly worse in languages with more free word order and (though this usually comes as a package deal) stronger morpology. As a counter-argument, though, some of my friends who have contact with people who do commercial speech recognition in languages other than English do actually use vanilla n-gram models in those other languages. I don't know whether this is for lack of an alternative or just because they remain so good even outside of English.

I don't know the secret sauce to beating an n-gram. They appear to work so well because language (where, by "language" I mean "English") is really not that ambiguous, in a Shannon-experiment sense. Moreover, raw English words seem to really be just about right when it comes to information content. Sure, we can get some mileage by looking at suffixes or bigrams, but, comparing to, say, German (where one word can correspond to 2-3 English words), English really seems to strike the right balance of amount of information in a word to sparsity (where "right" = "right for NLP"). I'm sure other languages fall fairly close by and I recall seeing a study comparing Shannon-style tests in multiple languages (anyone know a pointer?), but, if pressed, this is my guess as to why n-grams work so well. To beat them, it seems we are almost forced to look at problems with less data, or problems which are significantly noisier.