23 June 2008

This is a request to the community. If you have published a paper in an ACL-related venue in the past ten years or so, please consider contributing the LaTeX source. Please also consider contributing talk slides! It's an relatively painless process: just point your browser here and upload! (Note that we're specifically requesting that associated style files be included, though figures are not necessary.)

21 June 2008

This is the first in what I hope will be a long line of "Area Chair Summaries" from NLP-related conferences. If you would like to contribute one, please contact me!

For ACL this year, I was lucky to be the Area Chair for the summarization track. I know I've said it before, but we really got a ton of great papers this year. In the end, seven were accepted for presentation (note there are also some summarization-related papers that officially fell under "Generation" for which I was not the area chair). I would like to say that there was some sort of unifying theme this year, but there's not one that I can actually come up with. The papers were:

Most of these you can guess the contents more-or-less by their titles, but here's a quick run-down. Nomoto's is probably the hardest to guess. I dare-say it actually sounds a bit boring from just the title; it leads me to think it's yet another sentence compression method. But if you start reading the paper, you find out all about compression under dependency structures, summarization of Japanese text, and an fairly thorough evaluation.

The Titov and McDonald paper attempts to model associations between fine-grained user reviews of restaurants (eg: how do you rate the food versus the ambiance?) and the actual text in the reviews. This enables them to produce summaries that are specific to particular aspects of the review.

Biadsi, Hircshberg and Filatova present a model for producing biographies, by trying to identify biography-like sentences from Wikipedia (a source that's gaining more an more attention these days). One aspect that I found most interesting here was that they attempt to do full-on reference resolution and referring expression generation. This has always been something I've been too scared to touch, but they actually present some results that show that it's worthwhile!

Mei and Zhai talk about a sentence-retrieval method for summarizing scientific documents, which they gathered from DBLP. They take advantage of citation sentences (called "citances" by Marti Hearst and also explored deeply by Simone Teufel) and make a "citation" language model. This language model is interpolated with the standard document language model to perform extraction. This allows them to extract sentences that readers care about, not what the author thinks you should care about. The key limitation, of course, is that it only works once the paper has been cited for a while. (It would be nice to see how many citations you need before it's worth doing this.)

The paper by Nenkova and Louis describes a model for predicting if a batch of documents is going to be difficult (for a machine) to summarize. This is akin to the notion of query-difficulty that you see in IR. The results are about what you'd expect, but it's nice to see them played out. In particular, you see that more cohesive sets of documents are easier to summarize. Read the paper for more.

Penn and Zhu look at how well Rouge works when trying to summarize speech. They look at both Switchboard data (telephone conversations) and lectures. They have many interesting findings that cast doubt not only on the role that Rouge plays in speech summarization, but also on what sort of baselines are reasonable for speech, and what role meta-data plays in speech summarization. If you care at all about the intersection of speech and summarization, this truly is a must-read.

Last but not least, Carenini, Ng and Zhou discuss the task of summarizing emails (following up on previously work by the same authors on roughly the same task). Since the most relevant past work appeared in WWW07, I'll actually comment on that too (maybe unknown to many readers). There is quite a bit of work here, that primarily aims at finding useful features for summarizing chains of emails. They look at things like cue words, semantic similarity, lexical similarity, "PageRank" measures and a handful of others. This is all done in a graph-based framework that is pieced together based on processing the email chain (from what I can tell, this is highly non-trivial).

After writing this, I think that maybe the connecting thread is that there's a lot of work that aims at doing summarization for things other than straight news stories. I think this is a fantastic step for the community. Keep up the good work, guys!

18 June 2008

Just officially announced. Minor details: as of Mar 2009 (first issue next year), there will be no print version (electronic only) and will be open access. Also switched to an online electronic management system.

Obviously I think this is fantastic. Many many thanks go to Robert Dale and the other ACL and CL board members for making this happen.

11 June 2008

These days, the "conference blog" has come to be a fairly accepted part of an academic blogger's repertoire. I've actually received an email on the third day of a conference that people knew I was attending asking "why haven't you blogged yet!" Colleagues who blog have shared similar stories. How did the world manage without us?!

I occasionally like to read through old papers, like from the late 70s, early 80s, mostly in stats, but occasionally in NLP or ML. Mostly it's fun; often I learn things. The old NLP paper typically amuse me more than anything else, but it's a bit of a relief to see the set of things that were considered interesting or important 20-30 years ago. I was browsing through the ACL 1980 proceedings (I won't share the answer to "how old were you then?") on the anthology and came across a paper titled "Parsing." I thought that was quite impressive that someone would be audacious enough to title their paper "Parsing." I mean, can you imagine going to a talk at ACL 2008 and seeing the title "Machine Translation?" It's absurd.

Well, it turns out it's not really a research paper. It's the 1980 ACL's parsing area chair's impression of all the parsing papers that year, and how they related to the previous year.

"Fantastic!" I thought.

These are like pre-web-era conference blogs. But it's not just some random guy blogging about his favorite ACL papers across a variety of areas, but rather the one person who probably knew all the parsing papers in ACL that year better than anyone. Having area chaired myself, I know what all the summarization submissions were about, and I know quite well what all the accepted submissions were about. (Or, at least, I did right after we issues accept/reject notifications.) If I had sat down right then and written a page summary, relating the the past year, it probably would have taken about two hours. (An average blog post takes 30 mins, but I figure I would want to be more diligent, and also be sure to look up what happened the year before.)

I would actually love to see something like that reinstated. I think it captures an important area that's missing from you standard conference blogs -- namely, that you get coverage. As I age, I attend fewer and fewer talks, so my conference blog posts are necessarily really spotty. But if a chair from every area were to write a one page summary, I think that would be great. And I really don't think it's that much added effort -- I easily spent 10-20 times that going over reviews, looking for inconsistencies, reading papers, etc.

However, while I think it's useful, I also don't think that it's really something that needs to be in our proceedings. So I'm going to try to start a trend. Every time I am area chair for a conference, I will post to the blog a summary of the papers in my area. If you ever are an area chair, and would like to participate, please contact me. Perhaps I'll be the only one who ever does this, but I hope not. I'll start with ACL 2008 summarization as my next post.

10 June 2008

I think it's fair to say that I am a fan of the Bayesian lifestyle. I have at least a handful of papers with "Bayesian" in the title, and no in the misleading "I used Bayes' rule on a noisy-channel model" sense.

It's probably also fair to say that I'm a fan of NLP.

So... let's think about it. A fan of Bayes, a fan of NLP. He must do topic modeling (ie LDA-style models), right? Well, no. Not really. Admittedly I have done something that looks like topic modeling (for query-focused summarization), but never really topic modeling for topic modeling's sake.

The main reason I have stayed away from the ruthless, yet endlessly malleable, world of topic modeling is because it's notoriously difficult to evaluate. (I got away with this in the summarization example because I could evaluate the summaries directly.) The purpose of this post is to discuss how one can try to evaluate topic models.

At the end of the day, most topic models are just probabilistic models over documents (although sometimes they are models over collections of documents). For a very simple example, let's take LDA. Here, we have . In the particular case of LDA, the first two "p"s are Dirichlet, and the last two "p"s are Multinomial, where z is an indicator selecting a mixture. For simplicity, though, let's just collapse all the hyperparameters into a single variable a and the true parameters into a single parameter z and just treat this as and let's assume p and q are not conjugate (if they were, life would be too easy).

Now, we want to "evaluate" these models. That is, we want to check to see if they really are good models of documents. (Or more precisely, are the better models of documents than whatever we are comparing against... perhaps I've proposed a new version of p and q that I claim is more "life-like.") Well, the natural thing to do would be to take some held-out data and evaluate according to the model. Whichever model assigns higher probability to the heldout data is probably better.

At this point, we need to take a moment to talk about inference. There's the whole Monte Carlo camp and there's the whole deterministic (variational, Laplace, EP, etc.) camp. Each gives you something totally different. In the Monte Carlo camp, we'll typically get a set of R-many (possibly weighted) samples from the joint distribution p(a,z,w). We can easily "throw out" some of the components to arrive at a conditional distribution on whatever parameters we want. In the deterministic camp, one of the standard things we might get is a type-II maximum likelihood estimate of a given the training data: i.e., a value of a that maximizes p(a|w). (This is the empirical Bayes route -- some deterministic approximations will allow you to be fully Bayes and integrate out a as well.)

Now, back to evaluation. The issue that comes up is that in order to evaluate -- that is, in order to compute , we have to do more inference. In particular, we have to marginalize over the zs for the heldout data. In the MC camp, this would mean taking our samples to describe a posterior distribution on a given w (marginalizing out z) and then using this posterior to evaluate the heldout likelihood. This would involve another run of a sampler to marginalize out the zs for the new data. In the deterministic camp, we may have an ML-II point estimate of the hyperparameters a, but we still need to marginalize out z, which usually means basically running inference again (eg., running EM on the test data).

All of this is quite unfortunate. In both cases, re-running a sampler or re-running EM, is going to be computationally expensive. Life is probably slightly better in the deterministic camp where you usually get a fairly reasonable approximation to the evidence. In the MC camp, life is pretty bad. We can run this sampler, but (a) it is usually going to have pretty high variance and, (b) (even worse!) it's just plain hard to evaluate evidence in a sampler. At least I don't know of any really good ways and I've looked reasonably extensively (though "corrections" to my naivete are certainly welcome!).

So, what recourse do we have?

One reasonable standard thing to do is to "hold out" data in a different way. For instance, instead of holding out 10% of my documents, I'll hold out 10% of my words in each document. The advantage here is that since the parameters z are typically document-specific, I will obtain them for every document in the process of normal inference. This means that (at least part of) the integration in computing p(w|a) disappears and is usually tractable. The problem with this approach is that in many cases, it's not really in line with what we want to evaluate. Typically we want to evaluate how well this model models totally new documents, not "parts" of previously seen documents. (There are other issues, too, though these are less irksome to me in a topic-modeling setting.)

Another standard thing to do is to throw the latent variables into some sort of classification problem. That is, take (eg) the 20newsgroups data set, training and test combined. Run your topic model and get document-level parameters. Use these as parameters to, say, logistic regression and see how well you do. This definitely gets around the "test on new data" problem, is not really cheating (in my mind), and does give you an estimate. The problem is that this estimate is cloaked behind classification. Maybe there's no natural classification task associated with your data, or maybe classification washes out precisely the distinctions your model is trying to capture.

The final method I want to talk about I learned from Wei Li and Andrew McCallum and is (briefly!) described in their Pachinko Allocation paper. (Though I recall Andrew telling me that the technique stems---like so many things---from stats; namely, it is the empirical likelihood estimate of Diggle and Gratton.

The key idea in empirical likelihood is to replace our statistically simple but computationally complex model p with a statistically complex but computationally simple model q. We then evaluate likelihood according to q instead of p. Here's how I think of it. Let's say that whatever inference I did on training data allowed me to obtain a method for sampling from the distribution p(w|a). In most cases, we'll have this. If we have an ML-II estimate of a, we just follow the topic models' generative story; if we have samples over a, we just use those samples. Easy enough.

So, what we're going to do is generate a ton of faux documents from the posterior. On each of these documents, we estimate some simpler model. For instance, we might simply estimate a Multinomial (bag of words) on each faux document. We can consider this to now be a giant mixture of multinomials (evenly weighted), where the number of mixture components is the number of faux documents we generated. The nice thing here is that evaluating likelihood of test data under a (mixture of) multinomials is really easy. We just take our test documents, compute their average likelihood under each of these faux multinomials, and voila -- we're done!

This method is, of course, not without it's own issues. For one, a multinomial might not be a good model to use. For instance, if my topic model says anything about word order, then I might want to estimate simple n-gram language models instead. The estimates might also have high variance -- how many faux documents do I need? Some sort of kernel smoothing can help here, but then that introduces additional bias. I haven't seen anyone do any evaluation of this for topic-model things, but it would be nice to see.

But overall, I find this method the least offensive (ringing praise!) and, in fact, it's what is implemented as part of HBC.