10 June 2008

I think it's fair to say that I am a fan of the Bayesian lifestyle. I have at least a handful of papers with "Bayesian" in the title, and no in the misleading "I used Bayes' rule on a noisy-channel model" sense.

It's probably also fair to say that I'm a fan of NLP.

So... let's think about it. A fan of Bayes, a fan of NLP. He must do topic modeling (ie LDA-style models), right? Well, no. Not really. Admittedly I have done something that looks like topic modeling (for query-focused summarization), but never really topic modeling for topic modeling's sake.

The main reason I have stayed away from the ruthless, yet endlessly malleable, world of topic modeling is because it's notoriously difficult to evaluate. (I got away with this in the summarization example because I could evaluate the summaries directly.) The purpose of this post is to discuss how one can try to evaluate topic models.

At the end of the day, most topic models are just probabilistic models over documents (although sometimes they are models over collections of documents). For a very simple example, let's take LDA. Here, we have . In the particular case of LDA, the first two "p"s are Dirichlet, and the last two "p"s are Multinomial, where z is an indicator selecting a mixture. For simplicity, though, let's just collapse all the hyperparameters into a single variable a and the true parameters into a single parameter z and just treat this as and let's assume p and q are not conjugate (if they were, life would be too easy).

Now, we want to "evaluate" these models. That is, we want to check to see if they really are good models of documents. (Or more precisely, are the better models of documents than whatever we are comparing against... perhaps I've proposed a new version of p and q that I claim is more "life-like.") Well, the natural thing to do would be to take some held-out data and evaluate according to the model. Whichever model assigns higher probability to the heldout data is probably better.

At this point, we need to take a moment to talk about inference. There's the whole Monte Carlo camp and there's the whole deterministic (variational, Laplace, EP, etc.) camp. Each gives you something totally different. In the Monte Carlo camp, we'll typically get a set of R-many (possibly weighted) samples from the joint distribution p(a,z,w). We can easily "throw out" some of the components to arrive at a conditional distribution on whatever parameters we want. In the deterministic camp, one of the standard things we might get is a type-II maximum likelihood estimate of a given the training data: i.e., a value of a that maximizes p(a|w). (This is the empirical Bayes route -- some deterministic approximations will allow you to be fully Bayes and integrate out a as well.)

Now, back to evaluation. The issue that comes up is that in order to evaluate -- that is, in order to compute , we have to do more inference. In particular, we have to marginalize over the zs for the heldout data. In the MC camp, this would mean taking our samples to describe a posterior distribution on a given w (marginalizing out z) and then using this posterior to evaluate the heldout likelihood. This would involve another run of a sampler to marginalize out the zs for the new data. In the deterministic camp, we may have an ML-II point estimate of the hyperparameters a, but we still need to marginalize out z, which usually means basically running inference again (eg., running EM on the test data).

All of this is quite unfortunate. In both cases, re-running a sampler or re-running EM, is going to be computationally expensive. Life is probably slightly better in the deterministic camp where you usually get a fairly reasonable approximation to the evidence. In the MC camp, life is pretty bad. We can run this sampler, but (a) it is usually going to have pretty high variance and, (b) (even worse!) it's just plain hard to evaluate evidence in a sampler. At least I don't know of any really good ways and I've looked reasonably extensively (though "corrections" to my naivete are certainly welcome!).

So, what recourse do we have?

One reasonable standard thing to do is to "hold out" data in a different way. For instance, instead of holding out 10% of my documents, I'll hold out 10% of my words in each document. The advantage here is that since the parameters z are typically document-specific, I will obtain them for every document in the process of normal inference. This means that (at least part of) the integration in computing p(w|a) disappears and is usually tractable. The problem with this approach is that in many cases, it's not really in line with what we want to evaluate. Typically we want to evaluate how well this model models totally new documents, not "parts" of previously seen documents. (There are other issues, too, though these are less irksome to me in a topic-modeling setting.)

Another standard thing to do is to throw the latent variables into some sort of classification problem. That is, take (eg) the 20newsgroups data set, training and test combined. Run your topic model and get document-level parameters. Use these as parameters to, say, logistic regression and see how well you do. This definitely gets around the "test on new data" problem, is not really cheating (in my mind), and does give you an estimate. The problem is that this estimate is cloaked behind classification. Maybe there's no natural classification task associated with your data, or maybe classification washes out precisely the distinctions your model is trying to capture.

The final method I want to talk about I learned from Wei Li and Andrew McCallum and is (briefly!) described in their Pachinko Allocation paper. (Though I recall Andrew telling me that the technique stems---like so many things---from stats; namely, it is the empirical likelihood estimate of Diggle and Gratton.

The key idea in empirical likelihood is to replace our statistically simple but computationally complex model p with a statistically complex but computationally simple model q. We then evaluate likelihood according to q instead of p. Here's how I think of it. Let's say that whatever inference I did on training data allowed me to obtain a method for sampling from the distribution p(w|a). In most cases, we'll have this. If we have an ML-II estimate of a, we just follow the topic models' generative story; if we have samples over a, we just use those samples. Easy enough.

So, what we're going to do is generate a ton of faux documents from the posterior. On each of these documents, we estimate some simpler model. For instance, we might simply estimate a Multinomial (bag of words) on each faux document. We can consider this to now be a giant mixture of multinomials (evenly weighted), where the number of mixture components is the number of faux documents we generated. The nice thing here is that evaluating likelihood of test data under a (mixture of) multinomials is really easy. We just take our test documents, compute their average likelihood under each of these faux multinomials, and voila -- we're done!

This method is, of course, not without it's own issues. For one, a multinomial might not be a good model to use. For instance, if my topic model says anything about word order, then I might want to estimate simple n-gram language models instead. The estimates might also have high variance -- how many faux documents do I need? Some sort of kernel smoothing can help here, but then that introduces additional bias. I haven't seen anyone do any evaluation of this for topic-model things, but it would be nice to see.

But overall, I find this method the least offensive (ringing praise!) and, in fact, it's what is implemented as part of HBC.

This is a task eval if you're doing language modeling, which has been a hotbed of applied clustering research in both acoustic and language modeling.

The most compelling applications of LDA are natural ones like coreference (e.g. Haghighi and Klein's LDA coref paper) or collaborative filtering (as evaluated in the original Blei et al. paper and as compared to pLSA, SVD, etc. in subsequent papers).

Steyvers and Griffiths' Gibbs LDA paper showed that the Gibbs samples from LDA are surprisingly stable topic-wise if you allow alignment by KL divergence. They also showed that on synthetic data that matched the model, the inference procedure would uncover the underlying structure. We replicate both these results on different corpora in LingPipe's LDA tutorial.

The promise of a pure Bayesian approach is that you can estimate the probability that two objects are in the same cluster. That is, you get doc-doc and word-word similarity out the other end. Steyvers and Griffiths use human word association tests to evaluate the word-word similarities. Doc-doc similarities are natural implementations of "more like this" or query refinement for search, which still begs the question of evaluation.

Blei et al.'s nematode abstract LDA paper modeled abstracts related to the nematode worm showed that the clusters corresponded to some genetic pathways for aging. We've also replicated that work in LingPipe's LDA tutorial.

Yes, pretty much everything I said corresponds to straightforward (Bayesian) mixture models as well.

Maybe I'm being naive here, but how do you compute perplexity on held-out data with, say LDA? This requires marginalizing out latent variables, no? Which in turn requires rerunning EM.

Hrm... maybe I'm the odd one out here, but I don't really think of the Haghighi and Klein paper as being really LDA like at all. I also don't really think of LDA as being clustering. (Yes, there's a huge argument as to whether LDA is clustering or dimensionality reduction... I don't really think it matters.)

But yes, I agree -- if you have an external source of information about word clusters (eg, in the S+G paper or the nematode paper), then you're set. I'm more interested right now in how to do an intrinsic evaluation of these models -- for any topic specific question, you can always do something extrinsic.

(Also, I don't see what's "Bayesian" about being able to estimate the probability that two things are in the same cluster -- I can do that in a classical setting as well.)

I don't know how to do inference at the perplexity level on held-out docs using a fitted LDA model without more iterative inference (EM or Gibbs). I didn't even try to implement it as it doesn't seem useful for anything other than evaluating LDA as an LM!

The probability of two docs being in the same cluster with an LDA model (or simpler model) is just:

SUM_topic p(topic|doc1) * p(topic|doc2)

I was getting caught up in being Bayesian, which requires estimating the posterior distribution. This is easy with Gibbs sampling, but not possible as far as I know with the EM-like point estimates.

Haghighi and Klein isn't LDA per se, but it does use a Dirichlet Process prior on reference chains. Simple coreference as opposed to database linkage is a classical clustering problem, with the "reference chains" being the clusters.

Un- or semi-supervised classifier inference (e.g. EM) is a kind of un- or semi-supervised soft clustering. Once you have the classifier, the likelihood of two elements being in the same cluster is defined as above. Turning it into a hard clustering requires a heuristic of some kind or running some other clusterer on top of the probabilistic notion of proximity.

to try to answer one technical question: computing perplexity on a held out document with a fitted LDA model amounts to running variational inference to obtain a bound on the likelihood. comparing bounds is troubling to some. minka and lafferty's 2002 UAI paper uses importance sampling to get around this problem. wei li and andrew mccallum's solution is also good. (i wonder how it relates.)

i like testing algorithms on predictive distributions of the form p(w_new | a few other w's) for previously unseen documents. i think that a number of real world problems, like collaborative filtering, have this form. perplexity is nice, but imagining the probability of a 1000 word document broken down by the chain rule, we see that terms like p(w_601 | w_1, ..., w_600) dominate the computation and don't really test the predictive qualities of the model.

more generally, i agree with bob's sentiments that the best evaluation of these kinds of models is to test them on whatever it is that you are doing. our hope is that better held out likelihood means better everything else, but this is not always true!

moreover, we often want to do vague things like using the low dimensional structure to help browse and explore a large collection of documents. andrew mccallum's rexa project is a great example of this. how can we evaluate that? a user study is the natural answer but is there an easier way?

[shameless plug alert.] jon mcauliffe and i have been working on supervised topic models, where we developed a hybrid topic model/generalized linear model for doing prediction of a per-document response variable. (the examples in the paper are web-page popularity and movie reviewer sentiment.) in that work, evaluation was clear. for held out documents, we compared our predictions to the truth.

So I totally agree that if there is an extrinsic evaluation, you should do that (see, eg., this post). I also agree with Dave that for a lot of problems (collaborative filtering), the p(w | other ws) is pretty reasonable.

Regarding "probability of being in same topic", you can do this for, eg, a non-Bayesian mixture of Gaussians... sum_z p(x_i | mu_z) p(x_j | mu_z).

Sadly, I don't think the LaTeX works in comments... it's a GreaseMonkey plugin that just attaches itself to the post interface in Blogger.

Very nice thread, and in fact if you search for "evaluating topic models" on Google, it comes up as the number 1 hit (at least today, June 15, 2008).

So how is it that I find myself posting to this blog? Well, one of the problems I think about quite a bit is how to identify terms from collections of documents that will describe the content of that collection and (maybe) discriminate that collection from the others. Given what topic models try to do, they seem at least like worthy candidates for the descriptive part of the above, or maybe even the discriminating too...

Anyway, I've read a few topic model papers, and they usually go something like this - they start with an appealing introduction to the problem, that is we want to find terms that help us organize information, navigate through complex information spaces, etc. Then you hit this rather terrifying middle section of equations and terminology that usually takes up about half of the page limit (thus far I've mostly skipped those parts to be honest), and then there is an "evaluation" at the end of the paper that includes some tables that are essentially ranked lists of words that look pretty good. The ranked list of terms is indeed made up of terms that are descriptive, they are content rich words, and certainly they are useful pieces of information.

But then, I think about that middle section of the paper, and I think about something like TF/IDF, which can do a fairly good job of some of the above with quite a bit less algorithmic and probably computational complexity. Or I think about measures of association like pointwise mutual information (PMI) that can be used to identify interesting bigrams that are often very descriptive of content, and I wonder if topic models are really providing anything that different or that much better than some of the techniques we already know.

That's actually what I was looking for when I found this discussion - that is, has anyone of a similar frame of mind said "let's see how the lists of terms we get from a topic model compares to TF/IDF, etc."? Thus far I haven't found anything, but I'm still curious.

To tie this back to the thread of evaluation - what about comparative evaluation? Most of the topic model work I've seen doesn't seem to go outside of the probabilistic box when looking for methods to compare to, so the discussion quickly focuses on different sorts of parameters and what not, but doesn't really seem to acknowledge that there might be completely different alternatives that accomplish the same task, and might do so pretty well.

I'd actually be content to know if topic models provided something different than you would get from TF/IDF, etc, because as has been said evaluation of lists of ranked terms outside of some application setting is always a tricky business. (This is the same thing that has long bedeviled work on methods to identify and rank collocations, btw) But, at the very least it would be useful to know that the top 10 terms provided by a certain topic model correlates to X degree with the top 10 terms provided by TF-IDF, etc.

Anyway, I fear this sounds painfully naive, but you seem like a friendly group. :) I think the truth is that when complicated methods like topic models are used to attack problems that look simple (give me the 10 words that best describe the content of this article or set of articles), there is a sort of natural tendency on the part of a person trying to solve that problem in the context of a larger problem to ask to whether all that machinery is really necessary or not, and that seems like it would best be settled to some degree via comparative evaluations with quite different methods that try to accomplish the same thing.

I also believe that LDA-model just reduce the dimension. But when I use the model to training data, I find there is no instruction about how to choose proper parameters. So I want to know ,how to choose the parameters. Or, may I can get from some articles?

I always heard something from my neighbor that he sometimes goes to the internet bar to play the game which will use him some gw gold，he usually can win a lot of GuildWars Gold，then he let his friends all have some Guild Wars Gold，his friends thank him very much for introducing them the GuildWars money，they usually cheap gw gold together.

While I am passing by on this interesting thread, let me point out to some of the latest interesting developments on this topic:- Evaluation Methods for Topic Models, Hanna Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno, ICML 2009. An interesting comparison of different sampling methods to compute perplexity (it is missing the comparison to variational methods, though).- Reading Tea Leaves: How Humans Interpret Topic Models, Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, David Blei, NIPS 2009. At last some more objective empirical evaluation of whether the topics from topic models are meaningful for humans; and the interesting result that better perplexity doesn't translate necessarily to better topics (for humans).