Archive for the ‘Reviews’ Category

For those of you who like mystery novels, Mitzi’s just written one. The added bonus for readers of this blog is that there’s natural language processing involved in the detective work (I don’t want to give too much away, so I can’t tell you how).

Poetic Justice is in the cozy mystery sub-genre, where the focus is on the amateur sleuths and their milieu, not on grisly multiple homicides.

The Back-Cover Blurb

Once you’ve made it in Manhattan, why would you be caught dead in Staten Island? That’s what Jay Alfred, editor-in-chief of Ars Longa Press, can’t understand. Jay and his partner Ken live on the best block in Chelsea. They’re an attractive pair of opposites. Jay would never stoop to snoop. Ken exercises his right to know every chance he gets.

When Sheba Miller, literary agent and downtown doyenne, is found dead in a bar on Staten Island, Ken can’t wait to investigate. He hustles Jay onto the Staten Island Ferry and into adventure. Then Sheba’s tell-all memoir surfaces. It’s a catalog of white nights with hot artists and liquid lunches with idiot publishers. Jay’s the idiot-in-chief, but he’s not alone. It’s a good thing that Sheba’s dead, because half of literary New York is ready to kill her.

That’s not the only book in town and Ken’s not the only amateur detective. Sheba’s old friends and lovers and the junior members of Ars Longa are all ready and willing to explore New York City and beyond in search of authors, books, killers, and a killer martini.

Look Inside!

If you follow the Amazon link below, the first six chapters are available free online through Amazon’s “Look Inside!” feature.

Early Reviews

For what it’s worth, at least twenty people have read it and said they enjoyed it. Some (including me) are already clamoring for the second book in the series. Other mystery writers told Mitzi this would happen; luckily for us fans, she already has two follow-ons in the pipeline.

First, I have to mention how blown away I was that Apple has a feature (the “Migration Assistant“) that lets you clone your last computer. An hour or two after setting up, the new MacBook Pro had all the software, data, and settings (well, almost all) from my previous computer, a MacBook Air. All done over my home wireless network (though our sysadmin here at Columbia strongly recommended a wired connection, my Air doesn’t have a port and I didn’t have a dongle).

Yes, text is just as beautiful as on the iPad3. So are photos and images. Everything else I use is looking awfully pixelated in comparison (such as this blog post I’m typing into Safari on my 27″ iMac).

The biggest downside is that it’s big (15″ diagonal screen vs. 13″ on my MacBook Air) and heavy (4.5 lbs vs. 2.9 lbs for the Air). Though big isn’t so bad — the 15″ screen seems luxurious after the Air’s rather cramped confines. Some software’s not up to the display, so the text looks really bad on the new MacBook Pro. Firefox and Thunderbird, for instance, look terrible. Overall, it’s just not as nice to handle as the Air. (Not to mention Columbia slapping the ugliest anti-theft stickers ever on it. Now I look like both a hipster clone and a corporate drone at the same time.) The magsafe cable has a very strong magnet compared to the Air’s and sticks out a bit more. And to add insult to injury, they’re not interchangeable, so we had to throw more money toward Cupertino.

I’d say the price is a downside (mine came out to about $2700 before Columbia discounts, including AppleCare). Even if I were buying this myself it’d be worth it, because I’ll average at least 20 hours/week use for two or more years.

Additional upsides are 16GB of memory and four cores. With that, it runs the Stan C++ unit tests in under 3 minutes (it takes around 12 minutes on the Air and the Air starts buzzing like an angry fly). The HDMI port saves a dongle, but then the change to Thunderbolt meant buying another one. I don’t know that I’ll get much use of out of USB 3.0 (the iPad 3 is only USB 2.0). I also get 256GB of SSD, though I never filled the 128GB I had on the Air. The ethernet port and HDMI port are handy — two less dongles compared to the MacBook Air if you need either of these ports.

I haven’t heard the fan. I’ve heard about it — it’s asymmetrical, which according to my signal processing geek friends, reduces the noise tremendously. It’s either super quiet or the machine’s so powerful the Stan unit tests don’t stress it out.

We’ve switched the version control system for Stan (my project at Columbia Uni) from Subversion to Git. I was skeptical when everyone told me how great Git was; the move from CVS to Subversion didn’t buy us much.

Git, on the other hand, is worth it. What I’ve liked about Git so far is:

Local Repository Copies: Every user gets a full copy of the repository. So you can work on a local version of the entire repository before “pushing” any changes to the main repository. (So what was a commit in Subversion is now a commit followed by a push.) This makes it easy to work on the subway, but it also means you can keep things under version control without polluting the public server.

Ryan tells me that GitHub is the bomb, too, and when Ryan recommends something, I listen (he told me the move to Subversion was minor, by the way). It apparently has a great community and a great way to suggest pushes to other projects. We may move the Columbia project to there from Google Code. (We can’t do the same for LingPipe, at least in their free open source area, because of our quirky license.)

The main body of the paper is about they reapplication to malaria diagnosis. But I’m more interested in the statistical techniques they used for crowd sourcing.

None of the nine authors, the reviewer(s) or editor(s) knew that their basic technique for analyzing crowd sourced data has been around for over 30 years. (I’m talking about the statistical technique here, not the application to distributed diagnosis of diseases, which I don’t know anything about.)

Of course, many of us reinvented this particular wheel over the past three decades, and the lack of any coherent terminology for the body of work across computer science, statistics, and epidemiology is part of the problem.

Previous Work

The authors should’ve cited the seminal paper in this field (at least it’s the earliest one I know — if you know earlier refs, please let me know):

Mavandadi et al.

Mavandadi et al. use an approach they call a “binary channel model for gamers”. On page 4 of part II of the supplement to their paper, they define a maximum a posteriori estimate that is the same as Dawid and Skene’s maximum likelihood estimate. It’s the same wheel I reinvented in 2008 (I added hierarchical priors because I was asking Andrew Gelman and Jennifer Hill for advice) and that several groups have subsequently reinvented.

I didn’t understand the section about “error control coding” (starting with whether they meant the same thing as what I know as an “error correcting code”). Why have an annotator annotate an item an odd number of times and then take a majority vote? You can build a probabilistic model for reannotation of any number of votes (that presumably would take into account the correlation (fixed effect) of having the same annotator).

Role of Automatic Classifiers

As in Raykar et al.’s 2009 JMLR paper, Mavandadi et al. also include a machine-based system. But it is not tightly linked as in the work of Raykar et al. It’s just trained from the data a la Padhraic Smyth’s mid-1990s model of crowdsourcing crater location data and then training image analysis models on the resulting crowdsourced data.

Mavandadi et al. instead run their automatic classifier first, then if it’s not confident, hand it over to the crowd. This is, by the way, the standard practice in speech-recognition-based automated call centers.

Mavandadi et al. should check out (Sheng et al. 2010), which analyzes when you need to find another label, also using a Dawid-and-Skene-type model of data annotation. It’s also a rather common topic in the epidemiology literature, because it’s the basis of the decision as to which diagnostic test to administer next, if any, in situations like breast cancer diagnosis (which involves notoriously false-positive-prone image tests and notoriously false-negative-prone tissue tests).

I didn’t see any attempt by Mavandadi et al. to calibrate (or even measure) their system’s confidence assessments. I’d wait for that analysis before trusting their output.

My iPad 3 arrived Friday afternoon. I’ve been using the iPad 1 for the past year and a half or so for all of my technical reading.

Mirroring My Old iPad

After synching my iPad 1 with my Macbook Air, when I plugged in the iPad 3 for the first time, it gave me the option of just mirroring what I had on the old iPad. Yes, please. It worked like a charm. A guy could get spoiled with this kind of treatment. (On the other hand, I still feel like configuring an iPad is making a deal with the Borg; see my previous post, Resistance is futile — I’ve been assimilated by Apple.)

Pro

The Retina display on the iPad 3 is breathtaking. It’s a qualitatively different experience for reading text.

The iPad 1 was good, but it still looked like reading text on a computer. The iPad 3 feels more like reading a magazine or a journal article. The text is that sharp. Even on tiny subscripts in formulas, there’s no aliasing. The following link is to Apple’s demo, which, if anything, understates the perceivable difference:

The jump in quality from iPad 1 to iPad 3 seems much more noticeable than the jump from standard def video (480 vertical lines) to full 1080p high def (1080 vertical lines). In fact, HD video on the iPad 3 is just stunning (I’m running out of adjectives here). The speaker also sounds surprisingly clear for such a little device.

Cons

It’s heavier and fatter than the iPad 1. It’s just enough heavier that it’s much more uncomfortable to hold with one hand while reading, which is what I’m often trying to do on the subway. The iPad 2 is the thinnest and lightest of the three, but it’s hardly a Kindle.

The iPad 3 runs considerably hotter than the iPad 1. It doesn’t get as hot as my Macbook Air when running statistical simulations. But hot enough to notice. Nothing to worry about, but it adds to the unpleasantness of holding it.

I don’t notice any speed difference in the things I do, which is a bummer. It still takes GoodReader a dog’s age to load an old scanned PDF and flip the pages. It’s just that they’re much sharper when they come up.

It seems to take longer to recharge, but I’m not 100% sure.

Conclusion

For my use case, which is mainly for reading technical papers at home, on the subway, and at work, and secondarily for board games (Carcassone, Neuroshima Hex, Ticket to Ride) and for video (Vimeo and YouTube HD look awfully nice), it’s a no brainer. The iPad 3 blows away anything else I’ve ever seen, no contest.

It’s a paper that shows you how to use active learning to build reasonably high-performance classifier with only minutes of user effort. Very cool and right up our alley here at LingPipe.

The Big Picture

The easiest way to see what’s going on is with a screenshot of DUALIST, the system on which the paper is based:

It’s basically a tag-a-little, learn-a-little annotation tool for classifiers. I wrote something along these lines for chunk tagging (named entities, etc.) — you can find it in the LingPipe sandbox project citationEntities (called that because I originally used it to zone bibliogrphies in docs, citations in bibliographies and fields in citations). Mitzi just brought it up to date with the current LingPipe and generalized it for some large multi-part document settings.

In DUALIST, users provide two kinds of input:

category classifications for documents

words associated with categories

The left-hand-side of the interface presents a scrolling list of documents, with buttons for categories. There are then columns for categories with words listed under them. Users can highlight words in the lists that they believe are associated with the category. They may also enter new words that don’t appear on the lists.

Settles points out a difficult choice in the design. If you update the underlying model after every user choice, the GUI items are going to rearrange themselves. Microsoft tried this with Word, etc., arranging menus by frequency of use, and I don’t think anyone liked it. Constancy of where something’s located is very important. So what he did was let the user mark up a bunch of choices of categories and words, then hit the big submit button at the top, which would update the model. I did roughly the same thing with our chunk annotation interface.

There’s always a question in this kind of design whether to pre-populate the answers based on the model’s guesses (as far as I can tell, DUALIST does not pre-populate answers). Pre-populating answers makes the user’s life easier in that if the system is halfway decent, there’s less clicking. But it raises the possibility of bias, with users just going with what the system suggests without thinking too hard.

Naive-Bayes Classifier

The underlying classification model is naive Bayes with a Dirichlet prior. Approximate inference is carried out using a maximum a posteriori (MAP) estimate of parameters. It’s pretty straightforward to implement naive Bayes this in a way that’s fast enough to use in this setting. The Dirichlet is conjugate to the multinomial so the posteriors are analytically tractable and the sufficient statistics are just counts of documents in each category and the count of words in documents of each category.

The Innovations

The innovation here is twofold.

The first innovation is that Settles uses EM to create a semi-supervised MAP estimate. As we’ve said before, it’s easy to use EM or some kind of posterior sampling like Gibbs sampling over a directed graphical model with any subset of its parameters or labels being unknown. So technically, this is straightforward. But it’s still a really good idea. Although semi-supervised classifiers are very popular, I’ve never seen it used in this kind of active-learning tagging interface. I should probably add this to our chunking tagger.

The second (and in my opinion more important) innovation is in letting users single out some words as being important words in categories. The way this gets pushed through to the model is by setting the component of the Dirichlet prior corresponding to the word/category pair to a larger value. Settles fits this value using held-out data rather than rolling it into the model itself with a prior. The results seem oddly insensitive to it, which surprised me (but see below).

Comments on the Classifier

Gadzooks! Settles seems to be missing the single biggest tuning parameter typically applied to naive Bayes — the document length normalizer. Perhaps he did this because when you document-length normalize, you no longer have a properly generative model that corresponds to the naive Bayes paradigm. But it makes a huge difference.

The LingPipe EM tutorial uses naive Bayes and the same 20 Newsgroups corpus as (Nigam, McCallum and Mitchell 2000) used for evaluation, and I showed the effect of document length normalization is huge (the Nigam et al. article has been cited nearly 2000 times!). You can do way better than Nigam et al.’s reported results by setting the document length norm to a small value like 5. (What document length norm’s doing is trying to correct for the lack of covariance and overdispersion modeling in naive multinomial document model — it’s the same kind of shenanigans you see in speech recognition in weighting the acoustic and language models and the same trick I just saw Kevin Knight pull out during a talk last week about decoding encrypted documents and doing machine translation.)

I think one of the reasons that the setting of the prior for important words has so little effect (see the performance figures) is that all of the priors are too high. If Settles really is starting with the Laplace prior (aka add 1), then that’s already too big for naive Bayes in this setting. Even the uniform prior (aka add 0) is too big. We’ve found that we need very small (less than 1) prior parameters for word-in-topic models unless there’s a whole lot of data (and the quantity Settles is using hasn’t gotten there by a long shot — you need to get enough so that the low counts dominate the prior before the effect of the prior washes out, so we’re talking gigabytes of text, at least).

Confusion with Dirichlet-Multinomial Parameterization and Inference

There’s a confusion in the presentation of the Dirichlet prior and consequent estimation. The problem is that the prior parameter for a Dirichlet is conventionally the prior count (amount you add to the usual frequency counts) plus one. That’s why a prior of 1 is uniform (you add nothing to the frequency counts) and why a prior parameter of 2 corresponds to Laplace’s approach (add one to all frequency counts). The parameter is constrained to be positive, so what does a prior of 0.5 mean? It’s sort of like subtracting 1/2 from all the counts (sound familiar from Kneser-Ney LM smoothing?).

Now the maximum a posteriori estimate is just the estimate you get from adding the prior counts (parameter minus one) to all the empirical counts. It doesn’t even exist if the counts are less than 1, which can happen with Dirichlet parameter components that are less than 1. But Settles says he’s looking at the posterior expectation (think conditional expectation of parameters given data — the mean of the posterior distribution). The posterior average always exist (it has to given the bounded support here), but it requires you to add another one to all the counts.

To summarize, the mean of the Dirichlet distribution is

,

whereas the maximum (or mode) is

.

where the is read componentwise, so . This only exists if all .

That’s why a parameter of 1 corresponds to the uniform distribution and why a parameter of 2 (aka Laplace’s “add-one” prior) is not uniform.

Settles says he’s using Laplace and taking the posterior mean (which, by the way, is the “Bayesian point estimate” (oxymoron warning) minimizing expected square loss). But that’s not right. If he truly adds 1 to the empirical frequency counts, then he’s taking the posterior average with a Laplace prior (which is not uniform). This is equivalent to the posterior mode with a prior parameter of 3. But it’s not equivalent to either the posterior mode or mean with a uniform prior (i.e., prior parameter of 1).

Active Learning

Both documents and words are sorted by entropy-based active learning measures which the paper presents very clearly.

Documents are sorted by the conditional entropy of the category given the words in the model.

Word features are sorted by information gain, which is the reduction in entropy from the prevalence category distribution to the expected category distribution entropy conditoned on knowing the feature’s value.

Rather than sorting docs by highest classification uncertainty, as conditional entropy does, we’ve found it useful to sort docs by the lowest classification uncertainty! That is, we ask humans to label the docs about which the classifier is least uncertain. The motivation for this is that we’re often building high-precision (and relatively low recall) classifiers for customers and thus have a relatively high probability threshold to return a guess. So the higher ranked items are closer to the boundary we’re trying to learn. Also, we find in real world corpora that if we go purely by uncertainty, we get a long stream of outliers.

Settles does bring up the issue of whether using what’s effectively a kind of active learning mechanism trained with one classifier will be useful for other classifiers. We need to get someone like John Langford or Tong Zhang in here to prove some useful bounds. Their other work on active learning with weights is very cool.

GUI comments

I see a big problem with this interface for situations with more than a handful of categories. What would the full 20 Newsgroups look like? There aren’t enough room for more columns or a big stack of buttons.

Also, buttons seems like the wrong choice for selecting categories. These should probably be radio buttons to express the exclusivity and the fact that they don’t take action themselves. Typically, buttons cause some action.

Discriminative Classifiers

Given the concluding comments, Settles doesn’t seem to know that you can do pretty much exactly the same thing in a “discriminative” classifier setting. For instance, logistic regression can be cast as just another directed graphical model with parameters for each word/category pair. So we could do full Bayes with no problem.

There are also plenty of online estimation procedures for weighted examples; you’ll need the weighting to deal with EM (see, e.g., (Karampatziakis and Langford 2010) for an online weighted training method that adjusts neatly for curvature in the objective; it’s coincidentally the paper I’m covering for the next Columbia Machine Learning Reading Group meeting).

The priors can be carried over to this setting, too, only now they’re priors on regression coefficients. See (Genkin, Lewis and Madigan 2007) for guidance. One difference is that you get a mean and a variance to play with.

Building it in LingPipe

LingPipe’s traditional naive Bayes implementation contains all that you need to build a system like DUALIST. Semi-supervised learning with EM is covered in our EM tutorial with naive Bayes as an example.

To solve some of the speed issues Settles brings up in the discussion section, you can always thread the retraining in the background. That’s what I did in the chunk tagger. With a discriminative “online” method, you can just keep cycling through epochs in the background, which gives you the effect of a hot warmup for subsequent examples. Also, you don’t need to run to convergence — the background models are just being used to select instances for labeling.

The Data

We will train and test on anything that’s easy to parse. Up today is a basic English part-of-speech tagging for Twitter developed by Kevin Gimpel et al. (and when I say “et al.”, there are ten co-authors!) in Noah Smith’s group at Carnegie Mellon.

Their paper describes their tagging scheme as well as their CRF-based tagger. It uses Stanford’s CRF tagger with baseline features as a performance comparison. The code for their tagger’s also in the distribution. I’m not sure what the license is — it’s listed as “other open source” (I didn’t even know Google Code let you do that — I thought it was “free beer” or nothing with them).

Training and Evaluating a LingPipe POS Tagger

Their corpus was very easy to parse (thanks, I really appreciate it). It only took me about an hour or so to download the data, parse it, and evaluate LingPipe’s baseline POS tagger on it. (It helps to be the author of code. The patterns feel awfully comfortable.)

Our performance was 85.4% accuracy on their train/test split using the default parameters for tagging in LingPipe. In contrast, the Stanford CRF tagger with default features was 85.9% accurate, whereas Gimpel et al.’s tagger achieved 89.4% accuracy. As usual, LingPipe’s HMM tagger is competitive with out-of-the-box CRFs and a few percentage points behind tuned, feature-rich CRFs.

Their paper (on page 5) says the annotator agreement is 92.2%. They also break accuracy out per tag, which LingPipe’s output also does; you can see this yourself if you run it.

LingPipe’s Baseline POS Tagger

The baseline POS tagger in LingPipe is a bigram HMM with emissions defined by a bounded character language model. Estimation is with simple additive smoothing (i.e., MAP estimates given symmetric Dirichlet priors) for the initial state and transition probabilities and Witten-Bell smoothing for the character LMs. Our main motivation for doing things this way is that (a) it’s online, letting us train an example at a time, and (b) it’s reasonably fast when it runs. We should be able to decode this tag set at well over 500K tokens/second by turning on caching of character LM results and pruning.

We could also implement their approach using LingPipe’s CRFs. It’s just that it’d take a bit longer than an hour all in.

Run it Yourself

You can get their code from their project home page, linked above.

All of my code’s checked into the LingPipe Sandbox in a project named “twitter-pos”. You can check it out anonymously using Subversion:

svn co https://aliasi.devguard.com/svn/sandbox/twitter-pos

The code’s in a single file, stored under the src subdirectory of the package:

LingPipe’s pretty fast for this sort of thing, with the entire program above, including I/O, corpus parsing, training, and testing taking a total of 5 seconds on my now ancient workstation.

Although it wouldn’t be a fair comparison, there’s usually a percent or so to be eked out of a little tuning in this setting (it would’ve been fair had I done tuning on the dev set and evaluated exactly once). This was just a straight out of the box, default settings eval. In general, one shouldn’t trust results that report post-hoc best settings values as they’re almost always going to overestimate real performance for all the usual reasons.

Finally, here’s the confusion matrix for tags in the first-best output:

But that’s not why we’re here today. I want to talk about their modeling decisions.

Tang and Lease apply a Dawid-and-Skene-style model to crowdsourced binary relevance judgments for highly-ranked system responses from a previous TREC information retrieval evaluation. The workers judge document/query pairs as highly relevant, relevant, or irrelevant (though highly relevant and relevant are collapsed in the paper).

The Dawid and Skene model was relatively unsupervised, imputing all of the categories for items being classified as well as the response distribution for each annotator for each category of input (thus characterizing both bias and accuracy of each annotator).

Semi Supervision

Tang and Lease exploit the fact that in a directed graphical model, EM can be used to impute arbitrary patterns of missing data. They use this to simply add some known values for categories (here true relevance values). Usually, EM is being used to remove data, and that’s just how they pitch what they’re doing. They contrast the approach of Crowdflower (nee Dolores Labs) and Snow et al. as fully supervised. They thus provide a natural halfway point between Snow et al. and Dawid and Skene.

Good Results vs. NIST Gold

The key results are in the plots in figures 4 through 7,which plot performance versus amount of supervision (as well as fully unsupervised and majority vote approaches). They show the supervision helping relative to the fully unsupervised approach and the approach of training on just the labeled data.

Another benefit of adding supervised data (or adding unsupervised data if viewed the other way) is that you’ll get better estimates of annotator responses (accuracies and biases) and of topic prevalences.

Really Gold?

They get their gold-standard values from NIST, and the notion of relevance is itself rather vague and subjective, so the extra labels are only as golden as the NIST annotators. See below for more on this issue.

Voting: Quantity vs. Quality

Tang and Lease say that voting can produce good results with high quality annotators. It’ll also produce good results with a high quantity of annotators of low quality. As long as their results are independent enough, at least. This is what everyone else has seen (me with the Snow et al. data and Vikas Raykar et al. very convincingly in their JMLR paper).

Regularized Estimates (vs. MLE vs. Bayesian)

I think it’d help if they regularized rather than took maximum likelihood estimates. Adding a bit of bias from regularization often reduces variance and thus expected error even more. It helps with fitting EM, too.

For my TREC entry, I went whole hog and sampled from the posterior of a Bayesian hierarchical model which simultaneously estimates the regularization parameters (now cast as priors) along with the other parameters.

I also use Bayesian estimates, specifically posterior means, which minimize expected squared error. MLE for the unregularized case and maximum a posterior (MAP) estimates for the regularized case can both be viewed as taking posterior maximums (or modes) rather than means. These can be pretty different for the kinds of small count beta-binomial distributions used in Dawid and Skene-type models.

Really Adversarial Turkers?

How in the world did they get a Mechanical turker to have an accuracy of 0 with nearly 100 responses? That’s very very adversarial. I get higher accuracy estimates using their data for TREC and don’t get very good agreement with the NIST gold standard, so I’m really wondering about this figure and the quality of the NIST judgments.

Active Learning

Choosing labels for items on the margin of a classifier is not necessarily the best thing to do for active learning. You need to balance uncertainty with representativeness, or you’ll do nothing but label a sequence of outliers. There’s been ongoing work by John Langford and crew on choosing the right balance here.

Adding a Model

Vikas Raykar et al. in their really nice JMLR paper add a regression-based classifier to the annotators. I think this is the kind of thing Tang and Lease are suggesting in their future work section. They cite the Raykar et al. paper, but oddly not in this context, which for me, was its major innovation.

Not Quite Naive Bayes

Tang and Lease refer to the Dawid and Skene approach as “naive Bayes”, which is not accurate. I believe they’re thinking of generating the labels as analogous to generating tokens. But the normalization for that is wrong, being over annotators rather than over annotator/label pairs. If they had a term estimating the probability of an annotator doing an annotation, then it would reduce to naive Bayes if they allow multiple annotations by the same annotator independently (which they actually consider, but then rule out).

So it’s odd to see the Nigam et al. paper on semi-supervised naive Bayes text classification used as an example, as it’s not particularly relevant, so to speak. (I really like Nigam et al.’s paper, by the way — our semi-supervised naive Bayes tutorial replicates their results with some more analysis and some improved results.)

Two-Way vs. K-Way Independence

Another nitpick is that it’s not enough to assume every pair of workers is independent. The whole set needs to be independent, and these conditions aren’t the same. (I was going to link to the Wikipedia article on independent random variables, but it only considers the pairwise case. So you’ll have to go to a decent probability theory textbook like Degroot and Schervish or Larsen and Marx, where you’ll get examples of three variables that are not independent though each pair is pairwise independent.

One Last Nitpick

A further nitpick is equation (6), the second line of which has an unbound i in the p[i] term. Instead, i needs to be bound to the true category for instance m.

Synthetic Data Generation?

I also didn’t understand their synthetic data generation in 3.1. If they generate accuracies, do they take the sensitivities and specificities to be the same (in their notation, pi[k,0,0] = pi[k,1,1]). In my (and others’) experience, there’s usually a significant bias so that sensitivity is not equal to specificity for most annotators.

The Problem and the Data

The basic idea, which they describe as Thurstonian, has annotators rank-order a common set of items, such as the sizes of 10 cities. The goal is to then induce the true ranking (the so-called “wisdom of crowds”) and also to estimate the annotator’s accuracies (but not biases in this case).

Some Background

The model they propose should be familiar to anyone who’s seen item-response models or Bradley-Terry models from the psychometrics literature on educational testing and preference ranking respectively. Somewhat surprisingly given Steyvers’s connection to cognitive science, they don’t seem to know (or don’t care to cite) sixty years worth of previous psychometrics literature on these kinds of problems. As Andrew Gelman is fond of saying, just about any model you invent was studied decades ago by a psychometrician.

Instead, they dig back even deeper to Thurston in the 1920s and also cite some work by Mallows in the 1950s, the latter of which is closer to what I’d have expected in the way of citations.

Their model reminds me most of Uebersax and Grove’s approach to ordinal ranking problems described in their 1993 Biometrics paper A latent trait finite mixture model for the analysis of rating agreement. Uebersax and Grove also use latent positions and normally distributed noise. The difference is that Uebersax and Grove looked at the case of multiple annotators evaluating multiple items on an ordinal scale. An example would be five doctors ranking 100 slides of potential tumors on a 0-4 scale of severity.

Steyvers et al.’s Model

The Items

The basic idea is to introuce a latent scalar for each item being ranked. The ordering of the latent scalars induces a complete ordering of items.

The Annotators

Each annotator is characterized by a single noise parameter . These are given what seem like a rather arbitrary prior:

where is a constant hyperprior (set to 3). I’m more used to seeing inverse gamma distributions used as priors for variance (or gammas used as priors for precision).

They mention that one could fit another level of hierarchy here for , which would account for population effects in the model; this is standard operating procedure for Bayesian modeling these days and usually results in a better model of posterior uncertainty than optimizing or arbitrarily setting hyperparameters.

The Annotations

The annotations that are observed are of the form of complete rankings. That is, if we had three cities to rank by population, Chicago, Houston and Phoneix, an annotator’s response might be

Houston > Chicago > Phoenix.

The model assumes these annotations are derived from a latent annotator-specific scalar for each item and annotator (it’d also be easy to allow an incomplete panel design in which not every annotator ranks every item). The model for this latent scalar is the obvious one:

.

That is, the latent position assigned to item by annotator is drawn from a normal centered around the true latent location for item with noise determined by the annotator-specific deviation parameter .

The Sampling Trick

There’s only one problem in implemeting this model: the latent must be consistent with the observed ranking . As you can imagine, they follow Albert and Chib’s approach, which involves a truncated normal sampler. That is, conditional on all but a single position , use a normal distribution truncated to the interval bounded by the next lowest-ranked and next highest-rank item than (with no lower bound for the lowest-ranked item and no upper bound for the highest-ranked item).

The whole model’s only a few lines of JAGS code, though they wrote their own implementation using a mix of Metropolis and Gibbs updating (this is a case where Gibbs is going to mix relatively slowly because of the interdependence of the , yet this is where they use Gibbs). An advantage of using JAGS is that it’s trivial to explore the hierarchical extensions of the model.

The Posterior

The posterior distribution is modeled using samples. Here, the random variables being sampled are . Given samples for , we can estimate the rank of each item by looking at the rank in each sample . We also get a direct characterization of the noise of an annotator through . We probably don’t care at all about the annotator-specific latent positions .

In the NIPS paper, we get posterior intervals. In general, I prefer more direct views of the posterior samples, like scatterplots of histograms. For instance, check out the alternative plots for similar data in the collection BUGS Examples, Volume I, which contains a closely related example model involving ranking hospitals based on pediatric surgery fatalities (p. 13, diagram p. 17). It’s basically a histogram of the item’s rank in the posterior samples.

The Wisdom of Crowds

The result is a successful “wisdom of the crowds” aggregation of rankings. Each ranker is weighted by their estimated noise, so more reliable rankers have their rankings weighted more heavily. This is just like all the annotation models we’ve talked about, beginning with Dawid and Skene’s seminal 1979 paper, Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm (sorry for the paywalled link — I can’t find a pirated copy).

In the terminology of statistics, these kinds of models are called “measurement error” models (there doesn’t seem to be a good Wikipedia page or good general overview — you find introductions in any book on survey sampling). It’s not uncommon to have the basic data be measured with noise. Especially in an epidemiology setting or any kind of subjective coding by human subjects, like survey responses.

Future Directions

The authors point out in their second paper that it’d be natural to build hierarchical models for this task. But their suggestion for how to do it is odd. They suggest adding a single parameter for an individual across all tasks. Usually, you’d have this, then a domain-specific parameter that varied around the hierarchical parameter.

That is, if our tasks were indexed by , we’d have an individual level error for each individual across tasks, and an error for each task for each user that’s sampled in the neighborhood of . It’d be common at this point to estimate random effects for things like task difficulty or annotator expertise. You see this all over the epidemiology and psychometrics literature when they extend these annotation models (for instance, in epidemiology, blood sample tests vs. doctor physical exam is an example of an annotator-level effect; in psychometrics, midterm versus final exam is an example of an item-level effect).

I’d at least start with giving the a hierarchical prior.

I’m guessing since they also suggest multiple-choice questions as an extension, they really haven’t seen the item-response theory literature.

The second most popular post ever on this blog was my review of Lenovo’s Thinkpad W510. I used to really like the IBM Thinkpads. (The number one post is the link to 64-bit versions of Eclipse, and I don’t even use Eclipse except for an occassional refactoring or stepwise debug).

I just gave my 15-month old quad-core, high-res, 8GB Thinkpad W510 with a 128GB SSD to one of our grad students after Andrew bought me a shiny new Macbook Air 13″ with 4GB and a 256GB SSD. Apple only just a few weeks ago came out with similar hardware for the Macbook Pro, by the way, and Lenovo’s not exactly rushing out of the gate with new CPUs.

Evaluation Period Honeymoon

I thought at the very least, if I hated Mac OS X, I could just install Windows. If the Air was too small, I could just use it for travel.

Within about 24 hours, having watched Apple’s converting from Windows to Mac video, I’m a convert. So much so that Mitzi’s been getting sick of hearing me tell her how great the new Macs are. We’d both last used them with any frequency when I had a Mac Quadra in the early 1990s.

Among my favorite things is emacs commands in text windows. And I’m not sure how I lived without the gestures for scrolling and navigating the web. It really is a unix machine underneath, though I never had any problems with Cygwin on Windows. I also love the keyboards. That and the weight are what finally soured me on the Thinkpads — their new ones are terrible.

Did I mention the insane battery life?

Or just how nice it feels? It’s like moving from a Kodak Brownie to a Leica.

The last time I was this impressed with a piece of hardware was when my grad school department got its first Sun Workstation I way back in 1985 or 1986.

It’s not perfect. Does Microsoft own the patent on grabbing windows by their sides to resize them? It’s not possible in either Linux or the Mac [I was corrected in the comments; it is possible in Linux, just very trying as a test of hand-eye-mouse coordination]. And why do I have a delete key that’s really a backspace and no real delete key? (Yes, I realize fn-delete adds up to a delete.) And what’s the function key doing where my control key should be? Does anyone ever use caps lock for anything? Why do computers still have them?

Also, I don’t find Macs as blindingly obvious as everyone else says they are. For instance, I had no clue as to what to do with the “power cord”. I went online. I read the manual. I’d have thought it was the wrong piece, but it was listed in the parts. Turns out you need to take their power supply apart and then it clips in. Huh? It reminds me of getting an iPod many years ago and having no clue how to turn it off.

No Half Measures

Given that the money was turned on at Columbia, I ordered the Macbook Air for home and travel, as well as a 27″ iMac for the office, and an iPad 2. The iMac is very sweet, but somehow not nearly as cool as the Air. I can’t wait until the iPad shows up — I’m tired of printing out PDFs. As soon as the iPhone 5 shows up, I’m getting one from Verizon (I don’t even have a cellphone now).

Welcome to the New Borg

The old joke used to be that Microsoft was like The Borg, whose tagline was “you will be assimilated”. Slashdot even uses a Borged-out Bill Gates as an icon.

It turns out that as hard as Google’s trying to be the new Borg, the current Borg headgear apparatus rests on Apple’s head. After all, they’ve not just locked down their hardware, they’re also trying to take over the music and movie distribution business. That’s why I’m so surprised I couldn’t find a Steve-Jobs-as-Borg image. C’mon Slashdot, help me out here.

Tip of the Day

When your significant other says he or she thinks you like your new Macbook Air more than him or her, do NOT reply, “It’s so thin!”. Luckily, Mitzi has a sense of humor to go with her yoga-and-running-toned body.