The Problem and the Data

The basic idea, which they describe as Thurstonian, has annotators rank-order a common set of items, such as the sizes of 10 cities. The goal is to then induce the true ranking (the so-called “wisdom of crowds”) and also to estimate the annotator’s accuracies (but not biases in this case).

Some Background

The model they propose should be familiar to anyone who’s seen item-response models or Bradley-Terry models from the psychometrics literature on educational testing and preference ranking respectively. Somewhat surprisingly given Steyvers’s connection to cognitive science, they don’t seem to know (or don’t care to cite) sixty years worth of previous psychometrics literature on these kinds of problems. As Andrew Gelman is fond of saying, just about any model you invent was studied decades ago by a psychometrician.

Instead, they dig back even deeper to Thurston in the 1920s and also cite some work by Mallows in the 1950s, the latter of which is closer to what I’d have expected in the way of citations.

Their model reminds me most of Uebersax and Grove’s approach to ordinal ranking problems described in their 1993 Biometrics paper A latent trait finite mixture model for the analysis of rating agreement. Uebersax and Grove also use latent positions and normally distributed noise. The difference is that Uebersax and Grove looked at the case of multiple annotators evaluating multiple items on an ordinal scale. An example would be five doctors ranking 100 slides of potential tumors on a 0-4 scale of severity.

Steyvers et al.’s Model

The Items

The basic idea is to introuce a latent scalar for each item being ranked. The ordering of the latent scalars induces a complete ordering of items.

The Annotators

Each annotator is characterized by a single noise parameter . These are given what seem like a rather arbitrary prior:

where is a constant hyperprior (set to 3). I’m more used to seeing inverse gamma distributions used as priors for variance (or gammas used as priors for precision).

They mention that one could fit another level of hierarchy here for , which would account for population effects in the model; this is standard operating procedure for Bayesian modeling these days and usually results in a better model of posterior uncertainty than optimizing or arbitrarily setting hyperparameters.

The Annotations

The annotations that are observed are of the form of complete rankings. That is, if we had three cities to rank by population, Chicago, Houston and Phoneix, an annotator’s response might be

Houston > Chicago > Phoenix.

The model assumes these annotations are derived from a latent annotator-specific scalar for each item and annotator (it’d also be easy to allow an incomplete panel design in which not every annotator ranks every item). The model for this latent scalar is the obvious one:

.

That is, the latent position assigned to item by annotator is drawn from a normal centered around the true latent location for item with noise determined by the annotator-specific deviation parameter .

The Sampling Trick

There’s only one problem in implemeting this model: the latent must be consistent with the observed ranking . As you can imagine, they follow Albert and Chib’s approach, which involves a truncated normal sampler. That is, conditional on all but a single position , use a normal distribution truncated to the interval bounded by the next lowest-ranked and next highest-rank item than (with no lower bound for the lowest-ranked item and no upper bound for the highest-ranked item).

The whole model’s only a few lines of JAGS code, though they wrote their own implementation using a mix of Metropolis and Gibbs updating (this is a case where Gibbs is going to mix relatively slowly because of the interdependence of the , yet this is where they use Gibbs). An advantage of using JAGS is that it’s trivial to explore the hierarchical extensions of the model.

The Posterior

The posterior distribution is modeled using samples. Here, the random variables being sampled are . Given samples for , we can estimate the rank of each item by looking at the rank in each sample . We also get a direct characterization of the noise of an annotator through . We probably don’t care at all about the annotator-specific latent positions .

In the NIPS paper, we get posterior intervals. In general, I prefer more direct views of the posterior samples, like scatterplots of histograms. For instance, check out the alternative plots for similar data in the collection BUGS Examples, Volume I, which contains a closely related example model involving ranking hospitals based on pediatric surgery fatalities (p. 13, diagram p. 17). It’s basically a histogram of the item’s rank in the posterior samples.

The Wisdom of Crowds

The result is a successful “wisdom of the crowds” aggregation of rankings. Each ranker is weighted by their estimated noise, so more reliable rankers have their rankings weighted more heavily. This is just like all the annotation models we’ve talked about, beginning with Dawid and Skene’s seminal 1979 paper, Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm (sorry for the paywalled link — I can’t find a pirated copy).

In the terminology of statistics, these kinds of models are called “measurement error” models (there doesn’t seem to be a good Wikipedia page or good general overview — you find introductions in any book on survey sampling). It’s not uncommon to have the basic data be measured with noise. Especially in an epidemiology setting or any kind of subjective coding by human subjects, like survey responses.

Future Directions

The authors point out in their second paper that it’d be natural to build hierarchical models for this task. But their suggestion for how to do it is odd. They suggest adding a single parameter for an individual across all tasks. Usually, you’d have this, then a domain-specific parameter that varied around the hierarchical parameter.

That is, if our tasks were indexed by , we’d have an individual level error for each individual across tasks, and an error for each task for each user that’s sampled in the neighborhood of . It’d be common at this point to estimate random effects for things like task difficulty or annotator expertise. You see this all over the epidemiology and psychometrics literature when they extend these annotation models (for instance, in epidemiology, blood sample tests vs. doctor physical exam is an example of an annotator-level effect; in psychometrics, midterm versus final exam is an example of an item-level effect).

I’d at least start with giving the a hierarchical prior.

I’m guessing since they also suggest multiple-choice questions as an extension, they really haven’t seen the item-response theory literature.