Spinning Straw into Gold: How to do gold standard data right

We have been struggling with how to evaluate whether we are finding ALL the genes in MEDLINE/PubMed abstracts. And we want to do it right. There has been a fair amount of work on how to evaluate natural language problems–search via TREC, BIOCREATIVE, MEDTAG but nothing out there really covered what we consider to be a key problem in text based bioinformatics–coverage or recall in application to existing databases of Entrez Gene.

“In order to make himself appear more important, a miller lied to the king that his daughter could spin straw into gold. The king called for the girl, shut her in a tower room with straw and a spinning wheel, and demanded that she spin the straw into gold by morning, for three nights, or be executed. ” Much drama ensues but in the end a fellow named Rumpelstiltskin saves the day.

The cast breaks down as follows:

The king: The National Institutes of Health (NIH) who really prefer that you deliver on what you promise on grants.

The miller: Our NIH proposal in which we say “We are committed to making all the facts, or total recall, available to scientists…” Even worse is that this is from the one paragraph summary of how we were going to spend $750,000 of the NIH’s money. They will be asking about this.

The daughter: I (Breck), who sees lots of straw and no easy path to developing adequate gold standard data to evaluate ourselves against.

The straw: 15 million MEDLINE/PubMed abstracts and 500,000 genes that need to be connected in order to produce the gold. Really just a subset of it.

The gold: A scientifically valid sample of mappings between genes and abstracts that we can test our claims of total recall. This is commonly called “Gold Standard Data.”

Rumpelstiltskin: Bob, and lucky for me I do know his name.

Creating Gold from Straw

Creating linguistic gold standard data is difficult, detail oriented, frustrating and ultimately some of the most important work that one can do to take on natural language problems seriously. I was around when version 1 of the Penn Treebank was created and would chat with Beatrice Santorini about the difficulties they encountered for things as simple seeming as part-of-speech tagging. I annotated MUC-6 data for named entities and coreference, did the John Smith corpus of cross-document coref with Amit Bagga and have done countless customer projects. All of those efforts gave me insights that I would not have had otherwise about how language is actually used rather than the idealized version you get in standard linguistics classes.

The steps for creating a gold standard are:

Define what you are trying to annotate: We started with a very open ended “lets see what looks annotatable” attitude for linking Entrez Gene to MEDLINE/PubMed. By the time we felt we had a sufficiently robust linguistic phenomenon we had a standard that mapped abstracts as a whole to gene entries in Entrez Gene. The relevant question was: “Does this abstract mention anywhere a literal instance of the gene?” Gene families were not taken to mention the member genes, so “the EXT familly of genes” would not count, but “EXT1 and EXT2 are not implicated in autism” would.

Validate that you can get multiple people to do the same annotation: Bob and I sat down and annotated 20 of the same abstracts independently and compared our results. We found that we had 36 shared mappings from gene to abstract, with Bob finding 3 mappings that Bob did not and I found 4 that Bob did not. In terms of recall I found 92% (36/39) of what Bob did. Bob found 90% (36/40) of what I found. Pretty good eh? Not really, see below.

Annotate enough data to be statistically meaningful: Once we are convinced we have a reliable phenomenon, then we need to be sure we have enough examples to minimize chance occurrences.

The Tricky Bit

I (the daughter) need to stand in front of the king (the NIH) and say how good our recall is. Better if the number is close to 100% recall. But what is 100% recall?

Even a corpus annotation with an outrageously high 90% interannoatator
agreement leads to problems:

A marketing problem: Even if we hit 99.99% recall on the corpus, we don’t know what’s up with the 5% error.
We can report 99.99% recall against the corpus, but not against the truth.
after being total rock stars and modeling Bob at 99.99%, a slide that says we can only claim recall of 85-95% on the data. So I can throw out the 99.99% number and introduce a salad of footnotes and diagrams. I see congressional investigations in my future.

A scientific problem: It bugs me that I don’t have a handle on what truth looks like. We really do think recall is the key to text bioinformatics and that text bioinformatics is the key to curing lots of diseases.

Rumpelstiltskin Saves the Day

So, here we are in our hip Brooklyn office space, sun setting beautifully over the Williamsburg bridge, Bob and I are sitting around with a lot of straw. It is getting tense as I imagine the king’s reaction to the “standard approach” of working from an interannotator agreement validated data set. Phrases like “cannot be done in a scientifically robust way”, “we should just do what everyone else does” and “maybe we should focus on precision” were bandied about with increasing panic. But the next morning Rumpelstiltskin walked in with the gold. And it goes like this:

The problem is in estimating what truth is given somewhat unreliable annotators. Assuming that Bob and I make independent errors and after adjudication (we both looked at where we differed and decided what the real errors were) we figured that each of us would miss 5% (1/20) of the abstract to gene mappings. If we took the union of our annotations, we end up with .025% missed mentions (1/400) by multiplying our recall errors (1/20*1/20)–this assumes independence of errors, a big assumption.

Now we have a much better upper limit that is in the 99% range, and more importantly, a perspective on how to accumulate a recall gold standard. Basically we should take annotations from all remotely qualified annotators and not worry about it. We know that is going to push down our precision (accuracy) but we are not in that business anyway.

An apology

At ISMB in Detroit, I stood up and criticized the BioCreative/GENETAG folks for adopting a crazy-seeming annotation standard that went something like this: “Annotate all the gene mentions in this data. Don’t get too worried about the phrase boundaries, but make sure the mention is specific.” I now see that approach as a sane way to increase recall. I see the error of my ways and feel much better since we have demonstrated 99.99% recall against gene mentions for that task–note this is a different, but related task to linking Entrez Gene ids to text abstracts. And thanks to the BioCreative folks for all the hard work pulling together those annotations and running the bakeoffs.

3 Responses to “Spinning Straw into Gold: How to do gold standard data right”

Reporting 90% recall for an annotator assumes the 90% is 90% of something. Philosophically old-fashioned (or naive) practitioners might call this “the truth”, but Breck and I studied Quine and have adopted pragmatism as the official semantic theory of Alias-i.

What we’re really proposing is the adoption of a pragmatic evaluation of recall that bypasses the need for “truth”.

On a technical note, what I (Bob) would actually like to do is get multiple annotators and try to measure just how correlated the errors are. I’m guessing they’ll be highly correlated, and thus we’ll need more than two 90% recall annotations to get to 99% recall even measured pragmatically.

Does the need for 90% recall have to do with only using two raters? Imagine using 100 raters, or 1000. How low could the agreement requirements go given larger numbers of annotators? I’m asking because I’ve just blogged about the feasibility of using a web-based app to get raters from all over the world to complete large-scale annotation projects. It may be a pipe dream, even if not a LingPipe dream … rimshot!

Given that we suspect that my and Bob’s recall errors are correlated–meaning we tend to miss the same things–then the question becomes how many annotators do we need to get 100% recall? I made the estmate that I was missing 5% of the recall upon adjudication as was Bob. We need to get a third annotator to decide if the union of me and Bob misses .025% of recall against the third annotation. If so then that supports independence of my and Bob’s recall oversights. If the union annotation misses more than that, then we have evidence that my and Bob’s errors are correlated. All of this has to be wrapped in a meaningful statistical analysis over many iterations but that is the general idea.

It would be a fairly easy experiment to run to see for 20 abstracts when do you start not seeing increased recall with new annotators. I am guessing you would start to see little new found around 4-5 annotaotors but I don’t know.