Original (Erroneous) Results

In those papers, I reported 7% precision at 99.99% recall, 8% precision at 99.9% recall, and 11% precision at 99% recall.

Corrected Results

It turns out the real numbers, using default LingPipe settings (n-gram = 5, interpolation = 5) with a max of 1024 chunks/sentence in our chunk.CharLmHmmChunker, the actual results are:

Recall

Precision

99%

3.6%

99.9%

0.9%

99.99%

0.6%

100%

0.5%

Reducing n-gram length to 4 raises precision at 99.9% recall to 1% and 3-grams raise precision at 99.9% recall to 1.3%. As I speculated in the paper, less tightly fit models do a bit better in high recall settings, even though they do worse in high-F-measure evaluations.

Luckily it still only takes 2 minutes to do a complete confidence-based 20-fold cross-validation in a single thread.

The Bug

The bug causing the problem is described in LingPipe's release notes for 3.8.2:

3.8.2: Bug Fix: Scored Precision-Recall and Chunker Evaluations
We made major bug fixes for the precision-recall evaluations. There were two bugs. First, a tree set was being used where a list should've been used, causing some items to be ignored. Second, there was no way to add counts for missed items.

This bug affected the confidence-based chunker evaluator by overreporting recall in cases where the chunker did not return every reference chunk with at least some score. A new method addMisses(int) was added to the scored precision recall evaluation and called from the chunker evaluator.

As I've said before, unit tests are great, but they only catch the bugs you think to check for. Lack of imagination on testing is still a killer. I'm sure it'd help to have an independent tester.

Thanks again to Mike Ross for finding the bug. He was suspicious of the results he was getting using LingPipe's chunker evaluator, and subsequently wrote an independent P/R evaluator that isolated the bug.