Confidence intervals on recall and eRecall

There is anongoingdiscussionaboutmethods of estimating the recall of a production, as well as estimating a confidence interval on that recall. One approach is to use the control set sample, drawn at the start of production to estimate collection richness and guide the predictive coding process, to also estimate the final confidence interval. This requires some care, however, to avoid contaminating the control with the training set. Using the control set for the final estimate is also open to the objection that the control set coding decisions, having been made before the subject-matter expert (SME) was familiar with the collection and the case, may be unreliable.

Another approach, first suggested by Herb Roitblat, is to calculate what Herb calls eRecall. This involves drawing a sample from the null set at the conclusion of the production, and comparing the ratio between richness on this null set sample, and the richness of the original control set. Since we are using the control set sample only to estimate overall richness and yield, not the richness of either the production or the null set, contamination with training is not an issue (though the method does not address concerns about the reliability of the coding decisions made on the control set at the start of the review process). A confidence interval can be calculated on eRecall using the Koopman ratio-of-binomial method; this appears to be the method that Herb himself recommends.

An apparent problem with the eRecall estimator is that collection and null samples overlap on the null set. We are in effect estimating null set yield twice and independently, and using the two estimates on different sides of the recall division. A little reflection suggests that this should increase the variability of our estimator, making our point estimate less reliable, and our confidence intervals wider. It is possible, for instance, for the null set sample to lead to a larger estimate of yield than the collection sample, in which case our estimate of recall will be negative---a nonsensical result. While this is perhaps unlikely to occur in practice, it does suggest problems with the estimator.

Let's examine the accuracy of the eRecall estimator with a simple production scenario. Assume that we have a collection of 1 million documents with 5% richness, and that we achieve 75% recall with 50% precision (though of course we don't know this in advance of sampling and estimation). Now consider three methods of estimating a confidence interval:

eRecall / Koopman: a collection sample of 1500 (for instance, the control set sample), and a null set sample of 1500, for a total sample size of 3000.

Direct method: a simple random sample of 1500 from the entire collection post-production, with an exact binomial confidence interval on the retrieved proportion of the relevant documents that turn up in the sample.

Segmented / Koopman: proportionally allocate a sample of 1500 documents between the production and the null sets (for our scenario, this works out to 113 for the production, and 1387 for the null set), then calculate a confidence interval using my recci package with the Koopman method. (The BetaBinomial-Half method is the default for my package, but I use the Koopman here for comparability; the results are very similar for this scenario.)

Note that the latter two methods ignore the control set sample, even if we have one, and just use the 1500 sample that in the eRecall method is allocated to the null set.

I repeatedly sample 10,000 times for the above scenario and sampling allocations, and calculate a confidence interval for each sample by the three methods described. Across these 10,000 repeated simulations, I calculate the average lower bound and upper bound on a 95% confidence interval on recall, along with the average width of the interval end-to-end, and the width from the true recall value (0.75) to the lower bound. (The code for this experiment is here: it relies upon Version 0.5.0 of my recci package.) Here are the results:

Mean CI

Method

Lower

Upper

Width

True - Lower

eRecall

0.589

0.842

0.253

0.161

Direct

0.637

0.843

0.206

0.113

Segmented

0.650

0.829

0.179

0.100

As our intuition suggested, the confidence interval on eRecall is much wider than necessary: 40% wider than the segmented interval, and 23% wider than even the simple direct method. Moreover, the additional width is largely on the downside: the eRecall lower bound is 60% further from the true value than the segmented estimator. And this is despite using an effective sample of twice the size. Put another way, it is better to throw away the original control set sample and resample from scratch, than to use the control set sample within the eRecall estimator. And not using the control set sample also removes concerns about the reliability of the coding decisions made prior to review.

We can use the same experiments to consider the accuracy of the point estimates on recall that the three sampling methods produce. The results are as follows:

Point estimate

Method

Mean

Std.dev.

eRecall

0.747

0.0635

Direct

0.751

0.0466

Segmented

0.751

0.0466

All methods are approximately unbiased, but the eRecall has considerably higher variance than the others. And again, eRecall has twice the effective sample size of the other methods, and brings with it concerns about the reliability of the control set labels.

These figures are for a single scenario only. I suspect that the performance of eRecall would be worse for lower collection richness, relatively somewhat better for higher. Nevertheless, in the light of these results, the use of the eRecall estimator cannot be recommended, either for point estimates or confidence intervals.

3 Responses to “Confidence intervals on recall and eRecall”

Roitblat gives two inconsistent definitions for eRecall. I'm wondering which you used.

Definition 1. In his earlier work (cited above) he uses prevalence to estimate TP+FN (the total number of relevant documents) and he uses elusion to estimate FN (the total number of missed relevant documents. He then plugs these estimates into a contingency table with N (the total number of documents) and D (the total number of discarded documents). If I am not mistaken, the resulting formula is