How accurate can manual review be?

One of the chief pleasures for me of this year’s SIGIR in Beijing was attending the SIGIR 2011 Information Retrieval for E-Discovery Workshop (SIRE 2011). The smaller and more selective the workshop, it often seems, the more focused and interesting the discussion.

Grossman and Cormack compare the reliability of manual and technology-assisted review through an analysis of the assessment results for the Interactive Task of the TREC 2009 Legal Track. In the Interactive Task, runs produced by participating teams (using various degrees of automation) had documents sampled from them for initial relevance assessment by student or professional assessors; unretrieved documents were also sampled, though much more sparsely. Teams could appeal the initial assessments to be adjudicated by a figure called the Topic Authority, an experienced e-discovery lawyer who had advised the teams in their production and directed the assessors in their review.

Grossman and Cormack’s insight is to view this assessment process as an experiment comparing manual and technology-assisted review. The post-adjudication relevance assessments are treated as the gold standard; the assessors for each topic as the manual review team; and the documents they assessed as relevant (extrapolated based upon sampling) as the productions resulting from exhaustive manual review. The effectiveness of this manual review is then compared with the technology-assisted submissions of participating teams against the adjudicated gold standard.

Grossman and Cormack find that the best technology-assisted productions as accurate as the best manual review team, and more accurate than the majority of review efforts. The following table (adapted from Table 7 of Grossman and Cormack) summarizes their results:

Topic

Team

Rec

Prec

F1

t201

System A

0.78

0.91

0.84

TREC (Law Students)

0.76

0.05

0.09

t202

System A

0.67

0.88

0.76

TREC (Law Students)

0.80

0.27

0.40

t203

System A

0.86

0.69

0.77

TREC (Professionals)

0.25

0.12

0.17

t204

System I

0.76

0.84

0.80

TREC (Professionals)

0.37

0.26

0.30

t207

System A

0.76

0.91

0.83

TREC (Professionals)

0.79

0.89

0.84

My paper at SIRE was in part an attempt to pick holes in Grossman and Cormack’s analysis; their conclusions proved to hold up pretty well to scrutiny. One possible criticism is that the effectiveness figures quoted above are extrapolated from an unequal sample, and that a small number of appeal decisions on sparsely sampled unreturned documents have a disproportionate effect on measures of effectiveness. This criticism doesn’t have a lot of statistical purchase — the sample was random and the extrapolations are correct, unless one thinks there is likely to be greater bias in the unretrieved segment — but practical considerations give it some more bite: a Boolean pre-filter (itself, admittedly, a blunt sword) would reduce the size and impact of this segment; and errors here are mostly false positives, which would be picked up (presumably) in a second pass. However, even if you calculate effectiveness only on the documents actually sampled and assessed (Table 4 of my paper), the best technology-assisted system still comes out on top for three topics out five, and even for the other two.

A second criticism might be that the appeals process could be incomplete: not all assessor errors may be appealed, and therefore some will remain unfound. But this incompleteness would in general boost the apparent effectiveness of the manual review teams more than that of the technology-assisted productions. (That said, I do suspect that absolute recall scores are overstated for both manual and technology-assisted review, as false negatives in the unretrieved segment would not get appealed.)

The remainder of my paper examines what TREC 2009 tells us about assessor reliability and variability, and whether the TREC setup really is a fair depiction of manual review. One of my findings is that there is great variability in the reliability of different reviewers on the one team, as the following figure shows:

Each circle represents the precision and recall of a single assessor, measured against the adjudicated assessments; the red cross in the effectiveness of the technology-assisted production. (Measures are extrapolated from the sample to the population.) The best reviewers have a reliability at or above that of the technology-assisted system, with recall at 0.7 and precision at 0.9, while other reviewers have recall and precision scores as low as 0.1. This suggests that using more reliable reviewers, or (more to the point) a better review process, would lead to substantially more consistent and better quality review. In particular, the assessment process at TREC provided only for assessors to receive written instructions from the topic authority, not for the TA to actively manage the process, by (for instance) performing an early check on assessments and correcting misconceptions of relevance or excluding unreliable assessors. Now, such supervision of review teams by overseeing attorneys may (regrettably) not always occur in real productions, but it should surely represent best practice.

My SIRE paper has since generously been picked up by Ralph Losey in his blog post, Secrets of Search — Part One. Ralph’s post first stresses the inadequacy of keyword search alone as a tool in e-discovery. Predominant current practice in e-discovery is to create an initial Boolean query or queries, often based upon keywords negotiated between the two sides; run that query against the corpus; and then subject the documents matched by the query, and only those documents, to full manual review. The query aims at recall; the review, at precision. However, previous work by Blair and Maron (1985) — almost three decades ago now — found that such Boolean queries typically achieve less than 20% recall, even when formulated interactively. (Note, in passing, that the concepts of “Boolean search” and of “keyword-based ranked retrieval” are frequently confounded under the term “keyword search” in the e-discovery literature.) Ralph also questions whether the assessment process followed at TREC really represents best, or even acceptable, manual review practice, due to the lack of active involvement of a supervising attorney.

The most interesting part of Ralph’s post, and the most provocative, both for practitioners and for researchers, arises from his reflections on the low levels of assessor agreement, at TREC and elsewhere, surveyed in the background section of my SIRE paper. Overlap (measured as the Jaccard coefficient; that is, size of intersection divided by size of union) between relevant sets of assessors is typically found to be around 0.5, and in some (notably, legal) cases can be as low as 0.28. If one assessor were taken as the gold standard, and the effectiveness of the other evaluated against it, then these overlaps would set an upper limit on F1 score (harmonic mean of precision and recall) of 0.66 and 0.44, respectively. Ralph then provocatively asks, if this is the ground truth on which we are basing our measures of effectiveness, whether in research or in quality assurance and validation of actual productions, then how meaningful are the figures we report? At the most, we need to normalize reported effectiveness scores to account for natural disagreement between human assessors (something which can hardly be done without task-specific experimentation, since it varies so greatly between tasks). But if our upper bound F1 is 0.66, then what are we to make of rules-of-thumb such as “75% recall is the threshold for an acceptable production”?

These are sobering thoughts; but there are perhaps reasons not to surrender to such a gloomy conclusion. We need to remember that the goal of the production (immediately, at least) is to replicate the conception of relevance of a particular person, namely the supervising attorney (or topic authority), not to render a production that all possible assessors would agree upon. And some productions appear surprisingly good at replicating this conception. Consider the effectiveness scores reported for the top-performing systems (and the most reliable manual review team) in the above table; these systems are achieving F1 scores from the high 70s to the mid 80s, as evaluated against the TA’s conception of relevance (as reflected in the adjudicated assessments). That is, while two assessors or producers might independently come to differing conceptions of relevance, it seems that if one producer or (attentive) assessor is asked to reproduce the conception of relevance of another, they are able to do a reasonably faithful job. (Mind you, we are eliding the possibility here that the TA is confused about their own conception of relevance. Certainly, their conception of relevance can shift over time; and the detailed criteria of relevance against which assessments and appeals were made were not formulated at the beginning of the production process, but after tens of hours of conception-clarifying interactions with the participating teams.)

A distinction also needs to be drawn between disagreement arising from differing conceptions of relevance, and disagreement arising from assessor error (that is, assessors making decisions through inattention or incompetence that they would not make if paying full attention and possessing the requisite skills). The distinction between disagreement and error is important to the current discussion, because error can be corrected (albeit at some expense), while disagreement is largely irreducible in the typical assessment setting.

Maura Grossman and Gord Cormack address just this question in another paper, “Inconsistent assessment of responsiveness in e-discovery: difference of opinion or human error?” (DESI, 2011). In that paper, they re-reviewed documents from TREC 2009 in which the topic authority and the first-pass assessor disagreed. Re-reviewed documents were rated inarguably responsive, arguable, or inarguably non-responsive, based upon the topic authority’s responsiveness guidelines. They then compare these ratings with the first-pass and official assessments, to divide them into cases in which the TA’s adjudication was inarguably correct (and the assessor has made an inarguable error); cases in which responsiveness was arguable; and cases in which the TA was inarguably incorrect (and the assessor inarguably correct). While rates vary between topics, on average they found the TA’s adjudication to be inarguably correct in 88% to 89% of cases. In other words, Grossman and Cormack conclude that the great majority of assessor errors are due to inattention, or to an inability properly to comprehend either the relevance criteria or the document. And since some 70% of relevant assessments were not contested, a rough estimate of achievable overlap would set it at around 95%.

Meeting the supervising attorney’s conception of relevance, though, is not the whole picture. The production ultimately has to be accepted as reasonable by the opposing side and by the judge, where there may be a genuine difference in conception of relevance, irreducible by careful weeding out of errors. Here, it is interesting to reflect that what Grossman and Cormack were able to do was to recreate the TA’s conception of relevance from the criteria guidelines, even for contested documents. That is, the TA’s conception of relevance was adequately externalizable, and in that sense objective (by which I mean that, given the guidelines, one no longer needed to refer to the TA’s subjective reaction to individual documents in order to make reasonably accurate relevance assessments, though of course there will always be particular cases which the guidelines don’t cover). (Unfortunately, as I’ve since learnt from Maura and Gord, they were aware at the time of their re-review what the TA’s adjudication was. It would be interesting to rerun the experiment blind, without this knowledge, to see how accurately experienced, careful assessors are able to recreate the original conception.)

Now, such explicit criteria for guidelines are not (to my understanding) typically created where predictive coding (that is, text classification, usually with active learning) is employed; instead, the supervising attorney conveys their conception of relevance by directly labelling documents selected by the learning algorithm. As Gord Cormack has suggested to me, we can consider the supervising attorney’s conception of relevance as a (developing) latent trait, which can be externalized on the one hand as a set of relevance labels, or on the other as a list of relevance criteria. And I’m assured by a least one prominent and technically-savvy e-discovery practitioner that the coherence of relevance and quality of error detection that such predictive coding systems offer is such as to make explicit relevance criteria (let alone full manual review) redundant. I’d be interested to see this last claim verified empirically (and I’ve got an idea of how one might do so).

But even if predictive coding makes stated criteria of relevance functionally redundant, then how is the conception of relevance employed in the production conveyed to the opposing side and justified to the court — if not routinely (the presumption of good faith in such matters still pertaining), but when contested? And what is the basis upon which disputes over whether a certain document was unreasonably withheld or produced are resolved? And, finally, how are we to proceed if we doubt the competence or the good-will of the lawyer training the machine? Ralph Losey has argued to me that this assurance comes through transparency and cooperation between the requesting and producing party. Perhaps; but how much visibility does the requesting party have of the producing party’s negative assessments? The vision of an externalizable and verifiable description of a conception of relevance still seems to me highly attractive.