Non-authoritative relevance coding degrades classifier accuracy

There has been considerable attention paid to the high level of disagreement between assessors on the relevance of documents, not least on this blog. This level of disagreement has been cited to argue in favour of the use of automated text analytics (or predictive coding) in e-discovery: not only do humans make mistakes, but they may make as many as or more than automated systems. But automated systems are only as good as the data used to train them, and production managers have an important choice to make in generating this training data. Should training annotations be performed by an expert, but expensive, senior attorney? Or can it be farmed out to the less expensive, but possibly less reliable, contract attorneys typically used for manual review? This choice comes down to a trade-off between cost and reliability—though ultimately reliability itself can be (at least partly) reduced to cost, too. The cost question still needs to be addressed; but Jeremy Pickens (of Catalyst) and myself have made a start on the question of reliability in our recent SIGIR paper, Assessor Disagreement and Text Classifier Accuracy.
The basic question that Jeremy and I ask is the following. We have two assessors available, one whose conception of relevance is authoritative, and the other whose isn't (one could think of these as the senior and the contract attorney respectively, though we are addressing primarily the question of assessor disagreement, not expertise). The effectiveness of the final production is to be measured against the authoritative conception of relevance. The predictive coding system (or text classifier) can be trained by annotations either from the authoritative or from the non-authoritative assessor. What (if any) loss in classifier effectiveness is there from training with the non-authoritative assessor, compared to training with the authoritative one?

Working with TREC (non-ediscovery) data that has been used in past research on assessor disagreement, we find that training with the non-authoritative assessor does indeed lead to lower effectiveness than with the authoritative one. For the dataset examined, binary F1 score was 0.629 in the latter case, against 0.456 in the former. The loss of effectiveness from assessor disagreement, and from using a machine classifier rather than the assessor directly, are additive: annotation by the non-authoritative assessor then automated review leads to lower effectiveness than either non-authoritative human review, or automated review with authoritative training examples (though the additive degradation is not quite as great as the level of human disagreement would predict, based on randomization experiments).

What does this mean in terms of overall production effort? A common pattern in e-discovery is to have two rounds of human involvement: one to train the classifier; another to review the classifier's positive predictions before production. Poor performance by the classifier can therefore (to some extent) be compensated by extending the review further down the prediction ranking that the classifier produces, to bring recall up to the required level; in this way, poor reliability becomes converted into extra cost. Our experiments found that increase in review depth to achieve a recall target of 75%, from using a non-authoritative trainer, was on average 24%; but for one in eight tasks, the required review depth doubled.

These results do not directly answer the question of cost. One may come out ahead overall even with the deeper review, due to the savings from using a cheaper trainer—though this largely assumes that the reviewers are also cheaper contract attorneys (which rather raises the question of the reliability of the review itself). Also, the experimental data and setup are more than a little removed from those of an actual e-discovery production, so the precise numerical results may not be directly translatable. Nevertheless, we have demonstrated that who does the training can have a substantial effect on how reliable the (machine) trainee. As other commentators have urged, consideration of these human factors needs to be a central part of ESI protocol design.

As an appendix, I provide a couple of figures that did not make it into the published paper, but which speak to the relative reliability of non-authoritative training and automated review.

The first of these figures, above, shows the F1 score of inter-assessor agreement (on the X axis), compared to the F1 score of the classifier when trained using the non-authoritative trainer ("cross-classification", in the jargon of the paper, on the Y axis; this is equivalent to the Y axis of Figure 2 of the paper). Each data point represents one of the 2 alternative trainers on each of the 39 topics included in our experiments. We can see that the classifier with non-authoritative training is substantially less accurate than inter-assessor disagreement alone.

F1 of inter-assessor agreement compared to F1 of classifier trained by authoritative assessor.

The second of these figures, above, compares the agreement with the authoritative assessor of the non-authoritative assessor (on the Y axis), and of the classifier trained by the authoritative assessor ("self-classification" in our jargon, on the X axis). This figure is essentially asking the same question as Grossman and Cormack, namely whether predictive coding (with an authoritative annotator) is as reliable as (non-authoritative) human review; and, as with Grossman and Cormack, we find (as far as our dataset is able to answer this question) that the answer is, on average, yes, though with enormous variation between topics and assessors. Indeed, the above figure may understate the accuracy of the machine classifier, since training set size was limited to the relatively small amount available in the dataset, and substantially larger training data sets would be standard in e-discovery—though again many other aspects of the data and process would be different as well.

...though this largely assumes that the reviewers are also cheaper contract attorneys (which rather raises the question of the reliability of the review itself).

What about your experiments? Did you compared outcomes in scenario
a) reviews with cheap workers
b) reviews with expensive workers
Can cheap workers help improve results produced by other cheaper workers? Or, do we need expensive ones at the review stage? Though, it maybe less expensive, of course.

Thank you for sharing this interesting study. I will want to blog about it for sure, but in the meantime here is my immediate reaction.

The issue of cost is important, but secondary to the issue of SME, subject matter expertise. The subject matter expert is capable of recognizing all relevant or responsive documents, especially with close call and unusual documents. The non-SMEs are not. GIGO. If you continue reviews by non-SMEs, then you just compound the errors. No matter how many times you multiply zero the result is still zero.

Without an SME there is no final judge of relevance. No calculations of recall or precision are possible.

Without an SME wrongfully withheld documents are im most cases never seen by the requesting party or the judge. Conducting a review without supervision of a qualified SME is for this reason not permitted by law or by legal ethics. Competence and reasonable efforts are legal requirements, not option expense considerations. For these reasons a review of documents cannot be carried out ethically in a litigated matter without the review being conducted by a lawyer, or in many states, by a paralegal under direct supervision of a lawyer.

But just being a lawyer does not make you an SME. Far from it. For instance, even though I have 33 years of experience as a practicing lawyer I am not an SME in most areas of the law. I am just an SME in the types of cases and issues that I have extensive experience with. My experience allows me to quickly learn a new area from a bona fide SME in that area, but it does not make me an SME. The ability of inexperienced and low paid contract lawyers to learn from an SME and act as the SME's agent with any accuracy is highly questionable. The studies indicate it is not possible. Even if it was, there is the problem of multiplying errors by the use of many contract lawyers. It may only work IMO if one SME teaches one non-SME who is also a search expert, and power user of the software in question. But I digress.

The point here is that a document review by just lawyers, who are not also SMEs, is not reasonable and not not permitted by law or ethics. Never has been even in the old paper discovery days. You could not ethically just handing a file to a new associate and tell them to find relevant documents without proper training and supervision.

It may also be helpful for your non-attorney readers to understand that legal SMEs are different than typical SMEs. In the law the ultimate SME is the judge in the law suit. The lawyer reviewer SME analyzes each document to understand its legal significance and responsiveness, but also to predict how the assigned judge would rule on the issue of the discoverability of the document. Usually the SME's prediction and personal opinion are aligned, but if there is divergence, and if there is no desire by the SME to try to change the law, then the prediction governs. In fact, even outside of document review I spend most of my time as an SME on e-discovery law giving other lawyers my prediction of how the judge will rule in a case on specific issues that I am questioned about. Then I explain why I think they will do that. Sometimes I am wrong, but most often I am right. Plus I may often express my prediction of a court ruling with probability qualifications. So, I may say that Judge X is almost certain to rule in a particular way given these facts and legal issues, or I might say it is a toss up, but he will probably rule one way (as we say in the law, depending on what the judge had for breakfast).

A good legal SME when analyzing each document will base the prediction of judicial ruling on an assumption that the issue was properly and fully briefed and presented to the presiding judge. Thus determination involves an internal analysis based on what can sometimes be a complex analysis of the law itself, plus analysis of application of the law to the facts of the case in general, and application to each document reviewed in particular.

This internal debate may be less than a micro-second on most irrelevant classifications, but may take several minutes of active mentation for some close-call documents, or complex issues. Consultation with other SMEs may also sometimes be required, but this is very rare when considering relevance. Usually in such cases it means the document is responsive.

Classifications on privilege are typically much more difficult than relevance or responsiveness classifications, and multiple SME consults on privilege are more common, sometimes even legal research is required (although typically the SME knows the law in his head, which is why he is a subject matter expert of the legal issues under review). Also be aware that privilege law is a separate legal subject matter, and an expert in one area of law, unlawful discrimination for example, may not also be an SME on privilege.

Thanks again for this paper in particular, and your interest in legal review in general. The new team for the search for truth in order to do justice is Law, Technology and Science.

Hi! We looked at combining two cheap workers, which led to a small improvement over just one. But the other experiments we didn't try (our dataset didn't really support it). A similar question which our dataset didn't really support, but which we hope to look at (or hope someone else will look at) at a more suitable dataset (one with a larger number of judgments per query) is how many more training (not review judgments) you need from a non-authoritative assessor to get the same classifier effectiveness as an authoritative assessor (and indeed whether it is possible at all to reach the same effective upper bound on performance).

Hi! Thanks for your detailed and enlightening comment. I am glad to see so clearly explained that the SME's job is to predict what the judge would think if asked. Thus, neither SME nor non-SME are themselves authoritative (in the sense used in our paper), but the SME is a better predictor of the authoritative judgment..

Without an SME there is no final judge of relevance. No calculations of recall or precision are possible.

Let me note that eDiscovery as a whole consists of two main tasks: (1) Finding all the responsive documents, and (2) Verifying that all responsive documents have been found. In our study, William and I are only concentrating on part (1), the finding. The dataset as it is constructed allows us to verify this finding using the final expert judgment, so we are not in disagreement about needing that final expert judgment in order to calculate recall-precision curves, yield curves, etc. But again, the purpose of this study is to focus on the finding part, and what effect non-authoritative judgments have on the ability for documents to be found.

The issue of cost is important, but secondary to the issue of SME, subject matter expertise. The subject matter expert is capable of recognizing all relevant or responsive documents, especially with close call and unusual documents. The non-SMEs are not. GIGO. If you continue reviews by non-SMEs, then you just compound the errors. No matter how many times you multiply zero the result is still zero.

So let me give a little background. My understanding is that the current manner in which most, if not all, of eDiscovery operates is that SMEs are not used to judge every document in the collection. Suppose you've got 5 million documents. SMEs are not looking at all of these documents anyway, even in the days of manual linear review, correct? And in these bold new days of TAR, when you only have to look at 10% of the collection, the SME is still only looking at a few thousand, but not the full 500,000 (10%) out of the 5 million.

If so, that means that even with TAR, the non-SMEs are looking at (judging) 495,000 documents anyway. That is the real-world, practical reality. If I'm not stating this correctly, please pipe up. But if I do have a correct understanding, then the debate we're having becomes a question of whether you really need a SME to look at those 5,000 documents, or whether you could instead have non-SMEs look at not only those 5,000, but perhaps even 20,000, 50,000, 100,000. At what point do the TAR results that you get from training with 100,000 non-SME judgments allow an ordering or classification of the remainder of the collection that is equal to (or perhaps better?) than training using the 5,000 SME judgments? And there it does become an issue of cost, if the non-SMEs never catch up to the SME, as to how much more work will be required of the non-SMEs until they have eyeballs on those same original 490,000 documents that they would have had a chance to put eyeballs on, had you trained with the SME. Again, the non-SMEs are already judging all 495,000 documents anyway. It's just a question of whether they have to look at exactly 495,000 documents to see those 495,000, or if they have to look at 550,000 to see those 495,000. Go back and read this paragraph again.. it's a bit tricky.. d'ya see what I'm saying?

The question about who you train with is an empirical question, not a philosophical one. I think there are reasons that even correct SME judgments can be problematic for training, while incorrect non-SME judgments will result in a better outcome. For example, suppose a document really needs to contain all four issues, A, B, C, and D, in order for it to truly be responsive. A good SME will recognize this, and recognize that the document contains all four issues. And that same SME will not only mark another document that contains issues F and G as nonresponsive, but also (correctly) mark another document that only contains A, C, and D as nonresponsive. Whereas the non-SME will make a mistake and mark the A, C, D document as responsive.

This is going to start to confuse the machine. It is very easy to distinguish A, B, C, D from F, G. But when negative examples also include A, C, and D, machines get confused. So what will end up happening is that the A, C, D negative training example will "pull down" a lot of truly responsive A, B, C, D documents. If, on the other hand, the non-SME labels things the following way:

..then doc 3 is indeed INcorrectly labeled. But it won't confuse the machine as much, and the machine will put more A, B, C, D truly (as an SME would agree) responsive documents toward the top of the ranked list, or onto the correct side of the classification.

But again, how much of an effect this has is not a philosophical question, but an empirical question. William and I have some additional research that we've done, extensions of this current paper, that we're in the process of writing up.

WILLIAM - One comment on your data. It seems all of the judgments under review were by SMEs. There was no equivalent in the data to contract lawyers. You just choose to call the second set of appeal SMEs as the non-authoritative judgments.

"... the organizers arranged for selected documents to be triplyassessed,
first by the author of the TREC topic, and then by two
additional assessors, who were authors of other TREC topics [8]. We treat the original assessor as the authoritative, testing assessor, and separately treat each additional assessor as a training assessor for cross-classification."

But the differences in expertise were not shown. In fact, it seems more likely they were all had about equivalent expertise in this made-up research topics. A better test would be the review classifications of the actual volunteer reviewers, the contract lawyers and law students. That would better model the differences between use of contract lawyers and SMEs. But perhaps I'm missing something here? And no doubt there was a good reason for you not to do so?

Hi! Yes; as I mention in my blog post, we're not considering the impact of differing level of expertise here, solely of assessor disagreement. (Though the assessors are not quite interchangeable: the "authoritative" original assessor is the one who wrote the topic, and so they are the one who "knew what they meant".) The reasons for only looking at the effect of disagreement are, first, one variable at a time; and second (and perhaps more importantly), this data set was already available, and only incorporated disagreement.

You're right that, as the question applies to training practice in e-discovery, what one really wants to do is compare actual contract attorneys with actual SMEs. This sort of data should be collectable by production managers (and indeed is the sort of human-factors quality control that, in my opinion, should be performed regularly). Jeremy and I are hoping to get a closer simulation to this situation using the TREC Legal Track data, but there are a number of data cleaning issues before we can do that.

"If so, that means that even with TAR, the non-SMEs are looking at (judging) 495,000 documents anyway."

It is true that the SME only reviews a small percentage. That part is certainly right. But you no longer have armies of low paid lawyers review the rest. That is a big waste of time and money. They are no longer needed with today's software. The assumption you make was true in the old days, say 3- 5 yrs ago before SMEs had powerful active machine learning tools. Then the SME reviewed only a few thousand and armies of underpaid, low skilled and barely motivated persons with law degrees reviewed the rest. But that is old school. That is not the current paradigm. That is not the kind of CAR I drive, or Maura, etc. We do ALL of the relevance culling ourselves. Then the non-SMEs do the follow-up drudge work of redaction, confidentiality, privilege, and sometimes noting possible errors in relevancy determination for the SME to double check. (They can change a relevant to irrelevant and a mere relevant to highly relevant, but we double check those changes and they make such changes under our Direct supervision and training.)

The SMEs do not have to actually work on each document individually to do their classifications, neither no the non-SMEs. But the SMEs read even fewer docs than the non-SMEs, usually, or maybe better said, sometimes. Certainly the SMEs never read all of the documents they classify. The computer does that for them.

No more eyeballs on the docs culled out as probable irrelevant by the computer trained by SME. So I disagree with your premise. You are working on an obsolete issue. Although most lawyers do not realize that yet, so I can well understand your confusion.

Once again I may well have misunderstood your premise, and if so, please let me know. But perhaps we should try a real time phone chat for that? I have had similar conversations with some software cos and I know many predictive coding lawyers users still use both predictive coding and manual review by contract lawyers. but that is an unnecessary step and a big time waster. Defeats the whole purpose really and I think your study helps show that.

Hi! One model that I have in mind for e-discovery driven by predictive coding is as follows (leaving out control samples, production certification, and review for privilege):

An expert trainer provides a limited number of training examples to the predictive coder (say, 2,000).

The predictive coder ranks the collection (excluding the training examples) by decreasing predicted probability of responsiveness.

A cutoff point is selected in the ranking in order to achieve the required level of recall.

The documents above the cutoff point are sent to contract attorneys for manual review, to cull out the remaining non-responsive documents.

What remains after this culling is the production.

I think this is the model where in Step 4, Jeremy refers to the non-SMEs still having to look at many tens or even hundreds of thousands of documents.

Ralph, I'm thinking that you're working with another model in mind, one in which the Step 4 is not performed as manual review by contract lawyers. But presumably the SME also does not look through all the candidate production documents to weed out non-responsive ones. What, then, is done instead? Does the SME bulk-code groups of documents? Or are all documents with a high-enough predicted relevancy score produced, without manual verification by a human?

William, I believe Ralph is talking about bulk coding without eyeballs-on manual verification. I mean, that's my understanding. But I'll let him correct me/us if not.

Ralph, you write:

"No more eyeballs on the docs culled out as probable irrelevant by the computer trained by SME. So I disagree with your premise. You are working on an obsolete issue. Although most lawyers do not realize that yet, so I can well understand your confusion."

No, putting eyeballs on docs culled out (actually, I prefer thinking about it as "deprioritized" rather than culled out, but.. same effect) as probably irrelevant is NOT what I am talking about. I am thinking closer to the model that William cites above. I have a slightly better workflow in mind than this model, but at its core, what he is saying is the model I am working with as the current state of industry acceptance.

Go back to my original comment, and I talk about having a collection with 5 million documents, the bottom 90% of which are deprioritized/culled out because of TAR technology. That is a huge savings, and I am NOT talking about putting any kind of eyeball.. SME or non-SME.. on on those 4,500,000 documents. but the remaining 10% is still 500,000 documents. Are you going to have the SME put eyeballs on all of those? Or are you going to bulk code, as William suggests?

But essentially the question is this: Let's set aside the issue of whether ANYONE (SME or non-SME) looks at the 495,000 documents ABOVE the deprioritization/culling point. We all agree that no one is going to look at the 4,500,000 below that culling point. But let's not care for the moment if there actually will be eyeballs on documents above that culling point, other than the documents used to establish that culling point (i.e. training docs, seeds).

In that case, the question becomes: Is the only way to select and review documents used for training to have the SME come in and do it, 100%? Or might it be possible to use non-SME eyeballs and feedback as part of the process? If so, if one does allow non-SME input into those training documents, then there exists the possibility of making the process both faster (parallelism) and cheaper.

Don't get me wrong; I am not against the SME. As you know, I am a big, big fan of HCIR and human intuition and knowledge as part of the process. I am also not suggestion full-on Borg-ism. I certainly do not want to get rid of the expert, SME or not, in any way.

Rather, I am.. er, William and I, through some initial steps taken in this research.. beginning to open up the possibility of an SME-nonSME hybrid model. I love hybrid models. You love hybrid models. We all love hybrid models. So why not a hybrid model to more quickly (and perhaps even better?) establish that cutoff point in your collection, whether or not you batch code or do put eyeballs on that 10%? Maybe with a hybrid model, you might be able to move that 10% cutoff up to a 7% cutoff.. at the same level of recall.. thereby reducing your risk and exposure. Know what I mean?