cicayda blog

Search is easy; Search is hard, part 2

This blog is Part 2 in the "Search is Easy; Search is Hard" series. To view Part 1click here.

Why Are We Searching?

Them: “Give me all of your… Aces!”

Us: “Go Fish”

Them: “I don’t believe you. Let me see your cards!”

Search is a tool that Discovery practitioners often use in the process of answering a “Request for Production”. Examine the following excerpt from a fictitious Request for Production (Request):

Any and all written correspondence between ABC and Widgets Incorporated between December 1, 2001 and the present date that relate directly to the December 30, 2001 contract between said parties for 1,710 PCUs.

Any and all records of oral communication between ABC and Widgets Incorporated between December 1, 2001 and the present date that relate directly to the contract described in Request #1.

Any and all internal communication between December 1, 2001 and the present date between any of the following parties: John Smith, Sandy Taylor, and Kristin Fielding, relating to the Widgets Incorporated contract described in Request #1.

Any and all internal communication between any of the following parties: John Smith, Sandy Taylor, and Kristin Fielding relating to the agreement with ABC from December 1, 2001 that involved the 1,710 PCUs for the Widgets Incorporated contract described in Request #1, that are not duplicative of those documents produced pursuant to Request #2.

Go ahead and paste that into Google….

These requests are simple enough to understand. Any person familiar with the matter and the businesses involved would likely recognize whether or not a particular document is relevant to one or more of these requests although there is evidence that even the “gold standard” of exhaustive manual review is less sacrosanct than you might guess (Ellen M. Voorhees“Variations in relevance judgements and the measurement of retrieval effectiveness” 1999).

This is our primary challenge in eDiscovery: Relevance is inherently subjective. That means that responding to one of these requests for production cannot be handled solely by technology until we decide to abdicate our role as judge to the machines.

In the Meantime...We Search

It's always search all of the time. There are many kinds of search such as comparative, keyword, boolean, proximity, fuzzy, expression, similarity, and conceptual to name a few. Even manually reading a document is a form of search. The outcomes of these searches can all be characterized by a measure of precision and recall. (reference:https://en.wikipedia.org/wiki/Precision_and_recall)

Think of precision as a measure of the percent of irrelevant documents that are not retrieved. A perfectly precise search would return no documents that are irrelevant. Precision alone is not a sufficient measure of success. After all, by this definition all searches that return zero results are perfectly precise because 100 percent of the irrelevant documents were not retrieved. Producing zero documents is probably not going to be acceptable to the requesting party.

Recall is a measure of the percent of relevant documents retrieved in the search. A perfect recall search would return 100 percent of the documents that are relevant to the request. Recall percentage is also not a sufficient measure of search success by itself. Perfect recall is easily achieved by always returning all documents. Producing all documents may not be the best strategy for the client.

Recall before Precision or Precision before Recall

It turns out that high precision and high recall seem to work against one another. In a common sense kind of way they are working towards completely different purposes; one is to eliminate items (precision) and another is to include them (recall). Is it then possible to achieve high scores in both? Two techniques dominate the market now. For our purposes I’ll call them “Linear Review” and “Technology Assisted Review”.

A linear review is a process where documents are first retrieved according to some search criteria often derived by analysis of the request for production and negotiation with opposing counsel. The retrieved documents are subjected to a manual review by attorneys. A sample of the reviewing attorney’s decisions are reviewed by other more senior attorneys in the matter who also resolve disputes and provide clarifications when needed. Linear review is a “recall before precision” aimed at achieving a high recall at the beginning of the process and an increase in precision by virtue of a subsequent manual review process.

A technology assisted review is generally a process where a statistically significant number of random documents are selected from the entire set and submitted for a manual review by attorneys who are considered as authoritative experts on the subject. Various computer algorithms are then applied to propose review decisions for the other documents that were not manually reviewed by the attorneys. In essence the “computer” is learning what documents are relevant based on the decisions the attorneys made on “similar” documents. These computer review decisions are typically reviewed by attorneys until the attorneys and the computer agree a sufficiently high percentage of the time. Technology assisted review is a “precision before recall” approach requiring a manual review of documents at the start with a subsequent improvement in recall by virtue of machine learning algorithms to retrieve the additional documents.

There is a lot of emotion and debate in the marketplace about which technique is better. I won’t be addressing that question at all in this series.

In part 3, I will take a look at constructing searches for our fictitious request for production in the context of a linear review and some of the many ways we can easily get the wrong results. Be sure to Subscribe to the blog here so you won’t miss the rest of the series.