Morrison on Metrics: Analytical Discovery Software and Metrics

My point in this column is that metrics predominate in e-discovery generally and especially for one particular category of software. Analytical discovery software determines the relevance of documents in a collection based on the contents of the documents rather than on the presence in them of key words or phrases. Sometimes referred to as machine-aided review, the software gathers similar documents into clusters, makes binary relevance determinations or ranks the documents they find according to an algorithm for relevance.

Metrics rule in e-discovery. Generally speaking, those who manage e-discovery scope their tasks with metrics. So many terabytes of data collected from so many custodians, such and such a percentage of relevant documents compared to a lower percentage of privileged documents, so many documents reviewed per hour per person. Keyword searches report how many documents had hits. At every step, therefore, averages and medians and percentages are the vocabulary for describing how much time the process takes, what's involved and what the likely expense will be.

Once documents are collected, if analytic software contributes to the review, metrics continue to play key roles. To give two examples, this newer generation tool, analytical software, uses sampling and iterative learning.

Sampling is a statistical notion. In this context it means that someone chooses a small portion of the total document collection and puts the analytic software through its paces on that sample. It is important that the sample be as representative of the entire set as possible, since statistical conclusions are stronger to the degree the set analyzed resembles the full set.

The premise of analytical software is that the probabilities of relevance developed at the start carry over to the entire domain. Thereafter, probabilities decide whether a document is relevant or not. Calculations of all sorts go on when the software analyzes a document for possible relevance. The software weighs and calculates many aspects of its content.

Iterative learning means the software becomes more accurate as human beings outline what the software should look for and then repeatedly assess what it finds and refine the filters, rankings, terms and other elements. The software gets better, as announced by numbers. Once the sample has generated as much quality in its searching as is desired, the software is unleashed on the entire document set.

Iterative learning depends on metrics because the user tracks improvements at each stage. As terms and concepts are added or discarded, modified or re-ranked, the software becomes "smarter." In the jargon of discovery, "recall" increases - the software will find more documents that might be relevant. It also gets better on "precision," the decision whether the documents recalled are in fact relevant. As an example of the omnipresence of metrics in this domain, one expert has suggested that keyword searches have recall rates of 20-30 percent whereas analytic software can consistently exceed 80 percent.

With e-discovery output, too, metrics are everywhere. Analytic software selects documents and clusters or ranks where they stand relative to the documents in the original training set. That output involves massive calculations. Given the volume of some discovery efforts, and whatever search software is employed, metrics become the indispensable analytical and descriptive tool. Metrics govern electronic discovery.