Technology: Shedding light on the predictive coding black box

The guarded court approval in Da Silva Moore v. Publicis Groupe opened the door for expansion of the use of predictive coding in document productions. In the decision, Magistrate Judge Peck stated that he was more interested in validation of the process and results than in the “black box” of the vendor’s software, that is, how predictive coding achieves the categorizing of documents. This is a valid observation in context. However, in using any new product, there are benefits to at least peripherally understanding the underlying technology.

Predictive coding is a specific application of the computer science field of machine learning—in particular, supervised learning and natural language processing. This is not a new technology, but rather is used in everyday applications. Most spam filters use algorithms to determine whether emails are legitimate or unwanted solicitations. Reclassification of emails by users permits the filters to learn to categorize better. Another common use is contextual web advertisement placement, in which ad placement is based on content displayed to the user. For example, viewing a sports-related website results in ads from sports-related companies.

In one example of predictive coding, a computer program is taught to categorize documents through analysis of human-coded samples. A subset of the documents are reviewed and coded by knowledgeable people. That subset of documents is then analyzed by the computer for patterns and used to classify additional documents. This process may be repeated several times—the knowledgeable persons may code additional subsets of documents selected from the computer-categorized sets. Once the computer’s predictions and the human reviewer’s coding sufficiently coincide, the program is run on the entire document set. The integrity of the process may then be checked through sampling.

The software that was the subject of the Da Silva Moore opinion used a combination of two known algorithms: Support Vector Machines (SVM) and Probabilistic Latent Semantic Analysis (PLSA). In SVM, patterns are determined and categorized from positive examples (relevant documents) and negative examples (irrelevant documents), and new examples are classified in one category or the other based on whether these patterns appear in the new examples. SVM is used, for example, in many email spam filters.

In PLSA, documents are categorized by detecting concepts through a statistical analysis of word contexts. Documents are grouped based on probabilities of the number of times words occur together. Aside from these, there are a number of other potential algorithms that generate correlations and categorizations. In essence, these algorithms determine how often certain data—typically words, authors, recipients, names and places—occur together in pre-classified documents and use them to categorize other documents.

The contextual analysis of the algorithms, whether implementing spam filters, ad placement or document coding, correspond to a certain degree to the same factors considered by human coders. Documents authored or received by certain individuals are more important than those authored or received by others. The subject matter of the document is determined by which words are used in certain combinations. The difference is that a human reviewer can comprehend the meaning of the words, while a computer can only mathematically analyze correlations based on the pre-determined data, that is, human pre-categorized documents. Hence, for the computer, the quality of the analysis depends heavily on the quality of the sample set and human categorizations.

Knowing this, when is predictive coding a viable option? The cost should be less than the next best acceptable alternative. Also, the document collection should be natively electronic and contain consistent machine readable information. Converted documents, such as those converted by optical character recognition, may lead to errors due to the conversion. Handwritten notes, charts, spreadsheets, drawings and similar documents lack consistency and may not be able to be categorized by machines with the same degree of reliability. The algorithms are human language independent and thus are well-suited to cases involving non-English language documents, especially where the human coders are likely high-level (bilingual and very knowledgeable about the issues). The one-off email or letter using unique words and word combinations will likely be missed. The question is whether the particular sets of words or information at issue in the case and within the document set occur together with enough regularity to make the correlations reliable. That can only be assessed on a case-by-case basis.

Initial studies show that predictive coding can lead to better results than manual review for large homogenous document sets. The underlying technology bears this out. As the technology develops, it should become less expensive, more reliable and used more often.