In defense of keywords in e-discovery

Disclaimer: You are about to read an article championing the much-maligned process of filtering documents using keywords — not au courant in an era when excitement over algorithms rules the day. Nevertheless, it is my strong belief that the concept of using keywords as a tool to help locate the documents that matter is spot-on; it is the execution and the expectations that are spotty.

Undoubtedly the typical method for developing keywords (a thesaurus, a yellow pad and a cup of coffee) has become outdated. Today’s document collections are just too big for this approach. When volumes are large, precision and completeness matter. It is too easy to sweep in a load of nonrelevant documents or, even worse, exclude a substantial number of relevant documents. But these shortcomings are related to the method of developing keywords, not to the efficacy of the underlying theory.

Conceptually, the theory of keywords is sound. During review, especially first-pass review, we are looking for documents that are about the issues of the case. For a document to be about a case issue, it must contain language (words) that communicates specific concepts. For example, for a document to be about meetings with competitors where price was discussed, the document must contain language that communicates the concept of a meeting, the attendance of a competitor and the concept of price. Given that conceptual precision, it is perfectly reasonable to use keywords to help identify documents about this topic.

The first step to successful keyword filtering is to unambiguously define what you are looking for. In our example we have stated that we are looking for documents that communicate three concepts: meeting + competitor + price. A relevant document must contain words sufficient to communicate all three of these concepts. We do not know specifically which words were used, but we know the concepts the words must communicate.

The next task is to identify the words that were actually used to communicate each concept. If you know the words, it is easy to find the documents that contain them. On the face of it, this is a big problem. College-graduate English speakers have, roughly, a 25,000 root-word vocabulary. That is a lot of possible choices. You would need to read a lot of documents to determine every word used in a large collection to communicate the concept of a meeting. Since reading reams of documents is what we are trying to avoid, we have to take a different approach.

For now, let’s broaden our target and just aim at finding documents that are possibly relevant. Possibly relevant documents contain words that could be used to communicate the component concepts of the issue. We recognize that those words may or may not have actually been used. And for now, that is okay. To identify the possibly relevant documents, we have to identify every word that might have been used to communicate a concept component. For example, there are a lot of words that could be used to discuss the idea that a meeting took place – and some of them (e.g., coffee, dinner) are not words that a thesaurus would list as synonyms for meeting.

Before we describe how to identify all of the possible alternatives for meeting, let’s take a moment to spotlight the benefit of this approach. Our experience is that 70-plus percent of the documents in a typical collection could not possibly be relevant to the case issues because they fail to contain any words that could potentially communicate the concepts required to be relevant. By identifying the potentially relevant documents, we eliminate more than 70 percent of the collection from the review.

That is a big deal.

True, we have not yet found the specific relevant documents, but if we can use keywords to filter out 70 percent of the collection, that is well worth the eight hours of effort.

Now, back to the workflow. How do you find all of the potential synonyms for meeting?

Actually it is pretty easy. First, extract the vocabulary from the collection. Second, organize the words by part of speech and frequency. Third, print out the organized word list. (For a tool that will do this, send me an email at andy.kraftsow@renewdata.com.) What you have now is a compendium of roughly 10,000 nouns, 6,000 verbs and 9,000 words you likely do not care about (adjectives, adverbs, prepositions, pronouns, etc.). The list contains all of the words in the collection — no room for arguments about completeness.

The task is to look through the nouns and verbs and check off every word you think could possibly communicate that a meeting took place. Your brain is exquisitely wired for this task — it is why you can read so fast. In 15 minutes you can look through the list and identify every word that could possibly have been used to communicate that a meeting took place. You know the instant you read a word if it is a candidate or not. The words you have checked off are possible synonyms for the concept of a meeting. Now do the same for competitors and price. String the components together with ANDs and the synonyms for each component with ORs, run the Boolean searches and you have now identified every document that is potentially relevant to the issue. It is not a bad use of keywords.

Now someone has to review the possibly relevant documents to find the ones that are actually responsive to the case issues, but the size of that job has been reduced by at least 70 percent.

Keywords may not take you to the exact documents you need, but they will quickly take you into the right neighborhood, and that is worth a lot.

Contributing Author

Andy Kraftsow

RenewData’s Chief Scientist Andy Kraftsow leads the company’s efforts to develop groundbreaking technologies. Trained as a mathematician (and a CPA) Kraftsow is one of the...