Document Review and Predictive Coding: Video Talks – Part Four

This is the fourth of seven informal video talks on document review and predictive coding. The first video explained why this is important to the future of the Law. The second talked about ESI Communications. The third about Multimodal Search Review. This video talks about the third step of the e-Discovery Team’s eight-step work flow, shown above, Random Baseline Sample.

Although this text intro is overly long, the video itself is short, under eight minutes, as there is really not that much to this step. You simply take a random sample at or near the beginning of the project. Again, this step can be used in any document review project, not just ones with predictive coding. You do this to get some sense of the prevalence of relevant documents in the data collection. That just means the sample will give you an idea as to the total number of relevant documents. You do not take the sample to set up a secret control set, a practice that has been thoroughly discredited by our Team and others. See Predictive Coding 3.0.

If you understand sampling statistics you know that sampling like this produces a range, not an exact number. If your sample size is small, then the range will be very high. If you want to reduce your range in half, which is a function in statistics known as a confidence interval, you have to quadruple your sample size. This is a general rule of thumb that I explained in tedious mathematical detail several years ago in Random Sample Calculations And My Prediction That 300,000 Lawyers Will Be Using Random Sampling By 2022. Our Team likes to usea fairly large sample size of about 1,533 documents that creates a confidence interval of plus or minus 2.5%, subject to a confidence level of 95% (meaning the true value will lie within that range 95 times out of 100). More information on sample size is summarized in the graph below. Id.

The picture below this paragraph illustrates a data cloud where the yellow dots are the sampled documents from the grey dot total, and the hard to see red dots are the relevant documents found in that sample. Although this illustration is from a real project we had, it shows a dataset that is unusual in legal search because the prevalence here was high, between 22.5% and 27.5%. In most data collections searched in the law today, where the custodian data has not been filtered by keywords, the prevalence is far less than that, typically less than 5%, maybe even less that 0.5%. The low prevalence increases the range size, the uncertainties, and requires a binomial calculation adjustment to determine the statistically valid confidence interval, and thus the true document range.

For example, in a typical legal project with a few percent prevalence range, it would be common to see a range between 20,000 and 60,000 relevant documents in a 1,000,000 collection. Still, even with this very large range, we find it useful to at least have some idea of the number of documents they are looking for. That is what the Baseline Step can provide to you, nothing more nor less.

If you are unsure of how to do sampling for prevalence estimates, your vendor can probably help. Just do not let them tell you that it is one exact number. That is simply a point projection near the middle of a range. The one number point projection is just the top of the typical probability bell curve shown above, which illustrates a 95% confidence level distribution. The top is just one possibility, albeit slightly more likely than either end points. The true value could be anywhere in the blue range.

To repeat, the Step Three prevalence baseline number is always a range, never just one number. Going back to the relatively high prevalence example, the below bell cure shows a point projection of 25% prevalence, with a range of 22.2% and 27.5%, creating a range of relevant documents of frombetween 225,000 and 275,000. This is shown below.

The important point that many vendors and other “experts” often forget to mention, is that you can never know exactly where within that range the true value may lie. Plus, there is always a small possibility, 5% when using a sample size based on a 95% confidence level, that the true value may fall outside of that range. It may, for example, only have 200,000 relevant documents. This means that even with a high prevalence project with datasets that approach the Normal Distribution of 50% (here meaning half of the documents are relevant), you can never know that there are exactly 250,000 documents, just because it is the mid-point or point projection. You can only know that there are between 225,000 and 275,000 relevant documents, and even that range may be wrong 5% of the time. Those uncertainties are inherent limitations to random sampling.

Shame on the vendors who still perpetuate that myth of certainty. Lawyers can handle the truth. We are used to dealing with uncertainties. All trial lawyers talk in terms of probable results at trial, and risks of loss, and often calculate a case’s settlement value based on such risk estimates. Do not insult our intelligence by a simplification of statistics that is plain wrong. Reliance on such erroneous point projections alone can lead to incorrect estimates as to the level of recall that we have attained in a project. We do not need to know the math, but we do need to know the truth.

The short video that follows will briefly explain the Random Baseline step, but does not go into the technical details of the math or statistics, such as the use of the binomial calculator for low prevalence. I have previously written extensively on this subject. See for instance:

[…] talked about step one, ESI Communications. The third about step two, Multimodal Search Review. The fourth about step three, Random Baseline. The fifth about steps four, five and six, the predictive coding […]

About the Blogger

Ralph Losey is a practicing attorney and shareholder in a national law firm with 50+ offices and over 800 lawyers where he is in charge of Electronic Discovery. All opinions expressed here are his own, and not those of his firm or clients. No legal advice is provided on this web and should not be construed as such.

Ralph has long been a leader of the world's tech lawyers. He has presented at hundreds of legal conferences and CLEs around the world. Ralph has written over two million words on e-discovery and tech-law subjects, including seven books. He is also the founder of Electronic Discovery Best Practices, and e-Discovery Team Training, an online education program that arose out of his five years as an adjunct professor teaching e-Discovery and Evidence at the UF School of Law. Ralph is also publisher and principle author of this blog and many other instructional websites.

Ralph is a specialist who has limited his legal practice to electronic discovery and tech law since 2006. He has a special interest in software and the search and review of electronic evidence using artificial intelligence, and also in general AI Ethics. issues. Ralph was the only private lawyer to participate in the 2015 and 2016 TREC Recall Track of the National Institute of Standards and Technology and prior to that competed successfully in the EDI Oracle research.

Ralph has been involved with computers, software, legal hacking and the law since 1980. Ralph has the highest peer AV rating as a lawyer and was selected as a Best Lawyer in America in four categories: Commercial Litigation; E-Discovery and Information Management Law; Information Technology Law; and, Employment Law - Management. Ralph also received the "Most Trusted Legal Advisor" industry award for 2016-17 by the Masters Conference. His full biography may be found at RalphLosey.com.

Ralph is the proud father of two children, Eva Losey Grossman, and Adam Losey, a lawyer with cyber expertise (married to another cyber expert lawyer, Catherine Losey), and best of all, husband since 1973 to Molly Friedman Losey, a mental health counselor in Winter Park.

Sedona Principles 3rd Ed

1. Electronically stored information is generally subject to the same preservation and discovery requirements as other relevant information.

2. When balancing the cost, burden, and need for electronically stored information, courts and parties should apply the proportionality standard embodied in Fed. R. Civ. P. 26(b)(2)(C) and its state equivalents, which require consideration of importance of the issues at stake in the action, the amount in controversy, the parties’ relative access to relevant information, the parties’ resources, the importance of the discovery in resolving the issues, and whether the burden or expense of the proposed discovery outweighs its likely benefit.

3. As soon as practicable, parties should confer and seek to reach agreement regarding the preservation and production of electronically stored information.

4. Discovery requests for electronically stored information should be as specific as possible; responses and objections to discovery should disclose the scope and limits of the production.

5. The obligation to preserve electronically stored information requires reasonable and good faith efforts to retain information that is expected to be relevant to claims or defenses in reasonably anticipated or pending litigation. However, it is unreasonable to expect parties to take every conceivable step or disproportionate steps to preserve each instance of relevant electronically stored information.

6. Responding parties are best situated to evaluate the procedures, methodologies, and technologies appropriate for preserving and producing their own electronically stored information.

7. The requesting party has the burden on a motion to compel to show that the responding party’s steps to preserve and produce relevant electronically stored information were inadequate.

8. The primary source of electronically stored information to be preserved and produced should be those readily accessible in the ordinary course. Only when electronically stored information is not available through such primary sources should parties move down a continuum of less accessible sources until the information requested to be preserved or produced is no longer proportional.

9. Absent a showing of special need and relevance, a responding party should not be required to preserve, review, or produce deleted, shadowed, fragmented, or residual electronically stored information.

10. Parties should take reasonable steps to safeguard electronically stored information, the disclosure or dissemination of which is subject to privileges, work product protections, privacy obligations, or other legally enforceable restrictions.

11. A responding party may satisfy its good faith obligation to preserve and produce relevant electronically stored information by using technology and processes, such as data sampling, searching, or the use of selection criteria.

12. The production of electronically stored information should be made in the form or forms in which it is ordinarily maintained or in a that is reasonably usable given the nature of the electronically stored information and the proportional needs of the case.

13. The costs of preserving and producing relevant and proportionate electronically stored information ordinarily should be borne by the responding party.

14. The breach of a duty to preserve electronically stored information may be addressed by remedial measures, sanctions, or both: remedial measures are appropriate to cure prejudice; sanctions are appropriate only if a party acted with intent to deprive another party of the use of relevant electronically stored information.