Now that my blog has changed from weekly to monthly I have more time for my hobbies, like trying to hack into NSA computers. I made a breakthrough with that recently, thanks primarily to exuberant disclosures by Snowden after the Oscars. I was able to get into one of the NSA’s top-secret systems. Not only that, my hack led to discovery of a convert operation that will blow your mind. (Hey, if the NSA can brag about their exploits, then so can I.) And if that were not enough, I was able to get away with downloading two documents from their system. I will share what I borrowed with you here (and, of course, on Wikileaks). The documents are:

A perviously unknown Plan to use sophisticated e-Discovery Teams with AI-enhancements to find evidence for use in investigations and courtrooms around the world.

A slide show in movie and PDF form that tells you how these teams operate.

I can disclose my findings and stolen documents here without fear of becoming Citizen Five because what I found out is so incredible that the NSA will disavow all knowledge. They will be forced to claim that I made up the whole story. Besides, I am not going to explain how I hacked the NSA. Moreover, unlike some weasels, I will never knowingly give aid and comfort to foreign governments. This is something many Hollywood types and script kiddies fail to grasp. All I will say is that I discovered a critical zero-day type error in two lines of code, out of billions, in a software program used by the NSA. In accord with standard white hat protocol, if the NSA admits my story here is true, I will tell them the error. Otherwise, I am keeping this code mistake secret.

The hack allowed me to access a Top Secret project coded-named Gibson. It is a Cyberspace Time Machine.This heretofore secret device allows you to travel in time, but, here’s the catch, only on the Internet. Since it is an Internet based device the NSA has to keep it plugged in. That is why I was not faced with the nearly insoluble air gap defense protecting the NSA’s other computer systems.

From what I have been able to figure out, the time travel takes place on a subatomic cyber-level and requires access to the Hadron Collider. The Gibson somehow uses entangled electrons, Higgs bosons, and quantum flux probability. The new technology is based on Hawking’s latest theories, quantum computers, and, can you believe it, imaginary numbers, you know, the square root of negative numbers. It all seems so obvious after you read the NSA executive summary, that other groups with Hadron Collider access and quantum computers are likely to come up with the same invention soon. But for now the NSA has a huge advantage and head start. Maybe someday they will even share some of that info with POTUS.

The NSA Internet Time Machine allows you to peer into the past content of the Internet, which, I know, is not all that new or exciting. But, here is the really cool part that makes this invention truly disruptive, you can also look into the future. With the Gibson and special web browsers you can travel to and capture future webpages and content that have not been created yet, at least not in our time. You can Goggle the future! Just think of the possibilities. No wonder the NSA never has any funding problems.

This kind of breakthrough invention is so huge, and so incredible, that NSA must deny all knowledge. If people discover this is even possible, other groups will race to catch up and build their own Internet Time Machines. That is probably why Apple is hoarding so much cash. Will there be a secret collider built off the books under their new headquarters? It kind of looks like it. Google is probably working on this too. The government cannot risk anyone else knowing about this discovery. That would encourage a dangerous time machine race that would make the nuclear race looks like child’s play. Can you imagine what Iran would do with information from the future? The government simply cannot allow that to happen.

For that reason alone my hack and disclosures are untouchable. The NSA cannot admit this is true, or even might be true. Besides, having seen the future, I already know that I will not be prosecuted for these intrusions. In fact, no one but a few hard-core e-Discovery Team players will even believe this story. I can also share the information I have stolen from the future without fear of CFAA prosecution. Technically speaking my unauthorized access of web pages in the future has not happened yet. Despite my PreCrimelike proposals in PreSuit.com, you cannot (yet) be prosecuted for future crimes. You can probably be fired for what you may do, but that is another story.

Still, the hack itself is not really what is important here, not even the existence of the NSA’s Time Machine, as great as that is. The two documents that I brought back from the future are what really matters. That is the real point of this blog, just in case you were wondering. I have been able to locate and download from the future Internet a detailed outline of a Plan for AI-Enhanced search and review.

The Plan is apparently in common use by future lawyers. I am not sure of the document’s exact date, but it looks like circa 2025. It is obviously from the future, as nobody has any plans like this now. I also found a video and PDF of a PowerPoint of some kind. It shows how lawyers and other investigators in the future use artificial intelligence to enhance all kinds of ESI search projects, including overt litigation and covert investigations. It appears to be a detailed presentation of how to use what is still called Predictive Coding. (Well, at least they do not call it TAR anymore.) Nobody in our time has seen this presentation yet. I am sure of that. You will have the first glimpse now.

The Plan for AI-Enhanced search and review is in the form of a detailed 1,500 word outline. It looks like this Plan is commonly used in the future to obtain client and insurer approval of e-discovery review projects. I think that this review Plan of the future is part of a standardized approval process that is eventually set up for client protection. Obviously we have nothing like that now. The plan might even be shared with opposing counsel and the courts, but I cannot be sure of that. I had to make a quick exit from the NSA system before my intrusion was detected.

I include a full copy of this Plan below, and the PowerPoint slides in video form. See if thee documents are comprehensible to you. If my blog is brought down by denial of service attacks, you can also find it on Wikileaks servers around the world. The Plan can also be found here as a standalone document, and the PDF of the slides can be found here. I hope that this disclosure is not too disruptive to existing time lines, but, from what I have seen of the future of law, temporal paradoxbe damned, some disruption is needed!

Although I had to make a quick exit, I did leave a back door. I can seize root of the NSA Gibson Cyberspace Time Machine anytime I want. I may share more of what I find in upcoming monthly blogs. It is futuristic, but as part of the remaining elite who still follow this blog, I’m sure you will be able to understand. I may even start incorporating this information into my legal practice, consults, and training. You’ll read about it in the future. I know. I’ve been there.

If you have any suggestions on this hacking endeavor, or the below Plan, send me an encrypted email. But please only use this secure email address: HackerLaw@HushMail.com. Otherwise the NSA is likely to read it, and you may not enjoy the same level of journalistic sci-fi protection that I do.

d. Power Users of particular software and predictive coding features to be used

(1) Law Firm and Vendor

(2) List qualifications and experience

e. Outside Consultants or other experts

(1) Anticipated roles

(2) List qualifications and experience

f. Contract Lawyers

(1) Price list for reviewers and reviewer management

A. $500-$750 per hr is typical (Editors Note: Is this widespread inflation, or new respect?)

B. Competing bids requested? Why or why not.

(2) Conflict check procedures

(3) Licensed attorneys only or paralegals also

(4) Size of team planned

A. Rationale for more than 5 contract reviewers

B. “Less is More” plan

(5) Contract Reviewer Selection criteria

g. Plan to properly train and supervise contract lawyers

5. One or Two-Pass Review

a. Two pass is standard, with first pass selecting relevance and privilege using Predictive Coding, and second pass by reviewers with eyes-on review to confirm relevance prediction and code for confidentiality, and create priv log.

b. If one pass proposed (aka Quick Peek), has client approved risks of inadvertent disclosures after written notice of these risks?

6. Clawback and Confidentiality agreements and orders

a. Rule 502(d) Order

b. Confidentiality Agreement: Confidential, AEO, Redactions

c. Privilege and Logging

(1) Contract lawyers

(2) Automated prep

7. Categories for Review Coding and Training

a. Irrelevant – this should be a training category

b. Relevant – this should be a training category

(1) Relevance Manual for contract lawyers (see form)

(2) Email family relevance rules

A. Parents automatically relevant is child (attachment) relevant

B. Attachments automatically relevant if email is?

C. All attachments automatically relevant if one attachment is?

c. Highly Relevant – this should be a training category

d. Undetermined – temporary until final adjudication

e. No or Very Few Sub-Issues of Relevant, usually just Highly Relevant

f. Privilege – this should be a training category

g. Confidential

(1) AEO

(2) Redaction Required

(3) Redaction Completed

i. Second Pass Completed

8. Search Methods to find documents for training and production

a. ID persons responsible and qualifications

b. Methods to cull-out documents before Predictive Coding training begins to avoid selection of inappropriate documents for training and to improve efficiency

(1) Eg – any non-text document; overly long documents

(2) Plan to review by alternate methods

(3) ID general methods for this first stage culling; both legal and technical

This is the e-Discovery Team‘s 500th blog. Since my first blog of November 10, 2006, entitled Basic idea of e-Discovery, I have written a blog every week. Now, more than eight years later, I am reflecting on the journey. The Grateful Dead’slyrics come to mind: What a long strange trip its been! I realize that it is time for a change, and this will be my last weekly blog. Going forward this blog will switch to monthly publications (perhaps more for breaking news), and will also change writing style. The new slogan on the top banner is a good clue of future format. As Jason Baron pointed out to me with a sly smile when considering this move last month, the change is reminiscent of the transformation of Life magazine from a weekly to a monthly in 1978. Unlike Life, however, my subscriptions are up, and e-Discovery Team has never been more popular. The Posse List ranked this blog as the Number One ‘go to’ blog on e-Discovery out of the 407 that they follow. A private survey late last year shows it is the second most popular blog in the country for corporate counsel on e-discovery, behind only The Sedona Conference. Unlike Life, I make this change purely for personal reasons, and not because of waning readership.

Part of My Life is an Open Blog

Molly, Ralph, Adam, Cat, Tor

Unlike Life magazine, I have no advertisers to please, and nothing to sell. This is a personal service, a payback to the legal profession for all that it has given to me and my family. My payback has been to share some of my adventures with law and technology. In 2006 I phased out of my civil litigation practice and focused solely on e-Discovery. I did this out of love of the challenge, and the subject, but also because I saw that we were living in dangerous times. Law was, and still is, in danger of being outpaced by the blindingly fast advances in Technology. Law and Technology seemed to be dangerously disconnected in 2006, and our system of justice imperiled. I saw that my fellow trial lawyers were lost and confused by the new forms of electronic evidence. I saw a strong need for lawyers like me to step up to the plate to try to narrow the gap.

Ralph and Eva, his daughter

As a person who loved both law and technology, I knew I was uniquely qualified to help the profession in that way. Plus my then law firm, Akerman Senterfitt, supported these efforts, as they needed these skills to handle the big cases. Then in the Fall of 2006 I was inspired, completely out of the blue, to address the professional gap problem in a new way. I decided to try using a then very new writing medium, a blog. There were very few law blogs then, although now there are hundreds, which in itself is another reason I do not feel compelled to continue at this intense pace. I think it is time for a break, to spend more time with my beautiful family.

Molly Losey

I have used this blog to share in near real time the insights that arose from my legal practice and studies with these new challenges. There has a personal price to pay for this effort, the countless lost weekends, the thousands of hours of writing. Ask Molly, my wife, she will tell you about it. I feel that I have paid my dues, and earned the right to step down completely, but I still see even more dangerous times to come. The gap has narrowed, but still remains. So I will keep speaking, but less often, and with a different tone. I am reminded of the famous line in the movie, Secret Life of Walter Mitty, where Mitty shares the motto of Life magazine:

To see the world, things dangerous to come to, to see behind walls, draw closer, to find each other, and to feel. That is the purpose of life.

Every Sunday Night

As most readers already well know, almost all of my 500 blogs were published on Sunday night. Soon I will enjoy a leisurely dinner instead, that is, at least three out of four weekends. My weekend blogs started in 2006 with just a few hundred words, but quickly morphed into several thousand word essays. In the last several years I have taken to writing multi-part blogs in the 10,000 plus words category. For example, my blog last week, Information Governance v Search: The Battle Lines Are Redrawn, was 5,032 words.

Almost all of my weekly blogs pertained to my current life’s work and passion, the new legal speciality of electronic discovery law. None were easy to write, but they became a habit, a creative addiction. From the start I set a standard for myself to write for all levels of readers at once: beginners, intermediates, advanced, and the elite. Writing for all levels at once is a major challenge. Most law bloggers I see aim for the beginners and intermediate. That is where the numbers are, and newbies are easier to write for. Plus, it is easier to skate by with easy stuff when your target is not your peers. Still, my friends and colleagues, the advanced and the elite, have always been my favorite. Plus these readers are the movers and shakers of the industry. If you want to put a dent in the e-discovery universe, they are who you need to reach.

From the beginning, I also wanted my writing to be web-based hyper-linked and image filled. I thought a blog on e-discovery should be as cutting edge as the subject. Finally, I wanted it to be creative and fun to read. I have often failed, but those were my goals. If you have ever smiled or learned something from this blog, hopefully both, then it has been worth the effort.

The first several hundred of my blogs were made into five books, two by the ABA, two more by West Thompson, and a fifth I made myself for iBooks. A few years ago I got bored with the whole book exercise, and started making the blogs into an online training program, e-Discovery Team Training. They started off with a program for law students, and it later morphed into a program for anyone interested in the subject. I also spun off parts of the e-Discovery team blog into a bewildering (even to me) array of sister websites, including one of my favorite projects, a web that collects the best practices for legal services in e-Discovery, EDBP.com. Although this is not a complete list, the main ones I use now are:

Some of the 500 blogs, indeed some of the best, were not written by me. They were guest blogs written by others, but edited, proofed, imaged, and published by me. I count them as part of my 500 because, well, many took almost as much of my time as when I wrote them myself. Still, they helped make e-Discovery Team the dominant blog it is. My guest blogs include many of the elite in the profession, including Judge Scheindlin, who also appeared in one of my books, Jason R. Baron, who contributed more blogs than anyone, Maura R. Grossman, Gordon Cormack, the late great Browning Marean, Thomas J. O’Connor, J. William (Bill) Speros, Shannon Capone Kirk (who contributed two of my rocking favorites), Kristin Ali, Mary Mack, Bill Hamilton (who contributed many), David Cowen, Sonya Sigler, Michael Simon,Judge Ralph Artigliere,and many others. A few as yet relative unknowns have contributed as well, including: Samir Mathur, Lawrence Chapin, Simon Attfield, and Efeosasere Okoro, Jesse B. Freeman, and James Cook. And, of course, who can forget the open letters to the judiciary by Anonymous, a lawyer in a big firm whose identify I will never reveal. I repeat my thanks to one and all who have helped me make this a truly team effort. You are the elite, and you are welcome to stay and contribute again.

The Future of the e-Discovery Team Blog

The two hardest parts about writing this blog have been its frequency, every week, and its wave-length, by which I mean its multilevel readership target. I have always tried to write for a general audience and for specialists. I tried to included multiple levels in each blog. No more. Sorry dear newbies. But you have plenty of other choices now, and you can still keep reading this blog, of course, but it may be much more difficult.

Writing for all levels at once is much harder than you would think. It also makes for much longer articles. I have had to spell things out and focus on clarity, so that a beginner could follow long with some effort (ok, maybe a lot of effort), and an intermediate level reader could too. But here is the real trick, I also wanted the same article to be of value and interest to advanced readers, to my friends and colleagues. They used to be called the Sedona bubble people,although, truth be known, that expression is, alas, already passé. Insiders know what I am talking about.

I am addressing the frequency challenge by changing to a monthly format. I am addressing the wavelength challenge by dropping the lower scales. After this blog, I will not worry so much about the beginners and intermediate level readers. I will instead focus on the elite. I will write to peers, not students. I will not even address simple topics. I will not take the time to spell everything out. Starting next month, this will be a blog for advanced readers only.

I am sure that such writing will be easier and shorter. I am not getting callous, nor uncaring. There will still be resources a plenty for the newbies. Moreover, I will continue to maintain, even expand and improve, the e-Discovery Team Training program. The online training will remain oriented to beginning and intermediate levels. Plus, there are another 406 blogs out there for beginners. I have done my part. Let others serve the newbies.

Advice to Fellow Bloggers

My advice to other bloggers, take the time to do it right and please, keep a dimension to it that is light. Law is too serious a subject not to include some humor. Be creativity too, and do not treat a blog like a paper article; use graphics, reader polls, comments, video, even music, and, of course, hyperlink.

Please do not be afraid to write the truth. You have a First Amendment right in this great country, use it. Express your opinions. Our ancestors have given their lives for that right. Censors beware. I will fight you. I will counter the chill whenever I can. Like all attorneys in the U.S., I have sworn an oath to uphold our Constitution, and I take this seriously. We are all endowed by our Creator with certain unalienable Rights, among these are Life, Liberty and the pursuit of Happiness. These are precious freedoms, and so is our Bill of Rights, including especially the First Amendment:

Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the Government for a redress of grievances.

My personality is to tell it like it is. I do not mince words. I am not afraid of a fight. I relish the opportunity to stand up for truth and justice. That is one reason why I went into litigation. Be bold fellow bloggers, be honest. Do not be afraid to take on controversial issues. Only write about what you are passionate about and care enough to do it right. Style matters. That is how to build your readership.

But also be smart about your writings. Look out for the reptiles and sharks out there who may sneak up on you. Assume that everything you write could end up on the front page of the NY Times. Do not say something stupid, nor slanderous, and of course, never provide (nor solicit) legal advice. Keep it educational, write smart, and keep it a matter of your opinion, no one else’s.

Take courage from the fact that none of my 500 blogs has ever been used against me by opposing counsel. If that ever happens to me, I will thank them for recognizing my expertise, and then explain that was then, this is now. The facts and circumstances are different. Of course, I never lie nor distort in a blog. I say what I mean. I tell the truth. That is the real reason why my blogs are not used against me. I say essentially the same thing in court. Walk your talk, or shut up. Nobody want to read a poser or empty opinions. Still, having said all of that, whenever I write a blog I remember the NY Times rule, and remember that everything I write could be used against me in a court of law. That is what I mean by writing smart.

In over eight years of blogging I have only taken down one of my blogs after a complaint (I do not count that in the 500). It happened about six years ago after a certain very powerful business person / celebrity who, unbeknownst to me, happened to be a client of my old firm, took offense to my telling the truth. (Hint, this guy has a funny hairdo and is famous for two words.) My partner asked me to do it in such a nice way, that I decided to oblige and took it down late Monday morning. Everything was fine, it was not such a great blog anyway.

I have, however, somewhat revised blogs from time to time, but that was always voluntary on my part. The ability to change things is one of the beauties of online writing. Aside from the one incident I mentioned, I have never been pressured to shut up. I have had negative reactions to be sure, but as a life long litigator I am used to that. I enjoy speaking truth to power. It is both a strength and a weakness, but it is who I am.

Your writing, like your actions, should be true to yourself. For me that means being direct. I will say what I think, and not just in writing, but in speaking too. That is how I operate. That is why people trust me. (Speaking of trust dear bloggers, do not ever quote people without their permission. Keep secrets.) I am honest to a fault. It has taken me years to accept certain social lying conventions as a polite necessity, like “My, you look great.” But I will never lie about anything important. I will tell you what I really think. Again, that is who I am.

That may not be you, so do not try to write that way if you are not like that. If you are the kind of person who always pussyfoots around everything like a crab, then write that way too. You will get many politically correct, devious followers like yourself. Above all else, be true to yourself. Do not try to copy another.

I also suggest that my fellow bloggers check out my disclaimer and use something like it. Do not use your blog for marketing, or to try to get new clients. Do it s an educational service, a personal adventure, or do not do it at all. Marketing blogs are crap. Everybody knows that except the marketing departments.

Writing and teaching is a great way to get to know a subject better. There are many benefits to writing a blog. I have learned a lot about my speciality and myself. I urge all to give it a try. The tradition of writing is a long and honored one. Its inner rewards can be great, and if you happen to receive some public recognition for your efforts, all the better. But that should never be the primary goal. The writing should be an end in itself, a flow or zone of satisfaction and happiness. Otherwise, try something else. Tell your marketing department to go jump in a lake. We do not need any more boring blogs.

Conclusion and New Beginning

You may have noticed that I always end my blog with a Conclusion. This is a habit from decades of writing legal memorandums supporting motions. This time it is a final conclusion, but also an introduction of what I hope will be a coming good. I will continue to try, and, as the Wherefore clauses of my motions seeking equity say (there is no law that you read my blog), pray that you will grant my request to continue to be a reader. I invite one and all to at least try out to remain on the e-Discovery Team.

The blog beginning next month will be different: shorter and more advanced. I have no idea at this time what I will write about in my first monthly blog, but after eight years of doing this weekly, I have complete trust in the muse. What I do know is that the next blog will have a new style. I am about to hit the gas pedal and accelerate in my thought expressions. Of course, beginners will not be barred from my blog (assuming they behave). It is just that I will no longer spend sentence after sentence so that they can more easily follow along. The Mahayana days of the e-Discovery Team blog acting as a big boat to carry all across the great water are over.

I am about to take off and see how fast this car will go. At this point I need a bigger challenge and need to write to a more select readership. I need to stop teaching undergraduate courses so that I can focus on research and advanced topics. I need to take my thinking to the next step, as well as enrich other aspects of my life. If some readers end up confused in the coming months, which is probably inevitable, they can drop this course. They need not feel too bad about that. Things are already going crazy fast. They will see our back to the future traces from the other 406 blogs out there. Some may come back later and begin to be able to follow along. I have helped build this car and now yearn to find out how fast it will go, and more importantly, where is will take the e-DiscoveryTeam.

The team of readers here may get smaller, but not necessarily so. The numbers of real experts in the field is growing, especially among the young. I have to think of the future generations. I have to see how far and fast the core team can go. That will make it easier for next generations of lawyers to keep up with their technology peers. Life is not about playing it safe. The techs are moving fast and they need legal support, not to mention grown up supervision. Society without the stabilizing force of law and justice is far too dangerous. Law must move just as fast as technology and science to remain relevant. Accept this responsibility and run with me. We are already close to catching up. They need us in the forefront to help guide the way.

To my friends and readers already among the 1337, the blog should soon become a more enjoyable experience than before. Shorter for sure, and I hope, also more profound, more thought provoking. You will be able to keep up, and hopefully, you will prod me along to go even faster. This is a team effort. That has always been the fundamental message of this blog and will remain. We are just going to increase our pace. I need to find out how fast we can go and where such speeds will take us. The future awaits and will never be evenly distributed. I am done trying. Pure tech gave up long ago, and now I must too, at least in this blog. Others will translate to the lawyers and other e-discovery professionals that we leave behind.

See you next month on the other side. There will be no bubble heads over there, but you will find most of the true elite. As many of you know, the Sedona bubble needs to be burst anyway. The team here will keep leading, including some guest blogs. Starting next month the latest truth will still be heard, even if the wave-length is different. The dangers in the world of litigation remain, and, rest assured, I will continue to speak out. As Walter Mitty found out after seeing a fin, those are no porpoises, despite what the crew may say.

There is a battle in the legal tech world between Information Governance and Search. It reflects a larger conflict in IT and all of society. Last year I came to believe that Information Governance’s preoccupation with classification, retention, and destruction of information was a futile pursuit. I challenged these activities as inefficient and doomed to failure in the age of information explosion. Instead of classify and kill, I embraced the googlesque approach of save and search.

I became wary of the whole approach of governing information as hostile to individual privacy rights and liberties. In my experience IG rules only seemed to serve the large entities who made them. For instance, IG rules typically state that employees have no reasonable expectation of privacy to any communications they may have at work, that all of their email accounts, even personal, can be searched at will. Their every keystroke can be monitored and recorded. Old school records policies seemed to encourage these draconian approaches. Under current U.S. law, these rules are usually enforceable.

Although I appeared to be a lonely searcher-voice in the legal technology world, which is, after all, not too surprising, since law itself is an attempt to govern, I had plenty of good company in the general technology world. There is not only Google, whom you would expect, but also EMC, GE, and a host of others. The debate is part of larger issues surrounding Big Data.

I took up arms against IG as I then knew it, which I understood to be an activity primarily designed to classify, control and delete records. I knew this conflict of approaches in how to treat information was important, and I felt compelled to speak out. Govern or Search is not just a legal issue. It is a cultural issue.

When I first spoke out with a contrarian voice, it created a controversy. Most in the legal establishment thought I was just plain wrong. Many wrote articles respectfully opposing my position. Many more were ready to argue, to fight even. Some did. I was even yelled at once at a CLE speakers dinner by a distinguished leader of IG who bristled at my challenges (some might say baiting). She insisted that everyone in her very large corporation could easily comply with her lengthy retention schedules. Oh brother.

The more thoughtful members of the IG leadership responded to the opposition with dialogue. This requires listening and trying to understand the points of the other side. I understand and favor dialogue, which is what attracted me to Sedona back in the day. I learned from this dialogue that IG, like Search, is not a monolith, that there are various factions and groups within IG.

After months of dialogue with the modern camp of IG, I have come to see that the contest between Search and IG need not be a fight to the death. I came to see a potential win/win outcome to this struggle. To those followers of IG who, like Jason R. Baron, have already transcended the old roles of traditional records keepers, there is no need to fight at all. My quarrel is, instead, with the old-liners, the Records Manager strata of IG who are obsessed with ESI classification and killing. To those who have let go of that traditional role, and already been reborn as multimodal, AI-enhanced Information experts, I have no quarrel. You could say that a partial settlement has been reached by a realignment of the parties.

My opposition continues only with the old-time record keepers with their long complex retention schedules and harsh top down rules. I will continue to oppose these caterpillars, no matter what smoke they may blow my way, unless and until they bow to the inevitable electronic metamorphosis. There has been no settlement with them. Trial in the world court of public opinion continues. I will oppose them for their own good. The librarians should relax, perhaps inhale a bit, cocoon, learn the new tech ways, and reemerge.

The battle against the new age Information Governors is, however, over; although I will remain watchful. Why? Because they in fact have already embraced the search and technology ways of “my side.” As Sun Tzu said: “The supreme art of war is to subdue the enemy without fighting.” Search and technology have won. Information has won. They are all one.

Underneath the superficial differences, and the annoying tendency of IG to claim every other field, including Search, as a subset of its own, both sides share almost all of the same values and concerns. Members of both sides are committed to cybersecurity and privacy, and do not see them as an either or choice. That is critical. We must not sacrifice all of our privacy and individual rights in the name of security.

Where are the rights to both privacy and security in the challenge of too-much-information? I am a strong proponent of privacy, and so aremany in the IG world. I am also a strong proponent of cybersecurity. I think it is possible to have both. In both the Search and IG camps their are people who agree with me on these points, and others who disagree. Many see it as one or the other, especially people in government. They take extreme views favoring either security or privacy. Many in both tech and government simply dismiss the importance of privacy, and say just get over it. Advocacy for individual privacy is a separate battle in both worlds, IG and Search. The same is true over cybersecurity. I favor a balanced approach, and so do many in the IG world.

The real battle is not between new IG and Search, it is between the extreme positions that can be found in both camps on the issues of privacy and security. I advocate for a middle ground, privacy and security, and so do many in the IG world. I am also apprehensive of the emergence of Big Brother from Big Data, but, as it turns out, so are many in the IG world. Our common ground is far greater than our differences. Thus a realignment of the parties to our common foes.

Death of a Caterpillar

The traditionalists in the IG world whom I continue to oppose, the ones who are glorified records managers, have another five years, at best, before complete obsolescence. The classify and control lock-down approach of records management is contrary to the time. It cannot withstand the continuing exponential growth of data, nor the basic entropy forces aligned against all attempts to govern by all-too-human rules and compliance. Records managers are caterpillars waiting to be reborn. They should withdraw into a cocoon and embrace the change.

My prediction is that within five years the traditional records management activities, specifically the classification, filing and obsessive deletion of data, will no longer be worth the effort. (I concede that some deletion is necessary and will continue.) It will be far more efficient to rely on advanced Search, than classify and kill. This five-year projection assumes continued exponential growth and complexity of ESI. Breakthroughs in search in the next five years would be nice too, but my prediction does not depend on that. It assumes instead a slow, steady improvement of search technologies. They are already awesome, when used properly. The caterpillar record managers will grow big and fly high with search if they will only allow themselves to have new eyes.

Alas, as of now the old-school IG’ers still see the world through paper glasses. They think that Information Governance is like paper records management, just with more zeros after the number of records involved. The file-everything librarian mentality lives on, or tries to. Yawn. There is a reason nobody in the C-Suite ever took records managers seriously. Dressing them up with new titles is not going to change anything. They have to really change and be reborn into the digital world. They need to learn to fly with search, instead of creeping along with filing rules. They need to embrace the new high-tech world of IG 2.0.

ESI Grows and Changes Too Fast for Traditional Governance

Electronic information is a totally new kind of force, something Mankind has never seen before. Digital Information is a Genie out of the bottle. It cannot be captured. It cannot be managed. It certainly cannot be governed. It cannot even be killed. Forget about trying to put it back in the bottle. It is breeding faster than even Star Trek’s Tribbles could imagine. Like Baron and Paul discussed in their important 2007 law review, ESI is like a new Universe, and we are living just moments after the Big Bang. George L. Paul and Jason R. Baron, Information Inflation: Can the Legal System Adapt? 13 RICH. J.L. & TECH. 10 (2007).

Ludwig Wittgenstein

What many outside of Google, Baron, and Paul fail to grasp is that Information has a life of its own. Id. at FN 30 (quoting Ludwig Wittgenstein (a 20th Century Austrian philosopher whom I was forced to study while in college in Vienna): “[T]o imagine a language is to imagine a form of life.”) Electronic information is a new and unique life form that defies all attempts of limitation, much less governance. As James Gleick observed in his book on information science, everything is a form of information. The Universe itself is a giant computer and we are all self-evolving algorithms. Gleick, The Information: a history, a theory, a flood.

Many claim that information wants to be free. It does not want to be governed, or charged for. Information is more useful when free and when it is not subject to transitory restraints. Still, it must also be respected and safeguarded.

On the one hand information wants to be expensive, because it’s so valuable. The right information in the right place just changes your life. On the other hand, information wants to be free, because the cost of getting it out is getting lower and lower all the time. So you have these two fighting against each other.

Regardless of the economic aspects, and whether information really wants to be free, as a practical matter Information itself cannot be governed, even if some of it can be commoditized. Information is moving and growing far too fast for governance. But not too fast for search or security, at least I hope not. There are promising tech methods on the horizon that should guaranty privacy. See eg.: Entangled Photons on Silicon Chip: Secure Communications & Ultrafast Computers, The Hacker News, 1/27/15 (quantum entanglement encryption as the ultimate privacy solution).

Digitized information is like a nuclear reaction that has passed the point of no return. The chain reaction has been triggered. This is what exponential growth really means. In time such fission vision will be obvious. Even people without Google glasses will be able to see it. Just look at the extent of ESI proliferated during any minute of the world today as shown by the chart below. And the volume of ESI stored doubles at least every two years.

In the meantime we have records managers running around who serve like heroic bomb squads. Some know that it is just a noble quest, doomed to failure. Most do not. Some helicopter in and out of corporate worlds like wannabe Brian Williamses. They take flack (for real). They attempt to defuse ticking information bombs. They build walls around it. They confidently set policies and promulgate rules. They inventory it, map it, delete it. They talk sternly about enforcement of rules. (Of course, that never happens, which is one reason the whole effort is futile.) They automate deletion. They also try to automate filing. Some are even starting to make robot file clerks. But is it worth the effort? Might the time and money be better spent to protect our data from black hat hackers? To protect our privacy and individual rights?

The old school IG’ers are all working diligently to try to solve today’s problems of information management. But, all the while, ever new problems encroach upon their walls. They cannot keep up with this growth, the new forms of information. The next generation of exponential growth builds faster than anyone can possibly govern. Do they not know that the nuclear bomb has already exploded? The tipping point has already past?

Information retention policies that are being created today are like sand castles built at low tide. Can you hear the next wave of data generated by the Internet of Things? It will surely wash away all of today’s efforts. There will always be more data, more unexpected new forms of information.

IG Through the Eyes of an AI-Enhanced Butterfly

I used to endorse the old ways myself. I used to be a caterpillar. ESI feared me. I was all about killing data as soon as you no longer had a business need for it. I was all in favor of short retention schedules. But, that was then. That was before I really mastered predictive coding, which in my version means active machine learning. That was before I understood much better than I used to, that we are living in a whole new world of Big Data Analytics.

I now realize that is possible to dramatically reduce the costs of document review. I now realize the incredible power of AI enhanced search. I am starting to realize the potential value of large pools of seeming worthless data. These realizations change everything. I have been reborn as a butterfly with digital wings of AI.

Old school IG, by which I mean e-dressed-up records management, is not the way to deal with today’s all digital world. We are all suffering from information overload. We are all looking for a solution. Will we cope by Search and advanced technology, or by vertical forces of governance and man-made laws? This is an important question for everyone.

My understanding and experiences with Big Data analytics over the last few years have led me understand that more data can mean more intelligence, that it does not necessarily mean more trouble and expense. I understand that more and bigger data has its own unique values, so long as it can be analyzed and searched effectively.

This change of position was reinforced by my observing many litigated cases where companies no longer had the documents they needed to prove their case. The documents had short retention spans. They had all been destroyed in the normal course of business before litigation was ever anticipated. I have seen first hand that yesterday’s trash can be tomorrow’s treasure. I will not even go into the other kind of problems that very short retention policies can place upon a company to immediately implement a lit-hold. The time pressures to get a hold in place can be enormous and thus errors become more likely.

There is a definite dark side to data destruction that many do not like to face. No one knows for sure when data has lost its value. The meaningless email of yesterday about lunch at a certain restaurant could well have a surprise value in the future. For instance, a time-line of what happened when, and to whom, is sometimes an important issue in litigation. These stupid lunch emails could help prove where a witness was and when. They could show that a witness was at lunch, out of the office, and not at a meeting as someone else alleges.

Who knows what value such seemingly worthless data may someday have? Perhaps millions of emails of ten thousand employees about lunch could be used someday to prove or disprove certain class-action allegations. Outside of the little world of litigation, perhaps the information could help management make smarter business decisions. For instance, they could help a company to decide whether to open a company cafeteria, and if so, what kind of food its employees would really like to have served there. Information can prove what really happened in the past and can help you to make the right decisions. With smart search, there can be great hidden value in too much information. Businesses are starting to see this now where Big Data mining is all the buzz. We lawyers need to start doing the same.

The point is, with the never-ending uncertainties of tomorrow, you can never know for sure that information is valueless and should be destroyed, and what information has value and should be saved. There may be an unimaginably large haystack of information, and you may think it only has a few valuable needles. But, you never really know. Today’s irrelevant straw could be tomorrow’s relevant needle. With the AI based search capacities we already have, capacities that are surely to improve, when you need to find a needle in these near infinite stacks, you will be able to. The cost of storage itself has become so low as to become a negligible factor for most large corporations. Why destroy data when you can effectively search it and mine it for value? That is the butterfly view.

Information Technology View on Records Management v. Search

The general IT world is also struggling between whether to go all-in with Search, or keep trying to solve the problem of too much information with records management. Unlike the legal world, where my vote for Search is still a new and small minority, in the IT world search is already a strong voice. Many in IT see attempts at information governance as a knee-jerk reaction from those still transitioning into the digital world. In the last year it seems to me that those favoring search over filing are gaining ground in the technology world. From what I see, the retain and search solution is surging ahead of the old-fashioned govern and destroy approach.

Consider, for instance, the policy of search stated by hot new companies like Pivotal, which is a joint venture between EMC, VMware, and GE. Pivotal’s public mantra is: Store Everything. Analyze Anything. Build the Right Thing.

Pivotal urges its customers to store everything, not just its organized databases, such as financial records. It provides the ability to store all types of data, including especially disorganized data, such as employee emails and texts, and do so in the same place. That is the new gold standard. Pivotal explains the value of store everything this way:

Store everything to create a rich data repository for business needs. With unlimited, supported Pivotal HD enterprises never have to worry about data growth constraints or runaway license costs.

Its suite of Big Data software is designed to allow a company to store all data types in the same place, which it, along with EMC, and others, have started calling a Data Lake. All types and formats of ESI become readable, searchable, in the Data Lake. They do not have to be stored separately, nor searched and analyzed separately. The Data Lakes are also infinitely expandable. Unlike real lakes, they cannot flood. They can instead grow unhindered in cyberspace. All they need are more servers.

These are major breakthroughs and mean the inevitable end of separate data silos by format type and size. This allows you to, in Pivotal’s words, leverage all your data, forever, and place it all in a centralized Business Data Lake. You can analyze multiple data sets and types that live in the Business Data Lake. This allows you to determine the integration value of multiple data sets and types. It also makes storage of Big Data much less expensive.

Bottom line, when all of your data is saved forever, and subject to advanced search analytics, you are empowered to build the right thing. In Pivotal’s words, building the right thing means to deliver a transformative solution to meet today’s demanding business needs. For business that means creation of new products, new advertising, new sales and business methods. For law it means building your case, finding evidence, and creating new legal methods. The promise of Big Data is changing everything in the tech world. Some in IG are also aware of these facts and are adapting ESI management accordingly.

The key problem all large organizations face is the challenge to find the information they need, when they need it, and do so in a cheap and efficient manner. Information needs are determined by both law and personal preferences, including business operation needs. In order to find information, you must first have it. Not only that, you must keep it until you need it. To do that, you need to preserve the information. If you have already destroyed information, really destroyed it I mean, not just deleted it, then obviously you will not be able to find it. You cannot find what does not exist, as all Unicorn chasers eventually find out.

This creates a basic problem for old-school IG because the whole system is based on a notion that the best way to find valuable information is to destroy worthless information. Much of old IG is devoted to trying to determine what information is a valuable needle, and what is worthless chaff. This is because everyone knows that the more information you have, the harder it is for you to find the information you need. The idea is that too much information will cut you off. These maxims were true in the pre-AI-Enhanced Search days, but are, IMO, no longer true today.

In order to meet the basic goal of finding information, old-school IG focuses its efforts on the proper classification of information. Again, the idea was to make it simpler to find information by preserving some of it, the information you might need to access, and destroying the rest. That is where records classification comes in.

The question of what information you need has a time element to it. The time requirements are again based on personal and business operations needs, and on thousands of federal, state and local laws. Information governance thus became a very complicated legal analysis problem. There are literally thousands of laws requiring certain types of information to be preserved for various lengths of time. Of course, you could comply with most of these laws by simply saving everything forever, but, in the past, that was not a realistic solution. There were severe limits on the ability to save information, and the ability to find it. Also, it was presumed that the older information was, the less value it had. Almost all information was thus treated like news.

These ideas were all firmly entrenched before the advent of Big Data and AI-enhanced data mining. In fact, in today’s world there is good reason for Google to save every search, ever done, forever. Some patterns and knowledge only emerge in time and history. New information is sometimes better information, but not necessarily so. In the world of Big Data all information has value, not just the latest.

The records life-cycle ideas all made perfect sense in the world of paper information. It cost a lot of money to save and store paper records. Everyone with a monthly Iron Mountain paper records storage bill knows that. Even after the computer age began, it still cost a fair amount of money to save and store ESI. The computers needed to buy and maintain digital storage used to be very expensive. Finding the ESI you needed quickly on a computer was still very difficult and unreliable. All we had at first was keyword search, and that was very ineffective.

Due to the costs of storage, and the limitations of search, tremendous efforts were made by record managers to try to figure out what information was important, or needed, either from a legal perspective, or a business necessity perspective, and to save that information, and only that information. The old idea behind IG was to destroy the ESI you did not need or were not required by law to preserve. This destruction saved you money, and, it also made possible the whole point of IG, to find the information you wanted, when you wanted it.

Back in the pre-AI search days, the more information you had, the harder it was to find the information you needed. That still seems like common sense. Useless information was destroyed so that you could find valuable information. In reality, with the new and better algorithms we now have for AI-enhanced search, it is just the reverse. The more information you have, the easier it becomes to find what you want. You now have more information to draw upon.

That is the new reality of Big Data. It is a hard intellectual paradigm to jump, and seems counter-intuitive. It took me a long time to get it. The new ability to save and search everything cheaply and efficiently is what is driving the explosion of Big Data services and products. As the save everything, find anything way of thinking takes over, the classification and deletion aspects of IG will naturally dissipate. The records life-cycle will transform into virtual immortality. There is no reason to classify and delete, if you can save everything and find anything at low cost.The issues simplify; they change to how to save and search, although new collateral issues of security and privacy grow in importance.

The New York Times in an opinion editorial in late 2014 discussed recent breakthroughs in Artificial Intelligence and speculated on alternative futures this could create. Our Machine Masters, NT Times Op-Ed, by David Brooks (October 31, 2014). The Times article quoted extensively another article in the Wired by technology blogger Kevin Kelly: The Three Breakthroughs That Have Finally Unleashed AI on the World. Kelly argues, as do I, that artificial intelligence has now reached a breakthrough level. This artificial intelligence breakthrough, Kevin Kelly argues, and David Brook’s agrees, is driven by three things: cheap parallel computation technologies, big data collection, and better algorithms. The upshot is clear in the opinion of both Wired and the New York Times: “The business plans of the next 10,000 start-ups are easy to forecast: Take X and add A.I. This is a big deal, and now it’s here.”

These three new technology advances change everything. The Wired article goes into the technology and financial aspects of the new AI; it is where the big money is going and will be made in the next few decades. If Wired is right, then this means in our world of e-discovery, companies and law firms will succeed if, and only if, they add AI to their products and services. The firms and vendors who add AI to document review, and project management, will grow fast. The non-AI enhanced vendors, non-AI enhanced software, will go out of business. The law firms that do not use AI tools will shrink and die. The same goes for IG.

The three big new advances that are allowing better and better AI are nowhere near to threatening the jobs of human judges or lawyers, although they will likely reduce their numbers, and certainly will change their jobs. We are already seeing these changes in Legal Search and Information Governance. Thanks to cheap parallel computation, we now have Big Data Lakes stored in thousands of inexpensive, cloud computers that are operating together. This is where open-sourced software like Hadoop comes in. They make the big clusters of computers possible. The better algorithms is where better AI-enhanced Software comes in. This makes it possible to use predictive coding effectively and inexpensively to find the information needed to resolve law suits. The days of vast numbers of document reviewer attorneys doing linear review are numbered. Instead, we will see a few SMEs, working with small teams of reviewers, search experts, and software experts.

The role of Information Managers will also change drastically. Because of Big Data, cheap parallel computing, and better algorithms, it is now possible to save everything, forever, at a small cost, and to quickly search and find what you need. The new reality of Save Everything, Find Anythingundercuts most of the rationale of old paradigm of Information Governance, but not the new. The new paradigm of IG gets it, and relies on AI technology.

The save everything forever AI search model of new IG will create a variety of new legal work for lawyers, but they will be the next generation of tech lawyers. The cybersecurity protection and privacy aspects of Big Data Lakes are already creating many new legal challenges and issues. Big Data breaches already mean Big Money for the law firms who offer curative services. That is happening now. In the future lawyers will play a larger role in preventative security issues. More legal issues are sure to arise with the expansion of Big Data, AI, and development of the next generation of IG. From what I have seen technology creates new jobs as fast as it eliminates old ones. The real challenge is keeping up with the changes.

Conclusion

Preservation is far less difficult when you are anyway saving everything forever. With this approach the challenging task remaining in e-discovery is really just search. That is why I say, only slightly tongue in cheek, that Information Governance is actually a sub-set of Search, not visa versa. In so far as e-discovery is concerned, that is true; but IG is a concern that goes beyond e-discovery.

In the IG now emerging – IG 2.0 – Information Governance serves as a kind of umbrella organization for all things information. It is not just a hyped up version of records management. It is a center of a high-tech wheel built around information. That image has traction for Search advocates such as myself, just so long as search is not considered to be just another spoke in the Wheel. Search has a much more important position. It is the tire around the wheel, where the rubber meets the road. In today’s world you are likely to get lost without it.

The second filter begins where the first leaves off. The ESI has already been purged of unwanted custodians, date ranges, spam, and other obvious irrelevant files and file types. Think of the First Filter as a rough, coarse filter, and the Second Filter as fine grained. The Second Filter requires a much deeper dive into file contents to cull out irrelevance. The most effective way to do that is to use predictive coding, by which I mean active machine learning, supplemented somewhat by using a variety of methods to find good training documents. That is what I call a multimodal approach that places primary reliance on the Artificial Intelligence at the top of the search pyramid. If you do not have active machine learning type of predictive coding with ranking abilities, you can still do fine grained Second Level filtering, but it will be harder, and probably less effective and more expensive.

All kinds of Second Filter search methods should be used to find highly relevant and relevant documents for AI training. Stay away from any process that uses just one search method, even if the one method is predictive ranking. Stay far away if the one method is rolling dice. Relying on random chance alone has been proven to be an inefficient and ineffective way to select training documents. Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – Part One, Two, Three and Four. No one should be surprised by that.

The first round of training begins with the documents reviewed and coded relevant incidental to the First Filter coding. You may also want to defer the first round until you have done more active searches for relevant and highly relevant from the pool remaining after First Filter culling. In that case you also include irrelevant in the first training round, which is also important. Note that even though the first round of training is the only round of training that has a special name – seed set – there is nothing all that important or special about it. All rounds of training are important.

There is so much misunderstanding about that, and seed sets, that I no longer like to even use the term. The only thing special in my mind about the first round of training is that it is often a very large training set. That happens when the First Filter turns up a large amount of relevant files, or they are otherwise known and coded before the Second Filter training begins. The sheer volume of training documents in many first rounds thus makes it special, not the fact that it came first.

No good predictive coding software is going to give special significance to a training document just because it came first in time. The software I use has no trouble at all disregarding any early training if it later finds that it is inconsistent with the total training input. It is, admittedly, somewhat aggravating to have a machine tell you that your earlier coding was wrong. But I would rather have an emotionless machine tell me that, than another gloating attorney (or judge), especially when the computer is correct, which is often (not always) the case.

That is, after all, the whole point of using good software with artificial intelligence. You do that to enhance your own abilities. There is no way I could attain the level of recall I have been able to manage lately in large document review projects by reliance on my own, limited intelligence alone. That is another one of my search and review secrets. Get help from a higher intelligence, even if you have to create it yourself by following proper training protocols.

Maybe someday the AI will come prepackaged, and not require training, as I imagine in PreSuit. I know it can be done. I can do it with existing commercial software. But apparently from the lack of demand I have seen in reaction to my offer of Presuit as a legal service, the world is not ready to go there yet. I for one do not intend to push for PreSuit, at least not until the privacy aspects of information governance are worked out. Should Lawyers Be Big Data Cops?

Information governance in general is something that concerns me, and is another reason I hold back on Presuit. Hadoop, Data Lakes, Predictive Analytics and the Ultimate Demise of Information Governance – Part One and Part Two. Also see: e-Discovery Industry Reaction to Microsoft’s Offer to Purchase Equivio for $200 Million – Part Two. I do not want my information governed, even assuming that’s possible. I want it secured, protected, and findable, but only by me, unless I give my express written assent (no contracts of adhesion permitted). By the way, even though I am cautious, I see no problem in requiring that consent as a condition of employment, so long as it is reasonable in scope and limited to only business communications.

I am wary of Big Brother emerging from Big Data. You should be too. I want AIs under our own individual control where they each have a real big off switch. That is the way it is now with legal search and I want it to stay that way. I want the AIs to remain under my control, not visa versa. Not only that, like all Europeans, I want a right to be forgotten by AIs and humans alike.

At the same time that I want unentangled freedom and privacy, I want a government that can protect us from crooks, crazies, foreign governments, and black hats. I just do not want to give up my Constitutional rights to receive that protection. We should not have to trade privacy for security. Once we lay down our Constitutional rights in the name of security, the terrorists have already won. Why do we not have people in the Justice Department clear-headed enough to see that?

Getting back to legal search, and how to find out what you need to know inside the law by using the latest AI-enhanced search methods, there are three kinds of probability ranked search engines now in use for predictive coding.

Three Kinds of Second Filter Probability Based Search Engines

After the first round of training, you can begin to harness the AI features in your software. You can begin to use its probability ranking to find relevant documents. There are currently three kinds of ranking search and review strategies in use: uncertainty, high probability, and random. The uncertainty search, sometimes called SAL for Simple Active Learning, looks at middle ranking documents where the code is unsure of relevance, typically the 40%-60% range. The high probability search looks at documents where the AI thinks it knows about whether documents are relevant or irrelevant. You can also use some random searches, if you want, both simple and judgmental, just be careful not to rely too much on chance.

My own experience also confirms their experiments. High probability searches usually involve SME training and review of the upper strata, the documents with a 90% or higher probability of relevance. I will, however, also check out the low strata, but will not spend as much time on that end. I like to use both uncertainty and high probability searches, but typically with a strong emphasis on the high probability searches. And again, I supplement these ranking searches with other multimodal methods, especially when I encounter strong, new, or highly relevant type documents.

Sometimes I will even use a little random sampling, but the mentioned Cormack Grossman study shows that it is not effective, especially on its own. They call such chance based search Simple Passive Learning, or SPL. Ever since reading the Cormack Grossman study I have cut back on my reliance on random searches. You should too. It was small before, it is even smaller now.

Irrelevant Training Documents Are Important Too

In the second filer you are on a search for the gold, the highly relevant, and, to a lesser extent, the strong and merely relevant. As part of this Second Filter search you will naturally come upon many irrelevant documents too. Some of these documents should also be added to the training. In fact, is not uncommon to have more irrelevant documents in training than relevant, especially with low prevalence collections. If you judge a document, then go ahead and code it and let the computer know your judgment. That is how it learns. There are some documents that you judge that you may not want to train on – such as the very large, or very odd – but they are few and far between,

Of course, if you have culled out a document altogether in the First Filter, you do not need to code it, because these documents will not be part of the documents included in the Second Filter. In other words, they will not be among the documents ranked in predictive coding. The will either be excluded from possible production altogether as irrelevant, or will be diverted to a non-predictive coding track for final determinations. The later is the case for non-text file types like graphics and audio in cases where they might have relevant information.

How To Do Second Filter Culling Without Predictive Ranking

When you have software with active machine learning features that allow you to do predictive ranking, then you find documents for training, and from that point forward you incorporate ranking searches into your review. If you do not have such features, you still sort out documents in the Second Filter for manual review, you just do not use ranking with SAL and CAL to do so. Instead, you rely on keyword selections, enhanced with concept searches and similarity searches.

When you find an effective parametric Boolean keyword combination, which is done by a process of party negotiation, then testing, educated guessing, trial and error, and judgmental sampling, then you submit the documents containing proven hits to full manual review. Ranking by keywords can also be tried for document batching, but be careful of large files having many keyword hits just on the basis of file size, not relevance. Some software compensates for that, but most do not. So ranking by keywords can be a risky process.

I am not going to go into detail on the old fashioned ways of batching out documents for manual review. Most e-discovery lawyers already have a good idea of how to do that. So too do most vendors. Just one word of advice. When you start the manual review based on keyword or other non-predictive coding processes, check in daily with the contract reviewer work and calculate what kind of precision the various keyword and other assignment folders are creating. If it is terrible, which I would say is less than 50% precision, then I suggest you try to improve the selection matrix. Change the Boolean, or key words, or something. Do not just keep plodding ahead and wasting client money.

I once took over a review project that was using negotiated, then tested and modified keywords. After two days of manual review we realized that only 2% of the documents selected for review by this method were relevant. After I came in and spent three days with training to add predictive ranking we were able to increase that to 80% precision. If you use these multimodal methods, you can expect similar results.

Basic Idea of Two Filter Search and Review

Whether you use predictive ranking or not, the basic idea behind the two filter method is to start with a very large pool of documents, reduce the size by a coarse First Filter, then reduce it again by a much finer Second Filter. The result should be a much, much small pool that is human reviewed, and an even smaller pool that is actually produced or logged. Of course, some of the documents subject to the final human review may be overturned, that is, found to be irrelevant, False Positives. That means they will not make it to the very bottom production pool after manual review in the diagram right.

In multimodal projects where predictive coding is used the precision rates can often be very high. Lately I have been seeing that the second pool of documents, subject to the manual review has precision rates of at least 80%, sometimes even as high as 95% near the end of a CAL project. That means the final pool of documents produced is almost as large as the pool after the Second Filter.

Please remember that almost every document that is manually reviewed and coded after the Second Filter gets recycled back into the machine training process. This is known as Continuous Active Learning or CAL, and in my version of it at least, is multimodal and not limited to only high probability ranking searches. See:Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training- Part Two. In some projects you may just train for multiple iterations and then stop training and transition to pure manual review, but in most you will want to continue training as you do manual review. Thus you set up a CAL constant feedback loop until you are done, or nearly done, with manual review.

As mentioned, active machine learning trains on both relevance and irrelevance. Although, in my opinion, the documents found that are Highly Relevant, the hot documents, are the most important of all for training purposes. The idea is to use predictive coding to segregate your data into two separate camps, relevant and irrelevant. You not only separate them, but you also rank them according to probable relevance. The software I use has a percentage system from .01% to 99.9% probable relevant and visa versa. A near perfect segregation-ranking project should end up looking like an upside down champagne glass.

After you have segregated the document collection into two groups, and gone as far as you can, or as far as your budget allows, then you cull out the probable irrelevant. The most logical place for the Second Filter cut-off point in most projects in the 49.9% and less probable relevant. They are the documents that are more likely than not to be irrelevant. But do not take the 50% plus dividing line as an absolute rule in every case. There are no hard and fast rules to predictive culling. In some cases you may have to cut off at 90% probable relevant. Much depends on the overall distribution of the rankings and the proportionality constraints of the case. Like I said before, if you are looking for Gilbert’s black-letter law solutions to legal search, you are in the wrong type of law.

Almost all of the documents in the production set (the red top half of the diagram) will be reviewed by a lawyer or paralegal. Of course, there are shortcuts to that too, like duplicate and near-duplicate syncing. Some of the documents in the irrelevant low ranked documents will have been reviewed too. That is all part of the CAL process where both relevant and irrelevant documents are used in training. But only a very low percentage of the probable irrelevant documents need to be reviewed.

Limiting Final Manual Review

In some cases you can, with client permission (often insistence), dispense with attorney review of all or near all of the documents in the upper half. You might, for instance, stop after the manual review has attained a well defined and stable ranking structure. You might, for instance, only have reviewed 10% of the probable relevant documents (top half of the diagram), but decide to produce the other 90% of the probable relevant documents without attorney eyes ever looking at them. There are, of course, obvious problems with privilege and confidentiality to such a strategy. Still, in some cases, where appropriate clawback and other confidentiality orders are in place, the client may want to risk disclosure of secrets to save the costs of final manual review.

In such productions there are also dangers of imprecision where a significant percentage of irrelevant documents are included. This in turn raises concerns that an adversarial view of the other documents could engender other suits, even if there is some agreement for return of irrelevant. Once the bell has been rung, privileged or hot, it cannot be un-rung.

Case Example of Production With No Final Manual Review

In spite of the dangers of the unringable bell, the allure of extreme cost savings can be strong to some clients in some cases. For instance, I did one experiment using multimodal CAL with no final review at all, where I still attained fairly high recall, and the cost per document was only seven cents. I did all of the review myself acting as the sole SME. The visualization of this project would look like the below figure.

Note that if the SME review pool were drawn to scale according to number of documents read, then, in most cases, it would be much smaller than shown. In the review where I brought the cost down to $0.07 per document I started with a document pool of about 1.7 Million, and ended with a production of about 400,000. The SME review pool in the middle was only 3,400 documents.

As far as legal search projects go it was an unusually high prevalence, and thus the production of 400,000 documents was very large. Four hundred thousand was the number of documents ranked with a 50% or higher probable prevalence when I stopped the training. I only personally reviewed about 3,400 documents during the SME review, plus another 1,745 after I decided to stop training in a quality assurance sample. To be clear, I worked alone, and no one other than me reviewed any documents. This was an Army of One type project.

Although I only personally reviewed 3,400 documents for training, and I actually instructed the machine to train on many more documents than that. I just selected them for training without actually reviewing them first. I did so on the basis of ranking and judgmental sampling of the ranked categories. It was somewhat risky, but it did speed up the process considerably, and in the end worked out very well. I later found out that information scientists often use this technique as well.

My goal in this project was recall, not precision, nor even F1, and I was careful not to overtrain on irrelevance. The requesting party was much more concerned with recall than precision, especially since the relevancy standard here was so loose. (Precision was still important, and was attained too. Indeed, there were no complaints about that.) In situations like that the slight over-inclusion of relevant training documents is not terribly risky, especially if you check out your decisions with careful judgmental sampling, and quasi-random sampling.

I accomplished this review in two weeks, spending 65 hours on the project. Interestingly, my time broke down into 46 hours of actual document review time, plus another 19 hours of analysis. Yes, about one hour of thinking and measuring for every two and a half hours of review. If you want the secret of my success, that is it.

I stopped after 65 hours, and two weeks of calendar time, primarily because I ran out of time. I had a deadline to meet and I met it. I am not sure how much longer I would have had to continue the training before the training fully stabilized in the traditional sense. I doubt it would have been more than another two or three rounds; four or five more rounds at most.

Typically I have the luxury to keep training in a large project like this until I no longer find any significant new relevant document types, and do not see any significant changes in document rankings. I did not think at the time that my culling out of irrelevant documents had been ideal, but I was confident it was good, and certainly reasonable. (I had not yet uncovered my ideal upside down champagne glass shape visualization.) I saw a slow down in probability shifts, and thought I was close to the end.

I had completed a total of sixteen rounds of training by that time. I think I could have improved the recall somewhat had I done a few more rounds of training, and spent more time looking at the mid-ranked documents (40%-60% probable relevant). The precision would have improved somewhat too, but I did not have the time. I am also sure I could have improved the identification of privileged documents, as I had only trained for that in the last three rounds. (It would have been a partial waste of time to do that training from the beginning.)

The sampling I did after the decision to stop suggested that I had exceeded my recall goals, but still, the project was much more rushed than I would have liked. I was also comforted by the fact that the elusion sample test at the end passed my accept on zero error quality assurance test. I did not find any hot documents. For those reasons (plus great weariness with the whole project), I decided not to pull some all-nighters to run a few more rounds of training. Instead, I went ahead and completed my report, added graphics and more analysis, and made my production with a few hours to spare.

A scientist hired after the production did some post-hoc testing that confirmed an approximate 95% confidence level recall achievement of between 83% to 94%. My work also confirmed all subsequent challenges. I am not at liberty to disclose further details.

In post hoc analysis I found that the probability distribution was close to the ideal shape that I now know to look for. The below diagram represents an approximate depiction of the ranking distribution of the 1.7 Million documents at the end of the project. The 400,000 documents produced (obviously I am rounding off all these numbers) were 50% plus, and 1,300,000 not produced were less than 50%. Of the 1,300,000 Negatives, 480,000 documents were ranked with only 1% or less probable relevance. On the other end, the high side, 245,000 documents had a probable relevance ranking of 99% or more. There were another 155,000 documents with a ranking between 99% and 50% probable relevant. Finally, there were 820,000 documents ranked between 49% and 01% probable relevant.

The file review speed here realized of about 35,000 files per hour, and extremely low cost of about $0.07 per document, would not have been possible without the client’s agreement to forgo full document review of the 400,000 documents produced. A group of contract lawyers could have been brought in for second pass review, but that would have greatly increased the cost, even assuming a billing rate for them of only $50 per hour, which was 1/10th my rate at the time (it is now much higher.)

The client here was comfortable with reliance on confidentiality agreements for reasons that I cannot disclose. In most cases litigants are not, and insist on eyes on review of every document produced. I well understand this, and in today’s harsh world of hard ball litigation it is usually prudent to do so, clawback or no.

Another reason the review was so cheap and fast in this project is because there were very little opposing counsel transactional costs involved, and everyone was hands off. I just did my thing, on my own, and with no interference. I did not have to talk to anybody; just read a few guidance memorandums. My task was to find the relevant documents, make the production, and prepare a detailed report – 41 pages, including diagrams – that described my review. Someone else prepared a privilege log for the 2,500 documents withheld on the basis of privilege.

I am proud of what I was able to accomplish with the two-filter multimodal methods, especially as it was subject to the mentioned post-review analysis and recall validation. But, as mentioned, I would not want to do it again. Working alone like that was very challenging and demanding. Further, it was only possible at all because I happened to be a subject matter expert of the type of legal dispute involved. There are only a few fields where I am competent to act alone as an SME. Moreover, virtually no legal SMEs are also experienced ESI searchers and software power users. In fact, most legal SMEs are technophobes. I have even had to print out key documents to paper to work with some of them.

Even if I have adequate SME abilities on a legal dispute, I now prefer to do a small team approach, rather than a solo approach. I now prefer to have one of two attorneys assisting me on the document reading, and a couple more assisting me as SMEs. In fact, I can act as the conductor of a predictive coding project where I have very little or no subject matter expertise at all. That is not uncommon. I just work as the software and methodology expert; the Experienced Searcher.

Right now I am working on a project where I do not even speak the language used in most of the documents. I could not read most of them, even if I tried. I just work on procedure and numbers alone, where others get their hands in the digital mud and report to me and the SMEs. I am confident this will work fine. I have good bilingual SMEs and contract reviewers doing most of the hands-on work.

Conclusion

There is much more to efficient, effective review than just using software with predictive coding features. The methodology of how you do the review is critical. The two filter method described here has been used for years to cull away irrelevant documents before manual review, but it has typically just been used with keywords. I have tried to show here how this method can be employed in a multimodal method that includes predictive coding in the Second Filter.

Keywords can be an effective method to both cull out presumptively irrelevant files, and cull in presumptively relevant, but keywords are only one method, among many. In most projects it is not even the most effective method. AI-enhanced review with predictive coding is usually a much more powerful method to cull out the irrelevant and cull in the relevant and highly relevant.

If you are using a one-filter method, where you just do a rough cut and filter out by keywords, date, and custodians, and then manually review the rest, you are reviewing too much. It is especially ineffective when you collect based on keywords. As shown in Biomet, that can doom you to low recall, no matter how good your later predictive coding may be.

If you are using a two-filter method, but are not using predictive coding in the Second Filter, you are still reviewing too much. The two-filter method is far more effective when you use relevance probability ranking to cull out documents from final manual review.

Try the two filter method described here in your next review. Drop me a line to let me know how it works out.

Large document review projects can maximize efficiency by employing a two-filter method to cull documents from costly manual review. This method helps reduce costs and maximize recall. I introduced this method, and the diagram shown here illustrating it, at the conclusion of my blog series, Introducing “ei-Recall” – A New Gold Standard for Recall Calculations in Legal Search – Part Three. I use the two-filter method in most large projects as part of my overall multimodal, bottom line driven, AI-Enhanced (i.w. – predictive coding) method of review. I have described this multimodal method many times here, and you will find summaries of it elsewhere, including my CAR page, and Legal Search Science, and the work in progress, the EDBP outlining best practices for lawyers doing e-discovery.

My two-filter method of course employs deduplication and deNisting in the First Filter. (I always do full horizontal deduplication across all custodians.) Deduplication and deNisting are, however, mere technical, non-legal filters. They are already well established industry standards and so I see no need to discuss them further in this article.

Some think those two technical methods are the end-all of ESI culling, but, as this two-part blog will explain, they are just the beginning. The other methods require legal judgment, and so you cannot just hire a vendor to do it, as you can with deduplication and deNisting. This is why I am taking pains to explain two-filter document culling, so that it can be used by other legal teams to reduce wasted review expenses.

This blog is the first time I have gone into the two-filter culling component in any depth. This method has been proven effective in attaining high recall at low cost in at least one open scientific experiment, although I cannot go into that. You will just have to trust me on that. Insiders know anyway. For the rest, just look around and see I have no products to sell here, and accept no ads. This is all part of an old lawyer’s payback to a profession that has been very good to him over the years.

My thirty-five years of experience in law have shown me that most reliable way for the magic of justice to happen is by finding the key documents. You find the truth, the whole truth, and nothing but the truth when you find the key documents and use them to keep the witnesses honest. Deciding cases on the basis of the facts is the way our system of justice tries to decide all cases on the merits, in an impartial and fair manner. In today’s information flooded world, that can only happen if we use technology to find relevant evidence quickly and inexpensively. The days of finding the truth by simple witness interviews are long gone. Thus I share my search and review methods as a kind of payback and pay it forward. For now, as I have for the past eight years, I will try to make the explanations accessible to beginners and eLeet alike.

We need cases to be decided on the merits, on the facts. Hopefully my writing and rants will help make that happen in some small way. Hopefully it will help stem the tide of over-settlement, where many cases are decided on the basis of settlement value, not merits. Too many frivolous cases are filed that drown out the few with great merit. Judges are overwhelmed and often do not have the time needed to get down to the truth and render judgments that advance the cause of justice.

Most of the time the judges, and the juries they assemble, are never even given the chance to do their job. The cases all settle out instead. As a result only one percent of federal civil cases actually go to trial. This is a big loss for society, and for the so-called “trial lawyers” in our profession, a group I once prided myself to be a part. Now I just focus on getting the facts from computers, to help keep the witnesses honest, and cases decided on the true facts, the evidence. That is where all the real action is nowadays anyway.

By the way, I expect to get another chance to prove the value of the methods I share here in the 2015 TREC experiment on recall. We will see, again, how it stacks up to other approaches. This time I may even have one or two people assist me, instead of doing it alone as I did before. The Army of One approach, which I have also described here many times, although effective, is very hard and time-consuming. My preference now is a small team approach, kind of like a nerdy swat team, or Seal Team Six approach, but without guns and killing people and stuff. I swear! Really.

One word of warning, although this method is software agnostic, in order to emulate the two-filter method, your document review software must have certain basic capabilities. That includes effective, and easy, bulk coding features for the first filter. This is the multimodal broad-based culling. Some of the multiple methods do not require software features, just attorney judgment, such as excluding custodians, but other do require software features, like domain searches or similarity searches. If your software does not have the features that will be discussed here for the first filter, then you probably should switch right away, but, for most, that will not be a problem. The multimodal culling methods used in the first filter are, for the most part, pretty basic.

Some of the software features needed to implement the second filter, are, however, more advanced. The second filter works best when using predictive coding and probability ranking. You review the various strata of the ranked documents. The Second Filter can still be used with other, less advanced multimodal methods, i.e. keywords. Moreover, even when you use bona fide active machine learning software features, you continue to use a smattering of other multimodal search methods in the Second Filter. But now you do so not to cull, but to help find relevant and highly relevant documents to improve training. I do not rely on probability searches alone, although sometimes in the Second Filter I rely almost entirely on predictive coding based searches to continue the training.

If you are using software without AI-enhanced active learning features, then you are forced to only use other multimodal methods in the second filter, such as keywords. Warning, true active learning features are not present in most review software, or are very weak. That is true even with software that claims to have predictive coding features, but really just has dressed-up passive learning, i.e. concept searches with latent semantic indexing. You handicap yourself, and your client, by continuing to use such less expensive programs. Good software, like everything else, does not come cheap, but should pay for itself many times over if used correctly. The same comment goes for lawyers too.

First Filter – Keyword Collection Culling

Some first stage filtering takes place as part of the ESI collection process. The documents are preserved, but not collected nor ingested into the review database. The most popular collection filter as of 2015 is still keyword, even though this is very risky in some cases and inappropriate in many. Typically such keyword filtering is driven by vendor costs to avoid processing and hosting charges.

Some types of collection filtering are appropriate and necessary, for instance, in the case of custodian filters, where you broadly preserve the ESI of many custodians, just in case, but only collect and review a few of them. It is, however, often inappropriate to use keywords to filter out the collection of ESI from admittedly key custodians. This is a situation where an attorney determines that a custodian’s data needs to be reviewed for relevant evidence, but does not want to incur the expense to have all of their ESI ingested into the review database. For that reason they decide to only review data that contains certain keywords.

I am not a fan of keyword filtered collections. The obvious danger of keyword filtering is that important documents may not have the keywords. Since they will not even be placed in the review platform, you will never know that the relevant ESI was missed. You have no chance of finding them.

See eg, William Webber’s analysis of the Biomet case where this kind of keyword filtering was use before predictive coding began. What is the maximum recall in re Biomet?, Evaluating e-Discovery (4/24/13). Webber shows that in Biomet this method First Filtered out over 40% of the relevant documents. This doomed the Second Filter predictive coding review to a maximum possible recall of 60%, even if was perfect, meaning it would otherwise have attained 100% recall, which never happens. The Biomet case very clearly shows the dangers of over-reliance on keyword filtering.

Nevertheless, sometimes keyword collection may work, and may be appropriate. In some simple disputes, and with some data collections, obvious keywords may work just fine to unlock the truth. For instance, sometimes the use of names is an effective method to identify all, or almost all, documents that may be relevant. This is especially true in smaller and simpler cases. This method can, for instance, often work in employment cases, especially where unusual names are involved. It becomes an even more effective method when the keywords have been tested. I just love it, for instance, when the plaintiff’s name is something like the famous Mister Mxyzptlk.

In some cases keyword collections may be as risky as in the complex Biomet case, but may still be necessary because of the proportionality constraints of the case. The law does not require unreasonably excessive search and review, and what is reasonable in a particular case depends on the facts of the case, including its value. See my many writings on proportionality, including my law review article Predictive Coding and Proportionality: A Marriage Made In Heaven, 26 Regent U. Law Review 1 (2013-2014). Sometimes you have to try for rough justice with the facts that you can afford to find given the budgetary constraints of the case.

The danger of missing evidence is magnified when the keywords are selected on the basis of educated guesses or just limited research. This technique, if you can call it that, is, sadly, still the dominant method used by lawyers today to come up with keywords. I have long thought it is equivalent to a child’s game of Go Fish. If keywords are dreamed up like that, as mere educated guesses, then keyword filtering is a high risk method of culling out irrelevant data. There is a significant danger that it will exclude many important documents that do not happen to contain the selected keywords. No matter how good your predictive coding may be after that, you will never find these key documents.

If the keywords are not based on a mere guessing, but are instead tested, then it becomes a real technique that is less risky for culling. But how do you test possible keywords without first collecting and ingesting all of the documents to determine which are effective? It is the old cart before the horse problem.

Interviews do help, but there is nothing better than actual hands on reading and testing of the documents. This is what I like to call getting your hands dirty in the digital mud of the actual ESI collected. Only then will you know for sure the best way to mass-filter out documents. For that reason my strong preference in all significant size cases is to collect in bulk, and not filter out by keywords. Once you have documents in the database, then you can then effectively screen them out by using parametric Boolean keyword techniques. See your particular vendor for various ways on how to do that.

By the way, parametric is just a reference to the various parameters of a computer file that all good software allows you to search. You could search the text and all metadata fields, the entire document. Or you could limit your search to various metadata fields, such as date, prepared by, or the to and from in an email. Everyone knows what Boolean means, but you may not know all of the many variations that your particular software offers to create highly customized searches. While predictive coding is beyond the grasp of most vendors and case managers, the intricacies of keyword search are not. They can be a good source of information on keyword methods.

First Filter – Date Range and Custodian Culling

Even when you collect in bulk, and do not keyword filter before you put custodian ESI in the review database, in most cases you should filter for date range and custodian. It is often possible for an attorney to know, for instance, that no emails before or after a certain date could possibly be relevant. That is often not a highly speculative guessing game. It is reasonable to filter on this time-line basis before the ESI goes in the database. Whenever possible, try to get agreement on date range screening from the requesting party. You may have to widen it a little, but it is worth the effort to establish a line of communication and begin a cooperative dialogue.

The second thing to talk about is which custodians you are going to include in the database. You may put 50 custodians on hold, and actually collect the ESI of 25, but that does not mean you have to load all 25 into the database for review. Here your interviews and knowledge of the case should allow you to know who the key, key custodians are. You rank them by your evaluation of the likely importance of the data they hold to the facts disputed in the case. Maybe, for instance, in your evaluation you only need to review the mailboxes of 10 of the 25 collected.

Again, disclose and try to work that out. The requesting party can reserve rights to ask for more, that is fine. They rarely do after production has been made, especially if you were careful and picked the right 10 to start with, and if you were careful during review to drop and add custodians based on what you see. If you are using predictive coding in the second filter stage, the addition or deletion of data mid-course is still possible with most software. It should be robust enough to handle such mid-course corrections. It may just slow down the ranking for a few iterations, that’s all.

First Filter – Other MultiModal Culling

There are many other bulk coding techniques that can be used in the first filter stage. This is not intended to be an exhaustive search. Like all complex tasks in the law, simple black letter rules are for amateurs. The law, which mirrors the real world, does not work like that. The same holds true for legal search. There may be many Gilbert’s for search books and articles, but they are just 1L types guides. For true legal search professionals they are mere starting points. Use my culling advice here in the same manner. Use your own judgment to mix and match the right kind of culling tools for the particular case and data encountered. Every project is slightly different, even in the world of repeat litigation, like employment law disputes where I currently spend much of my time.

Legal search is at core a heuristic activity, but one that should be informed by science and technology. The knowledge triangle is a key concept for today’s effective e-Discovery Team. Although e-Discovery Teams should be led by attorneys skilled in evidence discovery, they should include scientists and engineers in some way. Effective team leaders should be able to understand and communicate with technology experts and information scientists. That does not mean all e-discovery lawyers need to become engineers and scientists too. That effort would likely diminish your legal skills based on the time demands involved. It just means you should know enough to work with these experts. That includes the ability to see through the vendor sales propaganda, and to incorporate the knowledge of the bona fide experts into your legal work.

One culling method that many overlook is file size. Some collections have thousands of very small files, just a few bits, that are nothing but backgrounds, tiny images, or just plain empty space. They are too small to have any relevant information. Still, you need to be cautious and look out for very small emails, for instance, ones that just says “yes.” Depending on context it could be relevant and important. But for most other types of very small files, there is little risk. You can go ahead a bulk code them irrelevant and filter them out.

Even more subtle is filtering out files based on their being very large. Sort your files by size, and then look at both ends, small and big. They may reveal certain files and file types that could not possibly be relevant. There is one more characteristic of big files that you should consider. Many of them have millions of lines of text. Big files are confusing to machine learning when, as typical, only a few lines of the text are relevant, and the rest are just noise. That is another reason to filter them out, perhaps not entirely, but for special treatment and review outside of predictive coding. In other projects where you have many large files like that, and you need the help of AI ranking, you may want to hold them in reserve. You may only want to throw them into the ranking mix after your AI algorithms have acquired a pretty good idea of what you are looking for. A maturely trained system is better able to handle big noisy files.

File type is a well-known and often highly effective method to exclude large numbers of files of a same type after only looking at a few of them. For instance, there may be database files automatically generated, all of the same type. You look at a few to verify these databases could not possibly be relevant to your case, and then you bulk code them all irrelevant. There are many types of files like that in some data sets. The First Filter is all about being a smart gatekeeper.

File type is also used to eliminate, or at least divert, non-text files, such as audio files or most graphics. Since most Second Filter culling is going to be based on text analytics of some kind, there is no point for anything other than files with text to go into that filter. In some cases, and some datasets, this may mean bulk coding them all irrelevant. This might happen, for instance, where you know that no music or other audio files, including voice messages, could possibly be relevant. We also see this commonly where we know that photographs and other images could not possibly be relevant. Exclude them from the review database.

You must, however, be careful with all such gatekeeper activities, and never do bulk coding without some judgmental sampling first. Large unknown data collections can always contain a few unexpected surprises, no matter how many document reviews you have done before. Be cautious. Look before you leap. Skim a few of the ESI file types you are about to bulk code as irrelevant.

This directive applies to all First Filter activities. Never do it blind on just logic or principle alone. Get you hands in the digital mud. Do not over-delegate all of the dirty work to others. Do not rely too much on your contract review lawyers and vendors, especially when it comes to search. Look at the documents yourself and do not just rely on high level summaries. Every real trial lawyer knows the importance of that. The devil is always in the details. This is especially true when you are doing judgmental search. The client wants your judgment, not that of a less qualified associate, paralegal, or minimum wage contract review lawyer. Good lawyers remain hands-on, to some extent. They know the details, but are also comfortable with appropriate delegation to trained team members.

There is a constant danger of too much delegation in big data review. The lawyer signing the Rule 26(g) statement has a legal and ethical duty to closely supervise document review done in response to a request for production. That means you cannot just hire a vendor to do that, although you can hire outside counsel with special expertise in the field.

Some non-text file types will need to be diverted for different treatment than the rest of your text-based dataset. For instance, some of the best review software allows you to keyword search audio files. It is based on phonetics and wave forms. At least one company I know has had that feature since 2007. In some cases you will have to carefully review the image files, or at least certain kinds of them. Sorting based on file size and custodian can often speed up that exercise.

Remember the goal is always efficiency, and caution, but not over cautious. The more experienced you get the better you become at evaluating risks and knowing where you can safely take chances to bulk code, and where you cannot. Another thing to remember is that many image files have text in them too, such as in the metadata, or in ASCII transmissions. They are usually not important and do not provide good training for second stage predictive coding.

Text can also be hidden in dead Tiff files, if they have not been OCR’ed. Scanned documents Tiffs, for instance, may very well be relevant and deserve special treatment, including full manual review, but they may not show in your review tool as text, because they have never been OCR text recognized.

Concept searches have only rarely been of great value to me, but should still be tried out. Some software has better capacities with concepts and latent semantic indexing than others. You may find it to be a helpful way to find groupings of obviously irrelevant, or relevant documents. If nothing else, you can always learn something about your dataset from these kind of searches.

Similarity searches of all kinds are among my favorite. If you find some files groups that cannot be relevant, find more like that. They are probably bulk irrelevant (or relevant) too. A similarity search, such as find every document that is 80% or more the same as this one, is often a good way to enlarge your carve outs and thus safely improve your efficiency.

Another favorite of mine is domain culling of email. It is kind of like a spam filter. That is a great way to catch the junk mail, newsletters, and other purveyors of general mail that cannot possibly be relevant to your case. I have never seen a mail collection that did not have dozens of domains that could be eliminated. You can sometimes cull-out as much as 10% of your collection that way, sometimes more when you start diving down into senders with otherwise safe domains. A good example of this is the IT department with their constant mass mailings, reminders and warnings. Many departments are guilty of this, and after examining a few, it is usually safe to bulk code them all irrelevant.

To be continued. In the next part I will discuss the second filter and go deeper into AI-enhanced predictive culling. I will also discuss how tested keywords can also be used, if you do not yet have the skill-set or good software needed for predictive coding.

Follow Blog by Email

Enter your email address to follow this blog and receive notifications of new posts by email.

About the Blogger

Ralph Losey is a practicing attorney and shareholder in a national law firm with 50+ offices and over 800 lawyers where he lead’s the firm’s Electronic Discovery practice group. All opinions expressed here are his own, and not those of his firm or clients. No legal advice is provided on this web and should not be construed as such.

Ralph has long been a leader of the 1137 of the world's tech lawyers. He has presented at hundreds of legal conferences in the US, Canada, & UK, written over 500 articles, and five books on electronic discovery. He is also the founder of Electronic Discovery Best Practices, founder and CEO of e-Discovery TeamTraining, an online education program, and, of course, publisher of this blog and many other related instructional websites.

Ralph has limited his legal practice to electronic discovery and tech law since 2006. He has a special interest in software and the search and review of electronic evidence using artificial intelligence, and cybersecurity. Ralph has been involved with computers, software, legal hacking, and the law since 1980. Ralph has the highest peer AV rating as a lawyer and was selected as a Best Lawyer in America in Commercial Litigation, along with other awards. His full biography may be found at RalphLosey.com.

Ralph is also the proud father of two children, Eva Grossman, and Adam Losey, an e-discovery lawyer (married to another e-discovery lawyer, Catherine Losey), and best of all, husband since 1973 to Molly Friedman Losey, a mental health counselor in Winter Park.

Recent Blog Comments

Blog Stats

1,173,877 visits

Legal Robots Are Coming!

Sedona Principles, 2nd Ed.

1. Electronically stored information is potentially discoverable under Fed. R. Civ. P. 34 or its state equivalents. Organizations must properly preserve electronically stored information that can reasonably be anticipated to be relevant to litigation.

2. When balancing the cost, burden, and need for electronically stored information, courts and parties should apply the proportionality standard embodied in Fed. R. Civ. P. 26(b)(2)(C) and its state equivalents, which require consideration of the technological feasibility and realistic costs of preserving, retrieving, reviewing, and producing electronically stored information, as well as the nature of the litigation and the amount in controversy.

3. Parties should confer early in discovery regarding the preservation and production of electronically stored information when these matters are at issue in the litigation and seek to agree on the scope of each party’s rights and responsibilities.

4. Discovery requests for electronically stored information should be as clear as possible, while responses and objections to discovery should disclose the scope and limits of the production.

5. The obligation to preserve electronically stored information requires reasonable and good faith efforts to retain information that may be relevant to pending or threatened litigation. However, it is unreasonable to expect parties to take every conceivable step to preserve all potentially relevant electronically stored information.

6. Responding parties are best situated to evaluate the procedures, methodologies, and technologies appropriate for preserving and producing their own electronically stored information.

7. The requesting party has the burden on a motion to compel to show that the responding party’s steps to preserve and produce relevant electronically stored information were inadequate.

8. The primary source of electronically stored information for production should be active data and information. Resort to disaster recovery backup tapes and other sources of electronically stored information that are not reasonably accessible requires the requesting party to demonstrate need and relevance that outweigh the costs and burdens of retrieving and processing the electronically stored information from such sources, including the disruption of business and information management activities.

9. Absent a showing of special need and relevance, a responding party should not be required to preserve, review, or
produce deleted, shadowed, fragmented, or residual electronically stored information.

10. A responding party should follow reasonable procedures to protect privileges and objections in connection with the production of electronically stored information.

11. A responding party may satisfy its good faith obligation to preserve and produce relevant electronically stored information by using electronic tools and processes, such as data sampling, searching, or the use of selection criteria, to identify data reasonably likely to contain relevant information.

12. Absent party agreement or court order specifying the form or forms of production, production should be made in the form or forms in which the information is ordinarily maintained or in a reasonably usable form, taking into account the need to produce reasonably accessible metadata that will enable the receiving party to have the same ability to access, search, and display the information as the producing party where appropriate or necessary in light of the nature of the information and the needs of the case.

13. Absent a specific objection, party agreement or court order, the reasonable costs of retrieving and reviewing electronically stored information should be borne by the responding party, unless the information sought is not reasonably available to the responding party in the ordinary course of business. If the information sought is not reasonably available to the responding party in the ordinary course of business, then, absent special circumstances, the costs of retrieving and reviewing such electronic information may be shared by or shifted to the requesting party.

14. Sanctions, including spoliation ﬁndings, should be considered by the court only if it finds that there was a clear duty to preserve, a culpable failure to preserve and produce relevant electronically stored information, and a reasonable probability that the loss of the evidence has materially prejudiced the adverse party.