The Akron Legal News this week published an interesting editorial on information governance. The story by Richard Weiner discussed how law firms are dealing with the transition from rooms filled with hard copy records to electronically stored information (ESI) which includes firm business records as well as huge amounts of client eDiscovery content. The story pointed out that ESI flows into the law firm so quickly and in such huge quantities no one can track it much less know what it contains. Law firms are now facing an inflection point, change the way all information is managed or suffer client dissatisfaction and client loss.

The story pointed out that “in order to function as a business, somebody is going to have to, at least, track all of your data before it gets even more out of control – Enter information governance.”

There are many definitions of information governance (IG) floating around but the story presented one specifically targeted at law firms: IG is “the rules and framework for managing all of a law firm’s electronic data and documents, including material produced in discovery, as well as legal files and correspondence.” Richard went on to point out that there are four main tasks to accomplish through the IG process. They are:

Map where the data is stored;

Determine how the data is being managed;

Determine data preservation methodology;

Create forensically sound data collection methods.

I would add several more to this list:

Create a process to account for and classify inbound client data such as eDiscovery and regulatory collections.

As law firms’ transition to mostly ESI for both firm business and client data, law firms will need to adopt IG practices and process to account for and manage to these different requirements. Many believe this transition will eventually lead to the incorporation of machine learning techniques into IG to enable law firm IG processes to have a much more granular understanding of what the actual meaning of the data, not just that it’s a firm business record or part of a client eDiscovery response. This will in turn enable more granular data categorization capability of all firm information.

For the clean-up of dark data (remediation) it has been suggested by many, including myself, that the remediation process should include determining what you really have, determine what can be immediately disposed of (obvious stuff like duplicates and any expired content etc.), categorize the rest, and move the remaining categorized content into information governance systems.

But many “conservative” minded people (like many General Counsel) hesitate at the actual deletion of data, even after they have spent the resources and dollars to identify potentially disposable content. The reasoning usually centers on the fear of destroying information that could be potentially relevant in litigation. A prime example is seen in the Arthur Andersen case where a Partner famously sent an email message to employees working on the Enron account, reminding them to “comply with the firm’s documentation and retention policy”, or in other words – get rid of stuff. Many GCs don’t want to be put in the position of rightfully disposing of information per policy and having to explain later in court why potentially relevant information was disposed of…

For those that don’t want to take the final step of disposing of data, the question becomes “so what do we do with it?” This reminds me of a customer I was dealing with years ago. The GC for this 11,000 person company, a very distinguished looking man, was asked during a meeting that included the company’s senior staff, what the company’s information retention policy was. He quickly responded that he had decided that all information (electronic and hardcopy) from their North American operations would be kept for 34 years. Quickly calculating the company’s storage requirements over 34 years with 11,000 employees, I asked him if he had any idea what his storage requirements would be at the end of 34 years. He replied no and asked what the storage requirements would be. I replied it would be in the petabytes range and asked him if he understood what the cost of storing that amount of data would be and how difficult it would be to find anything in it.

He smiled and replied “I’m retiring in two years, I don’t care”

The moral of that actual example is that if you have decided to keep large amounts of electronic data for long periods of time, you have to consider the cost of storage as well as how you will search it for specific content when you actually have to.

In the example above, the GC was planning on storing it on spinning disk which is costly. Others I have spoken to have decided that most cost effective way to store large amounts of data for long periods of time is to keep backup tapes. Its true that backup tapes are relatively cheap (compared to spinning disk) but are difficult to get anything off of, they have a relatively high failure rate (again compared to spinning disk) and have to be rewritten every so many years because backup tapes slowly lose their data over time.

A potential solution is moving your dark data to long term hosted archives. These hosted solutions can securely hold your electronically stored information (ESI) at extremely low costs per gigabyte. When needed, you can access your archive remotely and search and move/copy data back to your site.

An important factor to look for (for eDiscovery) is that data moved, stored, indexed and recovered from the hosted archive cannot alter the metadata in anyway. This is especially important when responding to a discovery request.

For those of you considering starting a dark data remediation project, consider long term hosted archives as a staging target for that data your GC just won’t allow to be disposed of.

Dark data, otherwise known as unstructured, unmanaged, and uncategorized information is a major problem for many organizations. Many organizations don’t have the will or systems in place to automatically index and categorize their rapidly growing unstructured dark data, especially in file shares, and instead rely on employees to manually manage their own information. This reliance on employees is a no-win situation because employees have neither the incentive nor the time to actively manage their information.

Organizations find themselves trying to figure out what to do with huge amounts of dark data, particularly when they’re purchasing TBs of new storage annually because they’ve run out.

Issues with dark data:

Consumes costly storage space and resources – Most medium to large organizations provide terabytes of file share storage space for employees and departments to utilize. Employees drag and drop all kinds of work related files (and personal files like personal photos, MP3 music files, and personal communications) as well as PSTs and work station backup files. The vast majority of these files are unmanaged and are never looked at again by the employee or anyone else.

Consumes IT resources – Personnel are required to perform nightly backups, DR planning, and IT personnel to find or restore files employees could not find.

Masks security risks – File shares act as “catch-alls” for employees. Sensitive company information regularly finds its way to these repositories. These file shares are almost never secure so sensitive information like personally identifiable information (PII), protected health information (PHI, and intellectual property can be inadvertently leaked.

Raises eDiscovery costs – Almost everything is discoverable in litigation if it pertains to the case. The fact that tens or hundreds of terabytes of unindexed content is being stored on file shares means that those terabytes of files may have to be reviewed to determine if they are relevant in a given legal case. That can add hundreds of thousands or millions of dollars of additional cost to a single eDiscovery request.

To bring this dark data under control, IT must take positive steps to address the problem and do something about it. The first step is to look to your file shares.

With the rapid advances in healthcare technology, the movement to electronic health records, and the relentless accumulation of regulatory requirements, the shift from records management to information governance is increasingly becoming a necessary reality.

In a 2012 CGOC (Compliance, Governance and Oversight Counsel) Summit survey, it was found that on the average 1% of an organization’s data is subject to legal hold, 5% falls under regulatory retention requirements and 25% has business value. This means that 69% of an organization’s ESI is not needed and could be disposed of without impact to the organization. I would argue that for the healthcare industry, especially for covered entities with medical record stewardship, those retention percentages are somewhat higher, especially the regulatory retention requirements.

According to an April 9, 2013 article on ZDNet.com, by 2015, 80% of new healthcare information will be composed of unstructured information; information that’s much harder to classify and manage because it doesn’t conform to the “rows & columns” format used in the past. Examples of unstructured information include clinical notes, emails & attachments, scanned lab reports, office work documents, radiology images, SMS, and instant messages. Despite a push for more organization and process in managing unstructured data, healthcare organizations continue to binge on unstructured data with little regard to the overall health of their enterprises.

So how does this info-gluttony, (the unrestricted saving of unstructured data because data storage is cheap and saving everything is just easier), affect the health of the organization? Obviously you’ll look terrible in horizontal stripes, but also finding specific information quickly (or at all) is impossible, you’ll spend more on storage, data breaches will could occur more often, litigation/eDiscovery expenses will rise, and you won’t want to go to your 15th high school reunion…

To combat this unstructured info-gain, we need an intelligent information governance solution – STAT! And that solution must include a defensible process to systematically dispose of information that’s no longer subject to regulatory requirements, litigation hold requirements or because it no longer has business value.

To enable this information governance/defensible disposal Infobesity cure, healthcare information governance solutions must be able to extract meaning from all of this unstructured content, or in other words understand and differentiate content conceptually. The automated classification/categorization of unstructured content based on content meaning cannot accurately or consistently differentiate the meaning in electronic content by simply relying on simple rules or keywords. To accurately automate the categorization and management of unstructured content, a machine learning capability to “train by example” is a precondition. This ability to systematically derive meaning from unstructured content as well as machine learning to accurately automate information governance is something we call “Predictive Governance”.

A side benefit of Predictive Governance is (you’ll actually look taller) previously lost organizational knowledge and business intelligence can be automatically compiled and made available throughout the organization.

In my early career, shred days – the scheduled annual activity where the company ordered all employees to wander through all their paper records to determine what should be disposed of, were common place. At the government contractor I worked for, we actually wheeled our boxes out to the parking lot to a very large truck that had huge industrial shredders in the back. Once the boxes of documents were shredded, we were told to walk them over to a second truck, a burn truck, where we, as the records custodian, would actually verify that all of our records were destroyed. These shred days were a way to actually collect, verify and yes physically shred all the paper records that had gone beyond their retention period over the preceding year.

The Magic 8-Ball says Shred Days aren’t Defensible

Nowadays, this type of activity carries some negative connotations with it and is much more risky. Take for example the recent case of Rambus vs SK Hynix. In this case U.S District Judge Ronald Whyte in San Jose reversed his own prior ruling from a 2009 case where he had originally issued a judgment against SK Hynix, awarding Rambus Inc. $397 million in a patent infringement case. In his reversal this year, Judge Whyte ruled that Rambus Inc. had spoliated documents in bad faith when it hosted company-wide “shred days” in 1998, 1999, and 2000. Judge Whyte found that Rambus could have reasonably foreseen litigation against Hynix as early as 1998, and that therefore Rambus engaged in willful spoliation during the three “shred days” (a finding of spoliation can be based on inadvertent destruction of evidence as well). Because of this recent spoliation ruling, the Judge reduced the prior Rambus award from $397 million to $215 million, a cost to Rambus of $182 million.

Another well know example of sudden retention/disposition policy activity that caused unintended consequences is the Arthur Andersen/Enron example. During the Enron case, Enron’s accounting firm sent out the following email to some of its employees:

This email was a key reason why Arthur Andersen ceased to exist shortly after the case concluded. Arthur Andersen was charged with and found guilty of obstruction of justice for shredding the thousands of documents and deleting emails and company files that tied the firm to its audit of Enron. Less than 1 year after that email was sent, Arthur Andersen surrendered its CPA license on August 31, 2002, and 85,000 employees lost their jobs.

Learning from the Past – Defensible Disposal

These cases highlight the need for a true information governance process including a truly defensible disposal capability. In these instances, an information governance process would have been capturing, indexing, applying retention policies, protecting content on litigation hold and disposing of content beyond the retention schedule and not on legal hold… automatically, based on documented and approved legally defensible policies. A documented and approved process which is consistently followed and has proper safeguards goes a long way with the courts to show good faith intent to manage content and protect that content subject to anticipated litigation.

To successfully automate the disposal of unneeded information in a consistently defensible manner, auto-categorization applications must have the ability to conceptually understand the meaning in unstructured content so that only content meeting your retention policies, regardless of language, is classified as subject to retention.

Taking Defensible Disposal to the Next Level – Predictive Disposition

A defensible disposal solution which incorporates the ability to conceptually understand content meaning, and which incorporates an iterative training process including “train by example,” in a human supervised workflow provides accurate predictive retention and disposition automation.

Moving away from manual, employee-based information governance to automated information retention and disposition with truly accurate (95 to 99%) and consistent meaning-based predictive information governance will provide the defensibility that organizations require today to keep their information repositories up to date.

Information growth is out of control. The compound average growth rate for digital information is estimated to be 61.7%. According to a 2011 IDC study, 90% of all data created in the next decade will be of the unstructured variety. These facts are making it almost impossible for organizations to actually capture, manage, store, share and dispose of this data in any meaningful way that will benefit the organization.

Successful organizations run on and are dependent on information. But information is valuable to an organization only if you know where it is, what’s in it, and what is shareable or in other words… managed. In the past, organizations have relied on end-users to decide what should be kept, where and for how long. In fact 75% of data today is generated and controlled by individuals. In most cases this practice is ineffective and causes what many refer to as “covert orunderground archiving”, the act of individuals keeping everything in their own unmanaged local archives. These underground archives effectively lock most of the organization’s information away, hidden from everyone else in the organization.

This growing mass of information has brought us to an inflection point; get control of your information to enable innovation, profit and growth, or continue down your current path of information anarchy and choke on your competitor’s dust.

Choosing the Right Path

How does an organization ensure this infection point is navigated correctly? Information Governance. You must get control of all your information by employing the proven processes and technologies to allow you to create, store, find, share and dispose of information in an automated and intelligent manner.

An effective information governance process optimizes overall information value by ensuring the right information is retained and quickly available for business, regulatory, and legal requirements. This process reduces regulatory and legal risk, insures needed data can be found quickly and is secured for litigation, reduces overall eDiscovery costs, and provides structure to unstructured information so that employees can be more productive.

Predicting the Future of Information Governance

Predictive Governance is the bridge across the inflection point. It combines machine-learning technology with human expertise and direction to automate your information governance tasks. Using this proven human-machine iterative training capability,Predictive Governance is able to accurately automate the concept-based categorization, data enrichment and management of all your enterprise data to reduce costs, reduce risks, enable information sharing and mitigate the strain of information overload.

Automating information governance so that all enterprise data is captured, granularity evaluated for legal requirements, regulatory compliance, or business value and stored or disposed of in a defensible manner is the only way for organizations to move to the next level of information governance.

Healthcare information/ and records continue to grow with the introduction of new devices and expanding regulatory requirements such as The Affordable Care Act, The Health Insurance Portability and Accountability Act (HIPAA), and the Health Information Technology for Economic and Clinical Health Act (HITECH). In the past, healthcare records were made up of mostly paper forms or structured billing data; relatively easy to categorize, store, and manage. That trend has been changing as new technologies enable faster and more convenient ways to share and consume medical data.

According to an April 9, 2013 article on ZDNet.com, by 2015, 80% of new healthcare information will be composed of unstructured information; information that’s much harder to classify and manage because it doesn’t conform to the “rows & columns” format used in the past. Examples of unstructured information include clinical notes, emails & attachments, scanned lab reports, office work documents, radiology images, SMS, and instant messages.

Who or what is going to actually manage this growing mountain of unstructured information?

To insure regulatory compliance and the confidentiality and security of this unstructured information, the healthcare industry will have to 1) hire a lot more professionals to manually categorize and mange it or 2) acquire technology to do it automatically.

Looking at the first solution; the cost to have people manually categorize and manage unstructured information would be prohibitively expensive not to mention slow. It also exposes private patient data to even more individuals. That leaves the second solution; information governance technology. Because of the nature of unstructured information, a technology solution would have to:

Recognize and work with hundreds of data formats

Communicate with the most popular healthcare applications and data repositories

Draw conceptual understanding from “free-form” content so that categorization can be accomplished at an extremely high accuracy rate

Enable proper access security levels based on content

Accurately retain information based on regulatory requirements

Securely and permanently dispose of information when required

An exciting emerging information governance technology that can actually address the above requirements uses the same next generation technology the legal industry has adopted…proactive information governance technology based on conceptual understanding of content, machine learning and iterative “train by example” capabilities