The Importance of Managing Unstructured Data

Unstructured data presents a hidden danger for many enterprises. At a minimum, it can damage business operations and lead to spiraling storage costs, but even worse it is a regulatory compliance time bomb. Learn how to turn unstructured data into structured.

Unstructured data can damage your organization's business operations and lead to spiraling storage costs. If that is not bad enough, it also represents a regulatory compliance time bomb that could explode at any time with serious consequences.

Unstructured data is best defined by looking at its opposite: Structured data that is stored in a repository like a database with rules governing data types, values and even where the data should be stored. Unstructured data, on the other hand, is everything else: The mass of spreadsheets, documents, presentations, images and other files that employees generate every day, store on corporate file servers and SharePoint servers, and share with people both within and outside the organization. Often, this data is simply ignored or forgotten about when it is no longer needed. Thus, it may continue to be stored indefinitely. As a result, most organizations have a great deal of unstructured data: Roughly 80 percent of all corporate data is unstructured, according to Gartner.

If you think about the type of information contained in all of these spreadsheets, documents and other files, it is highly likely that some of them will contain confidential corporate information such as pricing, designs and legal documents, as well as personal details about staff and customer records. Losing some of this information through accidental deletion could harm your business, while unauthorized access could leave companies in breach of Sarbanes-Oxley or other financial, data protection or health care compliance regulations.

To help deal with this challenge, data governance products that automate this process have been developed by companies including Siperian (now part of Informatica,) IBM and Varonis Systems. "The problem that many organizations face is how do you govern access to all this data efficiently? If you try and do it manually there's no way that you can figure out who is using it, whether it is being used properly, and how you can reduce access to data to just those who need it," David Gibson, Varonis' director of technical services, told ServerWatch.

To do this you must know if each individual file stored on your servers is still needed and, if so, who must access it, who can access it, and who is actually accessing it. Permissions are seldom revoked in practice, so files tend to be accessible by people long after they cease needing to access them, Gibson said. This can lead to serious compliance problems.

The software runs on Microsoft's SQL Server platform. From there, it connects to other servers within the organization including Microsoft SharePoint Microsoft Exchange; Windows file servers, including NAS devices like EMC Celerra and NetApp filers; UNIX file servers running Solaris, AIX and HP-UX; Linux servers running distributions including Red Hat, SUSE and Ubuntu, and Active Directory servers. Since the system is automated, it can work extremely fast: Within a few hours, the system can run a permissions audit and map files and folders to individuals and groups, and within a day or so the system can begin to build up a picture of which users are actually accessing these files, and what they are actually doing when they access them (e.g., editing or deleting them.)

It takes slightly longer -- typically four to six weeks -- before the system is be able to start generating useful permissions recommendations. These take the form of lists of users that Varonis's system believes really need to access a given file or folder, and lists of users it believes should have their permissions removed because they appear -- though inactivity -- to no longer require access.

These must be forwarded to the business owners of each piece of information, but establishing who these people are can, in normal circumstances, be extremely difficult. According to the Ponemon Institute, 91 percent of organizations lack processes for determining the business owners of data. By analyzing the data it collates, Varonis' solution is able to build up a directory of the likely business owners of each piece of data, Gibson claims. It forwards the recommendations it comes up with to these likely data owners for confirmation that they are indeed the owners, and if so they can review the recommendations and action them themselves. "A data owner will regularly get an email, which basically says 'here is a list of people who don't access this data anymore -- so shall we revoke their permissions?'" said Gibson.

The main benefit of the system is that it allows data owners to control their own data, rather than placing the burden on IT staff -- who in any case are not in a position to make judgements about who should be accessing it. It also makes it easier for data owners or compliance auditors to investigate who has been accessing data or who may have moved it -- and where to -- or deleted it.

An important side benefit is that a data governance solution like Varonis' also makes it easier for organizations to control unstructured data storage costs by identifying data that is being stored but never used. "Many organizations can't see the state of their non-accessed data," said Gibson. "We can identify data that has not been accessed for, say, 180 days, and recommend it for removal to other storage media." Ultimately much of this data can often be deleted, reducing storage costs and reducing compliance efforts.

Implementing a data governance suite such as Varonis' is certainly not without its drawbacks: Enabling employees rather than the IT department to make data governance decisions is a big cultural change, and the cost of the software is significant -- for medium and large organizations the annual licensing and support costs are likely to run in to hundreds of thousands of dollars. The question really is whether the benefits in terms of better data management and compliance and reduced storage requirements justify these costs.

Paul Rubens is a journalist based in Marlow on Thames, England. He has been programming, tinkering and generally sitting in front of computer screens since his first encounter with a DEC PDP-11 in 1979.

Advertiser Disclosure:
Some of the products that appear on this site are from companies from which QuinStreet receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. QuinStreet does not include all companies or all types of products available in the marketplace.