The New World of Compliance

Even before the Millennium and the current economic crisis, regulatory compliance had entered the vocabulary of senior managers and directors in most U.S. companies. Rules and laws differ from one industry to the next, but the core requirements are generally the same. Certain business records, including those maintained as electronic data, need to be retained for specified periods of time and deleted per a formal schedule -- in some cases in a very specific manner. Other data needs to be preserved and protected from loss or corruption via a disaster recovery plan, or kept private and secure (usually via some sort of encryption) throughout its operational life. And virtually all important data needs to be "discoverable," usually within tight time frames, when requested by a legal subpoena or administrative writ or warrant.

In truth, the protection and preservation of important data assets has been a concern of corporate risk management and internal/external auditing for as long as businesses have used computers. However, it was the regulatory boom in the late 1990s that translated common-sense incentives into a real mandate for formalized information governance. Now, in the wake of the banking debacle widely perceived to be the result of ineffective regulation and poor enforcement of existing laws, dozens of new bills aimed at enhancing fiscal transparency for investors and securing privacy and confidentiality for consumers are winding their way through the legislature.

Compliance is second only to cost-containment as a front-of-mind issue for most companies today. Developing a comprehensive and sustainable strategy for compliance requires more than the hasty adoption of meta-strategies like International Organization for Standardization (ISO), Information Technology Infrastructure Library (ITIL), service-oriented architecture (SOA) or Six Sigma. It requires a real data-management approach on a level not practiced in most companies up to this point. That in turn requires an equally unprecedented measure of cooperation between the front office (where management lives) and the back office (where IT lives) -- a knotty undertaking given the rift between the two groups that has developed over the years in too many firms.

Steve Akers, CEO of Digital Reef Inc., a Boxborough, Mass.-based start-up, says his company's success in an area where so many others had failed is due to the fact that he didn't pay much attention to how others had tried to achieve the goal.

Assuming that the divide between business and IT can be bridged, defining a sustainable strategy for compliance is still challenging work. Essentially, organizations need to do four things. First, they must sort their data storage "junk drawer" and classify data assets according to regulatory requirements and other business criteria. Second, they must map their data assets to existing infrastructure resources and services to understand how data is currently being hosted and how it is preserved and protected. Thirdly, they must develop business process-focused and data-centric policies for data movement across infrastructure, triggers for activating those polices, and controls for monitoring both data movements and accesses made to data over time. Finally, they need to instrument infrastructure to actually move data per policy and to deliver ongoing management and access monitoring.

Taken together, these four tasks are the essence of real information lifecycle management (ILM) -- not the ILM proffered a few years ago in the marketing literature of most technology vendors, but the real deal. Getting the strategy right is tricky.

What to MoveFirst and foremost, compliance requires a clear understanding of data assets within its business context, including the regulations that affect the data. Simply put, you need to know what the data is before you can provision the right resources and services to it.

For some data, the process is comparatively easy. The output of databases and content-management systems, for example, is structured in a manner that avails itself of classification and policy-based hosting and control.

This doesn't mean that this data are well managed today. Years of neglect have created a lot of data sprawl in many companies, even in structured data repositories. But database-archiving products like FileTek's StorHouse and database-extraction products like HP StorageWorks RIM for Databases (formerly OuterBay), IBM Softech Optim Archive (formerly Princeton Softech) and Grid-Tools' Data Archive all provide a way to get back to some level of database segregation as a precursor to intelligent management and archiving. Using these tools, companies can extract database data and store it in a manner that conforms to regulatory requirements for retention, deletion, preservation and protection.

Most Enterprise Content Management (ECM) systems also have tools that, if selectively applied, do a good job of returning a semblance of sanity and policy-based classification to the assets under their control. This point is key to the sales pitch of most ECM vendors when they argue that customers should replace productivity software -- Microsoft Office, for example -- with their ECM wares. They argue that the structure, meaning the indexing and workflow, imposed by ECM makes short work of user data classification and management, fixing the large and growing problem of corralling "unstructured or semi-structured" data (user files and e-mail) that predominates stored files today.

The characterization of productivity software data and e-mail as unstructured is a bit of a misnomer given the high degree of structure afforded by most e-mail and file systems today, but it has become part of the popular vernacular to describe user-controlled data. By conservative estimates, user files and e-mail constitute collectively the largest percentage of all data produced by companies today: more than 65 percent or 70 percent, depending on the analyst you read. Applying some sort of policy-based classification to this data is a bit more daunting than database systems output or ECM workflow files, and segregating this data for management via policy has been likened to herding goats.

Improvements have been made by e-mail archive solution providers to supplement native Exchange mail archive techniques supported by Microsoft. Mimosa Systems Inc., CA Inc. and dozens of other vendors provide good tools for applying classification policies to individual e-mails and attachments, and to e-mail threads or conversations.

User files are another story -- perhaps the most challenging one. The core problem is that users control the naming of their own files, and the names may provide little guidance about the actual content of the document or its meaning within a business or regulatory context.

Typically, users aren't aware of the regulatory requirements associated with their spreadsheets, word processing files or other productivity file output. Moreover, users tend to resist any sort of manual procedure for classifying their own documents. Numerous examples can be culled from case studies to demonstrate that user cooperation with any sort of uniform file naming scheme -- usually perceived as adding annoying steps to otherwise simple file-save procedures -- is rarely forthcoming.

Placed on a spectrum of options, the alternatives to self-classification by those who create files range from global namespace schemes (creating file folders to serve as repositories for certain types of documents and asking users to place certain files into these folders on a routine basis) to "role based" classification (classifying all files created by a specific user based on his or her role at the company) to "deep blue math" techniques that involve the application of algorithms to user files that search, index and evaluate file contents to determine the appropriate classification to apply to each one.

The latter two options have the advantage of requiring no end user participation in the data-classification scheme. That said, the challenges to implementing automated role-based or deep blue math techniques have traditionally been the lack of granularity in the selection process and the creation of false positives in the resulting data set.

An outstanding role-based tool in the Novell world that, at present, is being ported to fully support Microsoft environments is the Novell Storage Manager (NSM). Using NSM, user roles can be leveraged to define policies for handling the data produced by the use, and some management granularity, mostly exclusions, can be achieved by using file extensions to omit certain files (MP3s, .JPEGs or movie clips) from being included in classes meant to apply only to relevant work files. Trusted Edge, from FileTek Inc., also provides some similar capabilities.

CA has been cultivating an excellent suite of offerings in this space, including CA Message Manager (based on e-mail-management technology obtained through its acquisition of Ilumin), CA File System Manager and CA Records Manager (based on records management wares obtained through its acquisition of MDY). Considerable work has been done to integrate these products into a comprehensive information-governance capability for larger enterprises.

Until recently, I would have dismissed most of the wares claiming to provide deep blue math classification capabilities as non-starters. Most of these tools are actually search engines that scan repositories of files before ranking the files based on simplistic keyword hits or juxtapositions of certain words within the file. The number of false-positive results -- files that didn't fit within a certain class or category -- is generally so large that a cadre of human editors is needed to achieve any real granularity in creating meaningful classes of data from the output of the search.

However, a recent visit to Digital Reef Inc., a Boxborough, Mass.-based start-up, renewed my hope for a transparent and automated approach to file classification based on algorithmic sorting. CEO Steve Akers explained that his company's success in an area where so many others had failed was due to the fact that he didn't pay much attention to how others had tried to achieve the goal. No one told Akers that it couldn't be done. Contextualized mainly as an e-discovery tool for litigation, Digital Reef's technology has much broader application for data classification and indexing that the developers have yet to fully explore.

Where It LivesData classification is a key element of compliance, but creating a management strategy also requires a careful examination of how data is hosted on infrastructure today. The idea of nice, neat storage may be communicated by shiny, rack-mounted storage rigs placed into uniform racks interconnected by a switch to servers in other racks -- what the industry calls a storage area network (SAN) -- but looking at infrastructure from a data perspective shows a very different story.

In many companies, there's little direct correlation between data and its placement on hardware. Storage arrays tend to be isolated islands of capacity with underlying disk drives stovepiped behind proprietary array controllers operating proprietary "value-add" software. This software is intended, more than anything else, to lock in the consumer and lock out the competition. Managing across different vendor-branded hardware is next to impossible, though some management tools from companies like Tek-Tools Software and data traffic-analysis wares from Virtual Instruments can help.

Archiving for Compliance

Here's a short list of database archiving and extraction products that can help achieve compliance:

FileTek StorHouse

Grid-Tools Data Archive

HP StorageWorks RIM for Databases (formerly OuterBay)

IBM Softech Optim Archive (formerly Princeton Softech)

When SANs were first introduced in the late 1990s, vendors suggested that they were pools of storage capacity that could "bring a drink of storage to every server-hosted application regardless of brand." Companies proceeded to fill up these pools with undifferentiated data, creating a huge junk drawer. Today, just finding where the data belonging to a specific application or end user is located in the infrastructure is tough. Finding out if the data is receiving the right services for preservation, protection and privacy is an even more daunting task.

The "storage junk drawer" phenomenon is reflected in statistics presented by many industry analysts regarding storage capacity allocation and utilization inefficiency. For example, after normalizing more than 10,000 storage assessments performed by my company and Sun Microsystems Inc.'s Sun StorageTek storage-services group, we found that, on average, less than 30 percent of the data occupying space on disk spindles was useful data accessed regularly by organizations. Another 40 percent of space on the average disk drive contained data that was not re-referenced, but needed to be retained for compliance or business reasons. (This kind of data should be archived and moved from spinning rust to more energy efficient tape or optical storage.) The remaining 30 percent of disk space was a combination of orphan data (data whose owner no longer exists at the company), capacity held in reserve by storage vendors and contraband data (bootleg videos, MP3s, etc.). This suggests that, with a bit of data hygiene, classification and archiving, companies can recover up to 70 percent of the capacity of every disk drive they own, deferring the need for and cost of deploying more capacity.

Storage reclamation is one huge side-benefit of compliance-related analysis and classification of data assets, especially given the cost-containment challenge that most companies confront today. But getting to success again requires that we understand what data is and what services it requires.

With stovepipe arrays in a SAN, the services used to protect and preserve data assets are embedded on array controllers. That makes developing coherent policies for managing data flows in and out of this infrastructure -- so that data is exposed to the right set of services as set by compliance requirements and other business factors -- extraordinarily difficult.

Implementing a SOA for storage simplifies matters considerably. This requires adopting a "deconstructionalist" approach to building storage: Using bare-bones arrays whose controllers provide only services that are required to ensure the reliability of the platform. At the same time, other value-add services (encryption, de-duplication, etc.) must be exposed as shared services hosted either on routers (this is known as "Crossroads Systems"), on servers as software (third-party backup software is an example) or in a storage virtualization software layer such as that from DataCore Software Corp. or FalconStor Software. Separating software services from hardware resources provides a more effective basis for developing policies that route classes of data through the sets of services they require on their way to the storage resources where they reside.

For such a deconstructionalist approach to building infrastructure to work, the services and resources need to be managed in a coherent and holistic way, reporting their status and capabilities on an ongoing basis to a broker that provides dynamic routing of data based on parameters specified in governance policies. The good news is that World Wide Web Consortium (W3C) Web Services standards already provide the essential ingredients for creating such an infrastructure. Xiotech Corp. is among the first storage-hardware vendors to adopt these standards on its gear, and to instrument its hardware platform for Web services-based management and integration with third-party software services.

Whether other hardware vendors will follow Xiotech's lead is an interesting question. The storage industry has demonstrated little inclination to enable their products to be managed in common. Initiatives such as Storage Management Initiative Specification (SMI-S) within the Storage Networking Industry Association (SNIA) have met with little success. But three encouraging factors are: the adoption of Web services standards by most application and operating system vendors; the enormous success of Xiotech in selling its Intelligent Storage Element platform (especially in a difficult economy); and burgeoning regulatory requirements that will require data to be exposed to, and in some cases excluded from, certain storage services.

Case in point: One large financial company is classifying its data assets to ensure that those files required by the U.S. Securities and Exchange Commission (SEC) are excluded from de-duplication services. De-duplication -- essentially a set of techniques offered by numerous vendors for describing data with fewer bits in order to squeeze more data into the same amount of disk space -- is increasingly used by companies to economize on storage space. However, the SEC requires a "full and unaltered copy" of data to be provided by corporate management, and this financial firm is not convinced by vendor rhetoric that de-duplicated data complies with this requirement. Hence, a policy needs to be developed that will exclude certain data from a de-duplication service, and this is a harbinger of more data-management requirements to come.

Defining TargetsOf course, redesigning storage infrastructure to accommodate policy-based compliance with regulatory requirements won't happen overnight -- especially not in an economy that doesn't support "rip and replace" strategies. However, defining a vision for a manageable, purpose-built SOA needs to be done in any case. With such a vision, plans can be drawn up for migrating to a SOA in phases and as part of the normal hardware refresh cycle over the next five to seven years. Compliance could become a powerful driver for such change.

For now, it's important to know what hardware and software assets are available in infrastructure so appropriate targets can be defined for data. In other words, you need to determine where data classes will reside. This requires an inventory of technology assets and capabilities and the matching of these resources to the compliance requirements of the data itself.

It will also require the consideration of the way that data is currently used. In some cases, leaving data where it is and managing it via a "federated" approach is preferred. In other cases, companies seek to move compliance-related data into a shared, centralized repository. Most intelligent archive products enable you to mix the two: Data assets may reside on a local hard disk accessed by the user, but a copy may be placed into a centrally managed archive that's updated whenever the original file is changed.

In the final analysis, the best products are the ones that fit the company's culture and budget. Nonetheless, there are a few pitfalls you want to avoid.

First, eschew stovepipe archives: Buying software and a server and storage as a "one-stop-shop" has its appeal, but can drive up costs while creating headaches later on. There are a number of such one-size-fits-most archival products on the market that are the equivalent of data roach motels. Some vendors have open application-programming interfaces that enable data to be ingested from any source, but you need to buy additional software -- at a huge cost -- if you ever want to migrate data off of their platform.

A better approach is to favor a software-based archiving approach over a hardware-centric approach. Archive providers like QStar Technologies and KOM Networks support a broad range of implementation options and can leverage your existing storage for building a centralized approach. If you can, choose best-of-breed software for extracting and copying compliance-related data into separate archive images, then combine these images into a general archive that can then be backed up, encrypted, etc. This will allow you to fulfill most of your requirements without breaking the bank.

Second, keep admin labor costs to a minimum. To avoid having to dedicate a separate cadre of administrators to each archive product, consider buying a Manager of Managers (MOM) from a vendor such as BridgeHead Software Ltd. MOMs let you normalize the policy engines of many archive products into a single console so you can set policies and administer archives controlled by different vendor software through a common interface.

Third, keep it simple. Work with the business stakeholder, the legal, audit or risk-management department, and IT folks to define data assets, regulatory requirements and resource targets. It's best to conduct interviews separately at first to gather information and insights. But, ultimately, approving policies and managing change requires group participation. Periodic reviews and change-management are must haves.

Fourth, don't be distracted by hype. Storage hardware vendors are adding new bells and whistles to their products every six months or so. The latest buzz is on-array tiering, which is cast as ILM in a box. Non-granular data movements between different types of storage media may be useful from a capacity-management standpoint, but are of little value from a compliance standpoint -- and don't solve the problem of granular data management.

As for software, avoid "comprehensive" data management software products. They don't exist. ECM is not a valid replacement for productivity software any more than Hierarchical Storage Management or backup is a replacement for real ILM. Try the various software tools that are available to perform discrete management tasks, but don't commit to any technology that doesn't fit your general approach or seeks to bend your workflows into the vendor's template.

Finally, don't try to do it all at once. Associating data-asset classes with policies on data management is a tough job. Divide and conquer is the rule. Just about every vendor and consumer interviewed for this article gave the same advice: Start with one business process at a time, and, in some cases, with the data assets from a single business workflow.

Practicing compliance, in the final analysis, comes down to common sense. The benefits can be huge -- not only from a risk-reduction standpoint, but also from the standpoint of cost-containment and improved productivity. And because the best strategies don't require a lot of capital expense, compliance may be accomplished reasonably using hardware and labor you already have on hand.