Unstructured data: nail it - then mine it

Unstructured data is growing at an alarming rate - some say it now poses a serious threat to the efficacy of enterprise IT; but properly handled, can it also yield valuable information?

An IT professional from the 1980s propelled through a time warp into 2010 would be aghast at the state of enterprise data management, with all the hard-won order seemingly cast away in a climate of free-data user-power. The rise of the PC and departmental LAN alongside office productivity applications like word processing and spreadsheets liberated/unleashed (delete as preferred) the phenomenon of user-driven data generation. Now anyone could make data on demand - without need for permissions or privileges. And the first thing users started to do was to start making just-on-the-safe-side duplicates, local backups and interim draft copies, which they never went back to and deleted when no longer needed.

The effect took years to unfold. The evolution of inexpensive hard disc drives meant that one of the constraints on unstructured data - storage limitation - as a restraining force dissipated. These new data types did not interfere with the core enterprise systems but gradually expanded in importance to reach the current situation where, in many cases, they have become just as mission critical as the core transactional systems. Now some 40 per cent of sensitive data is unstructured, according to technology analyst the Aberdeen Group.

This development is creating a problem for the IT function that is broadly threefold. First, the volumes of unstructured data are growing at an uncontrolled and largely unpredictable rate: it needs to be stored and, in lieu of a more effective solution, more and more additional storage capacity is being thrown at the problem; this issue also makes it tricky to budget for future storage needs.

Second, a lot of unstructured data is not needed - copies of copies of copies - and to alleviate this problem, resource-hungry data de-duplication processes have to be applied.

Third, unstructured data increasingly contains critical business information important to an organisation's livelihood; and tracking it, or locating archive data for legal purposes, is proving a daunting challenge for CIOs and other branches of IT governance. This transformation has been most dramatic in the case of email, which for most enterprise staff remained a fringe activity until a decade ago and yet is now the fulcrum of many professional tasks. Usage figures give half the story; it's not so much the amount of 'pure' email that's being generated - that, arguably, qualifies as structured data - it's all the stuff that's being attached to it or embedded into it.

'Almost 90 per cent of workers now use email at least weekly, and 57 per cent use it hourly,' says Matt Brown, research director at analyst group Forrester. Email files tend to be small but high value and heavily used. 'By contrast, videoconference files are used by only 8 per cent of workers even on a monthly basis and are not often accessed, but the file size is enormous,' says Brown, making the point that the various unstructured file types pose different issues.

Brown points out that 'structure' is a relative concept in the context of data arrangement: 'In reality there is a continuum. There is the structured data world of relational database systems, data 'warehouses' and data 'marts', comprising fielded highly-typed, often transactional data. At the other end there is highly unstructured data, such as video content for training, teleconferencing and audio files such as iTunes and MP3 - increasingly we see a lot of what's generated out of social media'.

Brown continues: 'Then there is a lot of data with some structure such as .doc files, .pdfs, and .ppts (Microsoft PowerPoint format), which have date stamp and author, but where much of the [native and imported] content [in terms of format and size] is unpredictable.' A more recent - and growing - file format that Brown describes as coming in the middle, is XML: this comprises free text, but with an extra structural component in the tags and markers that define the size and nature of the content.

While the highly unstructured content such as video promises the most explosive growth in the future, many of the most pressing current issues revolve around the semi-structured data in the middle, with email attracting greatest attention. For one thing this data is still increasing in volume by 20-40 per cent a year according to Forrester, driving demand for de-duplication tools to mitigate this, as well as to restore some order and control. The data itself is only part of the problem, for there is also a lack of structure in the surrounding 'ecosystem' of software tools and components that provide additional functions omitted from the core platforms.

'On top of the core mail systems are a lot of outside technologies like spam filtering, archiving, authoring and discovery so that users can understand what is inside the systems,' says Brown. Some vendors of these external systems have tried themselves to unite the structured data world (see 'Taming The Hydra', pg 54), but inevitably any such product is itself proprietary.

Legal requirements

In the case of email, and also the associated attachments, enterprises have come under growing pressure from regulatory compliances to ensure that documents and messages relating to specific cases are stored securely and are readily accessible on demand should the need arise. Voice-mails and Instant Messages are becoming important in this respect. A number of recent high-profile cases - such as BSkyB versus HP-EDS - have indicated that major enterprises (at least until recently) lacked tools to do this, the problem often not being deletion, but the fact that critical data is not readily locatable within given (and affordable) time constraints.

Yet some fraud and other cases have highlighted the power inherent in unstructured data, which contains valuable information precisely because its lack of form yields crucial details of interactions and communications between protagonists. 'In the Enron case they found it was the content-free emails that were suspicious, with messages like 'meet me in the usual place',' relates IDC's research VP for search and discovery technologies Susan Feldman. But finding such email trails without exhaustive investigation requires more intelligent text analytic tools than have been available to date, seeking for example patterns of communication that contain unusual word combinations or that cut across typical hierarchical or departmental lines within the organisation.

For these to be allowed free reign, data rigid structuring imposed by applications may prove obstructive; this is not necessary to suggest that interrogating unstructured data is more likely to yield useful results. Unstructured data represents a problem for IT administrators every day, whereas a need to search through it for specific content may only arise occasionally.

Data mining

Emails and other forms of communications also contain information that can be mined for a more positive business advantage. This is also driving development of advanced discovery tools beyond content search of individual messages or documents. The fast-moving nature of business deals often means that processes are not always formally captured and documented. Valuable trends can evolve unconsciously and unwittingly - but you have to know where (and how) to look.

'We're realising that the value of email is extremely high because it contains the future of the business,' says IDC's Susan Feldman. 'It contains the deals that have not been done and the contacts that you didn't put in your contacts list because you weren't sure whether they were valuable or not. Email trails of the contracts that were closed are also valuable because they tell you who decided what and why you decided it.'

There is also valuable information in logs of website activity and from incompleted online transactions, which can be used to improve e-commerce processes and make it more likely that sales pitches are won. More generally, the unstructured data generated by transactional and communications of all types is a fount of information about attitude and sentiment as well as more tangible knowledge about business processes and customers, says Susan Feldman at IDC: 'It contains the information you don't know you want to know'.

This in itself is a challenge, of course. Apart from the need for more subtle value discovery tools, it raises the dilemma of just what data can safely be deleted and when, or which can be relegated to lower storage tiers for retention where it no longer participates actively in discovery applications. A snag here, according to Steve Legg, storage CTO at IBM, is the lack of adequate metadata describing the content accurately, enabling its value to be assessed, preferably automatically. Loosely-defined, metadata is data about data, and can be stored and managed in a database, often called a registry or repository. The usefulness of metadata is something that is becoming increasingly recognised; but unstructured data, by its nature, does not usually have its metadata fields filled in a formalised way.

'This is a serious problem for unstructured data since it limits the opportunity to deliver effective policy-based management, and leads to a 'keep everything forever' approach, rather than a more selective view of what is really needed,' Legg observes. 'Keeping everything forever works for art, but not every [scrap of enterprise data] that has ever been created warrants preservation.'

Vendors are on the case, meanwhile, fully seized of the issues and developing solutions to the issues of metadata, audit control, security and discovery, but it will take some time to bring unstructured data under control, especially as it continues to expand in scope.