Garbage In

Data. The new oil. In that you wouldn’t want to cover a puffin in the stuff.

The current obsession with data is predicated on one major assumption: that the data that organisations have amassed has some sort of integrity. If it doesn’t, then its value is dubious.

I have started to make a distinction between organic and synthetic data. Organic data is that that comes from naturally occurring observational data: temperature recordings, weights, dimensions and maybe even things like bank balances, exchange rates or stock prices. Synthetic data, however, is the meta-data that is associated to organic data in systems and requires some sort of human subjective judgement to be created. The classification of type of spend contained in a purchase order, for example.

Why make this distinction? Well, because whilst the synthetic data is often crucial to the exploitation of data for other things. If you can’t associate meaning to your data, then you can do much with it. Semantics are everything. I’m wondering if much of the synthetic meta-data that exists in corporate systems is meaningless guff.

I recently did a piece of work looking at the classification of spending across constituent parts of a federated organisation. Each of the subsidiaries had their own Finance system and associated processes to manage the flow of cash in and out of the organisation. We were looking to see if we could extract that spending information from the multiple systems and then map them to a set of consistent spend categories so that we could get a picture across the whole of the group.

The theory was that if we could find the classifications that existed across each of the organisations, we could then map them, Rosetta Stone-like, to a standard schema. As we spoke to each of the organisations we started to realise that there may be a problem.

The classification systems that were in use weren’t being managed to achieve integrity of data, but instead to deliver short-term operational needs. In most cases the classification was a drop-down list in the Finance system. It hadn’t been modelled – it just evolved over time, with new codes being added as necessary (and old ones not being removed because of previous use). Moreover, the classifications weren’t consistent. In the same field information would be encapsulated in various ways.

For example, if you wanted to code up a schema to classify your grocery products on a product-by-product basis by the type of foodstuff, it might look something like this:

Fruit

Vegetables

Raw meat

Cured or cooked meat

Dairy

Pasta & Rice

Now that schema looks a little bit the aisles of a supermarket, but actually they’re not (quite). Whilst the aisles certainly group some similar types of food together, the major grouping is the type of storage that is required for the products, something like:

Fresh produce in crates

Chilled products on shelves

Chilled products on trolleys

Ambient products on shelves

Frozen products

And then there could be other classifications required, like for example, whether a product was VAT-rated or not.

What we found in the various finance systems were classification systems akin to:

Fruit

Citrus Fruit

Salad

Vegetables

VAT-Rated chilled goods

Meat

Raw Meat

Fresh products retailing at less than £10

Cheese

Other dairy

I’d like to say that I’m hamming this up (sorry, no pun intended until I realised it was one and then I went for it anyway). The reality was just as confused.

But why is this an issue? Well, because if I wanted to map things that have been classified in that final example into one that I showed first, because the data isn’t of the same type it’s next to impossible. Now you might at this point shout “Pareto!” and suggest that the majority of data was correctly coded and the errors would be acceptable.

But that’s assuming that the codes themselves have been used consistently. Without any clear definitions of the codes, though, that’s very hard to tell because every individual classifying data will be doing so on their own interpretation of what the specific codes actually mean. Did I mentioned that nobody had any clear, written definition of the codes?

Which is why I’m increasingly thinking that much of the synthetic metadata that exists in organisations today is probably deeply flawed. Inconsistent classification schemes used subjectively by many different people in many different ways and not being managed in a consistent way other than for the day-to-day operational needs.