AnalystWatch: Data - true or false?

June 2008

The manipulation of data in an attempt to look into the future is not a result of the information age, technology just makes it easier.

However, the issue that should really be addressed is whether 'easier' can be directly translated to 'more accurate'?

There appears to be an almost implicit acceptance that by using computing power to manipulate data we are able to create correlations that would have been impossible in days gone by, and the greater the power or the easier the interface is to use then the results will be more accurate.

All organisations recognise the fact that forecasting, in its many guises, is essential to create agility. The ability to use data and to form relationships between data elements is a way of keeping one step ahead of the game.

Traditionally in the history of computing, data warehouse technologies were seen as an effective way of creating data result sets from data contained within an organisation.

However, the limitations of such technology are also well documented. The greatest limitation was the fact that the expected correlations had to be pre-defined. Data warehouses could be used almost as proof points for expectations rather than a true way to discover underlying causal relationships.

Greater computing power now means that users have the ability to manipulate data in near time and carry out multiple what-if scenarios across a large and complex data set. The end result, so the argument goes, is the ability to uncover causal relationships that perhaps would not have been considered.

Uncovering causal relationships is more than simply manipulating data until some perceived benefit is achieved. This method is even less effective than using data warehousing technology, as it can create perceived causality where none really exists.

There are other downsides to providing this enhanced data manipulation ability. Not everyone likely to use this type of solution will be trained in proper statistical techniques, which leads to the real possibility of simply back-fitting result sets.

This type of technology offers the ability to incorporate data from external sources or from unstructured data, which raises the real danger that data manipulation will become an end in itself rather than a means to an end.

There is no doubt that the new technologies that allow this level of data manipulation are of benefit, but they should be used within a reference framework of what is the end point.

They should be used by experienced people trained in proper statistical techniques, and the generated result sets should also be examined for common sense. We then have the three 'Cs' of good data usage: correlation, causality, and common sense.