Big Data is in early maturity stages, and could learn greatly from Infosec :re: Google Flu Trend failure

The concept of analysing large data sets, crossing data sets, and seeking the emergence of new insights and better clarity is a constant pursuit of Big Data. Given the volumn of data being produced by people and computing systems, stored, and ultimately now available for analysis – there are many possible applications that have not been designed.

The challenge with any new 'science', is that the concept to application process can not always be a straight line, or a line that ends where you were hoping. The implications for business using this technology, like the use of Information Security, requires an understanding of it's possibilities and weaknesses. False positives and exaggerations were a problem of past information security times, and now the problem seems almost understated.

The quote picks up where the author is speaking about the problem of the model:

“The first sign of trouble emerged in 2009, shortly after GFT launched, when it completely missed the swine flu pandemic… it’s been wrong since August 2011. The Science article further points out that a simplistic forecasting model—a model as basic as one that predicts the temperature by looking at recent-past temperatures—would have forecasted flu better than GFT.

So in this analysis the model and the Big Data source was inaccurate. There are many cases where such events occur, and if you have ever followed the financial markets and their predictions – you see if more often wrong than right. In fact, it is a psychological (flaw) habit where we as humans do not zero in on those times that were predicted wrong, but those that were right. This is a risky proposition in anything, but it is important for us in business to focus on the causes of such weakness and not be distracted by false positives or convenient answers.

The article follows up the above conclusion with this statement relating to the result:

“In fact, GFT’s poor track record is hardly a secret to big data and GFT followers like me, and it points to a little bit of a big problem in the big data business that many of us have been discussing: Data validity is being consistently overstated. As the Harvard researchers warn: “The core challenge is that most big data that have received popular attention are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis.”

The quality of the data is challenged here for being at fault, and I would challenge that ..

The analogy is from information security where false positives and such trends were awful in the beginning and have become much better overtime. The key inputs of data and the analysis within information security is from sources that are commonly uncontrolled and certainly not the most reliable for scientific analysis. We live in a (data) dirty world, where systems are behaving as unique to the person interfacing them.

We must continue to develop tolerances in our analysis within big data and the systems we are using to seek benefit from them. This clearly must balance criticism to ensure that the source and results are true, and not an anomaly.

Of course, the counter argument .. could be: if the recommendation is to learn from information security as it has had to live in a dirty data world, should information security instead be focusing on creating “instruments designed to produce valid and reliable data amenable for scientific analysis”? Has this already occurred? At every system component?