The Big Data Fallacy

The latest issue of Foreign Affairs features the cover article “The Rise of Big Data” by Kenneth Cukier and Viktor Mayer-Schoenburger, which mostly details some of the incredible ways companies like UPS, Google and Apple have come to rely on vast arrays of numbers in order to run their businesses better. But data has always provided a problem in that it gives a substantive assurance of certainty that has a propensity to foster overconfidence in those relying on it. The article attempts to address this:

“[K]nowing the causes behind things is desirable. The problem is that causes are often extremely hard to figure out… Behavioural economics has shown that humans are conditioned to see causes even where none exist. So we need to be particularly on guard to prevent our cognitive biases from deluding us; sometimes, we just have to let the data speak.”

The sentiment here is admirable, and the context perceptive. But the final part of the quotation (my emphasis) assumes wrongly that data can speak objectively, that there is a fundamental ‘truth’ in a number. All too often though the wrong things are measured, or not all variables are measured. What data does not record, or worse, cannot record, can often be overlooked. While ostensibly data is there to provide assistance with building models and predicting future trends and movements, it sometimes leads to a very narrow view of one particular future, and fails to account for possibilities, that, though while unlikely, could potentially be devastating. This is what Nicholas Taleb writes about in his by turns unreadable but seminal work, Black Swan. The fictional, paranoid loner Fox Mulder of the hit series The X-Files had it right fifteen years ago when he lamented “in a universe of infinite possibilities, we may find ourselves at the mercy of anyone or anything that cannot be programmed, categorised or easily referenced”. The financial system before 2008 was a victim of such narrow thinking.

Hendrik Hertzberg, in his Talk of the Town column “Preventive Measures” in this week’s The New Yorker, made the adroit analogy with the 2002 film Minority Report in our quest to categorise and predict acts of crime. Hertzberg points out that in reality this “turns out to be a good deal more difficult than investigating such an act once it occurs”. Indeed, such prediction methods are being implemented, just with somewhat less efficacy than in the Tom Cruise movie. The stop-and-frisk procedure currently employed by the New York Police Department points to a sustained effort to engage in preventative measures to reduce crime, effectively what Cruise and his myrmidons were doing, albeit without the help of psychic imagery as in the film. While the psychic “Pre-Cogs” turned out to occasionally disagree, the success rate with stop-and-frisk is even less attractive. “In the final months of 2012”, writes the New York Times, only 4% of stops resulted in an arrest. But what is this low figure telling us…?

Hertzberg also alludes to the dilemma of mountains of data, produced without concern for oversight or management; producing more just because it’s possible to produce it, rather than thinking about the implications:

“This fall, the National Security Agency, the largest and most opaque component of the counter-terrorism behemoth, will [open] a billion-dollar facility [analysing] intercepted telecommunications… each of the Utah Data Center’s two hundred (at most) professionals will be responsible for reviewing five hundred billion terabytes of information each year, the equivalent of twenty-three million years’ worth of Blu-ray DVDs… that’s a lot of overtime.”

The other problem this data poses – and increasingly this goes for many industries that are jumping on the Big Data bandwagon – is that intelligence departments and businesses alike are now technically able to put quantifiable targets and figures to what they want to achieve, without considering whether such targets are actually applicable. Police claim the low stop-to-arrest ratio implies that they are preventing crimes by stopping someone before they act. There is nothing to argue otherwise. The New York Times article alludes to the debate over what ratio or percentage the Supreme Court would be comfortable with under the tenet of “reasonable suspicion”. This leads down a dangerous path where we treat data as an answer to a question, rather than as supporting evidence to an answer.