The Problem with Big Data: Lies, Damn Lies, and Statistics

I’ve used the subtitle in a previous post and I think the application to the content of this post also makes it worthwhile to use again. I was reading a post from Tim Ferriss the other day and it made me think of statistics. The post is about alternative medicine, but understanding that isn’t entirely necessary for the point I’m making. Here’s some context:

Imagine you catch a cold or get the flu. It’s going to get worse and worse, then better and better until you are back to normal. The severity of symptoms, as is true with many injuries, will probably look something like a bell curve.

The bottom flat line, representing normalcy, is the mean. When are you most likely to try the quackiest shit you can get your hands on? That miracle duck extract Aunt Susie swears by? The crystals your roommate uses to open his heart chakra? Naturally, when your symptoms are the worst and nothing seems to help. This is the very top of the bell curve, at the peak of the roller coaster before you head back down. Naturally heading back down is regression toward the mean.

If you are a fallible human, as we all are, you might misattribute getting better to the duck extract, but it was just coincidental timing.

The body had healed itself, as could be predicted from the bell curve–like timeline of symptoms. Mistaking correlation for causation is very common, even among smart people.

And the important part of the quote [Emphasis Added]:

In the world of “big data,” this mistake will become even more common, particularly if researchers seek to “let the data speak for themselves” rather than test hypotheses.

Spurious connections galore–that’s what the data will say, among other things. Caveat emptor.

This analogy reminded me of the first time I learned about correlation and causation in my first psychology class as an undergraduate. It had to do with ice cream, hot summer days, and swimming pools. In fact, here’s a quick summary from wiki:

An example of a spurious relationship can be illuminated by examining a city’s ice cream sales. These sales are highest when the rate of drownings in city swimming pools is highest. To allege that ice cream sales cause drowning, or vice-versa, would be to imply a spurious relationship between the two. In reality, a heat wave may have caused both. The heat wave is an example of a hidden or unseen variable, also known as a confounding variable.

Getting back to what Ferriss was saying near the end of his quote: as “Big Data” grows in popularity (and use), there may be an increased likelihood of making errors in the form of spurious relationships. One way to mitigate this error is education. That is, if the people who are handling Big Data know and understand things like correlation vs. causation and spurious relationships, these errors may be less likely to occur.

I suppose it’s also possible that some, knowing about these kinds of errors and how little the average person might know when it comes to statistics, could maliciously report statistics based on numbers. I’d like to think that people aren’t doing this and it just has more to do with confirmation bias.

Regardless, one way to guard against this inaccurate reporting would be to use hypotheses. That is, before you look at the data, make a prediction about what you’ll find in the data. It’s certainly not going to solve all the issues, but it’ll go a long way towards doing so.