Sunday, May 18, 2014

Spurious Correlations

Here’s a site that plots spurious correlations* between variables. The description indicates that these correlations are found by a data mining program rather than a person.

The above is a true spurious correlation: even though the data is measured across time, it’s hard to see that time is intimately involved in the data generation. Per capita cheese consumption is just not something I can see trending off to infinity.

I do wish there was a section that isolated spurious correlations between trending times series though. In this one, we could reasonably expect both variables to go to infinity if given enough time.

For my part, I think pairs like the second one are more critical to recognize, because there isn’t a sense in which this is ever going to go away if we get more data. In contrast, the correlation in the upper plot will probably go away if we keep plotting it year after year, as eventually per capita cheese consumption levels off or starts to decline.

* A correlation is spurious if it occurs by chance, and has nothing to do with an identifiable cause and effect.