Spurious Correlations Everywhere: The Tragedy of Big Data!

Many of us who follow and/or engage in quantitative analysis have been following the rise of interest in “big data”. A major issue is the question of real findings versus spurious findings that result because of the very large size of datasets. The statistician, Nate Silver, referred to the above as the “signal” and the noise in a recent best seller. Geoffrey Pullum, a professor of general linguistics at the University of Edinburgh, in the blog Lingua Franca in the Chronicle of Higher Education, cautions those engaged in linguistics research about the subtleties of big data. He specifically calls out recent work conducted by Keith Chen, a professor of economics at Yale University:

“The results (see this blog post for an informal account) were jaw-dropping. He found that dozens of linguistic variables were better predictors of prudence than future marking: whether the language has uvular consonants; verbal agreement of particular types; relative clauses following nouns; double-accusative constructions; preposed interrogative phrases; and so on—a motley collection of factors that no one could plausibly connect to 401(k) contributions or junk-food consumption.

The implication is that Chen may have underestimated the myriads of meaningless correlations that can be found in large volumes of data about human affairs.

Roberts and a colleague recently published a paper on this topic (“Social Structure and Language Structure: the New Nomothetic Approach” by Sean Roberts and James Winters, Psychology of Language and Communication 16.2 [2012], 89-112). They noted several zany positive correlations of language with behavior; for example, people who speak a subject-object-verb language (like Japanese, Turkish, or Hindi) have more children on average than do people who speak a subject-verb-object language (like English, Indonesian, or Swahili).

Nassim Taleb’s Antifragile (2012, Page 417, quoted by James Winters in a blog comment) contains a relevant remark about why such things might be: “In large data sets, large deviations are vastly more attributable to noise (or variance) than to information (or signal). … The more variables, the more correlations that can show significance. … Falsity grows faster than information.”

We should expect correlations that are statistically significant but ultimately meaningless to pop up all over the place once large quantities of data are available—especially with regard to something like language, given the difficulty of controlling adequately for cultural diffusion, geographical proximity, shared origins, and intervariable linkage.

I suspect that Chen’s correlations mean nothing at all: There is no causal link, and we do not need an explanatory story. In the kind of world we live in, you wrestle every day with a swirling mass of inexplicable correlations, and then you die.”

Pullum’s analysis is meaningful as we attempt to define how big data can be used in a number of applications including our own research.