The Dangers of Data

I have no doubt that this is, on the whole, change for the better. But I do worry sometimes that social sciences are becoming an arena in which number crunching sometimes trumps sound analysis. Given a nice big dataset and a good computer, you can come up with any number of correlations that hold up at a 95 percent confidence interval, about 1 in 20 of which will be completely spurious. But those spurious ones might be the most interesting findings in the batch, so you end up publishing them!

Advertisement

A bit more subtly, there's also the problem of Milton Friedman's thermostat. Take a room with a furnace that's regulated by a really good thermostat. Your data is going to show that the amount of fuel burned by the furnace is uncorrelated with the temperature in the room. Thus you'll discover that burning fossil fuels doesn't cause heat. Oooops! In the natural sciences what you'd do with that finding is run some experiments, and you'd figure out what was really happening. But it's often difficult (or simply unethical or inhumane) to run proper social-science experiments. A country with a really sharp central bank ought to be like a room with a really good thermostat—variations in economic conditions will be statistically driven by real shocks, which can lead to the misleading conclusion that central bank policy doesn't matter. And the best way to really test what central banks can and can't do would be to run some experiments in which they deliberately do crazy stuff, but that's not going to happen.

All of which is to say that pure number crunching can be a dangerous business. When it's hard to run the math, empirical inquiry ends up constrained by hypotheses that are plausible. When it's possible to run experiments, we can dive deep into our data and try to really pin the issue down. But a large dataset and a powerful computer, untempered by theory or experimentation, can create a lot of room for mischief.