03 July 2008

Long live the scientific method

Chris Anderson provokes with an article titled, "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete."

There's some interesting ideas, but the argument is based on a false premise.

This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear.

It's perhaps understandable that an outsider, a non-scientist would mistakenly believe this premise to be true: that there are massive amounts of data available for all scientific problems.

There are not.

There are only a few fields of science that generate large amounts of high-quality data. I'm thinking maybe some branches of physics (like nuclear physics, maybe astronomy), social sciences (demographic and census data, automatic tracking of web useage), and maybe genetic data for a select few animals (humans, mice, fruit flies, Arabidopsis).

These are the exceptions.

In most cases, scientists have to eke out by hand one experiment at a time. It's not automated, it's not massive, and it doesn't generate huge numbers. To take an example from my field, invertebrate neurobiology, there isn't really good agreement on how to describe neurons in such as way that they can be put into a searchable database (although the NeuronBank project is making an effort to at least think about that problem).

Anderson goes on to say:

There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

Scientific theories have three traditional virtues. Predict, control, explain. Massive datasets may indeed give us pretty good predictive power -- correlations often do. It may not give us control. And it certainly doesn't explain. We really need causal mechanisms to explain.

For instance, let's take climate change. If it were the case that massive data is all you need, there would seem to be no need for the ongoing debates about climate change. We have massive datasets there. And indeed, the scientific questions are supported by a large consensus. But people don't care that there's a correlation between carbon output and temperature change, they want to know if one is caused by the other. The policy decisions are very different depending on what your thinking of causal mechanisms are. Cause is king.