The article is nicely balanced and a good read. However, there is really nothing new being stated: people working with data should know that even in completely random data, statistically significant correlations can be found. As such, stating that "we are making a big mistake" is somewhat far fetched, I would say.

Statisticians have spent the past 200 years figuring out what traps lie in wait when we try to understand the world through data. The data are bigger, faster and cheaper these days – but we must not pretend that the traps have all been made safe. They have not.

I think the issue is that computing power and an abundance of data has made it possible to explore many possible hypotheses very quickly, then pick and choose from the handful of positives which are littered with false positives. It makes it much easier to throw out the scientific process (i.e. testing a single hypothesis).

Rinse and repeat. I work in HIV research with behavioral survey data (not as big, but it suffers from similar problems). One variable is correlated in in one study, not the other. It all gets published and put on equal footing. In the end it is very hard to determine what associations are meaningful and what is noise because it's all just being dumped into regression models.

That's exactly what I mean. As a grad student, you show up once the grant has been written for XX amount of sequencing and get a heap of data with the instruction to "find something interesting". I basically hated it.

My dissertation was on high dimension, low sample size (HDLSS) statistics, so naturally I looked towards genetic data. I sat in on a few lectures with the local biostats folks and ran as fast as I could back toward low dimension, high sample size (LDHSS) applications.

I may never discover a mutation responsible for a deadly disease, but every result that comes off my desk is biologically sensible science. Still, the number of people who plop databases in front of me and say: "so, what looks good?" is awe-inspiring.

I am a psychometrician (delivering massive amounts of hypothetically unidimensional tests to large amounts of people), and even when we are trying to measure a single dimension with large samples there are a horde of issues.

This may well happen (and quite widely) but its not always the case as many researchers are aware of the problems involved.

Often a different procedure for "Step 3" is taken which involves looking for corroborating eveidence from other avenues.

For example rather than simply looking for the existence of a mutation that is associated with disease status the effect of that mutation on the encoded protein and how it affected its function/expression was also checked. Whether the gene itself encoded a protein that had a plausable biological role in the disease aetiology was also very important. In some cases mouse models with knockout genes are also derived to test if this results in a similar pheontype. Also confirmatory studies are done with collaborators around the world to test and replicate the association seen before publishing.

Thats not to say that there aren't problems with the methodology that many who have jumped on the band-wagon have employed, but not all researchers clamour to publish every single "significant" association they discover.

You say that as if philosophy of science hasn't progressed beyond the hypothetico-deductive model. There have been many advances since, such as the scientific realist camp delving into just what status to give models of different types.

A lot of these types of problems with drawing inferences with little hypothesis input also relate tangentially to the proof of the four color theorem, which was a pretty hot topic a couple of decades ago. Even though it was in the context of a mathematical proof, it was pretty much done entirely by computer, and the controversy involved the reliance upon the computer for the veracity of the proof.

To add to how busy philosophy of science can be, even work like Judea Pearl's falls partly into philosophy of science because of his work on the manipulability explanation of causation.

How aware is the "PhilSci establishment"--if there is such a thing--of more technical work like Judea Pearl's causal graphs, Ray Solomonoff's universal prior, and Marcus Hutter's compressive induction?

I wouldn't be able to speak for the "PhilSci establishment" as a whole beyond knowing what I come across every now and then in PhilSci journals and textbooks in my own time. I'm pretty sure I've seen Pearl and Solomonoff mentioned in Phil textbooks before, though I can't recall seeing Hutter's work in textbooks, if that helps. Then again, I tend to gravitate toward philosophical work on induction, causation, and probability more than the more broad stuff like the demarcation problem.

It depends. It wouldn't be if all projects had a mix. And most do. But some don't, and it's annoying. For stuff like genome-wide association studies it's understood that they find correlations and a good deal of scientists following up the work involves teasing out causal relationships, for example, but the problem is how it's reported and who follows up. It's usually over-hyped in the media and rarely the star lab who found the association who follows up. It creates some weird situations but science overall doesn't suffer.

But some functional genomics projects consist almost only of exploratory data analysis and validation experiments as an afterthought. There are papers claiming things they did not prove because of this.

On a more personal note, as a grad students starting out, it feels very unsettling to be thrown into the big data with the instruction to "find something interesting".

We have a large data science and data mining team, but no mathematicians or statisticians to be found. I even tried to refer one and they turned them down saying they did not know what statisticians did.

The issue that I see is companies like IBM selling BIGinsights like its going to do for advanced analytics what Excel did for simple analysis. But a powerful computer and sophisticated algorithms don't make somebody a statistician any more than a scalpel makes somebody a neurosurgeon.

I think the concern isn't necessary Type I error per se, but rather the long-term predictive validity of big data models. With no theory guiding WHY the model is specified the way it is, we have no underlying structure that holds it together.

This is why, according to the author, the Google flu model failed eventually. It didn't specify WHY the model was created the way it was and therefore couldn't understand what was driving the effects.

Just as ice cream and murder rates are correlated, a theoryless model would make the correlated assumption that as more ice cream is sold, more people will die when, in fact, the true effect is due to increased temperatures (deaths occur in hotter months, so do ice cream sales).

Theoryless models can't take these variables into account unless they're fed into the original models. If a key variable that drives the effect is omitted, the model (and researchers) will have no idea why it's predictions fail until it's too late.

Speaking as someone with a decent stats background in a field with lots of non-statisticians, yes. It's not so much that they believe that they don't have to do statistics, it's that the see significant relationships, and then stop thinking statistically.

This is such a refreshing article. I work in the data analysis division at a consulting company and all I hear is praise of "big data" and that if you aren't using hadoop you might as well be throwing quarters into a wishing well. All of the data quality issues that motivated classical statistical inference still exist. Its like we are telling people its so much more efficient to drink from a fire hose than a glass.