the internet is an unstable research environment

I am a huge advocate for big data research and it pains me to see otherwise rational people get in a tizzy over what is simply very interesting and rich observational data. What I find puzzling is that there actually is a big drawback to “big data” that few people mention – the lack of reproducibility that results from the inherent instability of the Internet. Let me give you a few examples:

You are harvesting tweets and then the API changes and you no longer get the same data.

You are working with a data provider and they loose their license.

The start up you were working with gets new management and you loose access to the data.

A start up firm goes bankrupt, or nearly bankrupt, and all data is removed or is simply erased. For example, you can no longer get MySpace data.

Google changes what you can get from their Google search data.

These are all serious problems because it means you can never go back and reproduce results. From the perspective of the firms, it’s ok. You probably changed for a good business reason. For a researcher, it is a disaster. You can never fix problems, explore new models, or reproduce the research for later generations. Since most firms don’t archive data – poof, it’s gone!

I hope that one day, we’ll get to a “big data Github” where firms will archive data so that people generations from now can use it. Sadly, we are in the opposite situation now. We are lucky if I can access data that was used last week.

Related

7 Responses

It sounds more like a legitimacy problem than an internet failure. You can’t get the raw data from climate change studies, you can’t get the raw data from other big data studies, so why advocate for it or fund it until you can?

August: Your comment doesn’t sound right to me. Are you telling me that climate scientists don’t have access to the original observational data? And that any PhD in meteorology can’t get the data and re run the models themselves?

I actually expected people to go to jail a few years ago, when the leak came out. The media just said it was emails, but there was more. Eric Raymond took a look at the code- and it looked a lot like cooking the books to him. Some of the data is ‘lost’ too. In some cases it looks like purposeful destruction, to avoid whatever the UK equivalent of freedom of information act requests are.

My worry is that the world of computer modelling/big data is rife with this stuff, with the computer model itself, and whatever it subsequently generates presented- more and more often- as the ‘science’ with little to no attempt at verifying that it coincides with reality. A while back I came across a paper from a woman who had found a lot of dna studies are using computer models rather than actually testing against dna. I think she came back with a 30% error rate between the computer model and real dna. This is scary stuff. I lost that paper- I think it was her thesis- I am on the look out for it because it is probably the most telling thing I can point to- since it is not imbued with political ideology like the climate is. Unfortunately, even for me, the internet is an unstable research environment, and I should have remembered to save a copy for myself.

The are two types of replication one can be concerned with. The first concerns making the original data available that was used in the paper. Journals should make it mandatory to provide the code and data used in all analysis. The second concerns being able to replicate the study on a new sample of data and get similar results.

This is hard to do even in lab settings and with rigorous experimental designs. Imagine running an education experiment in the same school over a 20 year period or the same experiment in a different school in perhaps a different country. You would very likely getting different results with the exact same experimental design because preferences and incentives change over time/place. While the problem with internet data is even more extreme, it is not significantly worse than non-internet data. This is where mathematical models and simulations might help as they can help explain why effects may differ across settings. Science needs to get away from “reduced form” thinking when it comes to human behavior … For an explanation of this problem: http://www.econ.brown.edu/econ/events/sortexp.pdf

Fabio, I’m not quite sure I understand how this problem differs from existing observational studies — at least for replication by others. Many studies, notably in management, use data under a proprietary license that the researcher will not share the data. While the researcher can verify their own work, no mechanism exists to share that data. Data use agreements with other types of data also expire after some set period, and might be lost forever due to new administrations or changes to FOIA rules, or a host of other reasons.

In summary, I agree that these are problems, I’m just not convinced that they are unique to “big data.”

Agree with mike3550. Fabio requested a github for big data, well, this: https://git-lfs.github.com/ isn’t exactly that, it’s closely related and shows how we’re on our way. Moreover, many of the problems are already solvable if researchers make an effort, and the solutions are getting better. If you have the right to share data, storage is not that difficult or expensive.

“Fabio, I’m not quite sure I understand how this problem differs from existing observational studies — at least for replication by others. ”

For me, it’s a problem of magnitude more than kind. For example, if I have data from management, I presumably could replicate it by looking at similar firms. But data sets are very unique in the Big Data world – if I loose my Twitter access, I can’t redo in google search or reddit. Simply different environments. While I could replicate a customer survey or examine manager behavior in other firms, there is only one Twitter environment.