Scientific Data Is Disappearing All The Time

When a study gets published and its results enter our collective body of scientific knowledge it feels like it's there to stay. But without the raw data behind the study, it's hard to revisit the research and use it to take new ideas to the next level. Which is why it's such a problem that old data is disappearing.

A new study in Current Biology shows that the raw data underlying 516 biological studies from between 1991 and 2011 was only available for 23 per cent. And for the papers that were written more than 20 years ago, there was a 90 per cent chance that no data was available.

It may sound meta to do a study studying studies, but it's important since the scientific method is supposed to revolve around reproducibility. Timothy Vines, a zoologist at the University of British Columbia, who oversaw the research, told Smithsonian that:

Everybody kind of knows that if you ask a researcher for data from old studies, they'll hem and haw, because they don't know where it is. But there really hadn't ever been systematic estimates of how quickly the data held by authors actually disappears.

The group tracked anatomical plant and animal measurements recorded in 25-40 papers for every other year between 1991 and 2011. And when they went searching for the data driving each paper, they often found that abandoned email addresses and unresponsive researchers got in their way for 25 per cent and 38 per cent of the investigated papers respectively.

Vines points out that data stored on outmoded technology like floppy disks is also an issue. And in addition to wanting the data for the scientific process, it also should be more available in many cases if it was paid for with public funding that stipulated general availability.

Smithsonian adds that some journals, like Molecular Ecology where Vines is managing editor, are now requiring that authors submit raw data with their papers. But journal archives, while perhaps more stable than those of individuals, can still disappear over time. Time for a digital pit where everyone can dump their data for long-term storage. [Smithsonian]

Another problem is that even when the raw data is still around, it can become impossible to understand it. Even with good log books, after 5 years the variable names that made so much sense that you didn't write down the meaning no longer do. The quirks in the data that somehow were 'corrected' are forgotten, so correction is no longer done. Worse yet, the tool used to analyse the data goes through 50 iterations fixing a range of problems. However, iteration 25 extracts certain data which is used but is no longer extracted by iteration 50. No-one remembers that it was iteration 25 that did it, though.

It isn't so bad if people are clear and honest with their results analysis, because it can be somewhat followed by someone who commits the time to do it. However, after 5-6 years, if someone completely inappropriately analysed data then it can be just about impossible to work out how the results came from the raw data. As a result, it can be hard to differentiate complex processing from bullshit processing. And the latter does occur.

Every once and a while, we get the chance to peek into an alternate timeline and see how things could have played out if a single decision had gone a different way. And with the new Nokia 8, that's exactly what we're getting.