Science's data secrecy problem

In 2006, amid growing skepticism about the reliability of psychology studies, a group of researchers decided to figure out just how solidly grounded those studies were. They looked at 141 major psychology papersand emailed their authors to request the original data.

Four hundred emails and six months later, they’d received the data for only a quarter of those studies. The rest were unavailable. And so, instead of the question they’d set out to answer, they wrote a different article—titled, pointedly, “The poor availability of psychological research data for reanalysis.”What went wrong? Given how important data is in scientific research, and how much of it is publicly funded, one might think research data is easily available for examination – for other researchers to kick the tires, so to speak. But actually, only a small minority of papers are published with the data available.

Those psych researchers in 2006 aren’t the only team to encounter such frustration. In 2009, a group looking at studies related to modeling in cancer, malaria, and other diseases found only 20 percent of datasets could be accessed. Other researchers who looked specifically at high-impact studies— those published in the most prestigious journals—found that only 10 percent of publications contained the raw data on which their findings were based.

This might come as a surprise. The entire scientific enterprise is, in theory, built on sharing data – it’s how researchers convince skeptics, how they pressure-test one another’s theories. Unlike the secretive world of private-sector invention, science is largely funded with federal or nonprofit money, adding a public-interest component to the basic scientific principle of transparency.

The reasons for the lack of data sharing sometimes are quite simple: Providing data can be a nuisance, taking time and money from running experiments. And sometimes published datasets vanish over time, a function of non-standard archival mechanisms and poor enforcement of data sharing. (This was documented by a research group in 2013; as one author described it, some data sets are simply being "lost to science.")

But secrecy is another problem. Data helps researchers publish, and publications are the currency of scientists, earning them grants and promotions. Thus, researchers often cling jealously to their most important data, treating it more like proprietary information than a public resource.

Troubled by this secrecy – especially given the public funding of most research – a movement for open data and overall open science has arisen, calling for open-access publishing—that is, research to be published in non-paywalled forums—and data sharing. This movement builds upon the mandate by the Obama administration, implemented in 2013, that all federally funded research articles be made available to read for free within one year of publication.Such a movement is supported by the scientific community in principle, but not often followed in practice. Over 16,000 researchers have signed a pledge to not publish in Elsevier, the world’s largest publisher and one that is known for expensive paywalls, and other closed-door practices. But four 4 years after the pledge started to circulate, more than one-third of signers who’ve published have already broken it.

(The movement has also triggered something of a data-sharing backlash: An op-ed in the New England Journal of Medicine last year coined the term “research parasite” to describe scientists who reuse and adapt others’ data without the explicit benefit of the collector of data.)

Today, mandates from research funders, federal and private, are starting to change this process—whether researchers like it or not. The Wellcome Trust and the Gates Foundation, two of the biggest independent sources of medical research funding, require any researcher receiving funding to post data openly.

For science to truly shift from a closed-door to an open-data mindset, however, it may be necessary to look deeper, and to create new kinds of incentives. One might be to turn data itself into a measurable product that can help advance scientists’ careers, bringing the same rewards as publishing results in a journal. Such routine publication of datasets might open the door to new kinds of research projects, with shared observations quickly stitched together into cohesive form by multiple groups, similar to how a computer program is written by collaborative teams using multiple open components today. This is happening already in classrooms utilizing open datasets, but this is not yet woven into mainstream academic science.

Software development also contains an interesting model for a new reward system that prioritizes data sharing over hoarding: in job hiring success as a software developer can be judged by the number of times your code is reused, a process known as forking. The more forks your software has, and thus the most uses, the better rewarded you are as a software developer in terms of career prospects and salary.

This might not be a bad approach for research: Ultimately the point of science is to share and advance knowledge, and it makes sense to reward researchers for providing widely used data, rather than for publishing a bold result based on data they keep secret. This change will require not just a shift in the professional reward system, but also in communications and technology: it needs to be easyfor researchers to share the data, and to track its forking.

This is one of my own personal missions: I left cancer research 3 years ago in part to help fix what I saw as flaws in the research enterprise overall, and started my own open publishing company, The Winnower, which later joined forces with Authorea, a startup I now run helping researchers write and publish data-driven research articles.

Arguably, in some cases open data can be detrimental, and there are ways in which closed data can be beneficial. For example, publications on viruses or bacteria that can be weaponized could cause real public harm. Proprietary data can also be useful for researchers to start companies without their ideas being co-opted by larger organizations.

Still, the benefits of open data are likely to far outweigh the current closed practices. And, as recent examples in astrophysics show, large-scale collaborations can produce breakthrough discoveries far beyond what individual scientists, hoarding their data, could produce alone. When the Higgs boson was discovered, the article had thousands of authors, each of whom had worked on a small piece of the whole. And the data, generated at CERN, is open to the public – which has already led to new ideas and discoveries.