Opinion: Share Your Data

Our analysis of a collection of open-access datasets quantifies their benefit to the scientific community.

Oct 24, 2017

Cameron Craddock, Arno Klein, Michael P. Milham

ISTOCK, RAWPIXELShould you share your scientific data? If you pose this question to almost any scientist, the answer will usually be an emphatic “yes,” because everyone appears to benefit.

The most obvious beneficiaries of open-source data are investigators who cannot gather their own data—they may not have the opportunity, for example, to perform particle physics experiments, gather arctic ice core samples, or capture clinical brain images. Those who share their data benefit from the crowdsourced scrutiny, analyses, and interpretations of the data by many investigators across many disciplines; more eyes on the data can lead to better and broader insights. Funders of scientific research benefit because it is less expensive to share and combine data than it is to needlessly duplicate efforts to acquire it. And science itself benefits when combining shared datasets increases statistical power, and therefore reproducibility, and trustworthiness of scientific results. This point is particularly poignant these days, given concerns that as many as 30 percent to 50 percent of reported findings may not be replicable due to the statistical traps of underpowered samples.

We found that INDI’s approximately 15,000 MRI datasets, aggregated across institutions from around the world over, have been used in 900 publications (including 58 theses) in the past seven years.

Yet, despite the various benefits, a study in 2015 found that as few as 13 percent of published articles with original data actually made their data available. Why are so many scientists still reluctant to share their data? For many scientists, the near-term payoffs of opening up their datasets are not obvious enough to justify the effort involved or potential loss of competitive advantage—a reality that was documented in a 2002 survey and has remained a consistent theme in the literature. This is due in part to the equivocal positions on data sharing taken by funding agencies, journals, and institutions. Funding agencies have started mandating sharing, but do not enforce these mandates as a regular part of the grant review process, so those who don’t comply often face no consequences. Journals that mandate data sharing remain the minority, leaving plenty of alternative venues for publication. Many academic institutions embrace and actively support data sharing, but have yet to address how it will be rewarded in promotion and tenure processes.

Even if the near-term payoff to sharing data were more visible, many would-be data sharers are thwarted by a lack of the technical resources or know-how. There are major infrastructural challenges to sharing, ranging from a lack of standards and tools for easily curating data and addressing privacy concerns, to long-term data maintenance. Funding agencies and journals are partially addressing the lack of resources by providing free databasing and storage (for example, https://www.nitrc.org), but the amount of effort required to prepare and organize data for dissemination can still be daunting. Technical societies and organizations can make (and some are making) a significant contribution by hosting educational activities to teach the skills needed to more efficiently share data (for example, http://www.brainhack.org/).

Although data sharing has the potential to accelerate scientific discovery, its effects thus far are not obvious and are hard to ascertain. In a recent biometrics analysis posted on bioRxiv, we tried to estimate the effect of the International Neuroimaging Data-sharing Initiative (INDI), a grassroots brain image data-sharing initiative, on the scientific literature. We found that INDI’s approximately 15,000 MRI datasets, aggregated across institutions from around the world over, have been used in 900 publications (including 58 theses) in the past seven years. More than 90 percent of the publications were from investigators who did not generate the data and many were from outside the field of brain imaging altogether. Those who contributed data appear to have benefitted from being able to use data from others to increase their sample sizes or ask more targeted questions.

It is clear that we urgently need to adopt a system for tracking the use of shared data to reliably estimate use and further incentivize would-be sharers.

While this information paints a very optimistic view of data sharing in the neuroimaging field, it was very difficult to compile. It is clear that we urgently need to adopt a system for tracking the use of shared data to reliably estimate use and further incentivize would-be sharers.

Given that the 15,000 datasets shared by INDI represent an infinitesimally small fraction of data that exist, and yet sharing them resulted in such a significant impact on the scientific literature, there is a clear opportunity to make science more reproducible by openly sharing as much data as possible. It is difficult to say what it will take to make data sharing a widespread reality. It may just be a matter of time, as the younger generation of researchers appears to be more receptive to the principles of open science, and funding agencies and journals are slowly becoming bolder with their mandates.

If each institution were to review its own policies with its investigators—including data sharing mandates and tenure review—this would inculcate a culture of data sharing that would help investigators align their own motives with those of the common good. Institutions such as the Allen Institute, Montreal Neurological Institute, and Child Mind Institute (where we authors are affiliated) are leading the way by making open science a defining principle of their operation. Beyond our support of INDI, the Child Mind Institute has launched the Healthy Brain Network—a large-scale research initiative focused on the generation and open sharing of multimodal data (for example, imaging, electrophysiology, voice samples, fitness, genetics) from 10,000 children and adolescents in the New York City Area for the purposes of advancing transdiagnostic child and adolescent mental health research. We hope others who have been reluctant to do so will step up and follow this example and usher in a new era of more collaborative and powerful research.

Michael P. Milham is the founding Director of the Center for the Developing Brain at Child Mind Institute in New York, where Arno Klein is the Director of Innovative Technologies and Cameron Craddock is a volunteer research scientist.