We hope publishing the argument in this high-visibility venue will inspire hallway conversations amongst scientists and influence how they view long-term data archive funding. Particularly those scientists who also wear hats in funding agencies!

The letter is currently behind a paywall. As is permitted by Nature’s preprint policies, I include the text we initially submitted below. It is very similar to what appears in the final article (linked above).

The published letter is also very short. The original article-length draft is at the bottom of this post. Needless to say, it includes nuances lost in the shorter versions. The README associated with the data has additional information about methods.

While doing this research I wrote a few blog posts about my methods and early results. Here are the links:

Data archiving gives a high return on investmentPiwowar, Vision, Whitlock

As recognized by the recent NSF, NIH, and other research council requirements for data management and dissemination plans, data archiving is valuable to scientific progress. Unfortunately, funding agencies have been reluctant to make long-term commitments to data archives. Here, we compare the research productivity per dollar of data archives with existing benchmarks. We argue that ongoing investment in data archiving infrastructure provides a high scientific return on financial investment.

First, how much do archives cost? As an example, we use Dryad (datadryad.org), a data repository for the biosciences with which we are all associated. For Dryad, a relatively cost-effective archive, we estimate that data from over 10,000 publications can be curated and preserved each year for approximately $400,000.

Next, how much research is typically published per grant dollar? NSF core grants in Population and Community Ecology averaged about 3-4 papers per $100,000 from 2000-2005, according to research conducted by Savanna Reyes with Alan Tessier and Susan Mazer. Thus $400,000 in original research funding results in about 16 papers.

Finally, how productive are data archives in facilitating original research publications? It is too early to say for Dryad, but we can look to NCBI’s Gene Expression Omnibus (GEO) database for insight. To derive an estimate of data reuse, we searched the full-text of articles in PubMed Central (PMC) for mention of any of the 2,711 datasets deposited in GEO in 2007. We excluded articles whose author names overlapped those who had deposited the dataset. Extrapolating the 338 hits in PMC to all of PubMed, we estimate the GEO datasets from 2007 have made substantive third-party contributions to more than 1150 published articles in 2007-2010 alone, and reuses continue to accumulate rapidly. Details of this estimate are available at doi:10.5061/dryad.j1fd7.

Assuming that Dryad has a comparable rate of reuse and collects even 2,500 datasets per year, a $400,000 investment would contribute to more than 1,000 papers within 4 years, far greater than the accepted value of a research dollar. Some papers based on data reuse may be partially funded by additional grant support­­ ­­— nonetheless, the modest amount of funding needed to maintain a repository like Dryad is almost certain to generate a large scientific return on investment. To maximize the impact of the support they provide to individual investigators, research funders should include the maintenance of data archives as an integral component of their investment portfolios.

Data archiving gives a high return on investmentPiwowar, Vision, Whitlock

In many fields of science, data are rarely available for scrutiny or reuse by the broader community. While data produced within large-scale research initiatives are increasingly freely and openly available, important data captured in the “long tail” of investigator-driven research are almost inevitably lost through a combination of poor data management, hardware failure, and retirement or death of their collectors.

In recent years, many funding agencies and journals have increased their expectations for the sharing of research data, and a variety of public data archives have been established to support these policies. Unfortunately, many of these archives face uncertain futures because the recurrent funds necessary for long-term preservation are difficult to obtain. Funding agencies must weigh investment in data infrastructure, such as data archives, against investment in research itself. Do data archives merit long-term investment?

We provide evidence that data archives promise an outstanding return on investment by facilitating a productive afterlife for data that would otherwise see very limited reuse. We use as our metric the number of papers written based on archived data, relative to the maintenance cost of the archive. While not captured by our numbers, it is important to also appreciate that data archiving has benefits beyond new publications, including transparency and broadened participation.

First, consider how much archives cost. This obviously varies depending on the archive. As a benchmark, we use Dryad (datadryad.org), a data repository with which we are all associated. Dryad was launched two years ago to house datasets associated with published articles in the biosciences. Dryad has been designed to operate efficiently: budget estimates for Dryad suggest that it can curate and preserve the data from over 10,000 publications on an annual budget of $400,000.

Second, how productive is research funding, in terms of journal publications? NSF core grants in the Population and Community Ecology cluster averaged about 3-4 papers per $100,000 from 2000-2005, according to research conducted by Savanna Reyes with Alan Tessier and Susan Mazer (1). Estimates from other studies in the literature are similar (2-6). If we use the upper estimate of productivity from the Population and Community Ecology cluster, $400,000 in original research funding would result in about 16 papers.

Finally, how often do data archives facilitate novel research? It is too early to say how many research papers Dryad will enable, but we can look to comparable data repositories for insight. Within the biosciences, Genbank and the Protein Data Bank are well-known success stories, but it is sometimes suggested that these datatypes are particularly conductive to reuse. NCBI’s Gene Expression Omnibus (GEO) database contains data more typical of individual investigator-driven research: gene expression microarray data are collected under a wide range of experimental conditions, on a variety of incompatible platforms, and undergo variable processing steps.

To derive an estimate of the reuse of data in GEO, we took advantage of the conventions for citing GEO datasets through accession numbers and GEO’s integration with PubMed and PubMed Central (PMC). Using PMC, we searched the full text of papers published between 2007 and 2010 for mention of one or more of the 2,711 accession numbers assigned to data series submitted to GEO in 2007. After excluding those papers that a) had author names in common with those who deposited the data (since the original authors would presumably have access to the data even in the absence of the archive) and b) mentioned an accession number without building upon the dataset, we identified 338 papers that appear to reuse the 2007 GEO datasets in a significant way.

Because PMC contains only a subset of papers recorded in PubMed, we extrapolated to the expected number of articles in PubMed based on the ratios of papers in PMC to PubMed in this domain (measured as the number of articles indexed with the MeSH term “gene expression profiling” in PMC relative to the number of articles with the same MeSH term in all of PubMed; 2007:23%, 2008:32%, 2009:36%, 2010:25%). We estimate that, as of the end of 2010, the whole of PubMed contains 1159 papers that mention GEO accession numbers in the context of novel reuse for datasets submitted in 2007 alone. Thus, for every ten datasets that it collects, we estimate that GEO contributes to at least four papers in the following three years.

This is an underestimate of reuse for several reasons. Our screen only captures papers that attribute reuse through mention of a GEO accession number, which is common practice but not universal. Furthermore, this analysis only includes the first few years in the productive afterlife of the data. As illustrated in Figure 1, reuses of data from 2007 continue to accumulate rapidly.

Figure 1: Reuse of datasets deposited in GEO in 2007

Assuming that Dryad collects a low figure of 2,500 data sets per year, and that it has a rate of publishable re-use equivalent to that for GEO, a $400,000 investment in this data archive would contribute to more than 1,000 papers within 4 years. While papers based on data reuse may be partially funded by grant support for analysis effort or additional data collection, the modest amount of funding needed to maintain a repository like Dryad is almost certain to generate a large scientific return on investment.

Public data archiving can generate important new results for a small fraction of the currently accepted cost of doing science. To maximize the impact of the support they provide to individual investigators, research funders should include the maintenance of data archives as an integral component of their investment portfolios.

Supporting data and detailed methods are available as supplementary information.

Acknowledgements
This analysis was conducted under the auspices of DataONE, funded by a Cooperative Agreement through the NSF DataNET program (OCI-0830944).

References
1. Personal communication
2. K. W. Boyack, K. Borner, Indicator-assisted evaluation and funding of research: Visualizing the influence of grants on the number and citation counts of research papers. Journal of the American Society for Information Science and Technology 54, 447 (2003).
3. B. G. Druss, S. C. Marcus, Tracking publication outcomes of National Institutes of Health grants. The American Journal of Medicine 118, 658 (2005).
4. M. Gaughan, B. Bozeman, Using curriculum vitae to compare some impacts of NSF research grants with research center funding. Research Evaluation 11, 17 (2002).
5. D. Hendrix, An analysis of bibliometric indicators, National Institutes of Health funding, and faculty size at Association of American Medical Colleges medical schools, 1997–2007. Journal of the Medical Library Association: JMLA, 86, 324 (2008).
6. V. Larivière, B. Macaluso, É. Archambault, Y. Gingras, Which scientific elites? On the concentration of research funds, publications and citations. Research Evaluation 19, 45 (2010).

Share this:

Like this:

Related

10 Comments

I was with you until “budget estimates for Dryad suggest that it can curate and preserve the data from over 10,000 publications on an annual budget of $400,000”. It’s the “curate” bit that worries me. OK, it’s an overloaded word, but you’ve included it to mean more than “preserve”. That suggests both subject skill and involvement with the dataset in some way. It’s hard to see how this could be done with what must be around 2 FTE staff. Existing subject-oriented archives in the UK are MUCH more expensive, and claim that their expertise is essential if the data are to be properly curated. UKDA for instance (which, I agree, does a larger and slightly different job) has a list of around 70 people on its staff. I’m not sure how this translates into FTE or dollars for the equivalent job, but I suspect it’s up by a factor of 10.

As you say, Chris, I think it’s all about how ambitious the archive is in its curation step. Dryad demonstrates it is possible to do light curation with this workload.

Below is a link that summarizes what the Dryad curators do. Dryad is involved in each of the data curation steps outlined by the UK Data Archive, but at a “processing standard” that makes no attempt to validate the data, sanity check it, or create additional documentation (similar to Processing standard C for UK Data Archive but without the data checks). Maybe this light level of curation could be considered similar to that of other immutable objects?

Clearly the sanity checks have value, but they cost money. ROI calculations for these steps would be very interesting and useful!

A recent study by Jeffrey M. Litwin, presented at the Association for Institutional Research conference in May 2011, reports slightly lower estimates of article publication per funding dollar (0.5 to 4 publications/$100k funding, across a wide range of disciplines).

Sorry to be so late in seeing it, but thanks for your comment, Heather. Any chance of a blog post exploring this topic in more detail? Understanding the cost-benefit of curation actions seems pretty important to me, given there may be a 50-times price differential involved!

I agree, it is a really important topic, and I think we all have a lot to learn about both the costs and benefits side of the curation equation. Thanks for the post suggestion, I’ll see what I can do :)