Dryad updates and goings on

Dryad is a nonprofit organization fully committed to making scientific and medical research data permanently available to all researchers and educators free-of-charge without barriers to reuse. For the past four years, we have engaged experts and consulted with our many stakeholders in order to develop a sustainability plan that will ensure Dryad’s content remains free to users indefinitely. The resulting plan allows Dryad to recoup its operating costs in a way that recovers revenues fairly and in a scalable manner. The plan includes revenue from submission fees, membership dues, grants and contributions.

A one-time submission fee will offset the actual costs of preserving data in Dryad. The majority of costs are incurred at the time of submission when curators process new files, and long-term storage costs scale with each submission, so this transparent one-time charge ensures that resources scale with demand. Dryad offers a variety of pricing plans for journals and other organizations such societies, funders and libraries to purchase discounted submission fees on behalf of their researchers. For data packages not covered by a pricing plan, the researcher pays upon submission. Waivers are provided to researchers from developing economies. See Pricing Plans for a complete list of fees and payment options. Submission fees will apply to all new submissions starting September 2013.

Membership dues will supplement submission fees, allowing Dryad to maintain its strong ties to the research community through its volunteer Board of Directors, Annual Membership Meetings, and other outreach activities to researchers, educators and stakeholder organizations. See Membership Information.

Grants will fund research, development and innovation.

Donations will support all of the above efforts. In addition, Dryad will occasionally appeal to donors to fund special projects or specific needs, such as preservation of valuable legacy datasets and deposit waivers for researchers from developing economies.

We are grateful for all the input we have received into our sustainability plan, and look forward to your continued support in carrying out our nonprofit mission for many long years to come.

An Integrated Journals page that helps depositors see which journals are coordinating the submission process with Dryad, figure out which stage in the publication process to submit data for your chosen journal, and more.

Prominent positioning of Dryad’s Terms of Service, which we view as a two-way compact with our users. We wrote it in plain language and sincerely want it to be read!

Improved accessibility to persons with visual disabilities (following the guidelines in Section 508 of the U.S. code)

More intuitive search and browse of data packages and a revamped layout for the data package page

There are lots more improvements underway. Not all of these will be immediately obvious to website visitors, but you can expect to see more changes over the coming months. Thanks to all who have provided feedback and helped with usability testing, and please let us know what you think!

Dryad is pleased to announce that a diverse array of new partner journals have completed submission integration during the first quarter of 2013. Authors to these journals will benefit from streamlined data deposition, while the journals will benefit from enhancement of the articles through a tighter linkage to the underlying data.

Submission integration is completely free, and can be implemented with a wide variety of manuscript submission systems. We welcome inquiries from other journals that wish to integrate submission with Dryad, and encourage authors from non-integrated journals to let their editors know if it is a service that they would value.

eLife is a prestigious new open-access journal published by the Howard Hughes Medical Institute, the Max Planck Society, and the Wellcome Trust.

Journal of Open Public Health Data (JOPHD) is a new journal from Ubiquity Press that publishes peer-reviewed data papers describing public health datasets with high reuse potential. The data itself must be made freely available in a public repository.

Each journal that integrates with Dryad chooses whether to have authors archive their data prior to peer review or after manuscript acceptance. Of these six journals, GMS Medical Sciences, eLife, and the Journal of Open Public Health Data chose to have their authors submit data prior to peer review.

A study providing new insights into the citation boost from open data has been released in preprint form on PeerJ by Dryad researchers Heather Piwowar and Todd Vision. The researchers looked at thousands of papers reporting new microarray data and thousands of cited instances of data reuse. They found that the citation boost, while more modest than seen in earlier studies (overall, ~9%), was robust to confounding factors, distributed across many archived datasets, continued to grow for at least five years after publication, and was driven to a large extent by actual instances of data reuse. Furthermore, they found that the intensity of dataset reuse has been rising steadily since 2003.

Heather, a post-doc based in Vancouver, may be known to readers of this blog for her earlier work on data sharing, her blog, her role as cofounder of ImpactStory, or her work to promote access to the literature for text mining. Recently Tim Vines, managing editor of Molecular Ecology and a past member of Dryad’s Consortium Board, managed to pull Heather briefly away from her many projects to ask her about her background and latest passions:

TV: Your research focus over the last five years has been on data archiving and science publishing- how did your interest in this field develop?

HP: I wanted to reuse data. My background is electrical engineering and digital signal processing: I worked for tech companies for 10 years. The most recent was a biotech developing predictive chemotherapy assays. Working there whetted my appetite for doing research, so I went back to school for my PhD to study personalized cancer therapy.

My plan was to use data that had already been collected, because I’d seen first-hand the time and expense that goes into collecting clinical trials data. Before I began, though, I wanted to know if the stuff in NCBI’s databases was good quality, because highly selective journals like Nature often require data archiving, or was it instead mostly the dregs of research because that was all investigators were willing to part with. I soon realized that no one knew… and that it was important, and we should find out. Studying data archiving and reuse became my new PhD topic, and my research passion.

My first paper was rejected from a High Profile journal. Next I submitted it to PLOS Biology. It was rejected from there too, but they mentioned they were starting this new thing called PLOS ONE. I read up (it hadn’t published anything yet) and I liked the idea of reviewing only for scientific correctness.

I’ve become more and more of an advocate for all kinds of open science as I’ve run into barriers that prevented me from doing my best research. The barriers kept surprising me. Really, other fields don’t have a PubMed? Really, there is no way to do text mining across all scientific literature? Seriously, there is no way to query that citation data by DOI, or export it other than page by page in your webapp, and you won’t sell subscriptions to individuals? For real, you won’t let me cite a URL? In this day and age, you don’t value datasets as contributions in tenure decisions? I’m working for change.

TV: You’ve been involved with a few of the key papers relating data archiving to subsequent citation rate. Could you give us a quick summary of what you’ve found?

HP: Our 2007 PLOS ONE paper was a small analysis related to one specific data type: human cancer gene expression microarray data. About half of the 85 publications in my sample had made their data publicly available. The papers with publicly available data received about 70% more citations than similar studies without available data.

There was a significant limitation to these early studies: they didn’t control for several of important confounders of citation rate (number of authors, of example). Thanks to Angus Whyte at the Digital Curation Centre (DCC) for conversations on this topic. Todd Vision and I have been working on a larger study of data citation and data reuse to address this, and understand deeper patterns of data reuse.Our conclusions:

After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation boost are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.

TV: Awareness of data archiving and its importance for the progress of science has increased massively over the past five years, but very few organizations have actually introduced mandatory archiving policies. What do you see as the remaining obstacles?

HP: Great question. I don’t know. Someone should do a study! Several journals have told me it is simply not a high priority for them: it takes time to write and decide on a policy, and they don’t have time. Perhaps wider awareness of the Joint Data Archiving Policy will help.

Some journals are afraid authors will choose a competitor journal if they impose additional requirements. I’m conducting a study to monitor the attitudes, experiences, and practices of authors in journals that have adopted JDAP policy and similar authors who publish elsewhere. The study will run for 3 years, so although I have more than 2500 responses there is still another whole year of data collection to go. Stay tuned :)

Funders, though. Why aren’t more funders introducing mandatory public data archiving policies (with appropriate exceptions)? I don’t know. They should. Several are taking steps towards it, but golly it is slow. Is anyone thinking of the opportunity cost of moving this slowly? More specific thoughts in my National Science Foundation RFI response with coauthor Todd Vision.

TV: You’re a big advocate of ‘open notebook’ science. How did you first get interested in working in this way?

HP: I was a grad student, hungry for information. I wanted to know if everyone’s science looked like my science. Was it messy in the same ways? What processes did they have that I could learn from? What were they are excited about *now* — findings and ideas that wouldn’t hit journal pages for months or years?

This was the same time that Jean-Claude Bradley was starting to talk about open notebook science in his chemistry lab. I was part of the blogosphere conversations, and had a fun ISMB 2007 going around to all the publisher booths asking about their policies on publishing results that had previously appeared on blogs and wikis (my blog posts from the time; for a current resource see the list of journal responses maintained by F1000 Posters).

TV: It’s clearly a good way to work for people whose work is mainly analysis of data, but how can the open notebook approach be adapted to researchers who work at the bench or in the field?

HP: Jean-Claude Bradley has shown it can work well very in a chemistry lab. I haven’t worked in the field, so I don’t want to presume to know what is possible or easy: guessing in many cases it wouldn’t be easy. That said, more often than not, where there is a will there is a way!

TV: Given the growing concerns over the validity of the results in scientific papers, do you think that external supervision of scientists (i.e. mandated open notebook science) would ever become a reality?

HP: I’m not sure. Such a policy may well have disadvantages that outweigh its advantages. It does sound like a good opportunity to do some research, doesn’t it? A few grant programs could have a precondition that the awardees be randomized to different reporting requirements, then we monitor and see what happens. Granting agencies ought to be doing A LOT MORE EXPERIMENTING to learn the implications of their policies, followed by quick and open dissemination of the results of the experiments, and refinements in policies to reflect this growing evidence-base.

TV: You’re involved in a lot of initiatives at the moment. Which ones are most exciting for you?

HP: ImpactStory. The previous generation of tools for discovering the impact of research are simply not good enough. We need ways to discover citations to datasets, in citation lists and elsewhere. Ways to find blog posts written about research papers — and whether those blog posts, in turn, inspire conversation and new thinking. We need ways to find out which research is being bookmarked, read, and thought about even if that background learning doesn’t lead to citations. Research impact isn’t the one dimensional winners-and-losers situation we have now with our single-minded reliance on citation counts: it is multi-dimensional — research has an impact flavour, not an impact number.

Metrics data locked behind subscription paywalls might have made sense years ago, when gathering citation data required a team of people typing in citation lists. That isn’t the world we live in any more: keeping our evaluation and discovery metrics locked behind subscription paywalls is simply neither necessary nor acceptable. Tools need to be open, provide provenance and context, and support a broad range of research products.

We’re realizing this future through ImpactStory: a nonprofit organization dedicated to telling the story of our research impact. Researchers can build a CV that includes citations and altmetrics for their papers, datasets, software, and slides: embedding altmetrics on a CV is a powerful agent of change for scholars and scholarship. ImpactStory is co-founded by me and Jason Priem, funded by the Alfred P. Sloan Foundation while we become self-sustaining, and is committed to building a future that is good for scholarship. Check it out! and contact if you want to learn more: team@impactstory.org

We are pleased to announce that Biology Letters is the latest journal to integrate submission of manuscripts with data to Dryad. In this process, the journal and repository communicate behind the scenes in order to streamline data submission for authors and ensure that the article contains a permanent link to the data.

It is particularly apt because Biology Letters is published by the Royal Society, which invented the idea of sharing knowledge through a scientific journal back in 1665. Scientific communication has come a long way from those early letters among gentleman natural philosophers to the current conception of Science as an Open Enterprise conducted in the public interest. Reflecting these changes in science and technology, the Royal Society recently strengthened its policy on the availability of research data:

To allow others to verify and build on the work published in Royal Society journals it is a condition of publication that authors make available the data and research materials supporting the results in the article.

Datasets should be deposited in an appropriate, recognized repository and the associated accession number, link or DOI to the datasets must be included in the methods section of the article. Reference(s) to datasets should also be included in the reference list of the article with DOIs (where available). Where no discipline-specific data repository exists authors should deposit their datasets in a general repository such as Dryad.

There are already a healthy number of articles in Biology Letters with associated data in Dryad, including one of last year’s hit data packages, Monsters are people too. The first to be published via integrated submission is:

PubMed and GenBank, from the National Center for Biotechnology Information (NCBI), are hugely popular resources for searching and retrieving article abstracts and nucleotide sequence data, respectively. PubMed indexes the vast majority of the biomedical literature, and deposition of nucleotide sequences in GenBank or one of the other INSDC databases is a near universal requirement for publication in a scientific journal.

Thanks to NCBI’s “LinkOut” feature, it is now easy to find associated data in Dryad from either PubMed or GenBank. For example, this Dryad data package is linked from:

the article’s abstract in PubMed. “LinkOut” is at the bottom of the page; expand “+” to see the links to Dryad and other resources.

nucleotide data associated with the same publication in GenBank. “LinkOut” is in the right hand navigation bar

LinkOut allows the data from an article to be distributed among repositories without compromising its discoverability.

At Dryad, we intend to expand on this feature in a couple of ways. First, we plan to make Dryad content searchable via the PubMed and GenBank identifiers, which because of their wide use will provide a convenient gateway for other biomedical databases to link out to Dryad. Second, we will be using open web standards to expose relationships between content in Dryad and other repositories, not just NCBI. For example, keen eyes may have noted the relationship of the Dryad data package in the example above to two records in TreeBASE.

To learn more about how Dryad implements NCBI’s LinkOut feature, please see our wiki.

What does the directive actually require? All federal government agencies with at least $100M in R&D expenditures must develop, in the next six month, policies for digital data arising from non-classified research that address a host of objectives, including:

to “maximize access, by the general public and without charge, to digitally formatted scientific data created with federal funds” while recognizing that there are cases in which preservation and access may not be desirable or feasible.

to promote greater use of data management plans for both intramural and extramural grants and contracts, including review of such plans and mechanisms for ensuring compliance

to allow inclusion of appropriate costs for data management and access in grants

to promote the deposit of data in publicly accessible databases

to address issues of attribution to scientific data sets

to support training in data management and stewardship

to “outline options for developing and sustaining repositories for scientific data in digital formats, taking into account the efforts of public and private sector entities”

Interestingly, the directive is silent on the issue of embargo periods for research data, neither explicitly allowing or disallowing them.

…the memorandum requires that agencies start to address the need to improve upon the management and sharing of scientific data produced with Federal funding. Strengthening these policies will promote entrepreneurship and jobs growth in addition to driving scientific progress. Access to pre-existing data sets can accelerate growth by allowing companies to focus resources and efforts on understanding and fully exploiting discoveries instead of repeating basic, pre-competitive work already documented elsewhere.

The breadth of research impacted by this directive is notable. Based on the White House’s proposed 2013 budget, the covered agencies would spend more then $60 billion on R&D. A partial list includes:

The National Institutes of Health (NIH)

The National Science Foundation (NSF)

The National Aeronautics and Space Administration (NASA)

The Department of Energy (DOE)

The Department of Agriculture (USDA)

The National Oceanic and Atmospheric Administration (NOAA)

The National Institutes for Standards and Technology (NIST)

The Department of the Interior (which includes the Geological Survey)

The Environmental Protection Agency (EPA)

and even the Smithsonian Institution

We applaud OSTP for moving to dramatically improve the availability of research data collected in the public interest with federal funds.

You can read the full memo here: the data policies are covered in Section 4.