A balancing act: The ideal and the realistic in developing Dryad's preservation policy

Data preservation has gained momentum and visibility in connection with the growth in digital data and data sharing policies. The Dryad Repository, a curated general–purpose repository for preserving and sharing the data underlying scientific publications, has taken steps to develop a preservation policy to ensure the long–term persistence of this archived data. In 2013, a Preservation Working Group, consisting of Dryad staff and national and international experts in data management and preservation, was convened to guide the development of a preservation policy. This paper describes the policy development process, outcomes, and lessons learned in the process. To meet Dryad’s specific needs, Dryad’s preservation policy negotiates between the ideal and the realistic, including complying with broader governing policies, matching current practices, and working within system constraints.

The growth in digital data and data–intensive computing has contributed to the phenomenon often referred to as “the data deluge” (Hey and Trefethen, 2003). These data are increasingly being recognized as valuable, and there are numerous efforts pushing for open access and facilitating data reuse. The goal of these efforts is not only to maintain but also to ensure the enduring value of such data. Recent research has also highlighted the benefits of open data by showing that data sharing leads to increased scientific discourse and discovery (Pienta, et al., 2010), and increases citations and impact (Gleditsch, et al., 2003; Piwowar and Vision, 2013).

Data preservation has gained momentum and visibility in connection with the growth in digital data and the increasing prevalence of data–sharing policies. Researchers archiving data want to confirm that it will be preserved in order to support data sharing and data reuse (Beagrie, et al., 2010). National and international funding agencies have begun to require data management plans for grant applications (e.g., the National Institutes of Health (NIH) in 2003; National Science Foundation (NSF) in 2011; Natural Environment Research Council in the U.K in 2011; Biotechnology and Biological Sciences Research Council in the U.K. in 2010). A February 2013 memo from the White House Office of Science and Technology Policy stated that data resulting from

“unclassified research that are published in peer–reviewed publications directly arising from federal funding should be stored for long–term preservation and publicly accessible to search, retrieve, and analyze in ways that maximize the impact and accountability of the Federal research investment“ (Holdren, 2013).

All of these efforts, together with the rapid growth of data repositories, support the need for reliable repositories with effective preservation policies. The information and library science community has expert knowledge in these areas and has been engaged in advancing practices and policies specific to preserving research data. Sharing these developments is important. This paper serves to document the development of a preservation policy for the Dryad Repository, which stands out in the area of scholarly communication as a partner to scientific journals with established data archiving requirements (Whitlock, et al., 2010).

This article begins with a description of the Dryad Repository, then reviews background information that has shaped Dryad’s preservation policy. The central part of the paper presents the process of developing a preservation policy for Dryad, the challenges, and the lessons learned. The conclusion summarizes the paper’s highlights and discusses the next steps.

The Dryad Repository

Dryad is a “curated, general–purpose repository that makes the data underlying scientific and medical publications discoverable, freely reusable, and citable” (Dryad Repository, 2013). Dryad was officially launched in in September 2008, funded by a National Science Foundation grant, as a joint project between the Metadata Research Center (MRC) at the University of North Carolina at Chapel Hill (UNC–CH) and the National Center for Evolutionary Synthesis (NESCent). NESCent is a collaboration involving Duke University (Duke), UNC–CH, and North Carolina State University (NC State). Dryad was envisioned as an easy–to–use, sustainable, community–governed data infrastructure, and preservation is a key element of Dryad’s mission to ensure access, facilitate data availability, support data sharing, and enhance scholarly communication. Dryad accepts data underlying peer–reviewed publications, and all data files in Dryad are available for download and reuse, except those that are under a temporary embargo period, as permitted by editors of the relevant journals. As of June 2014, Dryad held data from more than 300 different journals. Journals are encouraged to become “integrated,” a service that links the manuscript submission process with the data submission process for ease of deposit and richness of description. As of April 2014, Dryad had more than 70 integrated partner journals. Some of Dryad’s earliest partners were a group of leading journals and scientific societies in evolutionary biology and ecology, whose affiliation with Dryad supported a Joint Data Archiving Policy (Whitlock, et al., 2010). Since 2009, Dryad has been a membership organization, governed by a Board of Directors. In 2013, Dryad began transitioning away from its grant–funded status, incorporating as a not–for–profit and instituting a data publishing charge in order to support sustainability.

Literature review

Data preservation

Digital preservation is not a new concern in the library and archives fields, and considerable research has been done over the past 20 years regarding long–term preservation of digital information (Thibodeau, 2002; Burda and Teuteberg, 2013). The term digital preservation has been defined in various ways in the literature, but there is a general consensus in archives and preservation communities that digital preservation aims to ensure authenticity and access for digital objects over time. The Research Libraries Group (RLG) and Online Computer Library Center (OCLC) (RLG and OCLC, 2002) defined digital preservation as “the managed activities necessary for ensuring both the long–term maintenance of a bytestream and continued accessibility of its contents” [1]. The Joint Information Systems Committee’s (Jisc) briefing paper (Pennock, 2006) provides a more detailed definition, noting that digital preservation encompasses “not just technical activities, but also all of the strategic and organizational considerations that relate to the survival and management of digital material” [2]. Jisc’s definition supports the purpose of this paper, encouraging digital preservation activities not only in the course of digital object processing, but also on a broader policy level.

Digital preservation research has focused on a few key areas: intellectual property concerns (Charlesworth, 2012; Lor and Britz, 2012); technology obsolescence, including format/software and media/hardware (Caplan, 2007; Guttenbrunner and Rauber, 2012); physical threats such as bit rot, human error, and natural events (Weinstein, 1999; Martyniak, 2010); security (Qian, et al., 2011); and description/metadata (Lavoie, 2004; Groenewald and Breytenbach, 2011). Many data preservation issues align with this existing research. However, data preservation also has its own distinct challenges. Perhaps most challenging from a preservation perspective is that research data have a wide variety of formats (McGath, 2013). These formats may require rare or proprietary programs to interpret them, making emulation or migration difficult and expensive. Second, data are raw materials, rather than finished products; data creators need to be able to add new versions of datasets to the repository as they continue to conduct their research. Because data are the basis of scholarly output, they may not be immediately appropriate for public archiving; data may contain sensitive information, may not be formatted appropriately, or could potentially be used to scoop research findings (Laakso and Björk, 2013, Beagrie, et al., 2010). Lastly, data sets can be very large and consist of multiple files; so while metadata creation is important to facilitate discoverability, it is also time consuming to curate large data sets (Greenberg, et al., 2009).

Data repositories

Data curation and preservation in social science repositories dates back to the 1960s [3], but the scientific research community has been slower to enter the public discourse on data preservation. The earliest reports addressing scientific data sharing were published in the 1980s (Fienberg, et al., 1985), and national organizations did not begin to publish about research data preservation until the 2000s (National Science Board, 2005; Friedlander and Adler, 2006). Since then, the growth of e–science and data–intensive computing have resulted in a “data deluge” — a term that has risen to popularity in both the academic and popular presses (Hey and Trefethen, 2003; Vardi, 2008; Economist, 2010). The massive and growing amounts of data have underscored the importance of data management and preservation.

GenBank (https://www.ncbi.nlm.nih.gov/genbank), one of the first data repositories on the Web, was established by the National Institutes of Health in 1982 to hold nucleic acid sequences. GenBank was used to archive Human Genome Project data in the 1990s–2000s, and it continues to accept sequencing data for more than 240,000 named organisms. Another wave of online scientific data repositories arose in the 2000s, including the Knowledge Network for Biocomplexity (https://knb.ecoinformatics.org). The number of repositories has continued to grow. The Registry of Research Data Repositories (http://www.re3data.org/) currently lists hundreds of scientific data repositories in the United States, and Marcial and Hemminger (2010) estimate the worldwide number of research data repositories to be in the thousands.

Preservation polices and trustworthy repositories

The Roman philosopher Lucius Annaeus Seneca wrote, “if one does not know to which port one is sailing, no wind is favorable” (Knowles, 2009). Written policy is vital to any organization in order to convey a clear mission, develop a strategic roadmap, and encourage a proactive approach to company practice. By drafting policy, an organization lays a foundation that facilitates future growth.

Policy development is essential to the success of project implementation, including digital preservation (Beagrie and Jones, 2008). Preservation policy is also required in order to be certified as a Trusted Digital Repository. The Consultative Committee for Space Data Systems (CCSDS) Audit and Certification of Trustworthy Digital Repositories (2011) explains, “documentation assures stakeholders (consumers, producers, and contributors of digital content) that the repository is meeting its requirements and fully performing its role as a trustworthy digital repository” [4]. Policy development also provides support for strategic planning, and it is critical that policy be continually reevaluated as repositories grow, holdings change, and preservation theory evolves (Bergmeyer, et al., 2009).

While preservation policies have become increasingly commonplace in digital repositories over the past few years, data repositories lag behind the curve. Social science data repositories tend to have more robust policies [5]. However, there are fewer preservation policies for scientific data repositories. This may be because many scientific data repositories are developed by smaller, discipline–specific communities in order to serve their own needs [6], and it is more difficult for preservation policy development to happen on such small scales. Many existing scientific data preservation policies are associated with larger, more established data repositories. Four policies that informed Dryad’s policy development are summarized in Table 1. Our selection of these four policies for review was based on the recommendation of the Dryad Preservation Working Group.

Like Dryad, ADS has internal policy documents and host institution policies. The preservation policy points out that it “does not exist in isolation” (p. 1), but as part of a suite of policy documents, and in partnership with several organizations with whom ADS has service level agreements and memoranda of understanding. The ADS preservation policy is the only one surveyed here that acknowledges the Open Archival Information System (OAIS) reference model (CCSDS, 2012); it adheres closely to the model and relies heavily on OAIS terminology. It includes a guidance and implementation section that specifies preservation activities, indicating who in the organization conducts each activity.

The policy identifies four levels of complexity in the CMS content, and outlines distinct preservation policy for each level of data. The policy also emphasizes CERN’s commitment to open access, writing that “open access to the data will, in the long term, allow the maximum realization of their scientific potential” (p. 1). CERN plans to archive data via third parties, so this policy does not include repository–specific preservation planning.

The preservation policy acknowledges the importance of keeping policy up to date, explicitly stating that “to keep this document pertinent and effective, it should be reviewed frequently” (p. 3). However, reflecting the challenge of reconciling policy with practice, NSIDC’s policy has not been revised since its development in 2004, leading to outdated technical information. Apart from this issue, the policy is robust and detailed. Its sections include: Data Solicitation and Acceptance, Levels of Service, Metadata and Data Format Standards, Data Set Documentation, Archive Policies, Architecture and Security, Data Tools, Data Deletion and Retirement, and Data Access.

Of the policies surveyed, DataONE’s preservation policy comes closest to the kind of general scientific data policy that Dryad aimed to develop in order to fit its broad collecting policy. DataONE’s policy does not speak to specific implementation, but rather outlines a broad preservation practice using a three–tiered system: “keep the bits safe,” “protect the form, meaning, and behavior of the bits,” and “safeguard the guardians.”

Marcial and Hemminger (2010) used Google–based searches to determine that 62 percent of scientific data repositories had “a clear mention of a preservation policy or similar” [7]. However, the authors of this paper found that a mention of preservation policy did not necessarily equate to a published policy. We found few written policies available online. We conclude that there is a continuing need for general scientific data preservation policies that can be used to guide other data repositories with similar holdings.

Data preservation policy development process

Dryad’s preservation policy is the result of recent efforts that have built on basic, ongoing preservation activity. To understand the full activity, this section will first review basic, ongoing preservation practices for Dryad, then the process for development of a more sophisticated preservation policy are reviewed and, finally, Dryad’s preservation best practice policy to date is presented.

Dryad’s basic preservation practices

A number of basic preservation activities were already being implemented in Dryad before an official policy was developed.

An MD5 checksum is created for every bitstream upon upload. Checksums are verified nightly, and a verification email is sent to repository administrators.

Provenance metadata is automatically created as submissions progress through the Dryad workflow. This metadata provides the name and e–mail address of each actor — usually the depositor and Dryad curators; the date that each action occurred; and a description of the action. Common actions are submission, rejection with documented reason, acceptance into publication blackout, and archiving into DSpace. The provenance metadata is private, and is only visible to logged–in administrators. One problem with the current creation of provenance metadata is that these fields are only automatically created when a submission progresses normally through the workflow. In rare cases, the Dryad Curator moves submissions in uncommon directions across workflow states. In the future, we hope that provenance metadata can be automatically generated for all actions. Also, until February 2014, all assistant curators worked from the same Dryad account; unique accounts for each assistant curator provide more detailed provenance information.

Historically, Dryad curators have informally encouraged preferred formats. This practice developed organically and is used sparingly. The most common reason for a curator to request an alternative file type from a depositor is if the submitted file is unreadable by free software, and the curator is therefore unable to review the file. When the opportunity arises, preferred file types are also encouraged over the course of normal correspondence with depositors.

Dryad’s servers are hosted by NC State, where regular maintenance is conducted. There is also a server mirror at Duke, which is updated every five minutes and has its own backup processes.

Dryad’s preservation policy builds on the basic practices reviewed above. The first preservation policy was drafted in 2012 by then–curator Elena Feinstein (version 1.0), based on a Digital Curation Centre template (http://www.dcc.ac.uk/sites/default/files/documents/Preservation%20policy%20template.pdf), and with the assistance of Dryad staff. To further pursue the preservation mission, Dryad convened a Preservation Working Group (http://wiki.datadryad.org/Preservation_working_group_2013), comprised of preservation experts in the United States and United Kingdom. The Working Group’s initial meeting reviewed Dryad’s current practices, identified preservation priorities, and outlined broad goals for policy development. Official policy development was initiated in February 2013, led by two authors of this paper (Sara Mannheimer and Ayoung Yoon). See Table 2 for main phases of policy development.

Table 2: Three main phases of policy development.

Phase 1

Phase 2

Phase 3

Literature review of digital preservation literature and practice

Review of other data repositories’ preservation activities

Drafting the preservation policy document, informed by the Open Archival Information Systems (OAIS) reference model, but ultimately structured according to Dryad’s specific needs

The Working Group recommended resources and advised on several drafts of the policy. Version 2.0 was finalized and presented to the Dryad Board of Directors in May 2013. Taking into account the Board’s responses, the policy was revised in cooperation with Dryad staff (July 2013) and the Preservation Working Group (August 2013). In November 2013, the Board reviewed and approved the policy, dissolved the Preservation Working Group, and made plans to convene a new Preservation Task Force in 2014 to address specific implementation issues, including versioning (including pricing, DOI considerations, and corrections vs. concatenations), preferred file formats, file type migration strategies, and implementation budget.

Results: Dryad preservation policy

The final preservation policy is structured as follows:

1. Purpose

The policy states that its purpose is “to ensure authenticity, reliability, and integrity of research data over the long term so that data can be re–used for research, education, or any other purpose.”

2. Scope and content coverage

Content criteria (corresponds to 3.1 in Dryad’s Terms of Service)
— Must underlie a publication
— No sensitive or illegal material
— No technical problems or viruses

Versioning
— This section will be clarified and expanded by the 2014 Preservation Task Force

Withdrawal of content
— Withdrawal “should be rare and is not part of the normal life cycle for any archived file.”

6. Sustainability plans

Technical sustainability
— Dryad software: open source, periodically synchronized with DSpace
— Dryad data: in the event that Dryad is no longer able to maintain content at the primary location, responsibility will pass to a mirror location

Persistent data accessibility: DOIs

Institutional and financial sustainability
— Governance, business model, and sustainability plan for long term organizational stability and viability: Not–for–profit, overseen by a membership–elected board, partnership between UNC–CH, Duke, and NC State
— Dryad members have a strong interest in seeing repository persist
— Participation in DataONE network ensures future data availability

Discussion

Development of data preservation policy is still a relatively new activity, and we believe it is important to share what was learned from the policy development experience. The following discussion shares specific lessons learned, as well as some open questions raised from the experience.

1. Negotiating what is ideal and what is realisticPolicy development is a negotiation between what is ideal and what is realistic. International standards, models, and best practices exist for long–term preservation. While these standards provide useful insights and guidance for the big picture, implementation should be considered within the boundaries of organizational capability, and adopting these standards also requires embedding local context during implementation.

At the initial policy development stage, we identified standards and models relevant to the Dryad preservation policy, including the OAIS reference model (ISO 14721:2003) (CCSDS, 2012) and PREservation Metadata: Implementation Strategies (PREMIS) (PREMIS Editorial Committee, 2012). Due to its fundamental significance for developing repositories for long–term preservation, Dryad adopted the OAIS model and concepts at a high level. Dryad preservation policy uses the definition of “long term” as stated in the OAIS reference model: “a period of time long enough for there to be concern about the impacts of changing technologies, including support for new media and data formats, and of a changing user community, on the information being held in a repository.” [8] Because the model is conceptual and high level, however, sometimes there was a need to clarify the terms and localize the use of concepts in the context of Dryad. For instance, while the initial Dryad preservation policy included statements about SIPs and AIPs, a question was raised by Dryad staff about how to distinguish SIPs and AIPs during the day–to–day curation activities already in place. PREMIS was also one of Dryad’s considerations from the early development of this project. However, PREMIS did not align with the priorities at that time, and as already discussed, Dryad already had provenance metadata requirements for datasets. Thus, full implementation of PREMIS remains a long–term goal for Dryad.

Dryad is also aware of other standards and guidelines about audit and certification for building a trusted digital repository, such as Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC) (Ambacher, et al., 2007) and Data Seal of Approval (DSA) (Sesink, et al., 2008), and has kept certification as a consideration when developing policies and procedures. However, Dryad’s current efforts will focus more on putting the planned preservation infrastructure in place, such as format migration, before going through a formal self or external auditing process.

2. Aligning internal and external policiesA second lesson learned was that preservation policy must align with other internal and institutional policies. In order to follow Dryad’s internal policies, we looked primarily to Dryad’s Terms of Service document (https://datadryad.org/pages/policies), which includes policies on submission, content, payment, usage, and privacy. We aimed to have no overlap between the preservation policy and the Terms of Service, and therefore chose to link to the Terms of Service when necessary.

We also aimed to comply with Dryad’s unofficial policies, and policies that have yet to be finalized. One example of a policy–in–progress is Dryad’s policy on versioning. An initial policy on versioning was developed in the first quarter of 2009 (http://wiki.datadryad.org/Track_Version_Changes). The versioning feature is currently deployed for curator use only, and in a few rare cases, updated files have been submitted. However, the tool is still in the process of being streamlined and debugged before being deployed publicly. The versioning policy is currently being revisited by the Dryad Board of Directors, and will be one of the specific issues addressed by the 2014 Preservation Task Force.

Dryad has strong operational activities at both the Metadata Research Center (UNC–CH) and Duke University, and receives server support from NC State. It follows that Dryad’s policies are influenced, in part, by the policies at each of these three institutions. Knowledge of the policies at NC State was especially important, since Dryad’s servers are maintained and housed there. Historically, Dryad has been relatively hands–off with the servers, but in order to ensure preservation, NC State’s policies and procedures surrounding server maintenance and security must agree with Dryad’s preservation policy.

3. Meeting Dryad’s specific needsPreservation policy should ultimately be structured according to Dryad’s specific needs. Meeting specific organizational needs is fundamentally important and should be the first consideration in all work, as each organization has different goals, priorities, and capabilities.

This becomes apparent when preservation policy deals with data depositors’ requirements. Dryad has minimal requirements for depositors, which might seem to be a very different approach compared to some other data repositories that require more submission information. Making submission as easy as possible for data depositors has been important in supporting Dryad’s mission of encouraging data sharing through open and widespread availability. The minimal requirements may lead to questions, such as: is the minimal representation information sufficient for preservation and reuse? All self–deposit repositories must strike a balance between “minimum efforts” and having “enough” representation information. Dryad’s decision to require minimum efforts from depositors is compensated, on some level, by other factors related to Dryad’s submission process. Unlike some other data repositories, Dryad only accepts data that are linked to papers or publications, making it possible to acquire metadata mandated by journals. This link provides a richer context for understanding the data in Dryad, and helps to enhance the metadata experience from both curation and reuse perspectives.

4. DSpace constraintsDryad is built on a DSpace platform, and some development–related constraints are unavoidable. Thus far, Dryad has been committed to customizing DSpace to function according to its specifications, developing whatever special features are required to create a database with wide functionality. However, many preservation activities (i.e., PREMIS implementation, versioning, and format migration) would require substantial additional customization of DSpace, and could therefore require extensive developer hours to develop, implement, and debug. Developer time is spread between many projects at Dryad, so prioritizing time for preservation is an issue that must be addressed by Dryad administrators.

Conclusion

Dryad’s initial steps in developing a preservation policy required compromise and balance. International standards, models, and best practices that exist for long–term preservation (i.e., the OAIS reference model and PREMIS) provide valuable guidelines for preservation activities. During our policy development process, we came to understand that some departure from these standards was necessary, and that policy development would need to unfold in increments in order for long–term preservation practices to be manageable into the future.

The development of Dryad’s preservation policy is just the first step in an ongoing process of preservation efforts. The preservation policy, as a living document, should evolve alongside the repository. Dryad will make an ongoing effort to adopt preservation standards, models, and best practices, and the policy will be continually revised and re–evaluated as preservation practices are refined. A key goal of the 2014 Preservation Task Force will be to develop a strategic plan to complement the preservation policy. After the strategic plan is in place to bridge policy and practice, TRAC or DSA may be implemented.

Preservation policy does not exist in a vacuum; it must interoperate with other organizational functions. To address this, we aimed to structure Dryad’s preservation policy to fit the real world of the Dryad Repository, including broad governing policies, current practices, and system constraints. We also designed the policy to be aspirational. We included short– and long–term goals in order to facilitate growth and begin to mold Dryad’s workflow to approach the ideal.

As documented in Table 1, Dryad’s policy development was aided significantly by a review of existing data preservation policies. However, at this point in time, the Dryad staff were unable to identify a policy that could be immediately adopted. Just as existing policies informed our process, Dryad’s policy has the potential to inform preservation policy development at other repositories.

About the authors

Sara Mannheimer is the Data Management Librarian at Montana State University, supporting research data services and promoting open access. Her research focuses on digital preservation, digital curation, and data sharing. She was Dryad’s curator from June 2013–February 2014.
E–mail: sara [dot] mannheimer [at] montana [dot] edu

Ayoung Yoon is a doctoral candidate at the School of Information and Library Science, University of North Carolina at Chapel Hill. Her research interests include data reusers’ trust in data and data repositories, data curation, and personal digital archiving. She has an M.S.I. in both preservation and archives and record management from the University of Michigan School of Information, and B.A. in history from Ewha Womans University, South Korea.
E–mail: ayyoon [at] email [dot] unc [dot] edu

Jane Greenberg is a professor at the School of Information and Library Science and the director of the Metadata Research Center at the University of North Carolina at Chapel Hill. She is co–Principal Investigator of the Dryad grants. Her research focuses on metadata, knowledge organization/ontological engineering, data science, and linked data.
E–mail: janeg [at] email [dot] unc [dot] edu

Elena Feinstein was Dryad’s curator from 2009–2013. She is currently the Librarian for Chemistry and Biological Sciences at Duke University, where she continues to promote data sharing and open science.
E–mail: elena [dot] feinstein [at] duke [dot] edu

Ryan Scherle is the repository architect at Dryad Repository. He has a Ph.D. in computer science from Indiana University, Bloomington.
E–mail: ryan [at] datadryad [dot] org

Acknowledgements

This work was supported in part by National Science Foundation (NSF), Award number: 1147166/ABI Development: Dryad: scalable and sustainable infrastructure for the publication of data.

This work relied on the expertise of the Dryad Preservation Working Group (2009– 2013):

Kenneth Thibodeau, 2002. “Overview of technological approaches to digital preservation and challenges in coming years,” The state of digital preservation: An international perspective: Conference proceedings, at http://www.clir.org/pubs/reports/pub107/thibodeau.html, accessed 24 April 2014.

To the extent possible under law, Sara Mannheimer, Ayoung Yoon, Jane Greenberg, Elena Feinstein, and Ryan Scherle have waived all copyright and related or neighboring rights to “A balancing act: The ideal and the realistic in developing Dryad’s preservation policy.”