Transcription

1 Research Data Management Canadian National Research Data Repository Service Progress Report, June 2016 As their digital datasets grow, researchers across all fields of inquiry are struggling to manage those datasets. For an individual researcher, there is a strong motivation to be able to find, access, and analyze their own data once it is produced. There is a need to share datasets with colleagues, to preserve datasets for later reuse, or to combine their dataset with others from within or beyond their disciplinary area. For the research community, the reproducibility of a scientific result drives a need for open, managed, accessible datasets that allow results to be independently validated. For policy makers, the data output of government funded research is seen as a valuable asset to be preserved and shared for general good. The recent Tri-Agency Statement of Principles on Digital Data Management (June 15, 2016) (http://www.science.gc.ca/default.asp?lang=en&n=83f7624e-1) sets the expectation: Research data resulting from agency funding should normally be preserved in a publicly accessible, secure and curated repository or other platform for discovery and reuse by others. The Statement also includes expectations for researchers to create data management plans, and to provide the necessary metadata to facilitate understanding and reuse. Using Research Data Management (RDM) techniques, data can be ingested, curated, preserved, discovered, shared and transported. Researchers need the technical tools, skills and support to enable effective RDM. CC and CARL Partnership for a national platform for Research Data Management Compute Canada (CC) and the Canadian Association of Research Libraries (CARL) have agreed to collaborate to build a scalable national platform for research data management and discovery. The proposed pan-canadian platform will provide tools and services to support the curation, access, discoverability, and preservation of research data, allowing researchers across Canada in a range of disciplines to have improved access and control of large amounts of data. This addresses a longstanding gap in Canada s infrastructure for digital research data management. The Portage Network of CARL will assist with the requirements for and design of the national platform service, providing metadata and data workflow solutions and testing the platform. Compute Canada will provide project management and software development expertise and necessary computational power. This RDM service is not intended to serve as a monolithic solution for all of Canada s research data needs. Rather, it is meant to provide a framework that allows existing and future data repositories to be federated within a coherent system. At the same time, it will provide a flexible repository and preservation system for Canadian researchers and institutions who do not have a solution already in place.

2 Proposed Core Features of the Service Federated storage model: Individual institutions or organizations can deploy storage locally and can federate their local repository into the national system. Federated support model: On-campus support for the researchers generating the data to manage those data. Nationally integrated: While the storage and support are distributed, a coherent national service is provided to researchers regardless of their location or field. Scalable model: The system can scale as adoption by researchers and stored data grows. National data discovery: While different data collections can be hosted in different locations, with different access controls and different metadata, the various data collections are discoverable through a web-based, federated search tool. Data preservation: Researchers and institutions can choose to preserve data in multiple locations in long-term preservation formats. A long-term institutional commitment is required for any preserved datasets. Suitable for a broad range of data types: The favoured solution is suitable for managing diverse datasets from a broad spectrum of disciplines, typically referred to as the long tail of data. Bulk data and metadata ingestion: The system is able to ingest and index existing data and metadata from Canadian researchers. Access control mechanisms: The solution allows fine-grained control of who can discover and download each dataset, and supports embargo. The technology platform as envisioned Most data that is indexed and made available for national discovery is expected to be housed in curated collections. The role of the curator is to ensure the quality of both the data and the associated metadata. Curation could be performed at project level (e.g. by a member of the research team), at an institutional level, or at a national level through granting collection curation privileges to specific people. A self-serve option for small collections without curation is also envisioned, primarily for active research datasets. Once data is ingested by the system, any researcher can then discover and access the collection, if allowed by the data s access policies (which are highly granular). If the researcher can access the data, she or he can use Globus Connect to transfer the data to a Compute Canada processing facility, to any other Globus-enabled facility in the world, or to their laptop for further analysis. Each dataset has an associated (unique) identifier (such as a DOI) assigned, so that the dataset can be cited, the data owner can be credited, and the work can be reproduced by others. It will also be possible to federate existing data repositories into a national service. The national service can harvest metadata from existing repositories. This allows search over all repositories from a single web interface. This basic level of federation requires only the support of agreed-upon data exchange standards and protocols. A deeper federation would allow Globus transfers of any discovered dataset from the federated repository to any Globus enabled facility, as with data deposited directly into the national service. This requires repository-by-repository software development.

3 Building the Infrastructure Leveraging the work of a national federated pilot project convened by Research Data Canada (RDC) in , CARL and Compute Canada launched a 2 year joint project in January 2016 to build a national-scale research data repository tool and preservation suite, capable of providing the technical foundation for a national RDM service. This document provides only a high-level overview of this project. The project technology leverages three existing products: Globus Data Publication (a data repository service), Globus Connect (large file transfer service) and Archivematica (a Canadian open-source data preservation package). The technology also includes custom-built software to integrate these software packages into a coherent solution. Compute Canada has also partnered with Globus (www.globus.org) to assist with the integrations as well as accelerate vendor development of features in Globus software that researchers need. Curation, Training, Support, and Infrastructure Clearly, the technology development is only one piece of what would be necessary to establish and operate a national research repository platform, and to ensure it is well-used and meets researcher needs. Other pieces would include: the engagement of researchers and of librarians and curators working out of their institutions across the country; training and support of researchers throughout the country in RDM practices and the technology used; and the required IT infrastructure and its operation to underpin the national repository and federated discovery. Project Governance The Steering Committee for the development project comprises representation from Compute Canada and the Canadian Association of Research Libraries: Dugan O'Neil Chuck Humphrey Steve Marks Jason Hlady Stakeholder Group: A broad stakeholders group is being set up to keep interested parties informed about progress in the project and the service. Anyone can request to be added to an list to become part of this Stakeholder Group and to receive updates and comment about the evolving National RDM Service. This list is run as a Google Group at To ask to be added to the Stakeholder Group, send to Contact the Technology Project: Project Sponsor Lead Developer Project Manager Web site

4 Progress to June 2016 The technology development was started in January Six months in to development there are a number of successes to report. All work is early development only so far; there are no production services and no versions ready for end-user testing. Currently, datasets (including datasets with large files or large numbers of files) can be uploaded into the development data repository running on Compute Canada hardware. Data transfers, which could take hours to complete for very large datasets, are performed asynchronously. Datasets normally automatically receive some degree of standards-based preservation processing to help ensure the dataset will be stored in usable formats for future use. Metadata (information that describes the data) is collected from the repository and indexed for discovery, together with metadata collected about the datasets held in other data repositories (institutional, regional, and domain-specific) in Canada. Indexed data can be searched, restricting results by filtering by facets such as data type, date, subject and source repository. The dataset can be accessed or copied for reuse from whichever repository in which is held, depending on the access restrictions that may exist for that dataset at the source.

5 Specific technical achievements of the development project January-June 2016: Through a formalized two-year project partnership agreement between Globus and Compute Canada, the development team obtained access to the Globus Publication software code and created development instances of Globus Publication on Compute Canada hardware. Developed reproducible methods for automated deployment of Globus Publication and Archivematica. Deployed development instances of Archivematica on Compute Canada hardware. Developed a robust integration between Globus Publication and Archivematica, such that datasets submitted to the repository technology can be automatically processed by Archivematica and the resulting Dissemination Information Package (DIP) is passed back to Globus Publication. Data submitters can monitor the process. Designed and built a metadata harvester to collect metadata from selected Canadian research data repositories using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Any and all metadata exposed by repositories through OAI-PMH is harvested and made searchable through discovery interfaces.

6 Globus has developed a new Globus Search Platform, running on Amazon Web Services, with an API that will be used for developing a new discovery user interface for the repository. The discovery interface will enable users to find and access datasets in the national repository or other repositories from which metadata has been harvested. A prototype of an interface has been integrated with the development repository. The API supports faceted search and supports searching on custom metadata in repository collections. Improved the Globus Publication interface to support internationalization supporting both official languages of Canada, and completed project branding for the development stage of the proposed service. Completed the first stage of performance profiling for the execution of Archivematica in preparation for future development to improve performance, scalability, and implement new features. Shared improvements and contributed bug fixes back to Archivematica and DSpace open source projects. Delivered presentations on the technology and project: o Lightning talk at Globus World, Chicago, April 2016 o Presentation at 5th National Data Service Consortium Workshop, Chapel Hill, North Carolina, April 2016 o Presentation at CANHEIT/HPCS, Edmonton, June 2016

Globus Software as a Service data publication and discovery Kyle Chard, University of Chicago Computation Institute, chard@uchicago.edu Jim Pruyne, University of Chicago Computation Institute, pruyne@uchicago.edu

James Hardiman Library Digital Scholarship Enablement Strategy This document outlines the James Hardiman Library s strategy to enable digital scholarship at NUI Galway. The strategy envisages the development

Ex Libris Rosetta: A Digital Preservation System Product Description CONFIDENTIAL INFORMATION The information herein is the property of Ex Libris Ltd. or its affiliates and any misuse or abuse will result

Working with the British Library and DataCite Institutional Case Studies Contents The Archaeology Data Service Working with the British Library and DataCite: Institutional Case Studies The following case

Portage Status Report 2016/04/15 At its Spring 2015 meeting, the Canadian Association of Research Libraries received a report, Portage: Organizational Framework (April 7, 2015), outlining the aims, principles,

New Globus Features for Campuses Steve Tuecke The University of Chicago Instrument 1 Researcher initiates transfer request; or requested automatically by script, science gateway Globus SaaS: Research data

Memorandum of Understanding relating to the EThOS E-Theses Online Service 1. Purpose of the MOU Between Institution and The British Library This Memorandum provides a framework for co-operation between

THE BRITISH LIBRARY Unlocking The Value The British Library s Collection Metadata Strategy 2015-2018 Page 1 of 8 Summary Our vision is that by 2020 the Library s collection metadata assets will be comprehensive,

The NIH Commons Summary The Commons is a shared virtual space where scientists can work with the digital objects of biomedical research, i.e. it is a system that will allow investigators to find, manage,

A Policy Framework for Canadian Digital Infrastructure 1 Introduction and Context The Canadian advanced digital infrastructure (DI) ecosystem is the facilities, services and capacities that provide the

23 rd April 2015 First Workshop Jisc Research Data Discovery Service Project Christopher Brown Project Overview and Plan Walkthrough 2 Project Team Team Member Name Role Contact Details Rachel Bruce Deputy

THE UNIVERSITY OF LEEDS VCEG/12/274 Vice Chancellor s Executive Group Funding for Research Data Management: Interim SOME CONTENT HAS BEEN REMOVED FROM THIS PAPER TO MAKE IT SUITABLE FOR PUBLIC DISSEMINATION

Survey of Canadian and International Data Management Initiatives By Diego Argáez and Kathleen Shearer on behalf of the CARL Data Management Working Group (Working paper) April 28, 2008 Introduction Today,

EUROPEAN COMMISSION Directorate-General for Research & Innovation Guidelines on Data Management in Horizon 2020 Version 2.0 30 October 2015 1 Introduction In Horizon 2020 a limited and flexible pilot action

Repository as a Service (RaaS) Stuart Lewis - The University of Auckland Library, New Zealand: s.lewis@auckland.ac.nz Kim Shepherd - The University of Auckland Library, New Zealand: k.shepherd@auckland.ac.nz

HR191 JOB DESCRIPTION NOTES Forms must be downloaded from the UCT website: http://www.uct.ac.za/depts/sapweb/forms/forms.htm This form serves as a template for the writing of job descriptions. A copy of

A Vision for Research Excellence in Canada Compute Canada s Submission to the Digital Research Infrastructure Strategy Consultations Contents A Vision for Research Excellence in Canada 3 Overview of Recommendations

Digital Public Library of America (DPLA) Front End Design and Implementation Request For Proposal Summary The Digital Public Library of America (DPLA) seeks a skilled interactive agency to design and develop

Publishing and citing research data Research Data Management Support Services UK Data Service University of Essex April 2014 Overview While research data is often exchanged in informal ways with collaborators

Data Management Resources at UNC: The Carolina Digital Repository and Dataverse Network November 16, 2010 Data Management Short Course Series Sponsored by the Odum Institute and the UNC Libraries Campus

All You Wanted To Know About the Management of Digital Resources in Alma CONFIDENTIAL INFORMATION The information herein is the property of Ex Libris Ltd. or its affiliates and any misuse or abuse will

SHared Access Research Ecosystem (SHARE) June 7, 2013 DRAFT Association of American Universities (AAU) Association of Public and Land-grant Universities (APLU) Association of Research Libraries (ARL) This

This is a preprint of an article whose final and definitive form has been published in Library Hi Tech News [30(4):1-5, 2013]; Works produced by employees of the US Government as part of their official

ValpoScholar Digital Scholarship, Publishing, Preservation --- http://scholar.valpo.edu What it is and why you should use it Whether you are a student just completing an honors research project or a seasoned

SHERPA Document Institutional Repositories: Staff and Skills Set University of Nottingham 25 th August 2009 Circulation PUBLIC Mary Robinson University of Nottingham Introduction This document began in

Enhanced Research Data Management and Publication with Globus Vas Vasiliadis Jim Pruyne Presented at OR2015 June 8, 2015 Presentations and other useful information available at globus.org/events/or2015/tutorial

Data at NIST: A View from the Office of Data and Informatics Robert Hanisch Office of Data and Informatics Material Measurement Laboratory National Institute of Standards and Technology Data and NIST 1

Plan 0 07 Mapping the Library for the Global Network University NYU DIVISION OF LIBRARIES Our Mission New York University Libraries is a global organization that advances learning, research, and scholarly

Horizon2020 Data Management Plans Ma4 Harrison BGS Data Management plan What is a Data Management Plan? A data management plan (DMP) describes what data that will be created, the standards used to describe

CILIP Executive Briefings 2014 Research Data Management: The library s role Tuesday 20 May 2014 Organised by #CILIPRDM @CILIPEvents Research Data Management at the University of Northampton: a case study

IN CONFIDENCE THE BRITISH LIBRARY BOARD BLB 12/29 BRITISH LIBRARY DIGITAL STRATEGY TO 2015 1. PURPOSE OF THE PAPER To respond to members request for an overarching overview of how digital activities in

IFI Irish Film Archive Digital Preservation & Access Strategy Acknowledgements: This strategy document has been produced by the IFI Irish Film Archive team, and was written up by Kasandra O Connell, following

Workshop: Perspectives in understanding open access to research data infrastructure and technology challenges Key findings on technological barriers, solutions and best practice Lorenzo Bigagli CNR-IIA

February 22, 2013 MEMORANDUM FOR THE HEADS OF EXECUTIVE DEPARTMENTS AND AGENCIES FROM: SUBJECT: John P. Holdren Director Increasing Access to the Results of Federally Funded Scientific Research 1. Policy

Data Management Plan template Name of student/researcher(s) Name of group/project Description of your research Briefly summarise the type of your research to help others understand the purposes for which

Clinical Knowledge Manager Product Description 2012 MAKING HEALTH COMPUTE Cofounder and major sponsor Member and official submitter for HL7/OMG HSSP RLUS, EIS 'openehr' is a registered trademark of the

USE OF OPEN SOURCE SOFTWARE AT THE NATIONAL LIBRARY OF AUSTRALIA Reports on Special Subjects ABSTRACT The National Library of Australia has been a long-term user of open source software to support generic

Open Data & Libraries (in Canada) November 1, 2013 Open Data as Part of Government Information Day Berenica Vejvoda Data Librarian Map & Data Library University of Toronto Libraries Agenda What Why Prior

Introduction to Research Data Management Tom Melvin, Anita Schwartz, and Jessica Cote April 13, 2016 What Will We Cover? Why is managing data important? Organizing and storing research data Sharing and

Sage Integration Cloud Technology Whitepaper Sage Christian Rubach July 21, 2016 Abstract Sage is committed to providing businesses around the world the information, insight and tools they need to succeed.

Research Data Management 1 Why to we need to Manage Data? 2 Data Management Planning Typically covers: - What data will be created (format, types) and how? - How will the data be documented and described?

A Guide to the Research Data Service DMP online ONLINE DATASHARE MY RESEARCH DATA PURE DATA SYNC DATA VAULT DATA STORE This booklet was produced in April 2016 by the Research Data Service Team, Information

Summary of Responses to the Request for Information (RFI): Input on Development of a NIH Data Catalog (NOT-HG-13-011) Key Dates Release Date: June 6, 2013 Response Date: June 25, 2013 Purpose This Request

DIGITAL PRESERVATION The value of the research outputs produced by staff and research students at the University of Bradford cannot be over emphasised in demonstrating the scientific, societal and economic

Successful Platform-as-a-Service Requires a Supporting Ecosystem for HR Applications Platform-as-a-Service is the computing term used to describe a hosted web-based computing environment and the associated

Globus Research Data Management: Introduction and Service Overview Steve Tuecke Vas Vasiliadis Presentations and other useful information available at globusworld.org/tutorial 2 Thank you to our sponsors!