Breadcrumb trail

Research Data

For the purpose of this briefing paper, research data are defined as the factual records used as primary sources for research, and that are commonly accepted in the research community as necessary to validate research findings.

As with open access to publications, there is a broad international trend towards open access to data. This trend, often referred to as 'data sharing' is reflected in numerous international reports published over the last decade that have called for a greater sharing of research data within and across disciplines. These reports assert that improving access to research data would have significant benefits for research and society such as: accelerating scientific progress, avoiding the duplication of research, enabling replication and verification of research results, and increasing the visibility and impact of research.

Most recently, in October 2010, aHigh Level Expert Group submitted a report to the European Commission that described a vision for a pan-European data infrastructure, with links to the international community. The report “identifies the benefits and costs of accelerating the development of a fully functional e-infrastructure for scientific data – a system already emerging piecemeal and spontaneously across the globe, but now in need of a far-seeing, global framework. The outcome will be a vital scientific asset: flexible, reliable, efficient, cross-disciplinary and cross-border.”[1]

Several Canadian consultations over the past decade have discussed the potential benefits of data sharing in Canada:

In October 2002, SSHRC and the National Archivist of Canada established a Working Group that recommended the creation of a new national research data archival service.

In November 2004, NRC, in partnership with CFI, CIHR and NSERC, undertook a National Consultation on Access to Scientific Research Data (NCASRD) in the natural and medical sciences community. The final report provides a “road map” for the implementation of a national plan for open access to publicly funded scientific research data.

In March 2010, a report entitled Canadian Digital Information Strategy: Final Report of Consultations with Stakeholder Communities 2005–2008 published by Library and Archives Canada after extensive consultations with organizations across Canada called for greater sharing and preservation of research data within governments and the research community.

In 2010, the Canadian government published a Digital Economy Consultation Paper. The paper says, “Governments can help by making publicly-funded research data more readily available to Canadian researchers and businesses. Open access is consistent with many national strategies and holds great economic potential for Canadians to add value to machine-readable data, while ensuring that privacy rights are protected. In many cases, data are already available but are difficult to locate. Consistent methods of access will be reinforced.”[2]

Canada has not yet put into practice the recommendations from these various reports.

In 2008, a multi-stakeholder group, called the Research Data Strategy Working Group (RDSWG), was formed to addresses the challenges and issues surrounding the access and preservation of data arising from Canadian research. The Working Group includes representatives from universities, data centers, research libraries and CIOs, the granting agencies, government science departments and agencies, Compute Canada, and the research community. Their activities focus on the actions and leadership roles that organizations can take to ensure Canada's research data is accessible and usable for current and future generations of researchers. The Working Group is currently planning a Research Data Summit for senior policy makers and university administrators in order to develop a roadmap for more comprehensively managing research data in Canada.

The trend towards greater data sharing in the research environment is developing in parallel with a trend towards “open data”. Open data initiatives are aimed at expanding access and creative use of government-generated data into the non-governmental sphere by encouraging innovative ideas, tools and web applications. In March 2011, the Government of Canada launched the Open Data pilot project, an online data portal that provides access to a large number of government datasets through a single window. The data can be reused by application developers for commercial or research purposes[3].

Data sharing policies are most often developed and implemented by research funding agencies and in some cases research projects, and less commonly adopted by universities.[4] While data sharing policies differ across organizations, typical policy elements go beyond asking researchers to retain research data for a given period of time and include more comprehensive requirements to ensure that data are both retained and available to others.

Requirements for data sharing can range from full public open access, to sharing with specific researchers upon request, to access governed through restrictive licenses, depending on the sensitivity of the data, the size and complexity of the data set, their perceived reuse value, and the availability of a repository.

Typical policy elements of data sharing policies are described below:

Data management plans: Investigators are required to submit a data management plan with their funding proposals. These plans ensure that researchers consider ahead of time how they will manage and share their data.

Data quality and standards:Investigators are required to adhere to international standards that will ensure the data is accessible by others.

Data documentation: Data documentation and metadata must accompany data so that the data is understandable by others.

Method of data sharing:Investigators are required to either (1) deposit data in relevant subject or institutional repositories; or, (2) where there are no repositories hold the data locally, and make it available through a web-based presence; or (3) retain data so that upon request, other researchers can have access to data.

Timing of data sharing:Investigators must make data accessible within a given period of time after publication of research results.

Data retention: Data should be retained for a minimum number of years (on average 5 years)

Data preservation: Investigators must deposit their data in a long-term repository, where available, to ensure the preservation of their data.

There are also a number of common exceptions that are often included in data sharing policies:

Privacy and confidentiality of data: The privacy of individuals who participate in research and the confidentiality of the data must be protected at all times. Data intended for broader use must be free of identifiers that would permit linkages to individual research participants and variables that could lead to deductive disclosure of the identity of individual participants. In some cases where data cannot be stripped of identifiers, for example longitudinal studies that collect data over a period of time and must compare data points, data may be exempted from the data sharing requirements or data sharing may be qualified.

Intellectual property: Policies may permit delays in sharing research data for a period of time, in cases whereby institutions or researchers are applying for patents or developing new applications based on that data.

Traditional knowledge: Where local and traditional knowledge is concerned, rights of the knowledge holders shall not be compromised.

Sensitive data: Where data release may cause harm, specific aspects of the data may need to be kept protected (for example, locations of nests of endangered birds or locations of sacred sites, or data related to national security)

In Canada, as elsewhere, data sharing practices are very discipline specific. In some fields- such as genomics, proteomics, high-energy physics, and astronomy- data archiving and sharing is the norm. In other fields no such traditions exist. There is, however, a growing awareness across the scholarly community that there are significant potential benefits in making research data available for re-use.

In 2004, Canada along with 33 other countries (including the US, China, Japan and many European countries) adopted the OECD Declaration on Access to Research Data From Public Funding[5]. The underlying principles of this declaration are that publicly-funded research data are a public good, produced in the public interest, and that they should be openly available to the maximum extent possible.

These same principles have been reflected in other discipline-based initiatives. CIHR, for example, has recently signed a joint statement with the Wellcome Trust, the National Institutes of Health and other health funding agencies. It is a statement of intent to improve data sharing and reads, “we, as funders of health research, intend to work together to increase the availability to the scientific community of the research data we fund that is collected from populations for the purpose of health research, and to promote the efficient use of those data to accelerate improvements in public health.”[6]

Both SSHRC and CIHR have policies related to research data. SSHRC’s Research Data Archiving Policy has been in place since 1990. The policy states that “All research data collected with the use of SSHRC funds must be preserved and made available for use by others within a reasonable period of time. SSHRC considers “a reasonable period” to be within two years of the completion of the research project for which the data was collected.” There are few mechanisms in place, such as data repositories and data management expertise, to support researchers in preserving their data and there is no oversight regarding the implementation of this policy.

As part of its broader policy on access to research outputs, CIHR requires grant recipients to deposit certain data types--bioinformatics, atomic, and molecular coordinate data--into the appropriate public database immediately upon publication of research results. CIHR also requires researchers to retain original data sets arising from CIHR-funded research for a minimum of five years after the end of the grant.CIHR has indicated that they will review and update this policy on an annual basis or as needed.[7]

Projects funded by Genome Canada must comply with its Policy on Data Release and Resource Sharing, with expectations to share data and resources as rapidly as possible. At a minimum, data is expected to be released and shared “no later than the original publication date of the main findings from any datasets generated by that project.” [8] At the completion of a project, all data is to be shared without restriction. In addition, applicants must submit a Data and Resource Sharing Plan with each funding application. Genome Canada has an additional policy on Intellectual Property [9] to ensure the proper management of acquired data and resources.

There are also an increasing number of data sharing policies at the level of the research project. The NEPTUNE Project, an underwater ocean observatory at the University of Victoria, which makes huge volumes of data openly available to the public, has a Data Access Policy. The International Polar Year (IPY), a large scientific program focused on the Arctic and the Antarctic from March 2007 to March 2009, had a comprehensive data policy which “requires that IPY data, including operational data delivered in real time, are made available fully, freely, openly, and on the shortest feasible timescale.”[10] Dozens of Canadian research projects were selected for IPY funding from a variety of sources including the federal government, territorial governments, granting agencies and foundations.

There are also several other policies in Canada governing the management of research data:

The 2nd edition of the Tri-Council Policy Statement on the Ethical Conduct for Research Involving Humans (TCPS) sets out privacy and confidentiality requirements for researchers working with human participants, including for secondary use of research data. The policy emphasizes that respect for privacy in research is an internationally recognized norm and ethical standard.

All data in Canada collected, used or disclosed during the course of commercial activities are also subject to the Personal Information Protection and Electronic Documents Act.

Researchers that use federal government data may also be governed by data policies and licence agreements in terms of the reuse and accessibility of their data.

Policies that require data sharing cannot be implemented without corresponding infrastructures and other support mechanisms. Data cannot remain on the hard drives of researchers, but must be transferred to an environment where they are managed appropriately.

Ensuring the long term accessibility of research data is a complex and resource intensive process. Data must be created and maintained in a manner consistent with the goal of long-term preservation and involves active data management throughout the life-cycle of the data, beginning at the time they are first envisioned. The data must also be integrated into an enduring institutional environment supported by a stable digital repository.

There are some large scale international data repositories in certain fields, such as PubChem, GenBank, Protein Data Bank, Digital Sky Survey, World Data Centers, Global Biodiversity Information Facility, International Virtual Observatory Alliance, the Inter-university Consortium for Political and Social Research (ICPSR), and so on. These repositories collect data from around the world and provide broad access to the data in order to further research and knowledge creation. The vast majority of these archives are funded through government departments and/ or funding agencies in the country in which they are housed.

In addition, governments around the world maintain repositories that house data in many areas deemed of national importance, including climate data, population statistics and health data. The data housed in these government repositories are typically generated by governments, but are often made accessible to academic researchers for their research (though often through a pay per use option).

Similarly, Canada has a number of large government repositories and discipline-based repositories managed by universities and research centres. However, according to a Gap Analysis conducted by the Research Data Strategy Working Group in 2008, there are large gaps in both coverage and capacity of data repositories in Canada. Repositories do not exist for all subject areas, and the vast majority of research data still rests on researchers’ hard drives or locked in cabinets. Only a few active data repositories in Canada allow researchers to deposit their data.[11]

Institutional repositories, based at universities, have until recently put emphasis on the deposit of textual research output (e.g., journal articles and theses). The scope of these repositories is gradually being extended to cover research data as well, but the overall number of stored datasets is still very low. While institutional data repositories hold promise for the future with the advantage of being close to researchers, they are short of expert know-how and resources. As well, the business case for supporting a data repository is not yet clear for many research institutions.

To address the current lack of infrastructure in Canada, the Canadian Association of Research Libraries (CARL) is proposing to develop a national network of repositories for collecting research data, in collaboration with other partners. The vision for the project is to develop data repositories at Canada's universities in which researchers could deposit their data and link them with discipline-based repositories so that data can be integrated and reused in new ways. The project is currently in its initial stages. However, once the conceptual model has been developed, CARL will be seeking CFI funding that would enable them to lay the foundations for this project.

Few, if any, countries, currently have the infrastructure required to support widespread data sharing policies. However, several other jurisdictions are moving towards implementing the support mechanisms required to facilitate the widespread sharing and re-use of research data.

2.5.1 European Commission

In terms of data sharing policies, the EC, through the Seventh Framework Programme (FP7), requires that all research projects develop a preliminary data management plan as part of their proposals describing how data derived from the project will be managed.

The EC does not maintain data repositories, but through FP7, they have been funding e-infrastructure projects atEU member-states, including the development of discipline-based data repositories. One example of these projects is DARIAH (Digital Research Infrastructure for the Arts and Humanities)[12], which aims to enhance and support digitally-enabled research across the humanities and arts. DARIAH is developing repository infrastructure that will support of ICT-based research practices. Researchers will be able to go to DARIAH to find data andtools, archive their data, exchange information and advice in the field of metadata and digitalizing. Construction of DARIAH will begin sometime in 2011.

2.5.2 Netherlands

In the Netherlands, the Research Data Forum[13] has recently been launched in order to improve how research data is managed and to enable better access to such data for scientists/scholars and the public. The forum brings together initiatives developed by a number of different organizations and focuses on the technical, infrastructural, legal, and organizational aspects of storing research data and making it accessible.

The Royal Netherlands Academy of Arts and Sciences and the Netherlands Organisation for Scientific Research maintain the Data Archiving and Networked Services (DANS). Since its establishment in 2005, DANS has been storing and making research data in the arts, humanities and social sciences permanently accessible. DANS maintains a permanent archiving service, stimulates others to follow suit, and works closely with data managers to ensure as much data as possible is made freely available for use in scientific research. DANS is open to all researchers in the arts, humanities and social sciences in the Netherlands, and enables them to both store their data and to search for data themselves.

2.5.3 United Kingdom

The UK has some of the most comprehensive data sharing policies of any government. Four of the seven Research Councils within RCUK have data policies in place that require their researchers to make their research data available “with as few restrictions as possible in a timely and responsible manner to the scientific community for subsequent research.”[14] The policies vary, but generally researchers are also expected to make use of existing standards for data collection and management and make data available through existing community resources or databases where possible.

The UK also has a very robust infrastructure of discipline-based data repositories for collecting research data, managed by several of the RCUK funding agencies, and they have been providing centralized funding to develop university based repositories that are capable of collecting research data. The UK also has set up the Digital Curation Centre (DCC)[15], a centre of expertise for curating digital research data. In addition to providing expert advice and training to researchers in the area of data management, they are a gateway to the technical solutions, curation tools and learning resources that can help data custodians build capacity for digital curation.

2.5.4 Australia

In Australia, the funding agencies have not implemented data sharing policies but are investing heavily in infrastructure. In 2008, Australia launched a comprehensive national program for data sharing called the Australian National Data Service (ANDS)[16]as part of its National Collaborative Research Infrastructure Strategy.

The aim of ANDS is to create the infrastructure to enable Australian researchers to easily publish, discover, access and re-use research data. Their approach has been to engage in partnerships with the research institutions to support better local data management that enables structured collections to be created and published. ANDS then connects those institutional collections so that they can be found and used through the Australian Research Data Commons. The Australian Research Data Commons represents a significant change in their perspective towards research data, considering data as a strategic national resource.

Through ANDS, the Australian government is investing over 10 million dollars per year to support the development of data repositories, metadata and support services, and centralised access services through the Data Commons.

2.5.5 United States

In the US, both the National Institutes of Health (NIH) and the National Science Foundation (NSF) have policies in regards to data sharing. NIH has had a data sharing policy since 2003. The policy applies only to projects submitting a research application requesting $500,000 or more of direct costs in any single year. The policy states that “Data should be made as widely and freely available as possible while safeguarding the privacy of participants, and protecting confidential and proprietary data. NIH investigators are also expected to include a plan for sharing final research data for research purposes, or state why data sharing is not possible.”[17]

NSF has a policy on dissemination and sharing of research results that reads, “investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants.”[18] They have also recently announced new requirements that all NSF proposals include a data management plan in the form of a two-page supplementary document describing how researchers will conform to the policy. According to the NSF, "This is the first step in what will be a more comprehensive approach to data policy,"[19]

In terms of infrastructure, the US is home to several large-scale discipline-based data repositories supported by the NIH, NSF and other government agencies. In 2005, the NSF instituted an Office for Cyberinfrastructure (OCI). The OCI’s “Cyberinfrastructure Vision for 21st Century Discovery”[20] sets out the vision the NSF is to pursue in making research data accessible. The NSF’s goals for the period of 2006-2010 are to catalyze the development of a system of science and engineering data collections that is open, extensible, and evolvable; and to support development of a new generation of tools and services for data discovery, integration, visualization, analysis and preservation. To realize this vision, NSF has provided $100 million in funding over five years for a program called “Sustainable Digital Data Preservation and Access Network Partners (DataNet)”. The program is working with some of the large scale data repositories to develop “new methods, management structures and technologies to manage the diversity, size, and complexity of current and future data sets and data streams by creating a set of exemplar national and global data research infrastructure organizations”.

National Infrastructure Support: Discipline and national data repositories;varying levels of implementa-tion depending on country; FP7 funding development of data repositories and interopera-bility across national repositories

Other support programs: Some projects provide discipline based support services. Varying levels of support at the national level

Australia

Legislation: No

Policies: No

National Infrastructure Support: Large scale project to build discipline and institutional data repositories

Other support programs: ANDS provides expertise; support for metadata and standards

Canada

Legislation: No

Policies: CIHR requires data deposit with certain types of data. SSHRC has a policy, but is not mandatory

National Infrastructure Support: Selective disciplinary data repositories, but not widespread; IPY Data Assembly Network; CFI support for the development of databases during the life of the research project

Other support programs: No

Netherlands

Legislation: No

Policies: No

National Infrastructure Support: Large national disciplinary repositories in Humanities and Arts, Social Sciences and

Other support programs: DANS repository provides central expertise; support for metadata and standards in Humanities

United Kingdom

Legislation: No

Policies: 4 of the 7 research councils have data sharing policies

National Infrastructure Support: Large national disciplinary data repositories attached to funding agencies; a few institutional data repositories

Other support programs: Digital Curation Centre provides expertise; support for metadata and standards

There is growing recognition of the merits of data sharing in principle and in practice in the broader research community.

2.6.1 Researchers

Researchers’ perspectives towards data sharing are very discipline specific. Surveys and interviews undertaken over the last decade have articulated a wide range of opinions on the topic which cannot be easily generalized into a single statement about researchers’ attitudes.

Some fields have a tradition of data sharing and researchers have become comfortable with the concept. In other fields, researchers are still very opposed to making their data available for a number of reasons. Typical objections to data sharing fall in the areas of data ownership, time and skills involved with managing data, and privacy issues involving data about human participants.

A UK study of 16 different disciplines describes a number of factors that explain the differences in disciplinary attitudes[21]:

the heritage and practices of niche research communities;

the type and quantity of data they produce;

the uniqueness of those data and their potential value in terms of reuse;

the propensity of each community to create, adapt or adopt common data formats, metadata schema and other relevant standards;

their willingness to share data in a world where competition for funding looms large;

the policies of funding bodies in relation to data management, sharing and preservation;

the provision of storage infrastructure including national data centres and effective discovery systems;

the size of research teams (larger teams can benefit from keeping its own data private).

A recent review of the literature across 15 international jurisdictions undertaken in the Netherlands found that “although there are major differences in the way disciplines conduct their research, they also have a number of factors in common when it comes to data storage and access. They all encounter both technical barriers, for example the use of obsolete software, and non-technical ones, such as fear of competition, lack of trust, lack of incentives, and lack of control.”[22]

One particularly important issue expressed by researchers is that they remain in control of what happens to their data. Researchers wish to control who has access to the data and under which conditions.

There is a growing awareness in the research community of the value of data sharing. This was reflected in a number of submissions to Canada's Digital Economy Consultation that called for greater government support to assist researchers in making their data available. In addition, the Partnership Group for Science and Engineering (PAGSE), for example, has called on the government to “make data generated from federally funded research freely available online and provide the capacity to ensure data stewardship and preservation in the long term.”[23]

2.6.2 Institutions/Universities

To date, Canadian universities have not been actively engaged in supporting researcher data sharing practices. In terms of data policies, universities enforce data privacy policies via their Research Ethics Boards (REBs) in conformity with the TCPS. They have not developed policies on data sharing and have not been enforcing compliance with the data sharing policies of funding agencies.

Regarding infrastructure requirements, some universities host and provide financial support for discipline-based databases and repositories, but this support extends to a small minority of research projects. University libraries, which currently have services that provide access to data housed elsewhere (e.g., Statistics Canada data) through research data centres, are becoming interested in collecting the research data created at their institution. However, for the most part, data management support through the university libraries is still in its infancy.

One project that may act as an important demonstrator for institutional support for data sharing policies is the IPY Data Assembly Centre Network. The network is being developed to archive and provide access to all observed data and information generated from IPY projects funded by the Government of Canada Program for IPY. The Network in its current form consists of partners from the research library community (University of Alberta, Ontario Council of University Libraries' Scholars Portal) and several government agencies. The startup funding for this project is being provided by the department of Aboriginal Affairs and Northern Development, but the ongoing expenses of managing and preserving the data into the future will be eventually taken on by individual institutions.

There are a number of issues Canadian funding agencies may want to consider when implementing data sharing policies.

2.7.1 Skills, Training and Qualified Personnel

An important requirement for data accessibility is that data are organized and described using standards and best practices. This requires a significant amount of skill in terms of data management. A Gap Analysis[24] published by the Canadian Research Data Strategy Working Group in 2008, concluded that researchers rarely have the skills required to appropriately manage their data. The situation is similar in other countries.

As noted above, both the UK and Australia have created national centres of expertise to provide support for the research community and to data repository managers. In the United States, some university libraries have been working with researchers to assist them with managing their data appropriately. Regardless of the model, researchers in many disciplines will need access to support services in order to comply with any data sharing policy.

2.7.2Complex Policy Environment

The wide range of data policies that govern different jurisdictions and types of research data make it very challenging for researchers to understand and adhere to data sharing policies. This is particularly so for researchers who are working with data related to human participants. The Tri-Council Policy Statement on the Ethical Conduct for Research Involving Humans (TCPS) requires that data be completely anonymized or de-identified before they are shared, unless the researcher can justify to the Research Ethics Board (REB) otherwise. The 2nd edition of the TCPS provides guidance on the collection, use, dissemination, retention, and disposal of data. A narrow interpretation of TCPS by REBs or researchers can result in the unnecessary destruction of data related to human subjects in contravention with data sharing policies.

There are ways of ensuring that data sharing policies don't conflict with or compromise privacy and confidentiality requirements. The National Institutes of Health (NIH) policy on research data sharing, for example, states, “Prior to sharing, data should be redacted to strip all identifiers, and effective strategies should be adopted to minimize risks of unauthorized disclosure of personal identifiers.”[25]Similarly, the Wellcome Trust policy requires the anonymization of data to protect confidentiality and insists that data confidentiality should not “unduly inhibit responsible data sharing for legitimate research uses.””[26]

Other jurisdictions are developing clear instructions for researchers and REBs as to how to comply with funding agency data sharing policies in this complex environment. These could be provided in the form of “best practice” documents which offer clear guidance on how to comply with data sharing policies.

2.7.3 Infrastructure Support

For research data to be available after the lifespan of a specific research project, they must be integrated into an enduring institutional environment supported by a digital repository. The preservation of research data requires the active management of data over its entire lifecycle and involves activities such as “appraising, selecting, depositing or ingesting data into a repository, ensuring authenticity, managing the collection of data and metadata, refreshing digital media, and migrating data to new digital media.”[27]

Currently in Canada, most of the data collected through research are not deposited into data repositories and few if any repositories have full preservation capacity. Although data in certain disciplines are being collected by national agencies, this represents only a small minority of data sets created through research activities in Canada.

The lack of infrastructure is most acute when looking at the hundreds of smaller datasets produced by individual researchers and research groups. It is often suggested that institutional repositories are the natural locus for such datasets. However, existing institutional repository platforms do not yet have the functionality required for data to be tagged at the element level, something that is needed for interoperability and re-use of data. In addition, because research data are highly heterogeneous it is unlikely that any single repository could collect the range of data types created at any given single higher education institution.

2.7.4 Clarifying Roles and Responsibilities

There are currently large gaps in the roles and responsibilities for managing research data across its lifecycle. Researchers are responsible for managing their data during the lifespan of the project, but lack the means to maintain data once the project is over, and often lack the skills to prepare it for dissemination.

Again, institutions are an obvious candidate for taking onresponsibility for curating the data produced by their own research community where those data have no natural home. This, however, would require that institutions become aware of their potential role in the management of research data. In addition, there are significant costs associated with collecting and preserving research data, and there are not yet sustainable funding models in Canada that support these activities.