Citation of Datasets

CONTEXT

Datasets are increasingly being recognized as valuable, legitimate, standalone products of research that contribute to scholarly discourse. Indeed, in a revised version of its Proposal & Award Policies & Procedures Guide, the NSF made the following change:

“Instructions for preparation of the Biographical Sketch have been revised to rename the "Publications" section to "Products" and amend terminology and instructions accordingly. This change makes clear that products may include, but are not limited to, publications, data sets, software, patents, and copyrights.” (emphasis added)

Properly cited published manuscripts are clearly identified and easily located within their respective publication. In the same way, proper identification of datasets facilitates access, sharing, and reuse by making them unique and discoverable. While there is not yet consensus on a single method to cite or reference a dataset, discipline-agnostic standards and common practices are emerging. The information below is excerpted from the DataCitestandard, the Earth Science Information Partners (ESIP) Interagency Data Stewardship Committee wiki, and “How to Cite Datasets and Link to Publications” from the UK’s Digital Curation Centre (Ball & Duke, 2012; see also the PDF version). Please see these resources for a more comprehensive understanding of the scope and complexity of issues surrounding data citation.

IN PRACTICE

A data citation should include, at the very least, the following elements:

Author(s): the creator(s) of the dataset, in priority order. May be an institution or person(s).

If extant, the creator should include a "nameIdentifier," such as an Open Researcher and Contributor ID (ORCID) or International Standard Name Identifier (ISNI)

Publication/Release date: Whichever is the later of: the date the dataset was made available, the date all quality assurance procedures were completed, and the date the embargo period (if applicable) expired.

Title: the formal title of the data set

Version: the precise version of the data used. Careful version tracking is critical to accurate citation.

Publisher/Archive/Distributor: the organization distributing or hosting the data, ideally over the long term

Identifier: a unique string that identifies the resource; should be a persistent scheme such as a DOI (10.1234/8675309), handle, or ARK (www.example.org/ark:/12345/lucky777). If you deposit data in Academic Commons, we can assign your dataset a DOI and a persistent URL.

Access Date: because data can be dynamic and changeable in ways that are not always reflected in release dates and versions, it is important to indicate when on-line data were accessed.

If you need to use a specific citation style (e.g. APA, Chicago, etc.), enter your DOI (e.g. '10.1234/1234567') at this site to format the citation for your dataset.

SUMMARY

There are many more facets to data citation than we can reasonably cover here. We encourage you to visit the two linked references above. By way of closing, here is a summary of considerations quoted from Ball and Duke (2012, DCC):

If you have generated/collected data to be used as evidence in an academic publication, you should deposit them with a suitable data archive or repository as soon as you are able. If they do not provide you with a persistent identifier or URL for your data, encourage them to do so.

When citing a dataset in a paper, use the citation style required by the editor/publisher. If no form is suggested for datasets, take a standard data citation style (e.g. DataCite’s) and adapt it to match the style for textual publications.

Give dataset identifiers in the form of a URL wherever possible, unless otherwise directed.

Include data citations alongside those for textual publications. Some reference management packages now include support for datasets, which should make this easier.Cite datasets at the finest-grained level available that meets your need. If that is not fine enough, provide details of the subset of data you are using at the point in the text where you make the citation.

If a dataset exists in several versions, be sure to cite the exact version you used.

When you publish a paper that cites a dataset, notify the repository that holds the dataset, so it can add a link from that dataset to your paper.