Book Review: Issues in Open Research Data

Bringing together contributed chapters on a wide variety of topics, Issues in Open Research Data is a highly informative volume of great current interest. It’s also an open access book, available to read or download online and released under a CC BY license. Three of the nine chapters have been previously published, but benefit from inclusion here. In the interest of full disclosure, I’m listed as a book supporter (through unglue.it) in the initial pages.

In his Editor’s Introduction, Samuel A. Moore introduces the Panton Principles for data sharing, inspired by the idea that “sharing data is simply better for science.” Moore believes each principle builds on the previous one:

When publishing data, make an explicit and robust statement of your wishes.

Use a recognized waiver or license that is appropriate for data.

If you want your data to be effectively used and added to by others, it should be open as defined by the Open Knowledge/Data Definition— in particular, non-commercial and other restrictive clauses should not be used.

Explicit dedication of data underlying published science into the public domain via PDDL or CC0 is strongly recommended and ensures compliance with both the Science Commons Protocol for Implementing Open Access Data and the Open Knowledge/Data Definition.

In “Open Content Mining” Peter Murray-Rust, Jennifer C. Molloy and Diane Cabell make a number of important points regarding text and data mining (TDM). Both publisher restrictions and law (recently liberalized in the UK) can block TDM. And publisher contracts with libraries, often made under non-disclosure agreements, can override copyright and database rights. This chapter also includes a useful table of the TDM restrictions of major journal publishers. (Those interested in exploring further may want to check out ContentMine.)

“Data sharing in a humanitarian organization: the experience of Médecins Sans Frontières” by Unni Karunakara covers the development of MSF’s data sharing policy, adopted in 2012 (its research repository was established in 2008). MSF’s overriding imperative was to ensure that patients were not harmed due to political or ethnic strife.

Sarah Callaghan makes a number of interesting points in her chapter “Open Data in the Earth and Climate Sciences.” Because much of earth science data is observational, it is not reproducible. “Climategate,” the exposure of researcher emails in 2009, has helped drive the field toward openness. However, there remain several barriers. The highly competitive research environment causes researchers to hoard data, though funder policies on open data are changing this. Where data has commercial value, non-disclosure agreements can come into play. Callaghan notes the paradox that putting restrictions on collaborative spaces makes sharing more likely (the Open Science Framework is a good example). She also shares a case in which an article based on open data was published three years before the researchers who produced the data published. It is becoming likely that funders will increasingly monitor data use and require acknowledgement of data sources if used in a publication. Data papers (short articles describing a dataset and the details of collection, processing, and software) may encourage open data. Researchers are more likely to deposit data if given credit through a data journal. However, data journals need to certify data hosts and provide guidance on how to peer review a dataset.

In “Open Minded Psychology” Wouter van den Bos, Mirjam A. Jenny, and Dirk U. Wulff share a discouraging statistic: 73% of corresponding authors failed to share data from published papers on request. A significant barrier is that providing data means substantial work. Usability can be enhanced by avoiding proprietary software and following standards for structuring data sets (an example of the latter is OpenfMRI). The authors discuss privacy issues as well, which in the case of fMRI includes a 3D image of the participant’s face. The value of open data is that data sets can be combined, used to address new questions, analyzed with novel statistical methods, or used as an independent replication data set. The authors conclude:

Open science is simply more efficient science; it will speed up discovery and our understanding of the world.

Ross Mounce’s chapter “Open Data and Palaeontology” is interesting for its examination of specific data portals such as the Paleobiology Database, focusing in particular on the licensing of each. He advocates open licenses such as the CC0 license, and argues against author choice in licensing, pointing out that it creates complexity and results in data sharing compatibility problems. And even though articles with data are cited more often, Mounce points out that traditionally indexing occurs only for the main paper, not supplementary files where data usually resides.

Probably the most thought-provoking yet least data-focused chapter is “The Need to Humanize Open Science” by Eric Kansa of Open Context, an open data publishing venue for archaeology and related fields. Starting with open data but mostly about the interaction of neoliberal policies and openness, the chapter deserves a more extensive analysis than I can give here, but those interested in the context against which openness struggles may want to read his blog post on the subject, in addition to this chapter.

Other chapters cover the role of open data in health care, drug discovery, and economics. Common themes include:

encouraging the adoption of open data practices and the need for incentives

the importance of licensing data as openly as possible

the challenges of anonymization of personal data

an emphasis on the usability of open data

As someone without a strong background in data (open or not), I learned a great deal from this book, and highly recommend it as an introduction to a range of open data issues.