The folks at mloss.org — Machine Leaning Open Source Software — invited a blog post on my roundtable on data and code sharing, held at Yale Law School last November. mloss.org’s philosophy is stated as:

“Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for a wide range of applications. Inspired by similar efforts in bioinformatics (BOSC) or statistics (useR), our aim is to build a forum for open source software in machine learning.”

The site is excellent and worth a visit. The guest blog Chris Wiggins and I wrote starts:

“As pointed out by the authors of the mloss position paper [1] in 2007, “reproducibility of experimental results is a cornerstone of science.” Just as in machine learning, researchers in many computational fields (or in which computation has only recently played a major role) are struggling to reconcile our expectation of reproducibility in science with the reality of ever-growing computational complexity and opacity. [2-12]

In an effort to address these questions from researchers not only from statistical science but from a variety of disciplines, and to discuss possible solutions with representatives from publishing, funding, and legal scholars expert in appropriate licensing for open access, Yale Information Society Project Fellow Victoria Stodden convened a roundtable on the topic on November 21, 2009. Attendees included statistical scientists such as Robert Gentleman (co-developer of R) and David Donoho, among others.”

This is an excellent panel discussion regarding the leaked East Anglia docs as well as standards in science and the meaning of the scientific method. It was recorded on Dec 10, 2009, and here’s the description from the MIT World website: “The hacking of emails from the University of East Anglia’s Climate Research Unit in November rocked the world of climate change science, energized global warming skeptics, and threatened to derail policy negotiations at Copenhagen. These panelists, who differ on the scientific implications of the released emails, generally agree that the episode will have long-term consequences for the larger scientific community.”

Moderator: Henry D. Jacoby, Professor of Management, MIT Sloan School of Management, and Co-Director, Joint Program on the Science and Policy of Global Change, MIT.

Q1: Compliance. What features does a public access policy need to ensure compliance? Should this vary across agencies?

One size does not fit all research problems across all research communities, and a heavy-handed general release requirement across agencies could result in de jure compliance – release of data and code as per the letter of the law – without the extra effort necessary to create usable data and code facilitating reproducibility (and extension) of the results. One solution to this barrier would be to require grant applicants to formulate plans for release of the code and data generated through their research proposal, if funded. This creates a natural mechanism by which grantees (and peer reviewers), who best know their own research environments and community norms, contribute complete strategies for release. This would allow federal funding agencies to gather data on needs for release (repositories, further support, etc.); understand which research problem characteristics engender which particular solutions, which solutions are most appropriate in which settings, and uncover as-yet unrecognized problems particular researchers may encounter. These data would permit federal funding agencies to craft release requirements that are more sensitive to barriers researchers face and the demands of their particular research problems, and implement strategies for enforcement of these requirements. This approach also permits researchers to address confidentiality and privacy issues associated with their research.

Examples:

One exemplary precedent by a UK funding agency is the January 2007 “Policy on data management and sharing”
(http://www.wellcome.ac.uk/About-us/Policy/Policy-and-position-statements/WTX035043.htm)
adopted by The Wellcome Trust (http://www.wellcome.ac.uk/About-us/index.htm) according to which “the Trust will require that the applicants provide a data management and sharing plan as part of their application; and review these data management and sharing plans, including any costs involved in delivering them, as an integral part of the funding decision.” A comparable policy statement by US agencies would be quite useful in clarifying OSTP’s intent regarding the relationship between publicly-supported research and public access to the research products generated by this support.