Monday, February 27, 2012

How do we peer-review code?

The article deals with the problem of successfully and adequately peer-reviewing scientific research in this age of experiments which are supported by extensive computation.

However, there is the difficulty of reproducibility, by which we mean the reproduction of a scientific paper’s central finding, rather than exact replication of each specific numerical result down to several decimal places.

There are some philosophy-of-science issues that are debated in the article, but in addition one of the core questions is this: when attempting to reproduce the results of another's experiment, the reviewers may need to reproduce the computational aspects as well as the data-collection aspects. Is the reproduction of the computational aspects of the experiment best performed by:

taking the original experiment's literal program source code, possibly code-reviewing it, and then re-building and re-running it on the new data set, or

taking a verbal specification of the original experiment's computations, possibly design-reviewing that specification, and then re-implementing and re-running it on the new data set?

Hidden within the discussion is the challenge that, in order for the first approach to be possible, the original experiment must disclose and share its source code, which is currently not a common practice. The authors catalog a variety of current positions on the question, noting specifically that “Nature does not require authors to make code available, but we do expect a description detailed enough to allow others to write their own code to do similar analysis.”

The authors find pros and cons to both approaches. Regarding the question of trying to reproduce a computation from a verbal specification, they observe that:

Ambiguity in program descriptions leads to the possibility, if not the certainty, that a given natural language description can be converted into computer code in various ways, each of which may lead to different numerical outcomes. Innumerable potential issues exist, but might include mistaken order of operations, reference to different model versions, or unclear calculations of uncertainties. The problem of ambiguity has haunted software development from its earliest days.

which is certainly true. It is very, very hard to reproduce a computation given only a verbal description of it.

Meanwhile, they observe that computer programming is also very hard, and there may be errors in the original experiment's source code, which could be detected by code review:

First, there are programming errors. Over the years, researchers have quantified the occurrence rate of such defects to be approximately one to ten errors per thousand lines of source code.

Second, there are errors associated with the numerical properties of scientific software. The execution of a program that manipulates the floating point numbers used by scientists is dependent on many factors outside the consideration of a program as a mathematical object.

...

Third, there are well-known ambiguities in some of the internationally standardized versions of commonly used programming languages in scientific computation.

which is also certainly true.

The authors conclude that high-quality science would be best-served by encouraging, even requiring, published experimental science to disclose and share the code that the experimenters use for the computational aspects of their finding.

Seems like a pretty compelling argument to me.

One worry I have, which doesn't seem to be explicitly discussed in the article, is that programming is hard, so if experimenters routinely disclose their source code, then others who are attempting to reproduce those results might generally just take the existing source code and re-use it, without thoroughly studying it. Then, a worse outcome might arise: an undetected bug in the original program would propagate into the second reproduction, and might gain further validity. Whereas, if the second team had re-written the source from first principles, this independent approach might very well have not contained the same bug, and the likelihood of finding the problem might be greater.