Nature Editorial: If you want reproducible science, the software needs to be open source

According to an editorial in Nature, all scientific code should be released …

Modern scientific and engineering research relies heavily on computer programs, which analyze experimental data and run simulations. In fact, you would be hard-pressed to find a scientific paper (outside of pure theory) that didn’t involve code in some way. Unfortunately, most code written for research remains closed, even if the code itself is the subject of a published scientific paper. According to an editorial in Nature, this hinders reproducibility, a fundamental principle of the scientific method.

Reproducibility refers to the ability to repeat some work and obtain similar results. It is especially important when the results are unexpected or appear to defy accepted theories (for example, the recent faster-than-light neutrinos). Scientific papers include detailed descriptions of experimental methods—sometimes down to the specific equipment used—so that others can independently verify results and build upon the work.

Reproducibility becomes more difficult when results rely on software. The authors of the editorial argue that, unless research code is open sourced, reproducing results on different software/hardware configurations is impossible. The lack of access to the code also keeps independent researchers from checking minor portions of programs (such as sets of equations) against their own work.

Reproduce THIS!

Some journals take this issue seriously. Science includes code on its list of things that should be supplied by an author when submitting a paper. Biostatistics actually created an “Associate Editor for Reproducibility” dedicated to reproducing the results of a paper based on the data and code it receives.

Nature, on the other hand, only asks for a written description of code with sufficient details to allow interested readers to create their own version. This is currently the common practice for most journals. Typically, when a computer program is written for a paper, the authors will supply an executable version upon request.

However, the authors describe two reasons why these common practices (written descriptions of code and executables) are not sufficient to reproduce results: ambiguity in the descriptions and errors in the code.

When it comes down to it, code is the only thing that can unambiguously describe code—that’s why we use programming languages instead of natural language. Even if the authors of a paper accurately describe a program, using precise mathematical equations when necessary, independent implementations (and results) would differ.

Releasing executable versions of programs instead of code may not be sufficient due to underlying errors. This doesn’t just mean actual mistakes in the code, although some studies estimate one to ten errors for every thousand lines of code. Rounding and floating point errors, as well as ambiguities in programming languages like the order-of-evaluation problem, can all affect results.

Without the ability to examine code, independent researchers won’t know if potential uncertainties or errors (or even the results) described in a paper can be traced to ambiguity in a description or numerical implementation.

Why not open source?

The authors acknowledge there are some barriers to the ubiquitous release of scientific code. Many researchers don’t recognize the importance of the issues described above, and others may see commercial potential in their code. Jeffrey Benner pointed out in 2002 that, even when researchers want to release their code openly, universities and national labs can block them in an effort to license and monetize software.

The editorial also mentions a shortage of central scientific repositories or indexes for research code, and suggest funding agencies should investigate solutions similar to SourceForge. This might not be necessary, though, since many researchers (particularly in computer science) already post their code to SourceForge and Google Code.

Another justification for keeping code closed is selfish: to slow down the competition by keeping the results of hard work to yourself. Daniel Lemire, a computer scientist and professor, responded to this argument elsewhere by pointing out that open sourcing his code not only makes his work repeatable, but spreads the ideas faster and makes the code better in the long run, since other users can help debug it.

In the end, simple embarrassment over ugly code may also be a factor, according to Matt Might, another computer science professor. (As someone who writes code for my research, I can vouch for this. [Editor’s note: as can his editor.]) He also believes academics should release code openly, and created the Community Research and Academic Programming License (yes, that’s CRAPL) to help “absolve authors of shame, embarrassment, and ridicule for ugly code.”

What needs to change?

The authors of the paper suggest a few steps that could help correct the problem. First, more journals should adopt standards for source code accessibility (such as full source code, partial source code, executable, or no code) and ensure researchers provide a sufficient description of software used.

They also suggest that funding bodies could look into tools to integrate code with other elements of the paper. For example, the data and code used to generate a figure could be bundled with the figure itself.

The most important step, and probably the easiest and cheapest to accomplish, is for science and engineering departments to emphasize the concept of reproducibility in courses on statistics and programming.

Latest Ars Video >

The Greatest Leap, Episode 3: Triumph

In honor of the 50th anniversary of the beginning of the Apollo Program, Ars Technica brings you an in depth look at the Apollo missions through the eyes of the participants.

The Greatest Leap, Episode 3: Triumph

The Greatest Leap, Episode 3: Triumph

In honor of the 50th anniversary of the beginning of the Apollo Program, Ars Technica brings you an in depth look at the Apollo missions through the eyes of the participants.

Kyle Niemeyer
Kyle is a science writer for Ars Technica. He is a postdoctoral scholar at Oregon State University and has a Ph.D. in mechanical engineering from Case Western Reserve University. Kyle's research focuses on combustion modeling. Emailkyleniemeyer.ars@gmail.com//Twitter@kyle_niemeyer

Dryad (http://datadryad.org/) is a repository for scientific data funded and organized by a group of scientific journals. It also accepts scripts, source code and other files. From what I read on their page the data submission is highly integrated with the publishing of the corresponding manuscript, e.g. the data/code gets its own DOI that is linked to the article's DOI.

Even in cases where there is some idea of commercializing the code, the code should be put into some kind of escrow system so it could be accessed later if the results are challenged when they can't be reproduced. That would prevent a "dog ate my backups" defense by the researcher trying to hide flaws or misdeeds.

Open sourcing can give your competition an easy advantage, why is this so hard for some to admit? People need to stop treating open source as an ultimate good and admit that it has its flaws and is not appropriate in every situation.

Smaller research labs sometimes need to keep their work a secret to ensure they receive full credit for the work. That's what open source ideologues never seem to get. It's the proprietary secrets that can keep a small organization from falling prey to a larger one. If you came up with an award winning beer recipe you'd be a freaking idiot to email it to Budweiser with the hope they improve it and don't simply clone it and sell it under a different label.

I agree entirely, but only for non-commercially available software. If I can download or buy the software, then I have no need for the authors to make it available and indeed such a requirement is superfluous.

If I ran my data through Maple, MatLab or even Excel, then the OSS requirement falls flat on its face. If, however, I ran it through HatCrunch, which I developed in-house, then I should publish the source code. I'm wrong just to be publishing just the algorithms if, perhaps, my version of HatCrunch has a crippling bug which fucks up my data or the software otherwise deviates from the algorithms; This can be a phenomenal issue due to floating point data formats on different systems.

Only by specifiying the target architecture/OS and providing the source code can I ensure reproducibility. Science that is not reproducible is not science. That simple.

Seriously, I understand the intentions behind keeping work closed-source. My only hope is that the authors/institutions behind such projects are fully aware that sometimes keeping it to themselves can stop the evolution of ideas.

I sometimes wonder if it's ethical for organizations to monetize discoveries., e.g., the next cure, better food/resource production, etc.

There's a Google TechTalk from Richard Hipp (the author of SQLite) mentioning how people were used to the idea of paying & were confused when he went public domain. It's amazing how many products integrate his work -- not just computer programs. If Dr. Hipp kept it closed, how far do you think his project would reach?

For folks who say, "What if I have that one big idea?!" I'd say sure, keep it & do your best to bring it to market. But if it doesn't pan out, share your thoughts & let others continue your work. Otherwise, we'd all be using IE6 on the broken web.

I agree entirely, but only for non-commercially available software. If I can download or buy the software, then I have no need for the authors to make it available and indeed such a requirement is superfluous.

If I ran my data through Maple, MatLab or even Excel, then the OSS requirement falls flat on its face. If, however, I ran it through HatCrunch, which I developed in-house, then I should publish the source code. [...]

You run your data through Matlab, etc *with a Matlab script.* You include the *script.*

Open sourcing can give your competition an easy advantage, why is this so hard for some to admit? People need to stop treating open source as an ultimate good and admit that it has its flaws and is not appropriate in every situation.

Smaller research labs sometimes need to keep their work a secret to ensure they receive full credit for the work. That's what open source ideologues never seem to get. It's the proprietary secrets that can keep a small organization from falling prey to a larger one. If you came up with an award winning beer recipe you'd be a freaking idiot to email it to Budweiser with the hope they improve it and don't simply clone it and sell it under a different label.

You seem to fail to distinguish between "Open Source" and "Free Software".

Smaller research labs sometimes need to keep their work a secret to ensure they receive full credit for the work.

It would actually work the opposite way, I think: if bigLab uses smallLab's code, they'll have to put in a reference and give credit. Open source does not mean they can just take credit for writing code when they didn't. If code is closed-sourced, then bigLab and smallLab will have to duplicate efforts. That seems like a bigger hurdle for the small guys.

Quote:

If you came up with an award winning beer recipe you'd be a freaking idiot to email it to Budweiser

Apparently you value profits over the overall quality of beer: if Budweiser suddenly started making award-worthy beer, that would be a win in my book (but I won't hold my breath)

Open sourcing can give your competition an easy advantage, why is this so hard for some to admit? People need to stop treating open source as an ultimate good and admit that it has its flaws and is not appropriate in every situation.

Smaller research labs sometimes need to keep their work a secret to ensure they receive full credit for the work. That's what open source ideologues never seem to get. It's the proprietary secrets that can keep a small organization from falling prey to a larger one. If you came up with an award winning beer recipe you'd be a freaking idiot to email it to Budweiser with the hope they improve it and don't simply clone it and sell it under a different label.

I agree that open source is not a panacea. However, if you want to keep something secret, don't publish. It's that simple. It's just like the decision many companies make. File a patent, or keep a trade secret.

Even in cases where there is some idea of commercializing the code, the code should be put into some kind of escrow system so it could be accessed later if the results are challenged when they can't be reproduced. That would prevent a "dog ate my backups" defense by the researcher trying to hide flaws or misdeeds.

This seems like overkill, I think what's more reasonable is to provide access to source under a license that is not something that OSI would consider "open source"; something that would dis-allow re-distribution of source or binaries and perhaps even disallow use of the code for any purpose other than reproducing the original experiment. Assuming the licensing terms are obeyed, this would allow you to commercialize the code, while still letting other academics examine the code and use it to verify the validity of your results.

The ugly code is definitely an issue though - I've got one open-source academic-written library in my thesis project that is so nasty I'm writing my own version from scratch (which my supervisor will likely open-source once I'm done with it). It works fine, but it's not thread safe, or well-documented, and has a really kludgy interface.

Open sourcing can give your competition an easy advantage, why is this so hard for some to admit? People need to stop treating open source as an ultimate good and admit that it has its flaws and is not appropriate in every situation.

Smaller research labs sometimes need to keep their work a secret to ensure they receive full credit for the work. That's what open source ideologues never seem to get. It's the proprietary secrets that can keep a small organization from falling prey to a larger one. If you came up with an award winning beer recipe you'd be a freaking idiot to email it to Budweiser with the hope they improve it and don't simply clone it and sell it under a different label.

Ermm if your work isn't repeatable (competition issues aside) what you do may make you happy, but it isn't science....

Even in cases where there is some idea of commercializing the code, the code should be put into some kind of escrow system so it could be accessed later if the results are challenged when they can't be reproduced. That would prevent a "dog ate my backups" defense by the researcher trying to hide flaws or misdeeds.

This seems like overkill, I think what's more reasonable is to provide access to source under a license that is not something that OSI would consider "open source"; something that would dis-allow re-distribution of source or binaries and perhaps even disallow use of the code for any purpose other than reproducing the original experiment. Assuming the licensing terms are obeyed, this would allow you to commercialize the code, while still letting other academics examine the code and use it to verify the validity of your results.

Ohh a sort of, you can test if you agree my science is repeatable, but only if we're all part of an elite. It's probably a fair point as science has been that way for a long time, but as TimBL says, it's a shame that at the exact time we have the technology to open science to wider scrutiny, many seem engaged in entirely the opposite.

As a researcher, I think the real reason that most academic code isn't open-sourced is that it takes valuable time A) navigate your way through the university/institutional bureaucracy to get the code legally open-sourced, and B) to make crappy, hacked-together-at-the-last-minute clean and documented enough that someone else could even run it without significant hand-holding. Without a significant incentive to take the time to do these things, especially documentation, it's hard to justify putting in the time it'd take to go through the process. And in reality, code that's not at least somewhat documented and cleaned up is pretty useless to anyone other than the programmer.

Because of that, I'm actually pretty glad that journals are starting to put pressure on submissions to include code. It's a pain and it takes time, but it'll make for better science in the long term. After all, you wouldn't publish a paper where the math wasn't clean, clear, and followable (ideally, anyway, although I've read some papers that make me despair), and the same would, in an ideal world, apply to code.

Open sourcing can give your competition an easy advantage, why is this so hard for some to admit? People need to stop treating open source as an ultimate good and admit that it has its flaws and is not appropriate in every situation.

Smaller research labs sometimes need to keep their work a secret to ensure they receive full credit for the work. That's what open source ideologues never seem to get. It's the proprietary secrets that can keep a small organization from falling prey to a larger one. If you came up with an award winning beer recipe you'd be a freaking idiot to email it to Budweiser with the hope they improve it and don't simply clone it and sell it under a different label.

I agree that open source is not a panacea. However, if you want to keep something secret, don't publish. It's that simple. It's just like the decision many companies make. File a patent, or keep a trade secret.

This. The article is saying when you publish a paper, there should be enough information included to reproduce the results of the paper, so others can check your work. That includes the source code.

No one is saying the lab has to release the code while they are still running the experiment. Release it after, with the results. (And your analysis.)

This is science: the idea is to build upon each other to better understand how the world works. If I can't recreate your results because only you have the code, I haven't understood anything. You might as well as not write the paper, you aren't telling me anything.

If you don't want to write the paper because your business depends on only you being able to make the latest version of unobtanium, fine. But then you aren't a research lab.

Open sourcing can give your competition an easy advantage, why is this so hard for some to admit? People need to stop treating open source as an ultimate good and admit that it has its flaws and is not appropriate in every situation.

Citation needed. Do you have examples of open sourcing giving competition an advantage? i.e. is it just something that sounds logical, or is there data to support it?

First, a description of the software should be included, including all algorithms and formulas. This has to be in enough detail that anyone implementing it can do two things. (1) Re-implement the software from scratch, and (2) easily verify that the algorithms and formulas are correct.

The original source code must also be available, so people who re-run the experiment and get a different result can try to see if it's a problem in their implementation of their software, or if it's a data error.

The reasoning for this is simple - researcher code sucks! They really cannot write a line of code, and it's really quite horrendous (think Daily WTF worthy). Untangling the mess that it is difficult and bugs in it may be impossible to find. By having a "what it should do" version (the description for re-implementation) with a "what it actually does" version, we can tell if there's a problem with implementation, or with the analysis.

Absolutely agree with this. Science has become a lot poorer for the publishing of findings without the ability for anyone to reproduce them. And it's across the board, whether talking about environmental science, physics, astronomy (how you determine whether a planet is in the "green zone", for instance), biology...

As Mark Twain said, there are "lies, damned lies, and statistics programs" (or something like that).

Open sourcing can give your competition an easy advantage, why is this so hard for some to admit? People need to stop treating open source as an ultimate good and admit that it has its flaws and is not appropriate in every situation.

Smaller research labs sometimes need to keep their work a secret to ensure they receive full credit for the work. That's what open source ideologues never seem to get. It's the proprietary secrets that can keep a small organization from falling prey to a larger one. If you came up with an award winning beer recipe you'd be a freaking idiot to email it to Budweiser with the hope they improve it and don't simply clone it and sell it under a different label.

Actually, open sourcing gives the advantage to the small shop. There are easy ways to prove that your lab did the work if you open source the results. Being able to point to an verifiable source of publication, like a public blog or third party code repo is going to stop any third party from trying to steal your publication. All you would have to do is send a letter to the editor of the publication and the plagiarist is going to be publicly humiliated. You don't put enough credit on what your reputation means in academic research.

I agree entirely, but only for non-commercially available software. If I can download or buy the software, then I have no need for the authors to make it available and indeed such a requirement is superfluous.

If I ran my data through Maple, MatLab or even Excel, then the OSS requirement falls flat on its face. If, however, I ran it through HatCrunch, which I developed in-house, then I should publish the source code. I'm wrong just to be publishing just the algorithms if, perhaps, my version of HatCrunch has a crippling bug which fucks up my data or the software otherwise deviates from the algorithms; This can be a phenomenal issue due to floating point data formats on different systems.

Only by specifiying the target architecture/OS and providing the source code can I ensure reproducibility. Science that is not reproducible is not science. That simple.

I agree to some degree. But there are a lot of software packages that are prohibitively expensive that they will be unobtainable for many researchers. Research money is usually stretched thin as it is. Researchers should keep this in mind when using software for core parts of their research. Software that has many users is also less likely to have undiscovered bugs in it as well.

This seems like overkill, I think what's more reasonable is to provide access to source under a license that is not something that OSI would consider "open source"; something that would dis-allow re-distribution of source or binaries and perhaps even disallow use of the code for any purpose other than reproducing the original experiment. Assuming the licensing terms are obeyed, this would allow you to commercialize the code, while still letting other academics examine the code and use it to verify the validity of your results.

This is what I was thinking as well.

GNU or BSD grant significantly more rights then are necessary to accomplish the stated goals of the op-ed. Perhaps it is desirable anyway, for others reasons, but that's a separate question.

I have encountered this problem first hand. I was trying to relate some of my own PhD. research (experiments) to a particular prominent computational model. I would not have been able to interpret the code anyway, but I was horrified to learn that many of the details of the model were known only to the lead researcher who authored it. Not only was the code unavailable, the program itself had never been released, only described in English and in equasions. Many implementational details which did not relate to the author's core theoretical claims did not even merit that (and the devil IS in the details in such models). This left significant ambiguity regarding not only how the model worked internally, but regarding what its predictions were. Papers regarding the model were first published in the late 1990's. I was investigating it more than ten years later. The original lead researcher and collegues were still publishing work relating to the model at this time. All this left me wondering: "Why did the author construct a comutational model in the first place"? Isn't the whole purpose of such models meant to be that they force explicitness and don't allow us to gloss over uncertainties and ambiguities like English descriptions do? If this is the point of such models (which is what my lecturers and supervisors repeatedly told me), then this particular model is just a waste of the scientific community's time.

All this ignores a related but equally important problem which can result form closed code, namely that when you are playing with a model which has lots of free parameters you can "predict" (mimic?) just about anything just by messing about with those parameters (this is what is behind all the models which seem to explain everything when first released, but mispredict any new findings until messed about with again). If only the original author can play around with the model in that way then it can be hard to tell whether this has happened in a particular case. You have to just "trust" the good intentions and insight of the original researcher.

-Sometimes I think computational models create more problems than they solve. We have to start remembering that they are only descriptions or real-world stuff (and code just language), and are therefore only useful other researchers are able to actually interpret and understand them.

I don't know if open source is needed, but at least source code should be available for review by interested parties, even if under a restrictive license. Too many people publish results that are difficult to verify given the complexity and vagueness of their description. Generally not in good journals, but still.