Reproducibility in Computer Systems Research

These results about reproducibility in CS have been the subject of lively discussion at Facebook and G+ lately. The question is, for 613 papers, can the associated software be located, compiled, and run? In contrast with something I often worry about — producing good software — the bar here is low, since even a system that runs might be without value and may not even support the results in the paper.

It is great that these researchers are doing this work. Probably everyone who has worked in applied CS research for any length of time has wanted to compare their results against previous research and been thwarted by crappy or unavailable software; certainly this has happened to me many times. The most entertaining part of the paper is Section 4.3, called “So, What Were Their Excuses? (Or, The Dog Ate My Program)”. The recommendations in Section 5 are also good, I think all CS grad students and faculty should read over this material. Anecdote 2 (at the very end of the paper) is also great.

It’s not clear why installing these dependencies was difficult, but whatever. In a Facebook thread a number of other people reported learning that they have been doing irreproducible research for similar reasons. The G+ thread also contains some strong opinions about, for example, this build failure.

Another thing we might take issue with is the fact that “reproducible research” usually means something entirely different than successfully compiling some code. Usually, we would say that research is reproducible if an independent team of researchers can repeat the experiment using information from the paper and end up with similar (or at least non-conflicting) results. For a CS paper, you would use the description in the paper to implement the system yourself, run the experiments, and then see if the results agree with those reported.

Although it’s not too hard to put some code on the web, producing code that will compile in 10 or 20 years is not an easy problem. A stable platform is needed. The most obvious one is x86, meaning that we should distribute VM images with our papers, which seems clunky and heavyweight but perhaps it is the right solution. Another reasonable choice is the Linux system call API, meaning that we should distribute statically linked executables or equivalent. Yet another choice (for research areas where this works) might be a particular version of Matlab or some similar language.

The biggest threat is that their code has bugs. I don’t worry too much about the authors of a paper lying about what their program output on the screen, I worry about whether they correctly impelemented the algorithm they described. Downloading and running their binary code offers zero protection against this. Even examining their source code doesn’t offer real protection.

The only real way to get reproducibility is if I take the pseudo-code from their text, re-implement from scratch, and get good results.

It seems to me that the source being available should be considered as the metric of success. Building any software is a pain and ease of build isn’t really the point of research artifacts. If I really need to use some other system, I’ll happily put in the day to get it running. The real problem is when the implementation isn’t available at all, so you can’t compare against the previous work.
We got lucky on this one in that although our system was available on the web, we forgot to link to it in the paper. The email looked like a form letter, so we guessed something like this would be done and sent detailed build instructions. It was frustrating that we never heard anything back from them.

Brian, I totally agree. I’m going to be putting many days into any evaluation, so a day here and there getting the competition to run is nothing. Also, even if I end up not running the earlier code, I can inspect it to confirm guesses about how it actually works, since papers almost always leave out crucial details.

Regarding detailed build instructions, our software did include them, and C-Reduce has plenty of external users! So basically the person doing the work dropped the ball, in my (admittedly biased) opinion.

In many respects, the big flaw here is that they’re going for quantity over quality. I embarked once on a project that involved building two-year-old versions of Firefox and found that I had to fix many small details to get it to work properly: it probably took me a day, and this is someone who already builds it on a very regular basis. I imagine the authors of this paper didn’t try to put in the effort to get it to build that they would have if they had more time.

I’m not convinced that just having source is a sufficient metric for success–a receptive author by email is equally important.

Another issue is that exact build environment reproducibility may be insufficient. For example, if I’m trying to compare against a technique that requires g++ 3.4 when running on real-world C++ programs, it’s useless, as g++ 3.4 can no longer compiler newer versions of software. For a comparative evaluation, the testsuites you want to measure against are a moving target, and software being able to move with the target is arguably more important than trying to remeasure the original target.

Joshua, I’ve had a similar experience when trying to build old versions of GCC: it is extremely time consuming, and the alternative of going back to Red Hat 3.0.3 or whatever in a VM does not exactly solve my problems because nothing else runs there.

Yes, a responsive author is super super useful, but the reality is that the students are the technical experts and their memory of their research fades rapidly when they get jobs. Us professors don’t have great long term memories either.

The fast-moving nature of the field means that digital archaeology efforts are to some extent doomed. But as the Arizona paper makes clear, we can do a lot better without really breaking a sweat.

As for the general point: researchers are typically very cautious about their image. They are therefore very unhappy to release code they write, since they know it’s usually quite sloppy. I’ve had both good and bad experiences. It even happened once that the researcher had a bug in the SW that by accident implemented a good heuristic. The paper never mentioned it because the researcher was unaware of it. Pretty funny stuff.

Hi Mate, yeah, a significant number of these build failures are trivially solvable by anyone with a bit of experience building software on Linux. At the G+ thread linked in the post, people have posted a lot more examples of this. It’s a shame that (as Joshua observed) they went for quantity instead of quality.

Regarding research software that works by accident or otherwise works for a reason not stated in the paper: I believe this is extremely common. I had a PLDI paper (not the one that they could not reproduce) where exactly what you describe happened: I made a mistake implementing the algorithm described in the paper, but the results were indistinguishable from the results of the correct algorithm. We really scratched our heads over that puzzle — the success of the incorrect code ended up being explainable but not easy to explain, if you see what I mean. I was lucky my student discovered the error early enough that we fixed it for the accepted version of the paper (though as I said the fixed graphs looked just like the original ones).

A point that’s often overlooked in discussions of reproducibility is one of the root causes: today’s computational environments were absolutely not designed to facilitate distributing and installing software in a portable way. That’s particularly true for the Unix family, which treats all software as extensions to the operating system. It happens that I wrote a blog post on this topic a while ago: http://www.activepapers.org/2014/01/31/Installing-Software.html

The right way, in my opinion, is to create a well-defined software-execution platform. There have been a few promising attempts and I hope developments along these lines continue: the Java Virtual Machine, NixOS (http://nixos.org/) and its offshoot Guix (https://www.gnu.org/software/guix/), plus a couple of software installation management systems to which I don’t have references at hand right now.

I agree with some of the basic ideas of the paper, but the process seems slapdash. For one thing, I can’t easily lay my hands on a linux machine that doesn’t have at least flex and indent already installed. Pruning out the “trivial build errors for anyone who should be trying this in the first place” would have been a good start, because there ARE tools with genuinely nasty build dependencies. “You have to install LLVM” vs. “you need OCaml, but not a recent OCaml, this specific version and this old version of this theorem prover and…”

Perhaps the ACM and/or IEEE could “publish” some small number of versions of a linux VM each year. E.g., “The Official ACM 2014 Large VM”. There could be several different flavors, a light-weight version, one containing a bunch of libraries, etc. Authors should be encouraged to design their software to run on one of these VMs. The ACM or IEEE will distribute the VMs on their web sites and archive them. Authors will give instructions on how to build starting from one of these VMs. Then you will not need a 1-1 correspondence between VMs and papers.

Hi Stephen, I like this, but in practice wouldn’t it be easier and better for ACM (or, more likely, USENIX) to simply bless and archive the current Ubuntu LTS? 90% of research software would just work.