Errors in scientific results due to software bugs are not limited to a few high-profile cases that lead to retractions and are widely reported. Here we estimate that in fact most scientific results are probably wrong if data have passed through a computer, and that these errors may remain largely undetected. The opportunities for both subtle and profound errors in software and data management are boundless, and yet bafflingly underappreciated.

Computational results are particularly prone to misplaced trust

Perhaps due to ingrained cultural beliefs about the infallibility of computation, people show a level of trust in computed outputs that is completely at odds with the reality that nearly zero provably error-free computer programs have ever been written.

It has been estimated that the industry average rate of programming errors is “about 15 - 50 errors per 1000 lines of delivered code” (McConnell, Code Complete). That estimate describes the work of professional software engineers—not of the graduate students who write most scientific data analysis programs, usually without the benefit of training in software engineering and testing. The most careful software engineering practices in industry may drive the error rate down to 1 per 1000 lines.

For these purposes, using a formula to compute a value in Excel counts as a “line of code”, and a spreadsheet as a whole counts as a “program”—so many scientists who may not consider themselves coders may still suffer from bugs.

Table 1: Number of lines of code in typical classes of computer programs.

How frequently are published results wrong due to software bugs?

Of course, not every error in a program may affect the outcome of a specific analysis. For a simple single-purpose program, it is entirely possible that every line executes on every run. In general, however, the code path taken for a given run of a program executes only a subset of the lines in it, because there may be command-line options that enable or disable certain features, blocks of code that execute conditionally depending on the input data, etc. Furthermore, even if an erroneous line executes, it may not in fact manifest the error (i.e., it may give the correct output for some inputs but not others). Finally: many errors may cause a program to simply crash or to report an obviously implausible result, but we are really only concerned with errors that propagate downstream and are reported.

In combination, then, we can estimate the number of errors that actually affect the result of a single run of a program, as follows:

# errors per program execution =
total lines of code
* proportion executed
* probability of error per line
* probability that the error meaningfully affects the result
* probability that an erroneous result is plausible.

Scenario 1: A typical medium-scale bioinformatics analysis

All of these values may vary widely depending on the field and the source of the software. For a typical analysis in bioinformatics, I'll speculate at some plausible values:

100,000 total LOC (neglecting trusted components such as the Linux kernel).

20% executed

10 errors per 1000 lines

0.1 chance the error meaningfully changes the outcome

0.1 chance that the result is plausible

So, we expect that two errors changed the output of this program run, so the probability of a wrong output is effectively 1.0. All bets are off regarding scientific conclusions drawn from such an analysis.

Scenario 2: A small focused analysis, rigorously executed

Let's imagine a more optimistic scenario, in which we write a simple, short program, and we go to great lengths to test and debug it. In such a case, any output that is produced is in fact more likely to be plausible, because bugs producing implausible outputs are more likely to have been eliminated in testing.

1000 total LOC

100% executed

1 error per 1000 lines

0.1 chance that the error meaningfully changes the outcome

0.5 chance that the outcome is plausible

Here the probability of a wrong output is 0.05.

The factors going into the above estimates are rank speculation, and the conclusion varies widely depending on the guessed values. Measuring such values rigorously in different contexts would be valuable but also tremendously difficult. Regardless, it is sobering that some plausible values lead to total wrongness all the time, and that even conservative values lead to errors that occur just as often as false discoveries at the typical 0.05 p-value threshold.

Software is outrageously brittle

A response to these concerns that I have heard frequently—particularly from wet-lab biologists—is that errors may occur but have little impact on the outcome. This may be because only a few data points are affected, or because values are altered by a small amount (so the error is “in the noise”). The above estimates account for this by including terms for “meaningful changes to the result” and “the outcome is plausible”. Nonetheless, in the context of physical experiments, it's easy to have an intuition that error propagation is somewhat bounded, i.e. if the concentration of some reagent is a bit off then the results will also be just a bit off, but not completely unrelated to the correct result.

But software is different. We cannot apply our physical intuitions, because software is profoundly brittle: “small” bugs commonly have unbounded error propagation. A sign error, a missing semicolon, an off-by-one error in matching up two columns of data, etc. will render the results complete noise. It's rare that a software bug would alter a small proportion of the data by a small amount. More likely, it systematically alters every data point, or occurs in some downstream aggregate step with effectively global consequences.

Software errors and statistical significance are orthogonal issues

A software error may produce a spurious result that appears significant, or may mask a significant result.

If the error occurs early in an analysis pipeline, then it may be considered a form of measurement error (i.e., if it systematically or randomly alters the values of individual measurements), and so may be taken into account by common statistical methods.

However: typically the computed portion of a study comes after data collection, so its contribution to wrongness may easily be independent of sample size, replication of earlier steps, and other techniques for improving significance. For instance, a software error may occur near the end of the pipeline, e.g. in the computation of a significance value or of other statistics, or in the preparation of summary tables and plots.

The diversity of the types and magnitudes of errors that may occur makes it difficult to make a blanket statement about the effects of such errors on apparent significance. However it seems clear that, a substantial proportion of the time (based on the above scenarios, anywhere from 5% to 100%), a result is simply wrong—rendering moot any claims about its significance.

What to do?

All hope is not lost; we must simply take the opportunity to use technology to bring about a new era of collaborative, reproducible science. Some ideas on how to go about this will appear in following posts. Briefly, the answer is to redouble our commitment to replicating results, and in particular to insist that a result can be trusted only when it has been observed on multiple occasions using completely different software packages and methods. This in turn requires a flexible and open system for describing and sharing computational workflows, such as WorldMake. Crucially, such a system must be widely used to be effective, and gaining adoption is more a sociological and economic problem than a technical one. The first step is for all scientists to recognize the urgent need.