Guidelines for Research on Finding Bugs

There’s a lot of research being done on finding bugs in software systems. I do some of it myself. Finding bugs is attractive because it is a deep problem, because it’s probably useful work, and because — at least for some people — finding bugs is highly entertaining. This piece contains a few thoughts about bug-finding research, formulated as guidelines for researchers. Some of these are based on my own experience, others are based on opinions I’ve formed by reviewing lots of papers in this area.

Target Software That’s Already Pretty Good

Not all software is good. In fact, some of it is downright crappy. Here crappy is a technical term: it means a code base containing enough serious known bugs that the developers don’t want to hear about any new ones. Finding bugs in crappy software is like shooting fish in a barrel. Crappy software does not need clever computer science: it needs more resources, new management, or simply to be thrown away. In contrast, software that is pretty good (another technical term) may have outstanding bugs, but the developers are actively interested in learning about new ones and will work to fix them. This is the only kind of software that should be targeted by bug-finding tools. Software that is extremely good is probably not a good target either — but this code is quite uncommon in the real world.

Report the Bugs

I’m always surprised when I read a bug-finding paper that claims to have found bugs, but does not mention how the people who developed the buggy software reacted the resulting bug reports. The implication in many cases is that the bugs were identified but not reported. It is a mistake to operate like this. First, when handed a bug, software developers will often discuss it — you can learn a lot by listening in. Some bugs found by bug-finding tools aren’t bugs at all but rather stem from a misunderstanding of the intended behavior. Other bugs are genuine but developers aren’t interested in fixing them — it is always interesting to learn how bugs are prioritized. If a bug-finding research tool is not finding bugs that are useful, then the tool itself is not useful. However, perhaps it can be adapted to filter out the bad ones or to find better ones in the first place. In all cases, feedback from software developers is valuable to the bug-finding researcher. Another reason it’s important to report bugs is that fixed software is better for everyone. Obviously it’s better for the software’s users, but it is also better for bug-finding researchers since bugs tend to hide other bugs. Paradoxically, bug-finding often becomes easier as bugs are fixed. Furthermore, it is important to raise the bar for future bug-finding research: if a dozen different papers all claim credit for finding the same bug, this is hardly a success story. A final reason to report bugs is that the very best outcome that you can report in a paper about a bug-finding tool is that it has found bugs that developers cared about enough to fix.

Prefer Open Source Software as a Bug-Finding Target

In reporting a large number of compiler bugs, I have found that bugs in open source compilers are more likely to be fixed than bugs in commercial tools. One reason, I think, is that people developing commercial compilers have a tendency to spend their precious time supporting paying customers as opposed to humoring obnoxious researchers. Another reason to prefer open source software is that it usually has a public bug reporting system, even if it’s just a mailing list, so you can listen in and perhaps participate when bugs are being discussed. Also, when a bug is fixed you can immediately try out the fixed version. In contrast, it is unusual to be granted access to unreleased versions of commercial software, so you may have to wait six months or two years for a new release, at which point your students have graduated, your grants have expired, and your bug-finding tool might not compile any longer (it definitely won’t compile after six months if your code interfaces with a fast-moving code base like LLVM).

Think Carefully when a Bug-Finding Tool Doesn’t

Certain codes are too hardened to be good bug-finding targets. For example, when hunting for integer overflow errors we found none in several popular libraries such as libjpeg. The problem is that this kind of code has been intensely scrutinized by security people who spent a lot of effort looking for similar kinds of problems. Other codes don’t contain good bugs because they are too small. Of course, a failure to find bugs doesn’t mean they don’t exist: it might mean that the bug-finding tool is not yet mature enough or it may simply be based on an incorrect premise. I’ve supervised a couple of casual student projects where they wrote fuzzers for the Linux system call API. Although this is clearly a hardened target, I have confidence that bugs would have been found if the students had persisted (they did not).

One trick I’ve seen people play when a bug-finding tool doesn’t find any bugs is to relabel the work as an exercise in formal verification where the desired result is evidence of absence of bugs rather than evidence of their presence. In some cases this may work, but generally speaking most bug-finding tools have many embedded assumptions and unsoundnesses that — if we are honest — cause their verification claims to be weak. Although there is certainly some overlap between bug-finding and formal verification, the latter sort of work is often conducted differently, using different tools, and attacking much smaller pieces of software.

Conclusion

I feel like this is all pretty obvious, but that it needs to be said anyway. Other than the superb Coverity article, there’s not a lot of good writing out there about how to do this kind of work. If you know of any, please leave a comment.

14 thoughts on “Guidelines for Research on Finding Bugs”

I think you forgot to mention the tried and true technique of running your tool on a version of the software with known bugs (because…you have access to the bug database) and seeing if the tool can re-find those bugs.

Of course it gets exciting when you can find new bugs, perhaps too exciting. I decided early on as an undergraduate not to play that game.

John, I’m going to have to put in a caveat on points 1 and 2, while yelling “YES, YES, AMEN” on 3 and 4.

In general, if you want to do careful experiments on the power of different bug finding methods, hitting “pretty good” code and reporting the bugs may not be a great idea, depending on what is meant by “pretty good.”

If you’re bug-hunting in the latest version of good software, finding enough bugs to be statistically valid may be hard. As we know, additionally, knowing if you have found five bugs or fifty also depends on getting some sense of which tests/reports are for different bugs, and that’s harder if you are finding things that require reporting. Too many papers (and model checking is particularly bad for this, and I’ve done it myself) use “evidence” of bug finding power like: “Method M1 found a previously unknown bug in Important Program P that method M2 did not find.” This is great news for improving program P, but it might not mean much about M1 vs. M2. Actually, I guess I wouldn’t argue with point 1 — testing lousy code nobody cared about in the past or present or future is pointless. But there’s a decent argument for doing some research on things like old versions of Pretty Good and Important code that are (in some ways) lousy. When I want statistically large numbers of bugs to count for seeing which random testing methods work, old but not ancient JavaScript engines do seem reasonable. And of course, if any of the found bugs still fail on a current release, reporting is important and good. But reporting bugs, at least on a smaller scale than you guys’ Csmith efforts, has a little bit of tendency to “the plural of anecdote is data.”

Maybe it just depends what we mean by “pretty good.” Is Mozilla SpiderMonkey prior to 1.8.5 “pretty good?” Throwing jsfunfuzz with some modifications at it for a few hours will turn up a LOT of bugs, so maybe not. But it was clearly important software, with semi-serious test effort, so using it to compare techniques for testing JS engines seems reasonable. It also applies to the use case that most “pretty good” pieces of code are probably “pretty lousy” at some early point in development. Testing should help you get from “lousy” to “good” also.

The “historical” approach does have some problems, in that while everything on an old JS engine I find seems to eventually get fixed in the “future,” testing old C compilers will reveal bugs that aren’t “worth” fixing. But perhaps this can be used as a different measure of bug quality? If you test old program P and find bugs that were eventually fixed, by definition they were probably “good” bugs. Bugs not fixed are either (1) worth reporting or (2) uninteresting to developers. If you find that the bug is in the bug DB, you know it is probably 2. The time between the version you tested and the bug fix in future time is some mix of priority (unimportant stuff will take longer to find) and difficulty to find (they couldn’t fix it until they found it). Too bad the work of finding when a fixed bug was actually detected is often fairly hopelessly labor intensive.

You probably know about this (but then I’m not sure why you didn’t mention it), but Dave Jones (and other kernel developers) are developing and using a fuzzer the linux kernel system calls, Trinity ( http://codemonkey.org.uk/projects/trinity/ ). It finds bugs — dozens rather than hundreds, but still.

Hi Alex, of course I don’t disagree with your specific points but I do stand by the overall judgement that testing really bad software is a bad idea, not least because it is often the case that a handful of really horribly bugs distort the results by being hard *not* to trigger.

Hi gasche, yes, we looked at Trinity a while ago and I think it’s very cool. At one of the first conferences I attended, a USENIX in the late 90s, there was a great talk about CRASHME, a previous system call fuzzer. A few years ago when I was working on system call fuzzing I had some ideas about things to do differently than these previous project but I don’t remember now what they were….

Xi Wang, et al., did all of the right things with their work on optimization-unstable code: they developed a tool (STACK), ran it against several large open source projects’ codebases, manually looked through the tool’s output to find “real” bugs, developed patches, filed bug reports w/ patches attached in the projects’ bug trackers, and wrote about the results of submitting the bugs to the projects in their paper.

All of the projects except one fixed all of the bugs reported, so “doing the right thing” truly does produce results, at least with large open source projects. And it made for an interesting paper.

John — right, I agree strongly there. Guess I just thought you were making a stronger (and nobler?) statement. There is a fundamental negative point to the historical method that Sean and I advocate: beating dead horses because they are the only statistically significant horses has an element of purposelessness and cruelty to it. More important, I suspect testing old versions is also less likely to win over actual developers than announcing new bugs, even if it is a sounder research method in some cases.

Alex, right, I hear you. At some level it comes down to the question of what we’re doing here. I’m here almost 100% just to make software better. You (I think) also have a slice of “I’m here to understand software and software testing.” Many software engineers go further down that road. Of course I’m oversimplifying; obviously I recognize the need to understand software if we’re going to test it effectively. But fundamentally, I don’t want to understand software, I just want to beat it up.

Fair enough. Yeah, I have more of a foot in the generality camp. Flipside: I think you want to understand the software you’re beating up more than I do; part of my focus on generalized testing techniques and comparisons is that I want to not have to be too familiar with the system I’m testing, because I’m afraid that doesn’t scale well, and I have a poor attention span. You understand C compilers better than I understand file systems, to pick some of our favorite targets. You can’t be completely ignorant, but part of my big plan is to focus on the side of things where deep knowledge of the system under test isn’t totally needed, partly because in my experience at NASA, the test team is smaller than the dev team in most situations, so you can’t be as expert as the devs. You need some dirty tricks, lacking time to be an expert. I see random testing as one particularly potent dirty trick.