Levels of Fuzzing

Differential Testing for Software is one of my favorite papers about software testing: it is one of the few pre-delta-debugging papers to describe automated test-case reduction and it contains many pieces of wisdom about software testing in general and random testing in particular. One of them outlines a hierarchy of test case generation strategies:

For a C compiler, we constructed sample C source files at several levels of increasing quality:

Sequence of ASCII characters

Sequence of words, separators, and white space

Syntactically correct C program

Type-correct C program

Statically conforming C program

Dynamically conforming C program

Model-conforming C program

The paper goes on to describe how test cases at different levels reveal different kinds of flaws in a compiler. McKeeman’s choice of the words “increasing quality” is interesting because the set of test cases that can be generated at each level is a strict subset of the set of test cases generated at the previous level. In other words, for example, every type-correct C program is syntactically correct but the converse is not true. Since the expressiveness of the test-case generator decreases with increasing level, so must the size of the set of bugs that can be triggered.

So where does “increasing quality” come from? The problem is that in the real world, testing budgets are finite. Therefore, while a random sequence of ASCII characters can in principle trigger every single compiler bug, in practice we would have to run a very large number of test cases before finding the kind of bug-triggering programs shown here. Actually, it is not necessarily the case that every compiler bug can be triggered by 7-bit ASCII codes. To be safe, we should probably define test level 0: a random bit stream.

Given the low probability of triggering difficult optimizer and backend bugs using random bits or characters, we might interpret “increasing quality” as meaning “increasing coverage of compiler code within a reasonable test budget.” So we have this interesting situation where decreasing the set of inputs that we are willing to generate increases the expected level of coverage of the system under test.

Although McKeeman reports finding bugs at levels 1-5, for the Csmith work we did not bother with any of levels 0-5. It’s not that we didn’t think test cases at these levels would trigger some bugs, but rather that we were mainly interested in sophisticated optimizer bugs that are very hard to spot until level 6. My guess, however, is that a high-quality modern compiler (GCC, Clang, Intel CC, MSVC, etc.) is probably not going to have a lot of bugs that you can find using levels 0-3.

Bart Miller’s original fuzzer operates only at level 1. The fact that it found a lot of bugs tells us that the programs being tested contained a lot of bugs in the code that can be reached using level 1 inputs within a very small (by today’s standards) test budget. Basically, the then-unfuzzed UNIX utilities worked in the common case but were very fragile with respect to non-standard inputs. Also, it is most likely the case that many of these programs lacked the kind of sophisticated constraints on what constitutes a valid input that C compilers have.

Miller’s fuzzer, McKeeman’s fuzzer, and Csmith are all generational fuzzers: they generate test cases from scratch. In the time since the original fuzzing paper, the term “fuzz” has come to often mean mutational fuzzing where a valid input to the system under test (such as a PDF document, an Excel spreadsheet, or whatever) is randomly modified. This kind of fuzzer does not fit neatly into McKeeman’s hierarchy, it can be seen as a hybrid of a level 0 fuzzer and a level 7 fuzzer.

I want to finish up by making a few points. First, if you are writing a generational fuzzer, it is worth thinking about which level or levels that fuzzer should operate at. In other words, McKeeman’s hierarchy is not just a way to look at compiler fuzzers, it’s a way to understand all generational fuzzers. I also think it is useful as a means of understanding mutational fuzzers even if they do not fit into the hierarchy so neatly. For example, one can imagine a mutational fuzzer that instead of modifying test cases at level 0, modifies them at level 3 or 4. My second point is that a comprehensive fuzzing campaign should operate at multiple levels of McKeeman’s hierarchy. In other words, it would be useful to create a family of fuzzers, for example one operating at each of levels 0-7, and to split up the testing budget among them. The details of how this split is performed are probably not that important, but in practice I would look at the bug yield of each level after a couple of weeks of fuzzing and then use that information to increase the budget allocated to the more productive levels. However, if I really cared about the quality of the system under test, under no circumstances would I allocate zero test budget to any of the fuzzers. As I said earlier, for the Csmith work we ignored all but level 6: this was because we had a specific research agenda, as opposed to wanting to maximize the quality of some compiler. Finally, it is not the case that all of McKeeman’s levels make sense for all systems under test. For example, if we are fuzzing an API instead of a standalone tool, it probably makes no sense to pass random bit streams to the API.

12 thoughts on “Levels of Fuzzing”

Not sure I understand this: “For example, if we are fuzzing an API instead of a standalone tool, it probably makes no sense to pass random bit streams to the API.” What, in your terms, constitutes testing an API? Isn’t an API implemented by a piece of software (“a standalone tool”)? To me this reads like a slightly tautological remark that if we’re testing on a higher level, we’re going to be testing on a higher level, but I don’t know what you were going for.

Hi Jeroen, by “testing an API” I meant for example unit testing of a Java class. So in this case I certainly don’t want to send a random bit stream, I need to perform valid Java method invocations containing references to valid objects, etc.

@regehr: Yes, but… if you are restricting your testing frame to that, you *can’t* send random bit streams because those don’t exist on that level. (If the method takes a byte array you could stuff random bits in there, of course, but I think that’s not what we’re talking about.) So no need to mention you don’t want to.

Whether to be random or not could be considered an independent aspect of the strategy, and the levels could also be seen as the languages under test. In the example of Java unit tests, all possible invocation sequences of all possible methods with all possible parameter values constitutes the language under test, and the equivalent of a random bit stream is a random sequence of parameter calls with random values. If you take it one level higher, you’d restrict the parameter values to legal ranges, but you could still do random sequences of calls with random values that are in range… and so on. The highest level in this case is calling the API exactly as it was intended to be called, with no violations on any level of usage (level 7 in the post), so “all” we’re testing is correctness of the implementation under expected circumstances.

I think that instead of the random inputs one can generate, it helps to think about the oracles that one can define, or not.

Random strings work for compilers because a quality compiler does not crash on any input string.

An API may be open to calls from an adversary who is allowed anything to put the system in an inconsistent state. Then random sequences correspond to things he/she could try to produce a recognizable result, and it is useful to have some of this strategy in a testing campaign (we did a case study like this, but with static analysis instead of random testing).

Jeroen, that’s right, this hierarchy is describing an increasingly restricted series of languages.

Pascal, McKeeman’s hierarchy (at least as I present it here) is largely independent of what oracles you do or do not have. The problem being dealt with here is finding interesting inputs to the SUT within a limited testing budget.

Both of you are discussing the case where we are not fuzzing across a trust boundary, in which case we might be willing to honor a contract with the fuzzed code which specifies that we will not test some subset of the inputs. In the post I was simply trying to acknowledge that this issue exists without getting into it since the post is about this orthogonal issue: the efficiency of the fuzzer.

I think that meditating at length on the statement “The _possibility_ of finding a bug is not the same as the _probability_ of finding a bug” is really key to figuring out where in McKeeman’s hierarchy you want to put your effort. (It’s also the key to our swarm testing work). Think about any generational strategy as inducing a function P(B_i) that gives the probability of a test finding Bug #i for all bugs in the fault set of the program under test. The “good” thing about ASCII testing is that every possible P(B_i) is greater than 0! This is appealing. The downside is that P(B_i) is epsilon-close to zero for almost all interesting bugs, on a modern compiler or file system. It’s so close to zero in most cases that even throwing Google-level computing resources at the problem you still lose.

For APIs the hierarchy is still there, just internal. For file systems my students always wonder if they should generate random ASCII paths for calls. This is a level-0 strategy, and it has that same flavor of reducing P(B_i) to near zero for most bugs.

I think a second point for moving up the hierarchy in many cases is that bugs you lose by moving higher are often low-consequence bugs. Wrong-code bugs in compilers are by definition sitting at level 6. For file systems, you have to ask why you would be getting totally garbage pathnames. One likely possibility (the only one on NASA missions) is memory corruption by someone else. In that case, though, the pathname is just garbage and highly unlikely to make it past a path canonizer and do anything at all. If it’s provided by an adversary, you might want to care, but the right approach in that case is probably to drop generalized API testing of the file system and focus on the path canonizer. It might even be reasonable to prove that for bounded lengths the path canonizer never lets paths that don’t refer to valid file system entries cause harm.

The reason that security fuzzing gets away with very low-level fuzzing (0-2 or whatever) is that the bugs that we regard as being boring frontend bugs end up being security bugs often. So there’s this synergy where the easiest fuzzing techniques can often lead to the bugs that you most want to find.

On possible way to look at the levels is what detectable events should or should not happen:

0-7 compiler must not crash
3-7 must not generate syntax errors
4-7 must not generate type errors
5-7 must not generate compile time errors (but might generate warnings?)
6-7 will not volatile guarantees offered by the C memory model (e.g. output binary must not “crash”).

Hi bcs, I’ve often wanted to de-tune Csmith so that it can operate at these other levels. It seems certain that there are bugs to be found there.

On the other hand, there’s a little trick that I’ve run across (independently discovered by several other groups that I know of) where a testcase reducer serves as a pretty effective lower-level fuzzer. In other words, while reducing one bug you often end up crashing the compiler in novel ways. Pretty cool! We’ve seen this happen in the wild and it’s pretty reliable, the next thing to do is watch it happen under controlled circumstances and try to figure out exactly what’s going on and if it can be exploited to build fuzzers more cheaply.

“My guess, however, is that a high-quality modern compiler (GCC, Clang, Intel CC, MSVC, etc.) is probably not going to have a lot of bugs that you can find using levels 0-3.”

Hah, I wish. I don’t know about the others, but Clang is full of crash-on-invalid bugs; every so often, we’ll find one that’s triggered by an easy-to-make mistake, and it will collect a few duplicate reports a week.

I’d love to seed a mutational fuzzer with a >1MLOC codebase and set it loose for a few tens of thousands of CPU-hours.

Hi Matt, that is good to know. Since I do not feel like writing a new mutational fuzzer for C, we plan to reuse C-Reduce in this capacity. We have noticed that while reducing one compiler bug, it is very common to discover more. So there is something interesting going on there. The effect is very much analogous to the one mentioned in my “fuzzer puzzle” post.

So far we have only noticed this phenomenon in passing, but I plan to investigate it more thoroughly. Also, I believe it is possible (and in fact easy) to de-tune C-Reduce so that it becomes a worse reducer but a much better fuzzer.