Burning in a Module with Random Unit Testing

Sometimes a class or subsystem makes us uneasy; when something goes wrong in our software, we’ll immediately suspect the shady module is somehow involved. Often this code needs to be scrapped or at least refactored, but other times it’s just immature and needs to be burned in. Randomized unit testing can help with this burn-in process, increasing our confidence in a module.

This post isn’t about creating a random unit tester. Sometimes you can reuse a tester such as Quickcheck or Randoop. If not, it’s easy to create one by hand; I’ll cover that in a separate post.

The point is to increase our confidence in a module. But what does that mean? I’ll try to clarify with an example. Let’s say I implement a new data structure, perhaps a B-tree. Even after doing some unit testing, I probably won’t be particularly confident in its correctness. The question is: Under what conditions can I use random testing to gain as much confidence in my new B-tree as I would have in a balanced search tree from the C++ STL? For that matter, how confident am I in a random data structure from the STL in the first place? I would say “moderately confident.” That is, in general I would be happy to just use STL code in my program without specifically unit-testing it first. On the other hand, I would not just reuse the STL if I were developing safety-critical software.

Almost every time I write a software module that has a clean API and that might be reused, I write a random tester for it. Over the years I’ve developed a sort of informal procedure that, if successful, results in a burned-in module that I’m fairly confident about. Here are the necessary conditions.

Understandable code, clean API — I have to be able to understand the entire module at once. If not, it needs to be broken into units that are tested separately. The module can’t contain spaghetti logic or make use of extraneous state. If it does, it needs to be refactored or rewritten before burning in.

Heavy use of assertions — Every precondition and postcondition that can be checked, is. A repOk() / checkRep() method exists that does strong invariant checking. During fuzzing it is invoked after every regular API call.

Mature fuzzer for the module’s API — The random unit tester for the module is strong: it has been iteratively improved to the point where its tests reach into all parts of the module’s logic.

Fault injection — APIs used (as opposed to provided) by the module, such as system calls, have been tested using mocks that inject all error conditions that might happen in practice.

Good coverage — The maturity of the fuzzer is demonstrated by 100% coverage of branches that I believe should be covered. This includes error checking branches, but of course does not include assertion failure branches. At the system testing level, 100% coverage is generally impossible, but at the unit level I consider it to be mandatory. Coverage failures indicate bad code, a bad API, or a bad random tester.

Separate validation of code that is sneaky with respect to coverage — Branch coverage is, in some cases, a very weak criterion. Examples include complex conditionals and code that uses lookup tables. These have to be separately validated.

Checkers are happy — Valgrind, IOC, gcc -Wall, pylint, or whatever tools apply to the code in question are happy, at least up to the point of diminishing returns.

Oracles are happy — If a strong oracle is available, such as an alternative implementation of the same API, then it has been used during random testing and no important differences in output have been found.

You could argue that this list has little to do with random testing, but I’d disagree. I seldom if ever trust a software module unless it has been subjected to a broad variety of inputs, and it can be very hard to get these diverse inputs without using random numbers somehow. A haphazard (or even highly systematic) collection of unit tests written by myself or some other developer does not accomplish this, for code of any complexity.

A lot of real-world software, particularly in web-land, seems to be burned in by deploying it and watching the error logs and mailing lists. This development style has its place, especially since there’s a lot of software that’s just not easy to unit test. Test via deployment is how most of my group’s open-source projects work, in fact. But that doesn’t mean that it’s not satisfying and useful to be able to produce a piece of high-quality software the first time.

Might it be possible to bypass the burn-in process, for example using formal verification? Absolutely not, though we would hope that verified software contains fewer errors. Also, the errors found will tend to have a different character. The relationship between software testing and verification is a tricky issue that will increase in importance over the next few decades.

This post would be incomplete without mentioning that people with a background in academic software engineering sometimes claim that random testing can improve our confidence in software in a totally different sense from what I’m talking about here. In that line of thinking, you create an operational profile for what real inputs look like, you generate random test cases that “look like” inputs in the profile, and then finally you devise a statistical argument that gives a lower bound for the reliability of the software. I don’t happen to believe that this kind of argument is useful very often.

{ 9 } Comments

“I don’t happen to believe that this kind of argument is useful very often.”

Does anyone much (semi-serious question)?

For something the size of a B-tree, IF you’re in C (not C++), I think there’s a modest chance bounded model checking (via CBMC) can give you burn-in that’s quite similar to random testing burn-in. The errors found with bounded model checking are not dissimilar to random testing, if you do some kind of minimization on both sides, in my experience.

Hi Alex, can you recommend some solid model checkers that I should try? I really want to use these tools but in practice some problem always gets in the way. For example, it fails while parsing some random header file, crashes inexplicably, uses all my RAM, or something.

Regarding the statistical reliability arguments, I’ve seen them posed (or at least discussed) seriously. It could be that nobody really believes this stuff.

If you want to check actual code, CBMC _can_ handle some stuff, but you basically have to get lucky, and often it means stubbing some headers/system calls — esp. for your kind of code. Though for standard headers, the source release contains default stubs that sometimes work. I’m not sure there’s anything better than that, and as I said, it’s limited. But not so limited it’s never worth trying for smallish critical C code of the kind you’re discussing — self contained modules.

Hmm, thought I’d demonstrate by taking the first memory B-tree I found online and putting together a reasonable “burn in” harness in ten minutes. Alas, though it compiles with gcc just fine, CBMC is griping mysteriously:

John, you had worked with CIL not too long ago, is that right? It looked well-positioned to take over as a lingua franca of C intermediate languages, but somehow it fell off the curve. Do you know what happened?

Hey Ben, good question. I’m not totally sure but I’m guessing it’s a combination of: lack of OCaml programmers (I had to get a couple students to learn it just to use CIL), lack of a story for C++, and lack of a critical mass of contributors. George one time remarked somewhere that although CIL had lots of users, they weren’t contributing improvements. My personal impression (as an advisor, not as a CIL hacker) was that CIL stopped too close to the AST, for example making everyone write their own dataflow framework. A real compiler like LLVM, on the other hand, has a much richer collection of IRs, making it more likely that any individual project can hook in.

As a side note, KLEE handles the btree implementation ok, giving some credit to the LLVM plan, though it hits scaling trouble if you try to move up to 8 items into the tree, which is troubling. If CBMC didn’t choke on some SSA-transformation problem, I think it’d handle scaling a little better.