Fuzzing is sort of a superpower for locating vulnerabilities and other software defects, but it is often used to find problems baked deeply into already-deployed code. Fuzzing should be done earlier, and moreover developers should spend some effort making their code more amenable to being fuzzed.

This post is a non-comprehensive, non-orthogonal list of ways that you can write code that fuzzes better. Throughout, I’ll use “fuzzer” to refer to basically any kind of randomized test-case generator, whether mutation-based (afl, libFuzzer, etc.) or generative (jsfunfuzz, Csmith, etc.). Not all advice will apply to every situation, but a lot of it is sound software engineering advice in general. I’ve bold-faced a few points that I think are particularly important.

Invest in Oracles

A test oracle decides whether a test case triggered a bug or not. By default, the only oracle available to a fuzzer like afl is provided by the OS’s page protection mechanism. In other words, it detects only crashes. We can do much better than this.

Assertions and their compiler-inserted friends — sanitizer checks — are another excellent kind of oracle. You should fuzz using as many of these checks as possible. Beyond these easy oracles, many more possibilities exist, such as:

function-inverse pairs: does a parse-print loop, compress-decompress loop, encrypt-decrypt loop, or similar, work as expected?

differential: do two different implementations, or modes of the same implementation, show the same behavior?

metamorphic: does the system show the same behavior when a test case is modified in a semantics-preserving way, such as adding a layer of parentheses to an expression?

resource: does the system consume a reasonable amount of time, memory, etc. when processing an input?

domain specific: for example, is a lossily-compressed image sufficiently visually similar to its uncompressed version?

Strong oracles are worth their weight in gold, since they tend to find application-level logic errors rather than the lower-level bugs that are typically caught by looking for things like array bounds violations.

I wrote a bit more about this topic a few years ago. Finally, a twitter user suggested “If you’re testing a parser, poke at the object it returns, don’t just check if it parses.” This is good advice.

Interpose on I/O and State

Stateless code is easier to fuzz. Beyond that, you will want APIs for taking control of state and for interposing on I/O. For example, if your program asks the OS for the number of cores, the current date, or the amount of disk space remaining, you should provide a documented method for setting these values. It’s not that we necessarily want to randomly change the number of cores, but rather that we might want to fuzz our code when set to single-core mode and then separately fuzz it in 128-core mode. Important special cases of taking control of state and I/O include making it easy to reset the state (to support persistent-mode fuzzing) and avoiding hidden inputs that lead to non-deterministic execution. We want as much determinism as possible while fuzzing our code.

Avoid or Control Fuzzer Blockers

Fuzzer blockers are things that make fuzzing gratuitously difficult. The canonical fuzzer blocker is a checksum included somewhere in the input: the random changes made to the input by a mutation-based fuzzer will tend to cause the checksum to not validate, resulting in very poor code coverage. There are basically two solutions. First, turn off checksum validation in builds intended for fuzzing. Second, have the fuzzer generate inputs with valid checksums. A generation-based fuzzer will have this built in; with a mutation-based fuzzer we would write a little utility to patch up the test case with a valid checksum after it is generated and before it is passed to the program being fuzzed. afl has support for this.

Beyond checksums, hard-to-satisfy validity properties over the input can be a serious problem. For example, if you are fuzzing a compiler for a strongly typed programming language, blind mutation of compiler inputs may not result in valid compiler inputs very often. I like to think of validity constraints as being either soft (invalid inputs waste time, but are otherwise harmless) or hard (the system behaves arbitrarily when processing an invalid input, so they must be avoided for fuzzing to work at all). When we fuzz a C++ compiler to look for wrong code bugs, we face a hard validity constraint because compiler inputs that have UB will look like wrong code bugs. There is no simple, general-purpose solution to this kind of problem, but rather a family of techniques for explicitly taking validity properties into account. The most obvious solution — but often not the right one — is to write a new generational fuzzer. The problem is that if you do this, you cannot take advantage of modern coverage-driven fuzzing techniques, which are amazing. To fit into a coverage-driven fuzzing framework you have a couple of options. First, write a custom mutator that respects your validity constraints. Second, structure-aware fuzzing, which basically means taking the mutated data from the fuzzer and translating it into something like what the program being fuzzed expects to see. There’s a lot of research left to be done in making coverage-driven fuzzers work well in the presence of validity constraints without requiring a lot of manual effort. There are some significant subtleties here, maybe I’ll go into them another time. Putting something like a SAT solver into the fuzzer is not, generally speaking, the answer here, first because some validity constraints like checksums are specifically difficult for solvers, and second because some validity constraints (such as UB-freedom in a C++ program) are implicit and cannot be inferred, even in principle, by looking at the system being fuzzed.

A lot of code in a typical system cannot be fuzzed effectively by feeding input to public APIs because access is blocked by other code in the system. For example, if you use a custom memory allocator or hash table implementation, then fuzzing at the application level probably does not result in especially effective fuzzing of the allocator or hash table. These kinds of APIs should be exposed to direct fuzzing. There is a strong synergy between unit testing and fuzzing: if one of these is possible and desirable, then the other one probably is too. You typically want to do both.

Sanitizers and fuzzers often require tweaks or even significant changes to the build process. To make this easier, keep the build process as clean and simple as possible. Make it easy to switch out the compiler and modify the compiler options. Depend on specific tools (and versions of tools) as little as possible. Routinely build and test your code with multiple compilers. Document special build system requirements.

Finally, some fuzzer blockers are sort of silly and easy to avoid. If your code leaks memory or terminates its process with a deep call stack, it will be painful to test using a persistent-mode fuzzer, so don’t do these things. Avoid handling SIGSEGV or, if you really must do this, have a way to disable the handler for fuzzer builds. If your code is not compatible with ASan or UBSan, then these extremely useful oracles are harder to use. In particular, if your code uses a custom memory allocator you should consider turning it off for fuzzer builds, or else adapting it to work with ASan, or else you’ll miss important bugs.

Unblock Coverage-Driven Fuzzers

Because coverage-driven fuzzers refocus their effort to try to hit uncovered branches, they can be blocked in certain specific ways. For example, if a coverage-driven fuzzer is presented with too many uncoverable branches, it can spend so much time on them that it becomes less likely to hit more interesting branches elsewhere in the program. For example, one time I compared afl’s coverage on a program compiled with and without UBSan, and found that (in whatever time limit I used) it covered quite a lot less of the sanitized program, compared to the unsanitized build. On the other hand, we definitely want our fuzzer to look for sanitizer failures. My advice is to fuzz both sanitized and unsanitized builds of your program. I don’t know how to budget fuzzing resources for these different activities and don’t know of any principled work on that problem. It may not matter that much, since fuzzing is all about overkill.

Sometimes your program will call branchy, already-heavily-fuzzed code early in its execution. For example, you might decompress or decrypt input before processing it. This is likely to distract the coverage-driven fuzzer, causing it to spend a lot of time trying to fuzz the crypto or compression library. If you don’t want to do this, provide a way to disable crypto or compression during fuzzing.

Any interpreter in your program is likely to make life difficult for a coverage-driven fuzzer, since the relevant program paths are now encoded in the data being interpreted, which is generally opaque to the fuzzer. If you want maximum mileage out of coverage-driven fuzzers, you may want to try to avoid writing interpreters, or at least keep them extremely simple. An obvious way to deal with embedded interpreters — which someone must have thought of and tried, but I don’t have any pointers — would be to have an API for teaching the fuzzer how to see coverage of the language being interpreted.

Enable High Fuzzing Throughput

Fuzzing is most effective when throughput is very high; this seems particularly the case for feedback-driven fuzzers that may take a while to learn how to hit difficult coverage targets. An easy throughput hack is to make it possible to disable slow code (detailed logging, for example) when it is not germane to the fuzzing task. Similarly, interposing on I/O can help us avoid speed hacks such as running the fuzzer in a ramdisk.

“But I Want Fuzzing My Code to be Harder, Not Easier”

I don’t have a lot of sympathy for this point of view. Instead of aiming for security through obscurity, we would do better to:

fuzz early and thoroughly, eliminating fuzzable defects before releasing code into the wild

write code in a programming language with as strong of a type system as we are willing to tolerate — this will statically eliminate classes of bad program behaviors, for example by stopping us from putting the wrong kind of thing into a hashmap

aggressively use assertions and sanitizers to get dynamic checking for properties that the type system can’t enforce statically

Anti-fuzzing techniques are a thing, but I don’t think it represents a useful kind of progress towards better software.

Conclusion

Randomized testing is incredibly powerful and there’s no point avoiding it: if you don’t fuzz your code, someone else will. This piece has described some ways for you, the software developer, to make fuzzing work better. Of course there are plenty of other aspects, such as choosing a good corpus and writing a good fuzzer driver, that are not covered here.

Acknowledgments: Pascal Cuoq and Alex Groce provided feedback on a draft of this piece, and it also benefited from suggestions I received on Twitter. You can read the conversation here; it contains some suggestions and nuances that I did not manage to capture.

Part 1 of this series introduced C-Reduce and showed how it combines a domain-independent core with a large collection of domain-specific passes in order to create a highly effective test-case reducer for C and C++ code. This part tells the rest of the story and concludes.

Parallel Test-Case Reduction

C-Reduce’s second research contribution is to perform test-case reduction in parallel using multiple cores. The parallelization strategy, based on the observation that most variants are uninteresting, is to speculate along the uninteresting branch of the search tree. Whenever C-Reduce discovers an interesting variant, all outstanding interestingness tests are killed and a new line of speculation is launched. This is the same approach that was subsequently rediscovered by the creator of halfempty.

C-Reduce has a policy choice between taking the first interesting variant returned by any CPU, which provides a bit of extra speedup but makes parallel reduction non-deterministic, or only taking an interesting variant when all interestingness tests earlier in the search tree have reported that their variants are uninteresting. We tested both alternatives and found that the speedup due to non-deterministic choice of variants was minor. Therefore, C-Reduce currently employs the second option, which always follows the same path through the search tree that non-parallel C-Reduce would take. The observed speedup due to parallel C-Reduce is variable, and is highest when the interestingness test is relatively slow. Speedups of 2-3x vs. sequential reduction are common in practice.

Applicability and Moving Towards Domain-Independence

Something that surprised us is how broadly applicable C-Reduce is to test cases in languages other than C and C++. Our users have found it effective in reducing Java, Rust, Julia, Haskell, Racket, Python, SMT-LIB, and a number of other languages. When used in this mode, the highly C/C++-specific C-Reduce passes provide no benefit, but since they fail quickly they don’t do much harm. (C-Reduce also has a --not-c command line option that avoids running these passes in the first place.)

One might ask why C-Reduce is able to produce good results for languages other than C and C++; the answer appears to be based on the common structural elements found across programming languages in the Algol and Lisp families. In contrast, in our limited experience, C-Reduce does a very poor job reducing test cases in binary formats such as PDF and JPEG. Other reducers, such as the afl-tmin tool that ships with afl-fuzz, work well for binary test cases.

A consequence of the modular structure of C-Reduce is that while its transformation passes are aimed at reducing C and C++ code, the C-Reduce core is completely domain-independent. C-Reduce has been forked to create a highly effective reducer for OpenCL programs. We believe it would be relatively straightforward to do something similar for almost any other programming language simply by tailoring the passes that C-Reduce uses.

Limitations

When used for its intended purpose, when does C-Reduce work badly? We have seen two main classes of failure. First, C-Reduce can be annoyingly slow. This typically happens when the passes early in the phase ordering, which are intended to remove a lot of code quickly, fail to do this. Second, highly templated C++ sometimes forces C-Reduce to terminate with an unacceptably large (say, >1 KB) final result. Of course this is better than nothing, and subsequent manual reduction is usually not too difficult, but it is frustrating to have written 69 different Clang-based passes and to still find effectively irreducible elements in test cases. The only solution — as far as we know — is to strengthen our existing transformation passes and to create more such passes.

A small minority of compiler bugs appears to be impossible to trigger using a small test case. These bugs are exceptions to the small scope hypothesis. They typically stem from resource-full bugs in the compiler. For example, a bug in register spilling code requires the test case to use enough registers that spilling is triggered; a bug in logic for emitting long-offset jumps requires the test case to contain enough code that a long offset is required; etc. These test cases are just difficult to work with, and it is not clear to us that there’s anything we can do to make it easier to debug the issues that they trigger.

C-Reduce Design Principles

In summary, C-Reduce was designed and implemented according to the following principles:

Be aggressive: make the final reduced test case as small as possible.

Make the reducer fast, for example using parallelism, careful phase ordering of passes, and avoiding unnecessary I/O traffic, when this can be done without compromising the quality of the final output.

Make it as easy as possible to implement new passes, so that many domain-specific passes can be created.

Keep the C-Reduce core domain-independent.

Focus only on reduction, delegating all other criteria to the user-supplied interestingness test.

Directions for Future Test-Case Reduction Research

Although perhaps a few dozen papers have been written about test-case reduction since Hildebrandt and Zeller’s initial paper 19 years ago, I believe that this area is under-studied relative to its practical importance. I’ll wrap up this article with a collection of research questions suggested by our experience in over a decade of work on creating a highly aggressive reducer for C and C++.

What is the role of domain-specificity in test case reduction? Researchers who cite C-Reduce appear to enjoy pointing out that it is very domain-specific (nobody seems to notice that the C-Reduce core is domain-independent, and that the pass schedule is easy to modify). The implication is that domain-specific hacks are undesirable and, of course, an argument against such hacks would be forceful if backed up by a test-case reducer that produced smaller final C and C++ code than C-Reduce does, without using domain knowledge. So far, such an argument has not been made.

The open research question is whether domain knowledge is actually necessary, or if a domain-independent test-case reducer can beat C-Reduce at its own game. The most impressive such effort that we are aware of is David MacIver’s structureshrink, which uses relatively expensive search techniques to infer structural elements of test cases that can be used to create variants. Anecdotally, we have seen structureshrink produce reduced versions of C++ files that are smaller than we would have guessed was possible without using domain knowledge.

What is the role of non-greedy search in test-case reduction? In many cases, the order in which C-Reduce runs its passes has little or no effect on the final test case. In other words, the search is often diamond-shaped, terminating at the same point regardless of the path taken through the search space. On the other hand, this is not always the case, and when the search is not diamond-shaped, a greedy algorithm like C-Reduce’s risks getting stopped at a local minimum that is worse than some other, reachable minimum. The research question is how to get the benefit of non-greedy search algorithms without making test-case reduction too much slower.

What other parallelization methods are there? C-Reduce’s parallelization strategy is pretty simple, gives a modest speedup in practice, and always returns the same result as sequential reduction. There must be other parallel test-case reduction strategies that hit other useful points in the design space. This is, of course, related to the previous research question. That is, if certain branchings in the search tree can be identified as being worth exploring in both directions, this could be done on separate cores or machines.

What is the role of canonicalization in test-case reduction? A perfectly canonicalizing reducer — that reduces every program triggering a given bug to the same reduced test case — would be extremely useful. Although it seems impossible to achieve this in all cases, there are many relatively simple strategies that can be employed to increase the degree of canonicalization, such as putting arithmetic expressions into a canonical form, assigning canonical names to identifiers, etc. C-Reduce has a number of transformations that are aimed at canonicalization rather than reduction. For example, the reduced test case at the top of part 1 of this piece has four variables a, b, c, and d, which appear in that order. I believe that more work in this direction would be useful.

Can we avoid bug hijacking? Test reduction sometimes goes awry when the bug targeted by the reduction is “hijacked” by a different bug. In other words, the reduced test case triggers a different bug than the one triggered by the original. During a compiler fuzzing campaign this doesn’t matter much since one bug is as good as another, but hijacking can be a problem when the original bug is, for example, blocking compilation of an application of interest. Hijacking is particularly common when the interestingness test is looking for a non-specific behavior such as a null pointer dereference. C-Reduce pushes the problem of avoiding hijacking onto the user who can, for example, add code to the interestingness test looking for specific elements in a stack trace. The research question here is whether there are better, more automated ways to prevent bug hijacking.

Obtaining C-Reduce

Binary C-Reduce packages are available as part of software distributions including Ubuntu, Fedora, FreeBSD, OpenBSD, MacPorts, and Homebrew. Source code can be found here.

Acknowledgments

C-Reduce was initially my own project, but by lines of code the largest contributor by a factor of two is my former student Yang Chen, now a software engineer at Microsoft. Yang wrote effectively all of the Clang-based source-to-source transformation code, more than 50,000 lines in total. Eric Eide, a research professor in computer science at the University of Utah, is the other major C-Reduce contributor (and gave useful feedback on a draft of this piece). Our colleagues Pascal Cuoq, Chucky Ellison, and Xuejun Yang also contributed to the project, and we have gratefully received patches from a number of external contributors. Someone created a fun visualization of the part of C-Reduce’s history that happened on Github.

Since 2008, my colleagues and I have developed and maintained C-Reduce, a tool for programmatically reducing the size of C and C++ files that trigger compiler bugs. C-Reduce also usually does a credible job reducing test cases in languages other than C and C++; we’ll return to that later.

Why Reduce Test Cases?

Compiling this code at the -O2 optimization level causes LLVM to crash. The bug report doesn’t contain the original, unreduced test case, but most likely it was larger.

A reduced test case is preferable because:

it usually gets the compiler to misbehave quickly, reducing the number of function calls, memory allocations, etc. that the compiler developer has to step through and reason about while debugging

the reduced file contains little code not directly related to triggering the bug, reducing the likelihood that compiler developers will be distracted by extraneous features of the test case

reduced test cases for the same bug often look similar to each other, whereas this is not normally true for unreduced files that trigger the same bug

there is often little or no discernible similarity between an unreduced source file and its reduced version, making it easier for compiler bugs triggered by proprietary code to be reported externally

The minimum-sized input triggering any particular compiler bug can be found using a trivial algorithm: exhaustive search of text strings of increasing size. This is, of course, almost always intractable. In practice, test case reduction proceeds in the other direction: starting with a large, failure-inducing test case, incrementally making it smaller until a local minimum is reached.

A Bit of Background

The history of automated test case reduction does not seem to be well documented, but several examples can be found in software testing papers from the 1990s, such as Differential Testing for Software and Massive Stochastic Testing of SQL. Test-case reduction was first studied in its own right in 2000 when Hildebrandt and Zeller introduced Delta Debugging: a general-purpose technique for test case reduction. Their algorithm uses a greedy search where a series of “variants” (our term, not theirs, for partially-reduced test case candidates) is produced by removing chunks of the test case. As reduction progresses, the chunk size is reduced, until it reaches some minimum-sized unit, such as a single line, token, or character. When no minimum-sized chunk can be removed from the test case without breaking the property that it triggers the bug, the Delta Debugger terminates. Almost all subsequent test-case reduction work, including ours, builds upon this work.

Towards C-Reduce

I became interested in test case reduction when my colleagues and I began to find a lot of bugs in C compilers using random testing. We found so many bugs that reporting them became bottlenecked on reducing the bug-triggering programs. Since I was the one reporting the bugs we found, I was the one who felt the pain of manual test-case reduction, and it quickly got old. I eventually reported around 500 compiler bugs and I could not have done this without first creating C-Reduce.

At the time, the best open-source implementation of Delta Debugging, from UC Berkeley, was line-based and contained a significant innovation over the original algorithm: it could reorganize a file in such a way that all nested curly braces deeper than a configurable level would appear on a single line. Thus, at level zero, entire functions would be placed on the same line, enabling the line-based reducer to remove an entire function at once. At higher nesting levels, functions would be split across lines, enabling finer-grained reduction. This worked well but the tool ended up being inadequate for my needs: it got stuck at local minima that were often orders of magnitude larger than what could be achieved when reducing by hand.

The limiting factor in this existing tool (“Delta” from now on) was obvious: it was not able to exploit enough of the structure of the file being reduced. For example, it could usually not do much to simplify arithmetic expressions. These sorts of simplifications tend to have a cascading effect: eliminating the last use of a variable allows its definition to be eliminated, etc. The obvious path forward was to write a new tool that solved a reduction problem that Delta could not solve, and then to alternate running this tool and Delta until a global fixpoint was reached. I did this, adding more and more reduction techniques over time. At some point I implemented a line-elimination pass in my new reducer, at which point Delta was subsumed and could be dropped.

We ended up keeping two elements of Delta’s design. First, the configurable hierarchical reformatting of a test case based on curly brace nesting. This technique, followed by removing contiguous chunks of code, is still one of C-Reduce’s most useful first lines of attack on a test case. Second, Delta’s mechanism for determining whether a given variant is “interesting.” An interesting variant is used as the basis for further reduction steps; an uninteresting variant is a dead end, and is discarded. Delta determined interestingness by invoking a user-supplied program — typically a shell script — whose process exit code determines the interestingness of the current variant. The flexibility afforded by this small element of user extensibility ends up being extremely useful. For example, the interestingness test can discard test cases that trigger certain compiler warnings, it can attempt to disambiguate different crash bugs, etc.

It is more challenging to reduce test cases that cause the compiler to emit incorrect object code than it is to reduce test cases that merely cause the compiler to crash. C-Reduce itself is agnostic about the character of the bug of interest: we push all of the difficulties in reducing miscompilation triggers into the interestingness test, which should try to answer questions such as:

is the variant well-defined by the C or C++ standard?

does the variant avoid depending on behaviors that are unspecified by the C or C++ standard?

does the buggy compiler turn the variant into an executable?

does this executable terminate within a specified time?

does the reference compiler (assumed to not contain the bug of interest) turn the variant into an executable?

does this executable also terminate within a specified time?

does the behavior of the two executables differ in a way that indicates that a miscompilation occurred?

The variant is interesting if the answer to all of these questions is “yes.”

The hardest part of reducing programs that trigger miscompilation bugs is ensuring that variants avoid undefined behaviors (such as invalid pointer operations) and do not rely on unspecified behaviors (such as the order of evaluation of function arguments). A test case doing one of these things is ill-formed and can accomplish nothing beyond annoying compiler developers. Empirically, if undefined behavior is not actively avoided during test-case reduction, C-Reduce will almost certainly introduce it. The practical solution is to employ suitable static and dynamic analysis tools, in order to avoid these ill-formed variants. Since no single tool detecting all undefined behaviors in C and C++ programs exists, a hybrid approach involving multiple tools is typically used in practice. This is not completely satisfying, but it works well enough that C-Reduce can reliably produce useful reduced test cases for miscompilation bugs in C and C++ compilers.

Writing good interestingness tests for miscompilations takes a bit of practice. More than one user has described C-Reduce as being something like the sorcerer’s apprentice: it does an excellent job reducing according to the criteria it is given, but if these criteria contain any kind of loophole, C-Reduce is likely to find it. For example it is easy to accidentally write a test which believes that the empty file is interesting. Additionally, a good interestingness test is written in such a way that its expected-case runtime is minimized by asking the quickest and most-likely-to-fail questions first.

From the start, C-Reduce was designed to produce a very small final reduced test case, even when this would make C-Reduce run for longer than we liked. This is based on the premise that we should burn cycles instead of human time, and that reporting a compiler bug is rarely on the critical path; we can often afford to wait for a better result. The consequences of this decision can be seen in Tables 1 and 2 of this paper that evaluates several test-case reduction methods: C-Reduce produces the smallest final output, but takes more time to do so.

A Modular, Domain-Independent Reducer Core

Although C-Reduce started out as a pet project solving a specific problem, it evolved into a research project involving a number of my colleagues, whose top-level goal was to produce an effective and usable reducer for C and C++ code as found in the wild. The first research contribution to come out of this effort was a way to achieve a clean mechanism/policy separation in a test case reducer. Previous reduction techniques had all baked specific transformations into the overall search strategy. This impedes extensibility, which we believe to be crucial. The structure that we ended up with is a small core that invokes a collection of pluggable transformation passes until a global fixpoint is reached.

The API for C-Reduce passes is simple but — like many simple things — required a lot of iterations before it felt finished. It is based on the ideas that transformation passes should be stateless and that every pass should implement a linear sequence of transformations, each of which results in a variant that may or may not be interesting. The interface is as follows:

state new(filename, option) : Return a fresh state object. Each pass uses this state to keep track of where it is in the sequence of transformations that it is capable of performing. These states may contain arbitrary data items; the C-Reduce core treats them as opaque. A typical pass stores some kind of cursor — a byte offset, token offset, line number, position in a tree traversal, etc. — in the state object.

The file referred to by filename is logically part of the state object even though it resides in the filesystem instead of memory. Of course it would not be difficult to pass the contents of the file around as a memory object but this would be slow when these objects are large (C-Reduce is frequently invoked on multi-megabyte preprocessed C++ files).

The “option” is used to select among different behaviors implemented by a composite pass.

state advance(filename, option, state) : Return a new state object referring to the next transformation opportunity following the one referenced by the state object passed as a parameter.

result transform(filename, option, state) : Modify the file in-place, selecting the transformation instance referred to by the state object. The result takes one of three values:

OK : the transformation succeeded

STOP : no more transformation instances remain for this pass

ERROR : something went wrong; for example, an external tool crashed, a working file or directory could not be created, etc.

(The API contains one additional method, which checks whether a pass’s external dependencies are satisfied, that doesn’t matter here.)

Our experience has been that every transformation pass that we wanted has been easy to implement behind this API.

The C-Reduce core implements this algorithm:

current = original_test_case
do
size_at_start = size(current)
foreach (p, option) in pass_list
state = p::new(current, option)
do
variant = current // this is a file copy operation
result = p::transform(variant, option, state)
if result == ERROR
report_problem_in_pass(p, option)
if result == OK
if is_interesting(variant)
current = variant // also a file copy
else
state = p::advance(current, option, state)
while result == OK
while size(current) < size_at_start

The termination argument for C-Reduce is:

Since the outermost loop requires the size of the test case to decrease monotonically, it can only execute as many times as the size (in bytes) of the unreduced test case. In practice it executes many fewer times than this.

The loop over passes terminates because the pass list is immutable once C-Reduce is initialized.

Each iteration of the innermost loop either advances the state object or else (by selecting an interesting variant) removes one transformation opportunity. Either way, the number of remaining transformations decreases by one.

The interestingness test is, at worst, terminated (using OS support for killing all processes in a process group) after a configurable timeout.

In practice the weak link in this argument is case 3, which is vulnerable to bugs in passes. C-Reduce terminates robustly by abandoning passes when they appear to be behaving unreasonably.

The C-Reduce core does not insist that transformations make the test case smaller, and in fact quite a few of its passes potentially increase the size of the test case, with the goal of eliminating sources of coupling within the test case, unblocking progress in other passes.

The sequence of transformation passes is carefully orchestrated such that passes that are likely to give the biggest wins -- such as those that remove entire functions -- run first; otherwise the tool would end up spending days or weeks doing silly things such as trying to shrink numeric constants in a huge source file. Shrinking numbers is useful, and it should be done, but only after many other reduction mechanisms have run to completion.

C-Reduce's collection of cooperating passes, with heavy phase-ordering constraints, is highly reminiscent of how a modern optimizing compiler works. However, only a small proportion of the transformation passes is intended to be semantics-preserving in the sense that a compiler's optimization passes must be. In this domain, we only want to preserve enough semantics that we can probabilistically avoid breaking whatever property makes a test case interesting.

A consequence of writing a modular reducer is that once we came up with the right API for writing passes, we were free to write a lot of passes. My colleagues and I spent several years doing this and we ended up with:

35 passes, implemented in Perl, that include heuristics such as removing lines, removing various kinds of matched delimiters (and perhaps also the text between them), and shrinking integer values

6 passes that invoke external utilities such as unifdef, a partial evaluator for the C preprocessor language, a lexer for C and C++ that supports various token-level reduction transformations, and pretty-printing utilities that make the reduced test case more pleasant to look at

69 passes, implemented in C++, that use LLVM's Clang front end as a library for source-to-source transformation of C and C++ code; these include function inlining, partial template instantiation, scalar replacement of aggregates, copy propagation, and eliminating levels of a class hierarchy.

The actual number of dynamic passes is larger than the total of these numbers since some passes can be invoked in different modes using the "option" parameter mentioned above.

In this piece, we looked at why we had to create C-Reduce and at the modular structure that was key to making it solve the problems that we wanted to solve. In Part 2, I'll describe how C-Reduce improves reduction times using multiple cores and why C-Reduce usually does a good job reducing test cases in languages other than C and C++; finally, I'll discuss a few open research problems in test case reduction.

Alexia Massalin’s 1992 PhD thesis has long been one of my favorites. It promotes the view that operating systems can be much more efficient than then-current operating systems via runtime code generation, lock-free synchronization, and fine-grained scheduling. In this piece we’ll only look at runtime code generation, which can be cleanly separated from the other aspects of this work.

Runtime Code Generation in Ring 0

The promise of kernel-mode runtime code generation is that we can have very fast, feature-rich operating systems by, for example, not including code implementing generic read() and write() system calls, but rather synthesizing code for these operations each time a file is opened. The idea is that at file open time, the OS has a lot of information that it can use to generate highly specialized code, eliding code paths that are provably not going to execute.

Runtime code generation was a well-known idea in 1992, but it wasn’t used nearly as widely as it is today. In 2019, of course, just-in-time compilers are ubiquitous. However, operating system kernels still do not use runtime code generation very much, with a few exceptions such as:

several OS kernels, including Linux, have a simple JIT compiler in their BPF implementation

VMware used to use dynamic code generation to performantly virtualize OS kernels on x86 chips that lacked hardware virtualization extensions; I doubt that this is commonly used any longer

pre-NT Windows kernels would dynamically generate bitblit code. I learned this in a talk by a VMware employee; this code generation was apparently a debugging issue for VMware since it would fight with their own runtime code generator. Some details can be found in this post. The old paper about the origins of this technique in the Xerox Alto is a classic.

TempleOS, as explained in this nice writeup, made heavy use of dynamic code generation

Anyway, back to Synthesis. The OS and code generators were all written, from scratch, in 68020 assembly language. How do we translate Massalin’s ideas to 2019? Most likely by reusing an existing code generator and OS. For most of this piece I’ll assume that that’s what we want to do, but we’ll also briefly touch on customized alternatives.

Code Generator Requirements

The particular technology that we use for runtime code generation isn’t that important, but for now let’s imagine using LLVM. This means that the pieces of the kernel that we wish to specialize will need to be shipped as bitcode, and then we’ll ask LLVM to turn it into object code as needed. LLVM has lots of great optimization passes, from which we could pick a useful subset, and it is not hard to use in JIT mode. On the other hand, LLVM isn’t as fast as we’d like and also it has a large footprint. In production we’d need to think carefully whether we wanted to include a big chunk of non-hardened code in the kernel.

What optimizations are we expecting the code generator to perform? Mostly just the basic ones: function inlining, constant propagation, and dead code elimination, followed by high-quality instruction selection and register allocation. The hard part, as we’re going to see, is convincing LLVM that it is OK to perform these optimizations as aggressively as we want. This is an issue that Massalin did not need to confront: her kernel was designed in such a way that she knew exactly what could be specialized and when. Linux, on the other hand, was obviously not created with staged compilation in mind, and we’re going to have to improvise somewhat if we want this to work well.

My guess is that while LLVM would be great for prototyping purposes, for deployment we’d probably end up either reusing a lighter-weight code generator or else creating a new one that is smaller, faster, and more suitable for inclusion in the OS. Performance of runtime code generation isn’t just a throughput issue, there’ll also be latency problems if we’re not careful. We need to think about the impact on security, too.

Example: Specializing write() in Linux

Let’s assume that we’ve created a version of Linux that is capable of generating a specialized version of the write() system call for a pipe. This OS needs — but we won’t discuss — a system call dispatch mechanism to rapidly call the specialized code when it is available. In Synthesis this was done by giving each process its own trap vector.

Before we dive into the code, let’s be clear about what we’re doing here: we are pretending to be the code generator that is invoked to create a specialized write() method. Probably this is done lazily at the time the system call is first invoked using the new file descriptor. The specialized code can be viewed as a cached computation, and as a bonus this cache is self-invalidating: it should be valid as long as the file descriptor itself is valid. (But later we’ll see that we can do a better job specializing the kernel if we support explicit invalidation of runtime-generated code.)

If you want to follow along at home, I’m running Linux 5.1.14 under QEMU, using these instructions to single-step through kernel code, and driving the pipe logic using this silly program.

Skipping over the trap handler and such, ksys_write() is where things start to happen for real:

At this point the “fd” parameter can be treated as a compile-time constant, but of course “buf” and “count” cannot. If we turn “fd” into a constant, will LLVM be able to propagate it through the remaining code? It will, as long as:

We inline all function calls.

Nobody takes the address of “fd”.

It’s not that calls and pointers will always block the optimizer, but they complicate things by bringing interprocedural analysis and pointer analysis into the picture.

Our goal is going to be to see whether the code generator can infer the contents of the struct returned from fdget_pos(). (You might wonder why performance-sensitive code is returning a “struct fd” by value. Turns out this struct only has two members: a pointer and an integer.)

Keep in mind that up to here, we haven’t seen any optimization blockers. In __fdget_light(), we run into our first interesting challenge: “current” is a macro that returns a pointer to the running process’s PCB (in Linux the PCB, or process control block, is a “task_struct” but I’ll continue using the generic term). The current macro ends up being a tiny bit magical, but its end result can be treated as a constant within the context of a given process. There is no way a code generator like LLVM will be able to reach this conclusion, so we’ll need to give it some help, perhaps by annotating certain functions, macros, and struct fields as returning values that are constant over a given scope. This is displeasing but it isn’t clear there’s any easier or better way to achieve our goal here. The best we can hope for is that the annotation burden is close to proportional to the number of data types in the kernel; if it ends up being proportional to the total amount of code then our engineering effort goes way up.

Now, assuming that we can treat “current” as a compile-time constant, we’re immediately faced with a similar question: is the “files” field of the PCB constant? It is (once the process is initialized) but again there’s not going to be any easy way for our code generator to figure this out; we’ll need to rely on another annotation.

Continuing, the “count” field of files is definitely not a constant: this is a reference count on the process’s file descriptor table. A single-threaded Linux process will never see count > 1, but a multi-threaded process will. (Here we need to make the distinction between open file instances, which are shared following a fork, and the file descriptor table, which is not.) The fast path here is exploiting the insight that if our process is single-threaded we don’t need to worry about locking the file descriptor table, and moreover the process is not going to stop being single-threaded during the period where we rely on that invariant, because we trust the currently running code to not do the wrong thing.

Here our specializing compiler has a fun policy choice to make: should it specialize for the single threaded case? This will streamline the code a bit, but it requires the generated code to be invalidated later on if the process does end up becoming multithreaded — we’d need some collection of invalidation hooks to make that happen.

At this point we’re in deep “I know what I’m doing” RCU territory and I’m going to just assume we can figure out a way for the code generator to do what we want, which is to infer that this function returns a compile-time-constant value. I think this’ll work out in practice, since even if the open file instance is shared across processes, the file cannot be truly closed until its reference count goes to zero. Anyway, let’s move forward.

Next, we’re back in __fget_light() and then __fdget_pos(): our code generator should be able to easily fold away the remaining branches in these functions. Finally, we return to line 4 of ksys_write() and we know what the struct fd contains, making it possible to continue specializing aggressively. I don’t think making this example any longer will be helpful; hopefully the character of the problems we’re trying to solve are now apparent.

In summary, we saw four kinds of variables in this exercise:

Those such as the “fd” parameter to write() that the code generator can see are constant at code generation time.

Those such as the “current” pointer that are constant, but where the code generator cannot see this fact for one reason or another. To specialize these, we’ll have to give the compiler extra information, for example using annotations.

Those such as the “count” field of the “files_struct” that are not actually constant, but that seem likely enough to remain constant that we may want to create a specialized version treating them as constants, and then be ready to invalidate this code if the situation changes.

Those that are almost certainly not worth trying to specialize. For example, the “count” parameter to write() is not likely to remain constant over a number of calls.

Writing one byte to a pipe from a single-threaded process executes about 3900 instructions on Linux 5.1.14 (this is just in ksys_write(), I didn’t measure the trapping and untrapping code). The Synthesis thesis promises an order of magnitude performance improvement. Can specialization reduce the fast path on this system call to 390 instructions? It would be fun to find out.

I’ll finish up this example by observing that even though I chose to present code from the filesystem, I think it’s network stack code that will benefit from specialization the most.

Discussion

I have some experience with OS kernels other than Linux, and my belief is that attempting to dynamically specialize any mainstream, production-grade OS other than Linux would run into the same issues we just saw above. At the level the code generator cares about, there just isn’t much effective difference between these OSes: they’re all big giant blobs of C with plentiful indirection and domain-specific hacks.

If our goal is only to create a research-grade prototype, it would be better to start with something smaller than Linux/Windows/Darwin so that we can refactor specialization-unfriendly parts of the OS in a reasonable amount of time. xv6 is at the other extreme: it is super easy to hack on, but it is so incredibly over-simplified that it could not be used to test the hypothesis “a realistic OS can be made much faster using specialization.” Hilariously, an xv6+LLVM system would be about 0.15% OS code and 99.85% compiler code. Perhaps there’s a middle ground that would be a better choice, Minix or OpenBSD or whatever.

Given two developers, one who knows LLVM’s JIT interfaces and one who’s a good Linux kernel hacker, how long would it take to bring up a minimally ambitious dynamically specializing version of Linux? I would guess this could be done in a week or two, there’s not really anything too difficult about it (it’s easy to say this while blogging, of course). The problem is that this would not give good results: only the very easiest specialization opportunities will get spotted by the runtime code generator. But perhaps this would generate enough interest that people would keep building on it.

Do we want to do specialization work on C code? No, not really, it’s just that every one of our production-grade kernels is already written in it. A fun but engineering-intensive alternative would be to create a new, specialization-friendly kernel in whatever programming language looks most suitable. Functional languages should offer real advantages here, but of course there are issues in using these languages to create a performant OS kernel. Perhaps Mirage is a good starting point here, it is already all about specialization — but at system build time, not at runtime.

An ideal programming environment for a modern Synthesis kernel would provide tool and/or language support for engineering specialization-friendly kernel code. For example, we would identify a potential specialization point and then the tools would use all of our old friends — static analysis, dynamic analysis, symbolic execution, etc. — to show us what data items fall into each of the four categories listed in the last section, and provide us with help in refactoring the system so that specialization can work better. A tricky thing here is taking into account the different kinds of concurrency and synchronization that happen in a sophisticated OS.

Some useful questions to ask (and of course we’re always asking these same things when doing OS and compiler research) are: How are we supposed to think about a dynamically specializing OS kernel? What are the new abstractions, if any? Specialization could really benefit from some sort of first-class “code region over which these values are effectively constant” and then also “but the constant-ness is invalidated by this set of events.”

Why Now?

The literature on dynamic specialization of OS code is interesting: it looks like there was a flurry of interest inspired by Synthesis in the mid/late 90s. Many of these papers had Calton Pu, Massalin’s thesis supervisor, on the author list. Not a whole lot has happened in this area since then, as far as I know. The only paper I can think of about optimistic OS specialization is this one; it’s a nice paper, I recommend it. Static OS specialization, on the other hand, is what unikernels are all about, so there’s been quite a bit of work done on this.

It seems like time to revive interest in dynamic OS specialization because:

Most of the processor speed wins lately are application specific; the cores that execute OS code are not getting noticeably faster each year, nor do they seem likely to. In fact, way back in 1989 John Ousterhout argued that increases in processor speed weren’t benefiting OS code as much as other kinds of code.

OSes have slowed down recently to mitigate side channel attacks. Maybe we can get some of that speed back using dynamic specialization.

OSes are way bloatier than they were in the 90s, increasing the potential benefits due to specialization.

Compiler technology is far ahead of where it was in the 90s, with off-the-shelf toolkits like LLVM providing high-quality solutions to many of the problems we’d run into while prototyping this work.

I’d like to thank Perry Metzger who suggested this piece and also provided feedback on a draft of it. Perry worked with Alexia back in the day and hopefully he’ll also write about this topic.

Finally, I don’t want to give the impression that I’m summarizing a research proposal or an in-progress project. This is the kind of thing I love to think about, is all.

Packrafts are tough, light, individual-sized inflatable boats that people use to put together really amazing wilderness trips by combining rafting, hiking, and sometimes even biking. I’ve had sort of a low-grade obsession with packrafting since 2009 when I ran into some people in Alaska who were on their way to hike up and over the Brooks Range and paddle down to the Arctic Ocean. (I’m 99.9% certain it was these folks.) But anyway, I never managed to actually do any packrafting until spring break this year when — along with my friends Brian and Tim — I floated a section of the Dirty Devil River in southern Utah.

The Dirty Devil is a little tricky: it’s a seasonal desert river that, on average, only has enough flow for floating during February and March — and February can be really cold. Luckily the university’s spring break is in March, usually a superb time to be in the desert: the days are getting longer and warmer, the nights are still cold, and there’s not a single bug to be found. Usually you get snowed on for a couple hours, but the snow doesn’t stick around.

None of us had so much as touched a packraft, but all of us had canoed or kayaked before, so we figured a class 1-2 river should be doable. The Dirty Devil flows through a canyon 1500 feet (450 meters) deep and, once we started, there was no realistic possibility of escape except by finishing the trip 36 miles later.

We started out on a spectacular morning:

To set up a shuttle we drove down the rough but super fun Poison Springs Road; here Brian’s cramming a few last things into my vehicle with the DD running narrow and fast in the background:
There was quite a bit more water in the channel than we’d expected; it turned out that Happy Canyon, about eight miles upstream, was in the middle of a minor flash flood.

Next we returned to the paved road, drove a few miles, then down to the top of the western end of the Angel Trail (one of a number of trails with that name). Nobody else was there, and we didn’t see anyone until the last day of our trip. At this point the weather was deteriorating:

And soon we here hiking in the snow:

The snow didn’t last long, but it remained windy and not far above freezing; we were happy to have wetsuits and puffy jackets. We put in across the river from Robbers Roost Canyon, where Butch Cassidy and his gang used to hide out:
We paddled to Angel Cove, which has good camping and a nice spring. With the braided river and the wind it took forever to go just a couple of miles. We were pretty happy to get there, get into dry clothes, make some dinner. Also, the rest of our trip lacked reliable water sources (water from the DD is nasty) so we purified enough water to last the rest of the trip.

The next morning it was still cold, but nice and sunny:

We saw both sides of a ridge called the Saw Tooth, which the river meanders around:

We also passed No Man’s Canyon and Larry Canyon, both of which would have been great to spend a few days hiking, but we hadn’t been able to budget time for that.

The opaque water was hard to read and we did an awful lot of pulling our boats around the river trying to find the channel, which moved from bank to bank every few hundred yards:
This was exhausting and there was plenty of quicksand too. Time flew by as we moved down the river at less than a walking pace and it was around 3pm before we noticed that we should stop for lunch.

The scenery continued to develop:

We pushed late into the day:

And finally stopped at a nice camp site on the river bank where we watched the last of the sun on the canyon walls:

There was copious driftwood so we had a fire:

Morning was good:

Since we hadn’t made good time the previous day I got us up early, but it was just too cold to get into the water and anyway our wetsuits were frozen stiff. We had another cup of coffee and waited for the sun:

It ended up being warmer than the previous day — just about perfect weather — and for whatever reason we made much better time than we had earlier. At this innocuous-looking rest stop I hit some ugly quicksand and told the others to land somewhere else:
Brian landed a little ways to one side and just walked up the bank but Tim got mired much worse than I had, it probably took him 15 minutes to get out. We couldn’t even help.

Over the course of the day we passed Twin Corral Box Canyon and Sam’s Mesa Box Canyon, both huge and seldom-visited places I’d have been happy to spend a couple days exploring, but again we wanted to keep on schedule. Travel was generally easy and also the river gained some fun riffles as it started to drop more rapidly through the sandstone layers. It wasn’t possible to avoid hitting rocks since we couldn’t see through the water, but packrafts seem to be very tough. They’re also quite stable, I only worried about flipping once or twice, and never did.

By late afternoon we reached Happy Canyon — named, apparently, by old-time cowboys who were happy to find a way out of it — which was still running from its flash three days earlier:

No shortage of quicksand here either, ugh!
We didn’t take much time to explore, but rather found a place to camp nearby. By this point we were only eight miles from our takeout and I finally stopped worrying about getting trapped in the canyon if one of the boats popped or got lost. For whatever reason this night was extra cold, my water bottle was more than half frozen in the morning, meaning it had probably gotten below 25 degrees.

Nice light on the walls:

Happy Canyon has some of the finest walk-through narrows anywhere on the Colorado Plateau:

After this excursion it was time to pack up camp and float some more. We immediately hit a series of rock gardens:

Progress was overall slower than the previous day, but we moved along and the scenery kept being good.

And suddenly we reached the takeout, where we had adult beverages and loaded a lot of extremely sandy gear into Brian’s truck:

And then back to pick up my truck, which we reached just in time for sunset:

Here in the middle-right you can see the sawtooth ridge that we had floated past a few days earlier:

And finally, as tradition dictates, we stopped for burgers at Ray’s in Green River. This was a super good trip, but fairly difficult physically and perhaps a bit much to bite off for a group of 40-something first-time packrafters.

Popcount is the function that returns the number of set bits in its argument. Showing that a popcount implementation does what it claims to do has become one of my favorite examples to use when I need to quickly show students how we can reason about programs mathematically. Something like a selection sort is probably the more standard example here, but sorting gets into some subtleties that I’d rather reserve for a second example. Popcount is so easy that this whole example can be done in a half hour.

The programming language and the particular algorithm don’t matter too much, here we’ll use what is perhaps the simplest popcount implementation in C or C++:

This works, of course, because it scans the entire argument, incrementing the counter each time it finds a set bit. To show that it works in a more formal fashion, we can use methods more or less out of A Discipline of Programming, which allow us to combine obvious effects of individual statements into (hopefully) the desired high-level property, which we might as well annotate the program with right now:

Here we’ve used a couple of tricks that need to be fully explained when giving this example to students who aren’t used to this kind of reasoning. First, since x is modified as the program executes, we’ve used orig_x as a shorthand for the original value of x when it was passed as an argument. (This syntax is a trick I learned from ACSL.) Second, here we’ve used the POPCOUNT function, which is the mathematical specification of popcount, which is a completely different thing than the C code we are trying to verify. Consistently applying the correct rules to mathematical and programmatic elements isn’t as straightforward as we would hope and in fact quite a few results from the bad old days of non-mechanized program verification are wrong in subtle ways.

To make our proof go, we’ll need to find the right loop invariant: a statement that is true before the loop executes and also true after every loop iteration. The trick when explaining this task to students is helping them realize that (for a simple example like this) if they understand how the algorithm works, they effectively already know the loop invariant. They just need some help in teasing it out into a usable form. One of the stumbling blocks is that every loop has many invariants, but most of them (for example c ≥ 0) aren’t helpful. We can avoid a lot of silly loop invariants by observing that the right invariant will be strong enough to imply the correctness criterion for the program while also being weak enough to already be true before the first loop iteration, but I’ve found that negative constraints like these are not very helpful for students. The trick, of course, is to capture the abstract fact that is being held in balance as the loop modifies pieces of program state, and to make a lecture like this work, we need to avoid the appearance of pulling this fact out of thin air. It defeats the entire purpose of this discussion if students walk away feeling like there was a magical step in the middle.

So anyway, by hook or by crook we need to get the audience to realize that the key to this algorithm is the loop invariant c + POPCOUNT(x) == POPCOUNT(orig_x). At this point we’re over the hump and we should mark up the code some more:

Here I’ve used question marks to indicate a proposition that is still in doubt. Now we’ve broken up the problem into manageable pieces and can visit them in turn:

The loop invariant is true before the loop executes because c = 0 and x = orig_x.

To see that the loop invariant is preserved by the loop body, we analyze two cases: x & 1 is either zero or one. If it is zero, then c is unchanged and also, clearly, shifting x right by one place does not change POPCOUNT(x). If it is 1 then c is incremented by one and shifting x right by one place has decreased its popcount by one. So we can see that either way, the invariant is maintained.

To prove that the loop invariant implies the overall correctness condition, we must observe that x = 0 after being shifted right 64 times. This is the sort of situation where we better be clear about the rules for shifting signed vs. unsigned values in C or C++.

Of course we’ve skipped over a lot of smaller steps, but this argument captures the essence of why the algorithm works. When presenting this particular algorithm, we can safely skip the termination proof since i is not modified in the loop body.

I didn’t use it in this post because I feel like it’s useful to show the analysis by cases in the simpler algorithm. Wikipedia has some faster and much harsher algorithms, which are the ones used in practice on processors that do not have a native popcount instruction.

Anyway, I presented this material in a software engineering class yesterday, and realized that I’d used this example a couple of times before, so it seemed worth writing up here.

People tend to be visual: we use pictures to understand problems. Mainstream programming languages, on the other hand, operate in an almost completely different kind of abstract space, leaving a big gap between programs and pictures. This piece is about pictures drawn using a text character set and then embedded in source code. I love these! The other day I asked around on Twitter for more examples and the responses far exceeded expectations (thanks everyone!). There are a ton of great examples in the thread; here I’ve categorized a few of them. Click on images go to the repositories.

Data Structures

One of the most common kinds of ASCII art in code is illustrating the shape of a data structure.

The example I started with comes from LLVM:

The layout of a data structure in the Jikes RVM:

A tree rotate in Musl:

Double-ended queue from the Rust library:

Swift compiler internals:

Malloc header layout:

State Machines

JavaScript profiling:

RPCs in Cloud Spanner:

I/O stream states:

Logical Structure in the Problem Domain

Control flow in an NWScript program being decompiled:

ECC internals:

Formatting numbers:

A quantum circuit:

Balancing memory management objectives in an OS kernel:

Subtyping relations (this is a very cool special case where the ASCII art is also code):

Since the National Science Foundation funds a large fraction of academic computer science research in the USA, we often end up traveling to Washington to visit the NSF. This post is just to say that if you are traveling light, if you need some exercise, if you have a bit of free time, and if the weather is good, it is feasible and pleasant to walk or bike from Reagan Washington National Airport (DCA) to the new NSF location in Alexandria.

Do follow the Mount Vernon Trail, which connects DCA and Old Town Alexandria in a pretty direct fashion

I’ve walked this route in both directions; it’s roughly five miles and takes me 80 or 90 minutes (I’m a pretty fast walker so ymmv). I have not biked it, but Capital Bikeshare has several stations near DCA and quite a few stations scattered around Alexandria — so this seems like a great option, assuming that your luggage is in a backpack or something else hands-free.

Here’s an overview of the whole route:

To help orient yourself, the beltway runs across the bottom of the image, the Potomac River runs north to south, you can see DCA’s runways towards the top, and Old Town Alexandria is in the triangular area in between the Potomac, the beltway, and the red walking route.

The only potentially tricky bit is locating the Mt Vernon Trail. Starting anywhere in the airport, walk to Terminal A and go outside to the semicircle-shaped pickup/dropoff area, and then find a sidewalk going roughly southwest. The road you want to walk along is South Smith Boulevard. Don’t cross under the Metro tracks and also don’t go down a level to Thomas Avenue, which gives access to some of the buildings right next to the airport terminal. This may be helpful:

In less than half a mile, as you reach the end of the airport-related buildings, the Mount Vernon Trail, paved with a dotted yellow center line, is easy to see and you should move over to it here (open kmz files in Google Earth). This is all pretty obvious on the ground, and there are some helpful signs. Whatever you do, don’t cross the George Washington Parkway — the busy divided highway that runs nearby.

Follow the Mount Vernon Trail past Daingerfield Island to Alexandria, at which point the trail forks and the Mount Vernon Trail heads over towards the Alexandria waterfront. This is a nice walk if you have time, but it’s out of the way and you can get to NSF more quickly by moving to East Abingdon Drive here and then winding your way through residential Alexandria towards King Street and then NSF.

As long as you have some sort of map this whole route is very easy to follow, as is the reverse route back to the airport.

In this piece I want to discuss an aspect of program synthesis that sounds like it should be easy, but isn’t: synthesizing constant values. For example, consider trying to synthesize an optimized x86-64 implementation of this code:

We need to find a 64-bit constant and two 6-bit constants such that the assembly code returns the same result as the C code for every value of the function parameter. Of course there’s a well-known algorithm that GCC and LLVM use to compute these constants, but the synthesizer doesn’t have it: it must operate from first principles.

An even more difficult example is a 64-bit Hamming weight computation, where (assuming we don’t want to use vector instructions) we might hope to synthesize something like this:

Again assuming that we know that this sequence of instructions will work, we still need to come up with 280 bits worth of constant values. One can imagine even more challenging problems where we want to synthesize an entire lookup table.

The running example I’ll use in this post is much easier:

uint32_t bar1(uint32_t x) {
return ((x << 8) >> 16) << 8;
}

The code we want to synthesize is:

uint32_t bar2(uint32_t x) {
return x & 0xffff00;
}

That is, we want to find C in this skeleton:

uint32_t bar3(uint32_t x) {
return x & C;
}

Basic Constant Synthesis

We need to solve an “exists-forall” problem: does there exist a C such that the LHS (left-hand side, the original code) is equivalent to the RHS (right-hand side, the optimized code) for all values of x? In other words:
(Here we’ll play fast and loose with the fact that in the real world we check refinement instead of equivalence — this doesn’t matter for purposes of this post.)

The specific formula that we want an answer for is:
A SAT solver can natively solve either an exists query (this is the definition of SAT) or else a forall query (by seeing if there exists a solution to the negated proposition). But, by itself, a SAT solver cannot solve an exists-forall query in one go. An SMT solver, on the other hand, can natively attack an exists-forall query, but in practice we tend to get better results by doing our own quantifier elimination based only on SAT calls. First, we ask the solver if there exists a C and x that make the LHS and RHS equivalent. If not, synthesis fails. If so, we issue a second query to see if C works for all values of x. If so, synthesis succeeds. If not, we add a constraint that this value of C doesn’t work, and we start the process again. The problem is that each pair of queries only rules out a single choice for C, making this process equivalent, in the worst case, to exhaustive guessing, which we cannot do when C is wide. We need to do better.

Reducing the Number of Counterexamples using Specialization

Souper uses a technique that appears to be common knowledge among people who do this kind of work; unfortunately I don’t know where it was first published (Section 4.2 of this paper is one place it has appeared). The trick is simple: we choose a few values of x and use it to specialize both the LHS and RHS of the equation; you can think of this as borrowing some constraints from the forall phase and moving them into the exists phase (which becomes an “exists-forsome” query, if you will). In many cases this adds enough extra constraints that the solver can arrive at a workable constant within a few iterations. If we specialize x with 0 and -1 then in the general case we get:
After constant folding, our running example becomes:
The first choice, 0, turns out to be an unlucky one: after further simplification it comes out to “0 = 0”, giving the solver no extra information. The second choice, on the other hand, is extremely lucky: it can be rewritten as “0x00FFFF00 = C” which simply hands us the answer. In most cases things won’t work out so nicely, but specializing the LHS and RHS with fixed inputs helps enormously in practice.

Some open questions remain: How many specific values should we try? How should these values be chosen? An obvious constraint is that the values chosen should be as different from each other as possible, though it isn’t clear what distance metric should be used. A related but less obvious criterion (this is Nuno Lopes’s idea, and we haven’t tried it out yet in Souper) is that the values should exercise as many different behaviors of each instruction as possible. For example, an addition instruction should have example inputs that overflow and also inputs that don’t. This requirement can be satisfied syntactically for instructions that are close to the inputs, but solver calls (or other symbolic methods) will be required to reach instructions that consume the outputs of other instructions. (In Souper’s case, solver calls are required anyway if there are any path conditions, which place arbitrary restrictions on the input space.) It also seems possible that we could tailor the input values to maximize how much of the RHS folds away (the LHS always folds completely down to a constant, of course).

Synthesis at Reduced Bitwidth

Instead of solving a difficult synthesis problem, we can instead solve a problem that is structurally identical, but uses narrower bitvectors (even two or three bits are often enough to capture the essence of a computation). This results in far better solver performance and, typically, a narrow synthesis result that does not contain constants will also work at the original, wider bitwidth (though obviously this needs to be verified by the solver). When the narrow result has constants, there’s a problem: we lack a principled method for deriving the wider constants from the narrow ones. One approach is to use heuristics. This paper suggests the following methods:
Here you should read BV(x,y) as “a bitvector of length y holding value x.” So, for example, rule 1 would extend an 8-bit variable containing 8 to a 16-bit variable containing 16. Rule 4 would extend an 8-bit variable containing 8 to a 16-bit variable containing 8. These seem reasonable, but notice that none of them would help with our running example.

In Souper we have the additional problem that the codes we’re trying to optimize usually contain constants, presenting us with a constant-narrowing problem that is analogous to the constant-widening problem that we just discussed, but worse because now any dodgy heuristics that we come up with are located in front of the expensive synthesis process instead of in back of it. It isn’t clear to us that there’s any satisfactory solution to this problem.

Getting More out of Counterexamples

In standard constant synthesis, each failed guess only adds the constraint that that particular constant doesn’t work — this fails to make much of a dent in an exponential search space. This paper and this one propose extracting additional information from each counterexample, cutting out a larger part of the search space. This also seems like an excellent direction.

Symbolic Constants

An alternate approach that avoids both narrowing and widening problems would be to treat constants symbolically. This, however, creates two new problems: deriving preconditions for optimizations and deriving functions to compute RHS constants in terms of elements of the LHS (constants, widths, etc.). It seems clear that in some cases this will be very difficult, for example imagine trying to derive, from first principles, the algorithm for computing the output constants for an arbitrary value of the LHS constant C:

long foo(long x) {
return x / C;
}

Alive-Infer helps derive preconditions but it does not yet try to come up with the functions for computing constants in the synthesized output.

Since in some use cases (including my “destroy InstSimplify and InstCombine” dream) we want the generalized optimization anyway, this seems like a great direction for future work.

Conclusion

All of the techniques described in this post could be implemented inside the solver or outside it. Solvers like CVC4 already have sophisticated internal support for synthesis. The advantage of pushing more work into the solver is that it can exploit internal levers to guide the search, reuse data structures across synthesis iterations, and do other clever things that aren’t exposed by the solver’s API. The disadvantage is that we might have high-level information about the problem domain that is difficult to communicate effectively to the solver, that would help it do its job better.

In summary, constant synthesis is a hard problem that hasn’t received a lot of attention yet. My group is actively working on this and we have some ideas that aren’t mature enough to share yet. Please drop me a line if you know of any good techniques that I’ve missed here. Also, I’d be interested to hear from anyone working on benchmarks for constant synthesis.

Finally, tools such as Rosette and Sketch know how to synthesize constants.

C and C++ perform implicit casts when, for example, you pass an integer-typed variable to a function that expects a different type. When the target type is wider, there’s no problem, but when the target type is narrower or when it is the same size and the other signedness, integer values may silently change when the type changes. For example, this program:

prints 4294967293. Like unsigned integer wraparound (and unlike signed integer overflow) these changes of value are not undefined behavior, but they may be unintentional and may also be bugs. As of recently, Clang contains support for dynamically detecting value changes and either providing a diagnostic or else terminating the program. Some fine-grained flags are available for controlling these diagnostics, but you can enable all of them (plus some others, such as signed overflow and unsigned wraparound checks) using -fsanitize=integer.

To suppress the diagnostic, we can make the conversion explicit using a cast:

int main() {
int x = 300;
unsigned char c = (unsigned char)x;
}

Different parts of this functionality landed in Clang before and after the 7.0.0 release — to get everything, you will want to build a Clang+LLVM from later than 1 Nov 2018 (or else wait for the 8.0.0 release in a few months).

How would you use these checks? Generally, they should be part of a testing campaign, perhaps in support of a code audit. If you enable them on a non-trivial code base, you will run into diagnostics that do not correspond to bugs, just because real C and C++ programs mix integer types so freely. I suggest starting with -fsanitize=implicit-integer-truncation. Let’s look at doing this to Clang+LLVM itself and then compiling a hello world program: here’s the output.

I’d be happy to hear about any interesting bugs located using these checks, if anyone wants to share.

Finally, see this tweet by Roman Lebedev (the author of these checks) showing that the runtime impact of -fsanitize-trap=implicit-conversion is very low.