Some Goals for High-impact Verified Compiler Research

I believe that translation validation, a branch of formal methods, is just about ready for widespread use. Translation validation means proving that a particular execution of a compiler did the right thing, as opposed to proving once and for all that every execution of a compiler will do the right thing. These are very different. Consider some obstacles to once-and-for-all verification of a tool like GCC or LLVM where:

Most of execution is spent processing pointer soup where correctness depends on poorly documented and incredibly detailed properties of the soup.

Hundreds or thousands of analyses and transformations are performed, and due to performance constraints the compiler implementation entangles them in a way that is nearly impossible to disentangle.

The implementation language is usually unsafe, forcing any formal verification effort to spend an outrageous amount of effort proving properties of the compiler code, such as memory safety, that are incidental to the task of interest. A convincing formal verifier for the subset of C++ that LLVM is written in doesn’t even exist.

For some compiler algorithms like register allocation, it appears to be fundamentally easier to check the result than it is to prove that the right result is always computed. For example, CompCert uses this approach (or did the last time I looked).

The compiler is under active, rapid development. Any proof would have to be redone, likely incurring significant effort, for every release.

So it’s clear that once-and-for-all formal verification of LLVM or GCC is never going to happen, the costs ludicrously outweigh the benefits. Translation validation, on the other hand, is already to some extent practical, see for example this effort to prove that the seL4 object code refines its C source code. (Refinement just means that C gives a typical program many different meanings and we need to prove that the compiler has picked one of them.)

This work is awesome but research tools don’t, by themselves, stop people from being burned by compiler bugs. One way to make things better is to combine translation validation with aggressive testing, like we did here, and then make sure any resulting bugs get fixed. Better yet, we can try to push a translation validation out into the world so that anyone can use it. It’s time for this to happen. The rest of this piece is some thoughts about how that should work.

Goal 1: Ease of Use

The only thing an application developer should need to do is add a compiler flag like this:

clang++ -O -tv file.cpp

or:

rustc -O -tv file.rs

and then the compiler either validates, or fails to validate, its translation. It has to be this easy.

Goal 2: Near zero overhead for compiler developers

Translation validation can’t get in the way of normal development for a production compiler: it has to be almost entirely on the side. This doesn’t mean, however, that the compiler can’t help out the validator, but rather that this has to happen in non-invasive ways. For example, certain optimizations on nested loops that are hard to validate might need to emit a bit of extra debug info or optimization remarks or whatever, to help the validator piece together what happened.

Goal 3: Performance

Since translation validation will result in a lot of solver calls, it is going to be somewhat slow, probably well over an order of magnitude slower than regular compilation. A fairly easy way to speed it up would be to add a (persistent, networked) caching layer to exploit the fact that most parts of most code bases don’t change very often. We’ve had good luck using this kind of a cache for Souper, which is also slow due to making many solver calls.

Goal 4: Multiple Validators

Research tends to move rapidly when there is a level playing field and a clearly-defined goal, allowing different groups to compete or cooperate, as they see fit. Competition can be particularly motivating, see for example SMT-COMP.

The primary metric for choosing a winner in a translation validation competition is the number of functions validated for compilation of a given benchmark using a particular LLVM version and optimization level. Verification time would be a good secondary metric.

To ensure a fair competition, it would be best for all validators to be using the same semantics for the source and target languages. This isn’t so straightforward: all too often these mathematical artifacts end up not being readable or reusable since they are deeply embedded in the implementation of a formal methods tool (this is unfortunately the case for Alive, for example). A canonical, readable, writable, and reusable semantics for each of C, C++, Rust, Swift, LLVM IR, x86-64, etc. is something we should be spending significant resources on. This sort of thing is what I’m talking about.

Conclusions

Just to be clear, beyond the Alive-based work referenced above, I’m not working on, nor do I have any plans to work on, translation validation. Rather, it is clearly the right way to gain confidence that a production-grade compiler has done its job. The technologies are in reach and we should be working to deploy them widely.

6 replies on “Some Goals for High-impact Verified Compiler Research”

How much of the hard-to-prove aspects of compilers would it be tractable to implement in DSL’s specifically designed to facilitate proofs? For example the high-level language Alive consumes can (at least in theory) be compiled into a Instcombine pass that satisfies all the performance and optimization requirements of a production compiler. Can the same sort of approch be done with less local transformations?

Doing that wouldn’t remove the needs for translation verification (the various compiler-compilers would be themselves hard to verify), but it would reduce the number of compiler bugs which would it self improve the ability to get correct compilations, which is the end goal.

Hi bcs, this is a point I was thinking about describing in this post but then it didn’t really fit in!
I think it’s absolutely the case that if we disentangle compiler algorithms into small parts, perhaps written in a variety of DSLs, then we can prove the vast majority of it to be correct once-and-for-all. But this is a major re-engineering effort, of course.
I have some ideas about how future compilers should be built but it’s all somewhat half-baked.

Especially when it comes to “pointer soup”, the C Standard has evolved into an unworkable mess. It would be much easier to verify compiler correctness if the language had at least three recognized aliasing modes:

1. Precise aliasing, which would require that every access to an lvalue which isn’t qualified “register” or derived from a “restrict” pointer behave as though it read or wrote the storage occupied thereby.

2. Strict aliasing, which would require that no storage which has *ever* been written as a non-character type be read as a different non-character type (memory returned from “malloc” etc. would be considered “new storage” even if it re-uses RAM that has been previously passed to “free”). This is what gcc and clang seem to process [the idea that writing to storage should change the effective type thereof makes effective optimization much harder, and both clang and gcc sometimes make erroneous optimizations in cases where the Standard unambiguously defines behavior].

3. Evidence-based aliasing, which would allow a compiler to treat accesses to lvalues of type T as unsequenced relative to accesses to lvalues of other types except within narrowly-defined contexts where a pointer or lvalue of another type is derived from one of type T. If e.g. the address of a “uint32_t” is converted to “uint16_t*” and passed to a function, code within that function could treat accesses to uint32_t as unsequenced relative to accesses to uint16_t, but accesses made with pointers derived from the passed-in uint16_t* would be sequenced relative to code outside that uses the uint32_t whose address was taken.

Given something like “void test(int *ip, float *fp, int mode) { *ip = 1; *fp = 2.0f; if (mode) *ip = 1;}”, evidence-based aliasing as described above (unlike the Standard’s “effective-type-based” aliasing) would allow code that would be most efficient in the non-aliasing case to be verifiably correct without the compiler having to prove anything about the caller. Since some programs absolutely require the ability to use storage as different types at different times, a program that could only verify optimizations for programs that fit the #2 pattern above would be of limited use, compared with one that could accommodate programs that re-purpose storage within its lifetime.

How do you like the “Evidence-based aliasing” construct? I think it would simply, unambiguously, and correctly identify most cases where programs–including those which would be incompatible with gcc’s “-fstrict-aliasing” dialect–would or would not be reliant upon aliasing behavior. I’m not sure why some compiler writers insist that it’s impossible for compilers to support aliasing without disabling optimizations wholesale, since a rule like I propose would only require a compiler to look at the same code it would have to look at to perform the optimizations in the first place. I think some optimizers filter out pointer conversions at an early stage, thus denying later stages of information needed to properly recognize what should be obvious cases of aliasing, but that shouldn’t be an obstacle to recognizing a dialect that could be efficiently processed by better implementations without imposing the semantic limitations that gcc and clang do with “-fstrict-aliasing”.