Taming Undefined Behavior in LLVM

Earlier I wrote that Undefined Behavior != Unsafe Programming, a piece intended to convince you that there’s nothing inherently wrong with undefined behavior as long as it isn’t in developer-facing parts of the system.

Today I want to talk about a new paper about undefined behavior in LLVM that’s going to be presented in June at PLDI 2017. I’m an author of this paper, but not the main one. This work isn’t about debating the merits of undefined behavior, its goal is to describe and try to fix some unintended consequences of the design of undefined behavior at the level of LLVM IR.

Undefined behavior in C and C++ is sort of like a bomb: either it explodes or it doesn’t. We never try to reason about undefined programs because a program becomes meaningless once it executes UB. LLVM IR contains this same kind of UB, which we’ll call “immediate UB.” It is triggered by bad operations such as an out-of-bounds store (which is likely to corrupt RAM) or a division by zero (which may cause the processor to trap).

Our problems start because LLVM also contains two kinds of “deferred UB” which don’t explode, but rather have a contained effect on the program. We need to reason about the meaning of these “slightly undefined” programs which can be challenging. There have been long threads on the LLVM developers’ mailing list going back and forth about this.

The first kind of deferred UB in LLVM is the undef value that acts like an uninitialized register: an undef evaluates to an arbitrary value of its type. Undef is useful because sometimes we want to say that a value doesn’t matter, for example because we know a location is going to be over-written later. If we didn’t have something like undef, we’d be forced to initialize locations like this to specific values, which costs space and time. So undef is basically a note to the compiler that it can choose whatever value it likes. During code generation, undef usually gets turned into “whatever was already in the register.”

Unfortunately, the semantics of undef don’t justify all of the optimizations that we’d like to perform on LLVM code. For example, consider this LLVM function:

This is equivalent to “return x+1 > x;” in C and we’d like to be able to optimize it to “return true;”. In both languages the undefinedness of signed overflow needs to be recognized to make the optimization go. Let’s try to do that using undef. In this case the semantics of “add nsw” are to return undef if signed overflow occurs and to return the mathematical answer otherwise. So this example has two cases:

The input is not INT_MAX, in which case the addition returns input + 1.

The input is INT_MAX, in which case the addition returns undef.

In case 1 the comparison returns true. Can we make the comparison true for case 2, giving us the overall result that we want? Recall that undef resolves as an arbitrary value of its type. The compiler is allowed to choose this value. Alas, there’s no value of type i32 that is larger than INT_MAX, when we use a signed comparison. Thus, this optimization is not justified by the semantics of undef.

One choice we could make is to give up on performing this optimization (and others like it) at the LLVM level. The choice made by the LLVM developers, however, was to introduce a second, stronger, form of deferred UB called poison. Most instructions, taking a poison value on either input, evaluate to poison. If poison propagates to a program’s output, the result is immediate UB. Returning to the “x + 1 > x” example above, making “add nsw INT_MAX, 1” evaluate to poison allows the desired optimization: the resulting poison value makes the icmp also return poison. To justify the desired optimization we can observe that returning 1 is a refinement of returning poison. Another way to say the same thing is that we’re always allowed to make code more defined than it was, though of course we’re never allowed to make it less defined.

The most important optimizations enabled by deferred undefined behavior are those involving speculative execution such as hoisting loop-invariant code out of a loop. Since it is often difficult to prove that a loop executes at least once, loop-invariant code motion threatens to take a defined program where UB sits inside a loop that executes zero times and turn into into an undefined program. Deferred UB lets us go ahead and speculatively execute the code without triggering immediate UB. There’s no problem as long as the poisonous results don’t propagate somewhere that matters.

So far so good! Just to be clear: we can make the semantics of an IR anything we like. There will be no problem as long as:

The front-ends correctly refine C, C++, etc. into IR.

Every IR-level optimization implements a refinement.

The backends correctly refine IR into machine code.

The problem is that #2 is hard. Over the years some very subtle mistakes have crept into the LLVM optimizer where different developers have made different assumptions about deferred UB, and these assumptions can work together to introduce bugs. Very few of these bugs can result in end-to-end miscompilation (where a well-formed source-level program is compiled to machine code that does the wrong thing) but even this can happen. We spent a lot of time trying to explain this clearly in the paper and I’m unlikely to do better here! But the details are all there in Section 3 of the paper. The point is that so far these bugs have resisted fixing: nobody has come up with a way to make everything consistent without giving up optimizations that the LLVM community is unwilling to give up.

The next part of the paper (Sections 4, 5, 6) introduces and evaluates our proposed fix, which is to remove undef, leaving only poison. To get undef-like semantics we introduce a new freeze instruction to LLVM. Freezing a normal value is a nop and freezing a poison value evaluates to an arbitrary value of the type. Every use of a given freeze instruction will produce the same value, but different freezes may give different values. The key is to put freezes in the right places. My colleagues have implemented a fork of LLVM 4.0 that uses freeze; we found that it more or less doesn’t affect compile times or the quality of the generated code.

We are in the process of trying to convince the LLVM community to adopt our proposed solution. The change is somewhat fundamental and so this is going to take some time. There are lots of details that need to be ironed out, and I think people are (rightfully) worried about subtle bugs being introduced during the transition. One secret weapon we have is Alive where Nuno has implemented the new semantics in the newsema branch and we can use this to test a large number of optimizations.

Finally, we noticed that there has been an interesting bit of convergent evolution in compiler IRs: basically all heavily optimizing AOT compilers (including GCC, MSVC, and Intel CC) have their own versions of deferred UB. The details differ from those described here, but the effect is the same: deferred UB gives the compiler freedom to perform useful transformations that would otherwise be illegal. The semantics of deferred UB in these compilers has not, as far as we know, been rigorously defined and so it is possible that they have issues analogous to those described here.

The compiler is going to use the rule “x+1 > x” when x is signed to validate an optimization
which is to replace “x+1 < x" with FALSE. But the compiler does not actually guarantee that x+1 is greater than x and the C standard doesn't guarantee it either. In fact the C standard, by defining a maximum value for a signed int clearly implies otherwise. Hard to imagine how greater(x+1,x) and less(x,MAX_INT) both can be true theorems. Since assuming something false allows you to prove anything, the compiler can perform an unsound transformation. More practically, consider when we have code including, say, a data structure from an OS or library definition. If the upstream maintainers change member definitions from unsigned to signed, the program code is now silently asking for an unsound transformation that something like Coverity will flag, but the compiler does not. What it looks like on the outside is that considerable ingenuity and effort are being expended to defend an optimization that is of limited utility. What would be more useful is a warning of a type error so that the programmer could choose to optimize by hand, which is simple , or to use type coercion to avoid the ambiguity. C is not a language that is designed for complex transformations and I believe compiler developers would do better to consider programmers as colleagues who would benefit from better diagnostics and analysis.

Victor, I take your points about C, but the example I wanted to highlight here is at the LLVM level. Let’s take a language like Swift where integer overflow is defined to trap. There is no undefined overflow. When LLVM optimizes IR generated from Swift code, optimizations like this (perhaps not this exact one) are still useful and they cannot be observed from the level of the programming language.

The engineering tradeoffs being made inside the compiler enable some extra optimization power and have a cost in terms of complexity (UB is hard to think about) but this doesn’t necessarily have anything to do with UB in C, C++, or in any source language. Perhaps it was a mistake for me, in this post, to connect this LLVM example to a fragment of C code.

Ok, good point. But shouldn’t the compiler then generate LLVM code with an explicit overflow test – perhaps one that can be peephole optimized away on an instruction set that traps overflows? The answer to this may be “read up on llvm design”, but I don’t get why an add LLVM instruction should have an undefined or nondeterministic operation. Perhaps the source language has a leaky specification, but why would you want to replicate that mistake inside the compiler?

Is there any design document or paper you can recommend on this topic?

Victor, I don’t know of a design document that covers this stuff, it sort of evolved over a period of time.

I think the design makes more sense if we change to a different example: shift past bitwidth. Assume a language that throws an exception when you do that. If the optimizer can prove that a particular shift doesn’t do the wrong thing, the checking code goes away and we’re left with a naked shift instruction in the IR. However, if this naked shift has no UB then it must commit to some particular semantics for the shift-past-bitwidth case. This could be ARM, x86, or one of the other ones that is out there. The problem is that the information that it will not overflow gets lost, dropped by the compiler, and a naked shift with ARM semantics will require an extra instruction or two if we’re doing codegen for x86. A naked shift with UB for shift-past-bitwidth can be codegened to a single shift instruction on any architecture.

So UB in IR could be thought of as a code word for “we don’t care about this situation which has been proved to not occur.”

Yeah, it is hacky, and it makes compiler optimizations quite a bit harder to reason about, but as you say once you go the route of “heavily optimizing ahead of time compiler” I don’t know that there are good alternatives.

A program that traps is not meaningless. It’s something that programmers avoid most of the time, but it trapping is fully defined behaviour; indeed, languages that avoid undefined behaviour often do that by defining that performing various operations produces an error, rather than undefining the behaviour. And if Swift is defined to trap on integer overflow, then optimizing x+1>x to true is miscompilation, not optimization, because it will not behave correctly when x=maxint.

And I certainly don’t want to have x+1>x “optimized” to true in C. I want to use it to test if x!=MAXLONG in shorter code than what gcc and clang generate when I write “x!=MAXLONG”. Unfortunately, this is not reliably possible with gcc, and therefore this “optimization” leads to bigger code.

Yes, the shift example is certainly a hard one for IR design (because there is no common hardware behaviour), but undefining the too-far shift is a bad solution, and has lead to the undesirable result that clang “optimized” rol(x,0) into IIRC 0 for a common rol()-using-shifts idiom, while a straightforward translation to the native shift instruction of all architectures I looked at would have lead to the intended behaviour.

So one solution would be to define the IR 32-bit shift operation to mean “x<31 may result in either 0 or in x<<(n&31) (i.e., what the various architectures do, and what leads to the intended behaviour), but not undefined behaviour; that operation would be appropriate for a C 32-bit << operation. In addition, if needed the IR could define a shift that produces 0 for the case above, and a shift that produces x<<(n&31), and one that traps. No need to have IR operations with undefined behaviour.

Hi Anton, responding to your points in order…
– Of course nobody said that a binary program that traps is meaningless.
– Indeed, optimizing x+1>x to true in Swift would be wrong, and the LLVM-based Swift toolchain will not do that (it’s easy to try this out).
– I don’t particularly want C/C++ compilers to optimize x+1>x to true either.