Academic Attention for Undefined Behavior

Undefined behaviors are like blind spots in a programming language; they are areas where the specification imposes no requirements. In other words, if you write code that executes an operation whose behavior is undefined, the language implementation can do anything it likes. In practice, a few specific undefined behaviors in C and C++ (buffer overflows and integer overflows, mainly) have caused, and are continuing to cause, a large amount of economic damage in the form of exploitable vulnerabilities. On the other hand, undefined behaviors have advantages: they simplify compiler implementations and permit more efficient code to be generated. Although the stakes are high, no solid understanding of the trade-offs exists because, for reasons I don’t understand, the academic programming languages community has basically ignored the issue. This may be starting to change, and recently I’ve learned about two new papers about undefined behavior, one from UIUC and the other (not yet publicly available, but hopefully soon) from MIT will appear in the “Correctness” session at APSYS 2012 later this month. Just to be clear: plenty has been written about avoiding specific undefined behaviors, generally by enforcing memory safety or similar. But prior to these two papers, nothing has been written about undefined behavior in general.

UPDATE: I should have mentioned Michael Norrish’s work. Perhaps this paper is the best place to start. Michael’s thesis is also excellent.

Ignoring the obvious C++, what are the other languages that explicitly use undefined behavior? Even the languages that have no formal semantics and are defined by “whatever the official interpreter does” are arguably still well-defined by that (though not necessarily in a way that’s useful).

I’ve never seen other languages go beyond unspecified behavior: we must do something, the end result is well-defined, but we don’t document the algorithm by which we get there so we’re free to change it and you can’t complain if your program breaks because you were relying on the specifics. Undefined behavior of the “we can make demons fly out of your nose in the name of optimization” kind seems like a real C-ism to me.

I think academic studies of the pros and cons are long overdue. This won’t lead to UB being banished from the language, of course, but hopefully a more rational approach to when the trade-offs are worth it.

I don’t really understand section 2.6 of the UIUC paper. Are they making the (trivially correct) claim that it’s undecidable to detect UB statically? (It’s undecidable to detect *any* sort of behavior statically; that’s the Halting Problem.) Or are they making the interesting claim that it’s impossible to detect UB at runtime?

They say “this raises the question of whether one can *monitor* for undefined behaviors”, which makes it sound like they are making the interesting claim. But then in the flip() example, they say, “At iteration n of the loop above, r can be any one of 2n values. Because undefinedness can depend on the particular value of a variable, all these possible states would need to be stored and checked at each step of computation” — i.e., “Because r can take on any one of 2^32 values, we need at least 2^32 bits of memory to evaluate the current state of this program”… which is obviously a false claim.

In fact I believe the author of this very blog has written a tool that monitors arbitrary C programs to detect *exactly* the kind of UB (signed left-shift) the detection of which they claim to be undecidable!

*And by “detect X” I mean “invariably decide whether X occurs or not”. Obviously you can statically detect that certain programs don’t use the << operator at all, and others initialize static variables to -1<<1, and so on.

Hi Arthur, the UIUC authors are talking about the difficulty of detecting undefined behavior dynamically. I also find this to be interesting! In fact, the other day I tried to write a blog post about it, but then I couldn’t quite make sense of their argument. I eventually gave up since I was not sure which part of the confusion came from the paper and which part came from my own head.

Arthur, the part about flip() in Chucky’s paper is also where I got stopped. My opinion is that this issue can only be cleared up by defining the problem more precisely. Maybe I’ll try to write this blog post again.

regehr, jeroen: Another common place for undefined behaviour is in concurrency. e.g. the JVM memory model makes a lot of language-level behaviour explicitly undefined when there’s no happens-between relationship between two interacting operations.

Jeroen: we give some examples of other languages in the (UIUC) paper John linked to. These include Scheme, Haskell, Perl, and Ruby.

Arthur: we are talking about dynamic checking for UB. We argue that it is equivalent to the halting problem, even dynamically. Given
int main(void){
guard();
5 / 0;
}
The only way you can show this program has undefined behaviors is to show guard() terminates. This is obviously undecidable statically, but it is equally undecidable dynamically. Even knowing that you’ve successfully been executing for 30 days doesn’t help you decide whether guard() terminates or not.

The stuff about monitoring is a slightly different take on the idea. Since it’s clear that checking a program for undefinedness is undecidable statically and dynamically, what about simply detecting undefined behavior as you run? It’s also clear you can do this for a single way of evaluation (after all, this is what tools like John’s IOC does), but it only works for a single compiler/doesn’t account for things like C’s nondeterministic behavior. For example:
int choice;
int f(int x) {
return choice = x;
}
int flip() {
return (f(0) + f(1)), choice;
}
Calling flip() will return either 0 or 1 nondeterministically. The flip() function is not undefined, just nondeterministic. Because true detection of UB requires you to consider all the valid ways of evaluating, we explored this in the monitoring bit. This idea is sort of related to runtime predictive analysis.

In the paper we argue that to keep track of all the possible ways of evaluating a program, even while monitoring, is intractable for nondeterministic programs and again undecidable for multi-threaded programs. Again, of course you can keep track of a single evaluation, but that’s not all that interesting. I’m sorry we didn’t explain this better in the TR.

One more caveat, the flip() example given in the paper runs into UB itself quite quickly due to shifting problems etc, but we also explain that it’s for didactic purposes. We simply wanted to show that there might be 2^n possible behaviors for n times through the loop. A complete example would need to use allocated memory, etc. to avoid overflowing and blah blah. To be completely technical, since C has a fixed pointer size and all memory has addresses, C has a finite number of memory and is not turing complete. We figured this wasn’t really relevant.

@Chucky (13): By “undecidable” you also mean “uncomputable”, right? (Yeah, we can ignore the fact that C isn’t a Turing machine.)

That flip() is a bad example, because I happen to believe that it *does* exhibit undefined behavior. You modify “choice” twice without an intervening sequence point (e.g. in the case that the implementation spawns a new thread to compute f(0) and f(1) concurrently). That there happen to be two sequence points upon entry to f({0,1}) and two more sequence points upon return is completely irrelevant. But I recognize that experts disagree [with me ;)] on the subject.

For a little while this morning, I thought you might be saying that it’s difficult to detect the presence of UB in an expression like (a()+b()+c()+…), because you’d have to consider at least N! possible orders of evaluation. But then I remembered that you came up with the really neat idea of having your “C virtual machine” cache writes and flush the cache only at sequence points, which seems (handwave) to allow you to detect the multiple-writes-between-sequence-points kind of UB in basically linear time; it doesn’t *matter* what order the writes came in.

I believe Papaspyrou’s semantics discusses undefined behaviour correctly. I think he and I were the first to try to get it right in a formal setting (his thesis and mine came out at about the same time (late 90s)).

Authur,
The definedness of f(0) + f(1) above comes from, I believe, “Every evaluation in the calling function (including other function calls) that is not otherwise specifically sequenced before or after the execution of the body of the called function is indeterminately sequenced with respect to the execution of the called function.” (n1570, 6.5.2.2:10) I think it used to be less clear in previous versions of C.

I do argue that a()+b()+c()+… means you have to try all the combinations of evaluation if you want to ensure that such a program is without undefined behaviors. Consider:
int a = 0, …, m = 0;
int a(){
a = 1;
}
…
int m(){
m = 1;
}
int n() {
if (a && … && m) {
5 / 0;
}
}

int main(){
a() + … + n();
}

Only those evaluations where n() is called last will exhibit undefined behavior. You really have to consider all possible evaluations to detect stuff since different paths can have different behaviors.

Michael,
Papaspyrou does cover different evaluation orders, and might even handle (x=5) + (x=6) kinds of stuff, but his (and yours (and mine)) misses all kinds of other undefined behavior. One nice one is the (n1570, 6.5.16:3) “If the value being stored in an object is read from another object that overlaps in any way the storage of the first object, then the overlap shall be exact and the two objects shall have qualified or unqualified versions of a compatible type; otherwise, the behavior is undefined.”
I’m pretty sure we all three miss that one. There are hundreds more.

Not to reveal my age, but I cut my compiler teeth on Ada ’83, and they really tackled, or at least addressed, a lot of important issues, including undefined behavior. As I recall, any program that relied on undefined behavior was erroneous. For example, order of evaluation of subprogram arguments is undefined. If argument evalauations create side effects that in turn affect the program results, that program is incorrect. For example, if you ported that program to a compiler that had different evaluation order the program would compute an incorrect result. The Ada 83 language manual codified the word erroneous. It’s definitely worth a read. I wish all languages had such a good language definition manual.