Stochastic Superoptimization

“Stochastic Superoptimization” is a fancy way to say “randomized search for fast machine code.” It is also the title of a nice paper that was presented recently at ASPLOS. Before getting into the details, let’s look at some background. At first glance the term “superoptimization” sounds like nonsense because the optimum point is already the best one. Compiler optimizers, however, don’t tend to create optimal code; rather, they make code better than it was. A superoptimizer, then, is either an optimizer that may be better than regular optimizers or else a code generator with optimality guarantees.

How can a superoptimizer (or a skilled human, for that matter) create assembly code that runs several times faster than the output of a state of the art optimizing compiler? This is not accomplished by beating the compiler at its own game, which is applying a long sequence of simple transformations to the code, each of which improves it a little. Rather, a superoptimizer takes a specification of the desired functionality and then uses some kind of systematic search to find a good assembly-level implementation of that functionality. Importantly, the output of a superoptimizer may implement a different algorithm than the one used in the specification. (Of course, a regular optimizer could also change the algorithm, but it would tend to do so via canned special cases, not through systematic search.)

Over the last 25 years a thin stream of research papers on superoptimization has appeared:

The original superoptimizer paper from 1987 describes a brute-force search for short sequences of instructions. Since candidate solutions were verified by testing, this superoptimizer was perfectly capable of emitting incorrect code. Additionally, only very short instruction sequences could be generated.

The Denali superoptimizer works on an entirely different principle: the specification of the desired functionality, and the specification of the available instructions, are encoded as a satisfiability problem. Frustratingly, both the original paper and the followup paper on Denali-2 are short on evidence that this approach actually works. As idea papers, these are great. As practical compiler papers, I think they need to be considered to be negative results.

A more practical superoptimizer was described in 2006. This one is modeled after the “peephole” passes found in all optimizing compilers, which perform low-level (and often architecture-specific) rewrites to improve the quality of short sequences of instructions or IR nodes. Although using exhaustive search makes it expensive to discover individual rewrites, the cost is amortized by storing rewrites in a fast lookup structure. Another cool feature of this work was to use testing to rapidly eliminate unsuitable code sequences, but then to verify the correctness of candidate solutions using a SAT solver. Verifying equivalence of already-discovered code is much more practical than asking the SAT solver to find a good code sequence on its own. The punchline of this paper was that the peephole superoptimizer could sometimes take unoptimized code emitted by GCC and turn it into code of about -O2 quality. Also, without any special help, it found interesting ways to use SSE instructions.

Ok, enough background. There’s more but this is the core.

The new paper on stochastic superoptimization, which is from the same research group that produced the peephole superoptimizer, throws out the database of stored optimizations. However, that aspect could easily be added back in. The idea behind STOKE (the stochastic superoptimizer tool)—that it’s better to sparsely search a larger region of the search space than to densely search a smaller region—is the same one that has been responsible for a revolution in computer Go over the last 20 years. The punchline of the new paper is that STOKE is sometimes able to discover code sequences as good as, or a bit better than, the best known ones generated by human assembly language programmers. Failing that, it produces code as good as GCC or Intel CC at a high optimization level. These results were obtained on a collection of small bit manipulation functions. STOKE optimizes x86-64 code, and it is only able to deal with loop-free instruction sequences.

At a high level, using randomized search to find good code sequences is simple. However, the devil is in the details and I’d imagine that getting interesting results out of this required a lot of elbow grease and good taste. Solving this kind of optimization problem requires (1) choosing a mutation operator that isn’t too stupid, (2) choosing a fitness function that is pretty good, and (3) making the main search loop fast. Other aspects of randomized search, such as avoiding local maxima, can be handled in fairly standard ways.

STOKE’s mutation operator is the obvious one: it randomly adds, removes, modifies, or reorders instructions. Part of the fitness function is also obvious; it is based on an estimate of performance supplied by the x86-64 interpreter. The less obvious part of STOKE’s fitness function is that it is willing to tolerate incorrect output: a piece of wrong code is penalized by the number of bits by which its output differs from the expected output, and also by erroneous conditions like segfaults. To make each iteration fast, STOKE does not try to ensure that the code sequence is totally correct, but rather (just like the original superoptimizer) runs some test cases through it. However, unlike the original superoptimizer, STOKE will never hand the user a piece of wrong code becuase a symbolic verification method is applied at the end to ensure that the optimized and original codes are equivalent. This equivalence check is made much easier by STOKE’s insistence on loop-freedom. STOKE’s speed is aided by the absence of loops and also by a highly parallelized implementation of the randomized search.

The STOKE authors ran it in two modes. First, it was used to improve the quality of code generated by Clang at -O0. This worked, but the authors found that it was unable to find better, algorithmically distinct versions of the code that were sometimes known to exist. To fix this, they additionally seeded STOKE with completely random code, which they tried to evolve into correct code in the absence of a performance constraint. Once the code became correct, it was optimized as before. The fact that this worked—that is, it discovered faster, algorithmically distinct code—for 7 out of 28 benchmarks, is cool and surprising and I’d imagine this is why the paper got accepted to ASPLOS.

Are superoptimizers ever going to be a practical technology? Perhaps. One can imagine them being used in a few ways. First, a system such as STOKE (with modest improvements such as handling loops) could be used by library developers to avoid writing hand-tuned assembly for math kernels, media CODECs, compression algorithms, and other small, performance-critical codes that are amenable to tricky optimizations. The advantage is that it is easy to tell STOKE about a new instruction or a new chip (a new target chip only requires a new cost function; a new target architecture is much more difficult). This goal could be achieved in a short time frame, though I suspect that developers enjoy writing assembly language enough that it may be difficult to persuade them to give it up.

The second, and more radical way that a superoptimizer could be used is to reduce the size of a production-grade optimizing compiler by removing most of its optimization passes. What we would end up with is a very simple code generator (TCC would be a good starting point) and a database-backed superoptimizer. Actually, I’m guessing that two superoptimizers would be needed to achieve this goal: one for the high-level IR and one for the assembly language. So perhaps we would use Clang to create LLVM code, superoptimize that code, use an LLVM backend to create assembly, and then finally superoptimize that assembly. If the optimization database lived in the cloud, it seems likely that a cluster of modest size could handle the problem of superoptimizing the hottest collections of IR and assembly code that everyone submits. Drawbacks of this approach include:

Novel code will either be optimized very poorly or very slowly.

Compilation predictability suffers since the cloud-based peephole database will constantly be improving.

You can’t compile without a network connection.

It remains to be seen whether we want to do this sort of thing in practice. But if I were a compiler developer, I’d jump at a chance to substantially reduce the complexity, maintenance requirements, and porting costs of my toolchain.

Update from 3/9/13: A strength of this kind of superoptimizer that I forgot to mention is that it can be applied to code compiled from any language, not just C/C++. So let’s say that I’ve written a stupid little x64 generator for my domain-specific language. Now if I have a database-backed superoptimizer I should be able to get roughly -O2 quality binaries out of it. Whether this kind of approach is better or worse than just compiling to LLVM in the first place, I don’t know.

There is a peculiar back-and-forth “sloshing” in hardware/software codesign. For example, people made the first machine languages comprehensible because they intended to code in them, and then compilers were written targetting existing machine languages, and then some new machine languages are written based on known-to-exist compiler technology.

If you step back from the tradition and start instead from the premise that humans are NOT going to be writing in machine language, then you might instead target much simpler, less comprehensible, hardware implementations – cellular automata, soliton collisions, et cetera

Can we use stochastic superoptimization to attempt to compile a moderately-humane language (e.g. C) to these exotic, “probably-turing-complete” architectures without human intervention? More generally, can we use these techniques to automatically sift through vast families of probably-turing-complete architectures?

Hi Terry, I think you would *not* want to apply this tool to operating system code. It does not look like the authors have taught it how to preserve interesting side effects such as accesses to shared memory, device registers, etc. Thus, the tool as presented is going to break your OS. I don’t think this is very hard to fix, though.

Frank, a bytecode-to-bytecode superoptimizer hasn’t yet been implemented that I know of, but this probably would not be extremely difficult. However, I think that probably this technology would be better used inside a JIT compiler.

Hi Johnnicholas, the direction you suggest has been explored somewhat. One of the original justifications for RISC, I believe, was to create chips that are a better match for smart compilers. VLIW is a more aggressive move in the same direction and it has only been moderately successful, I think partially due to the compiler side being more difficult to implement than people anticipated.

A few other issues:
– “Super closed source” projects (think NSA or NetFlix) would see contributing to the cloud DB as a information leak.
– The evolving database would make repeatable compiles much harder.
– Non-repeatable compiles my result in Heisenbugs if a bug elsewhere in the program violates the assumptions of the symbolic verification method (e.g. some other bit of code has an off by one error that reads into the memory manipulated by the optimized code).

@Johnicholas: IIRC MIPS is very much designed to be programmed by a compiler at the expense of hand writable ASM.

Another issue with the cloud optimization server:
clouds cost money to maintain, so you’d presumably need to pay (per compile? per KLOC? per year?) to use this system. People who currently sell compilers would probably like this; people who currently give away compilers free-as-in-beer wouldn’t.

Jonathan, I was thinking that the optimization database would be maintained, for example, by the Ubuntu organization. Ubuntu users could opt into a statistical profiling program that used the timer interrupt to see where the program counter spends its time, and then Ubuntu could add value by superoptimizing the code in their distro that is, on average, hottest across all users.

bcs, I think such a thing could be done in the case where optimizations are expensive to generate but cheap to verify. Of course, that is the expected case!

I suspect that there’s a reason that there’s no database there: a stochastic search evaluated by running tests is inherently optimizing a combination of the program code and the data it’s being run on. This is actually perfectly fine if you know what kind of data you’re running it on, but also means that you can’t (in principle) just lookup the same initial code sequence and get back the best result from a precompiled database and have it turn out to be the best in your case. As an example WAY beyond current superoptimizer abilities, suppose you have a sorting routine that you test on data that happens to be nearly sorted: then it may well turn out bubble-sort works best in that case (simplicity, cache, etc), so that’s what gets output. If I look-up that result and run it on my “completely disarranged” data and it performs worse than even simple optimizations.

Now it may be that for completely straight-line code the processor is so “regular” in even the deep implementation details that this doesn’t matter, but it’s important to bear in mind you’re always optimizing “the code on the data you evaluate it on”.

davetweed, good point. I guess one hopes that for the small code sequences that are within reach of today’s superoptimizers, randomized or systematic input generation isn’t so difficult. As you say, there may be no good way to deal with a sorting problem.

What might end up happening is that the database would accumulate many solutions to each problem and the individual superoptimizer runs would seed it’s process from all of them. The bad-in-my-case solution would get thrown out almost instantly.

John, do you know whether the STOKE source code is available somewhere? I would be curious to see how it works on ocamlopt’s result¹, but haven’t found a link to an implementation on the author webpages.

¹: probably not that well on symbolic manipulation code at least, given it has a lot of memory accesses and relatively few tight computation loops; that would be similar to the linked-list traversal example of the article. But that may still be quite interesting on numerical programs, and help take advantage of advanced instruction sets.

If this could radically simplify the compiler back end, it might be a way to make better optimizing available for certified code. Today they fear compiler optimization due to perceived risk and lack of traceability. A database-based transformation in a simpler system could b appealing.