I was futzing around with some C code a few days ago and noticed that
executables generated by Clang would sometimes have an
extra 4 bytes on the stack. This was just for the main function. We can
verify this is Compiler Explorer. Try switching
to GCC and this doesn’t happen. This was interesting, so
I spent a few hours over the holiday digging into why and how this happens. For
the remainder of this post we will be using these 3 examples, which you can play
with in Compiler Explorer.

The terminology will become clear as we go along. This is my first time digging
into LLVM’s sources, and my first time reading about SSA form. I’m going to do
a lot of hand-waving in pursuit of the goal. To follow along in LLVM IR, you
can add -emit-llvm to the “Compiler options…” text field.

So where do we start from? Some preliminary Google searching only revealed this
mailing list
post.1
Of course, when I started down this rabbit-hole, I only had the dominating
multiple store example, since that is the canonical way to write main() in
modern C. So the mailing list reply didn’t make complete sense.

Yes. We allocate an implicit local variable to hold the return value; return
statements then just initialize the return slot and jump to the epilogue,
where the slot is loaded and returned.

Clearly the slot was being allocated (dword ptr [rbp - 4] is the slot,
dword ptr [rbp - 8] is int s), but the function epilogue was never loading
it into eax where the return value is expected.

If we go to dominating single store next, the slot is gone!

Finally, in the non-dominating store the slot is present and it is actually
used in the epilogue, with both return branches setting it to the requisite
value. This is fascinating!

Reserving some space

Why does the first example have stack space, but not use it, while the second
doesn’t use it at all, and the third uses it as we expect?

Let’s track down the place the slot is actually reserved. This is done while emitting the function
prolog,
If the function declaration’s HasImplicitReturnZero attribute is set,
a store is generated that puts Zero in ReturnValue.
This is responsible for the LLVM IR:

store i32 0, i32* %1, align 4

or the corresponding assembly

mov dword ptr [rbp - 4], 0

Enforcing the standards

The HasImplicitReturnZero attribute is set in Clang’s Sema component. It
seems to be in charge of enforcing various C/C++ spec semantics. One of the
rules is that the main function must have a return value of int, and if the
code falls through to the end, this return value must be zero. This is handled
here.

Pruning in the epilogue

The prologue generator unconditionally adds the return value slot to main().
Where does it go? Well, the epilogue generation performs some analysis and
removes it. This stage explains all the differences we are seeing in our three
samples. Essentially, Clang is doing some of its own optimizations before LLVM
kicks in. I’m not sure why. Here is code for the cases we care
about:

// If there is a dominating store to ReturnValue, we can elide
// the load, zap the store, and usually zap the alloca.
if (llvm::StoreInst *SI =
findDominatingStoreToReturnValue(*this)) {
// Reuse the debug location from the store unless there is
// cleanup code to be emitted between the store and return
// instruction.
if (EmitRetDbgLoc && !AutoreleaseResult)
RetDbgLoc = SI->getDebugLoc();
// Get the stored value and nuke the now-dead store.
RV = SI->getValueOperand();
SI->eraseFromParent();
// If that was the only use of the return value, nuke it as well now.
auto returnValueInst = ReturnValue.getPointer();
if (returnValueInst->use_empty()) {
if (auto alloca = dyn_cast<llvm::AllocaInst>(returnValueInst)) {
alloca->eraseFromParent();
ReturnValue = Address::invalid();
}
}
// Otherwise, we have to do a simple load.
} else {
RV = Builder.CreateLoad(ReturnValue);
}

Hmm… lots of zapping here. Let’s think through how this applies to our
3 examples. First, we attempt to find a “dominating store” for the return
value. The implementation here is very basic, and I go into detail further in
this post. We want to find a store that will definitely be executed on every
run of the function, and will be the last store to modify the return value slot
in this function. Clang won’t let us see the intermediate steps, but here is
how the LLVM IR would look2 before epilogue generation happens.

Let’s deal with example 3 which goes to the simpler else branch. Clang can’t
find a dominating store instruction because both branches modify the return
Clang slot. Instead we load the return value slot into register %8 and return
the register.

; epilog
%8 = load i32, i32* %1, align 4,
ret i32 %8, !dbg !21

This translates to assembly as

mov eax, dword ptr [rbp - 4]

In examples 1 and 2, because there are no branches, we find a dominating store
instruction; the one that is sequentially latest. We retrieve the value that
store was attempting to write to the slot, and set that as the return value
instead. In assembly, this will map to loading straight into eax. Then, in
both examples, the dominating store is erased from the IR. In example 1 this is
the store for the explicit return. In example 2 this is the implicit store from
the prolog.

The ReturnValue users are now down to zero in example 2. So we find the initial
alloca instruction and nuke that too. That is why our IR looks the way it
does. In example 1, the leftover stack allocation and store is completely
useless (since we already replaced the real return value with the dominating
store), but since the use count was non-zero, Clang didn’t nuke the allocation,
and LLVM faithfully preserved it.

That explains why we end up with these 3 variations, but it seems very
inelegant with the wasted stack space. Why not track the return statements and
directly use the one that matters? What is this “dominating store”? What is
this talk of phi in the second part of the cfe-dev reply?

We don’t use a phi because the control flow for getting to the epilogue is not necessarily as simple as a simple branch, due to cleanups in local scopes (like C++ destructors).

SSA form and Phi values

Static Single Assignment
form is a style
of intermediate representation where every variable can only be assigned once.
So compilers create new registers/variables for every time they want to update
an existing one. In sequential execution it is easy to see which one of these
is the latest value. Of course, as the Wikipedia example shows, this breaks
down at conditionals. Phi is a special instruction equivalent to delayed
evaluation. It captures both branches, and the generated assembly simply
generates labels and branches to capture that in reality.

This seems like what we want. Let the return value be captured by introducing
additional registers, with phi values capturing branches, and let the register
allocator deal with bringing the register count back to reality!

Making trade-offs

But we don’t have any phis! Apparently Clang front-ends should just
use stack memory (obtained by
using the alloca instruction) and let LLVM’s mem2reg pass promote stack
values to registers and insert phi nodes as necessary. So Clang avoids using
phi nodes (although they do get generated for
boolean conditionals). The mailing list reply is clarifying that just using
a stack memory location particularly simplifies epilogues, where the C++
standard may specify cleanup that could complicate things. I’m not sure exactly
how this would happen. If you do, get in touch!
Of course, in the default non-optimized builds, LLVM doesn’t try to eliminate
this allocation. We are stuck with the return value slot.

To minimize the “damage”, Clang tries to clean things up in the easier cases.
I won’t go into the details of LLVM’s blocks and SSA implementation; that is
coveredreasonably
in other places.
findDominatingStoreToReturnValue
first checks if the return value was referenced by multiple instructions. If it
was, Clang picks the last instruction in the current basic block (insertion
point)
and picks that if and only if it is a store. Otherwise, there is only one
instruction referencing the return value. If it is a store and it is possible
to reach unconditionally and sequentially from that point to the current
insertion point (which is the end of the function), then that store is
definitely the last access and is the dominating store. Clang is being quick
and dirty here. Instead of trying to do a full Phi reduction in cases where
there may be branches, it just gives up and eats the memory cost of the stack
store. This falls out from the semantics of getSinglePredecessor() which
will return NULL if more than one blocks lead to the current block.

All of this makes sense. Clang is not a full compiler and isn’t putting in
effort. Can LLVM do better? Try adding -O1 to the compiler options for the
first example. LLVM successfully eliminates the extra memory!3 😀

I love these kind of small technical details, because they are engineering
solutions that combine parts of theoretical CS, with engineering
restrictions like x86 calling conventions and C standards. This slot is out
there in thousands of executables produced by Clang, staying in the shadows
until duty calls!

cfe presumably stands for Clang Front-end. This is because Clang is a “front-end” to the LLVM compiler toolchain. It is capable of converting C/C++/Objective-C to LLVM Intermediate Representation, after which LLVM’s optimization passes and platform specific backends can translate this to x86 executables.
[return]