README.md

Missed optimizations in C compilers

This is a list of some missed optimizations in C compilers. To my knowledge,
these have not been reported by others (but it is difficult to search for
such things, and people may not have bothered to write these up before). I
have reported some of these in bug trackers and/or developed more or less
mature patches for some; see links below.

These are not correctness bugs, only cases where a compiler misses an
opportunity to generate simpler or otherwise (seemingly) more efficient
code. None of these are likely to affect the actual, measurable performance
of real applications. Also, I have not measured most of these: The seemingly
inefficient code may somehow turn out to be faster on real hardware.

I have tested GCC, Clang, and CompCert, mostly targeting ARM at -O3 and
with -fomit-frame-pointer. The exact versions tested are specified below.
For each example, at least one of these three compilers (typically GCC or
Clang) produces the "right" code.

This list was put together by Gergö Barany gergo.barany@inria.fr. I'm
interested in feedback.
I have described how I found these missed optimizations in a paper
(PDF) that I presented at CC'18, where
it won the best paper award. The software described in the paper is not
released yet. I also made a quick-and-dirty poster
(PDF) and some more informative
presentation slides (PDF).

I will omit the full code generated by Clang, but you can see it at
https://godbolt.org/g/8ZuYbz (generated by a slightly older version, but the
code is the same).

The interesting issue is the treatment of register d6, which is spilled in
the middle of the function and used for temporaries in some calculations,
before its value is reloaded into d1 towards the end of the function:

It is reasonable to spill d6 and to use it for temporaries during the
computation if no other registers are free. But that is not the case here:
The registers d13, d14, and d15 have short previous uses but are then
unused for the rest of the function. The live ranges allocated to d6 in
the above code could have been allocated to these registers instead, saving
the spill of d6 and its reload into d1.

In this particular case this spill causes an overhead of the store, the
load, and one instruction each to allocate and free the stack frame. GCC
generates fewer instructions overall, with lower register use and without
this inline spill.

Reported at https://bugs.llvm.org/show_bug.cgi?id=37073. It turns out that
this is due to bad spilling decisions before register allocation that are
not visible in the example above because they are undone by smart scheduling
after register allocation. Can be fixed by selecting a different scheduler
to run before allocation instead of the (bad) default choice.

GCC's code starts with an add, a multiply, and a subtract (interspersed with
constant loads I omit here):

vadd.f64 d8, d0, d2
vmul.f64 d9, d8, d6
vsub.f64 d10, d3, d3

The subtraction is independent of the other instructions, it corresponds to
the expression (p4 - p4) in the source code.

Clang starts with the same operations, but for unclear reasons it inserts
four (!) register copy instructions, at least two of which have the effect
of "setting up" the register d4 for use by the subtraction, although it
could just use the values from d3.

GCC

Two dead stores for type conversion

GCC predicates the entire function, nicely moving the conversion of p2 to
unsigned into a branch. But for some reason, in both branches it spills
the value for a to the stack, without ever reloading it:

Missed bitwise tricks

The least significant byte of 415615 is 0x7f, so the lowest 7 bits of a
are the same as the lowest 7 bits of p2, hence the lowest 8 bits of v
are the same as the lowest 8 bits of p2 << 1, and that is all that is
needed for the result. GCC generates all the operations present at the
source level, while Clang simplifies nicely:

lsl r0, r1, #1
uxtb r0, r0

Unnecessary spilling due to badly scheduled move-immediates

This is just a much smaller example of an issue with the same title below.

GCC

Useless initialization of struct passed by value

The struct is passed in registers, and the function's result is already in
r0, which is also the return register. The function could return
immediately, but GCC first stores all the struct fields to the stack and
reloads the first field:

float to char type conversion goes through memory

i.e., the result of the conversion in s15 is stored to the stack and then
reloaded into r0 instead of just copying it between the registers (and
possibly truncating to char's range). Same for double instead of
float, but not for short instead of char.

GCC converts a to double and back as above, but the result must be the
same as simply multiplying by the integer 10. Clang realizes this and
generates an integer multiply, removing all floating-point operations.

If you are worried about the multiplication overflowing, you might prefer
this version:

Missed simplification of floating-point addition

It is not correct in general to replace a + 0 by a in floating-point
arithmetic due to NaNs and negative zeros and whatnot, but here a is
initialized from an int's value, so none of these cases can happen. GCC
generates all the floating-point operations, while Clang just compares p1
to zero.

Heroic loop optimization instead of removing the loop

This function returns 0 if N is 0 or negative; otherwise, it returns
p1[N-1]. On x86-64, GCC generates a complex unrolled loop (and a simpler
loop on ARM). Clang removes the loop and replaces it by a simple branch.
GCC's code is too long and boring to show here, but it's similar enough to
https://godbolt.org/g/RYwgq4.

Note: An earlier version of this document claimed that GCC also removed the
loop on ARM. It was pointed out to me that this claim was false, GCC does
generate a loop on ARM too. My (stupid) mistake, I had misread the assembly
code.

The register d10 must be spilled to make room for the constants 5 and 9
being loaded into d11 and d13, respectively. But these loads are much
too early: Their values are only needed after d10 is restored. These move
instructions (at least one of them) should be closer to their uses, freeing
up registers and avoiding the spill.

Clang

Incomplete range analysis for certain types

In both cases, the divisor is outside the possible range of values for p,
so the function's result is always 0. Clang doesn't realize this and
generates code to compute the result. (It does optimize corresponding cases
for unsigned short and char.)

More incomplete range analysis

The condition is always true, and the function always returns 7. Clang
generates code to evaluate the condition. Similar to the case above, but
seems to be more complex: If the - !p5 is removed, Clang also realizes
that the condition is always true.

Failure to constant fold mla

The expression a * p1 + b evaluates to 0 * p1 + 0, i.e., 0. CompCert is
able to fold both 0 * x and x + 0 in isolation, but on ARM it generates
an mla (multiply-add) instruction, for which it has no constant folding
rules:

The value of v is never used. This assignment is dead code, and it is
compiled away. It should not affect register allocation.

Missed copy propagation

In the previous example, the copies for the arguments to the mul operation
(r0 to r1, then on to r2) are redundant. They could be removed, and
the multiplication written as just mul r1, r0, r0.

Note: This entry previously spoke of copy coalescing, but Sebastian
Hack pointed out that
it's actually copy propagation that is missed here.

Failure to propagate folded constants

Continuing the mla.c example further, but this time after applying the
mla constant folding patch, CompCert generates:

mov r1, r0
mul r2, r1, r0
mov r1, #0
orr r0, r2, r1

The x | 0 operation is redundant. CompCert is able to fold this
operation if it appears in the source code, but in this case the 0 comes
from the previously folded mla operation. Such constants are not
propagated by the "constant propagation" (in reality, only local folding)
pass. Values are only propagated by the value analysis that runs before, but
for technical reasons the value analysis cannot currently take advantage of
the algebraic properties of operators' neutral and zero elements.