Max's Work

TBAA

LLVM as of version 2.9 includes Type Based Alias Analysis. This mean using metadata you can specify a type hierarchy (with alias properties between types) and annotate your code with these types to improve the alias information. This should allow us to improve the alias analysis without any changes to LLVM itself like Max made.

Simon: As long as it propagates properly, such that every F(Sp) is a stack pointer, where F() is any expression context except a dereference. That is, we better be sure that

I64[Sp + R1[n]]

is "stack", not "heap".

How to Track TBAA information

Really to be sound and support Cmm in full we would need to track and propagate TBAA information. It's Types after all! At the moment we don't. We simply rely on the fact that the Cmm code generated for loads and stores is nearly always in the form of:

I64[ Sp ... ] = ...

That is to say, it has the values it depends on for the pointer derivation in-lined in the load or store expression. It is very rarely of the form:

x = Sp + 8
I64[x] = ...

And when it is, 'it is' (unconfirmed) always deriving a "heap" pointer, "stack" pointers are always of the in-line variety. This assumption if true allows us to look at just a store or load in isolation to properly Type it.

Problems / Optmisations to Solve

LLVM Optimisations

'-O2 -std-compile-opts' does the trick but it's obviously overkill because
it essentially executes the whole optimisation pipeline twice. The crucial
passes seem to be loop rotation and loop invariant code motion. These are
already executed twice by -O2 but it seems that they don't have enough
information then and that something interesting happens in later passes
which allows them to work much better the third time.

Safe Loads (speculative load)

We want to allow LLVM to speculatively hoist loads out of conditional blocks. Relevant LLVM source code is here:

Look at what indexDoubleArray# compiles to: F64[I32[Sp + 12] + ((R1 << 3) + 8)]. We would very much like LLVM to hoist the I32[Sp+12] bit (i.e., loading the pointer to the ByteArray data) out of the loop because that might allow all sorts of wonderful optimisation such as promoting it to a register. But alas, this doesn't happen, LLVM leaves the load in the loop. Why? Because it assumes that the load might fail (for instance, if Sp is NULL) and so can't move it past conditionals. We know, of course, that this particular load can't fail and so can be executed speculatively but there doesn't seem to be a way of communicating this to LLVM.

As a quick experiment, I hacked LLVM to accept "safe" annotations on loads and then manually annotated the LLVM assembly generated by GHC and that helped quite a bit. I suppose that's the way to go - we'll have to get this into LLVM in some form and then the backend will have to generate those annotations for loads which can't fail. I assume they are loads through the stack pointer and perhaps the heap pointer unless we're loading newly allocated memory (those loads can't be moved past heap checks). In any case, the stack pointer is the most important thing. I can also imagine annotating pointers (such as Sp) rather than instructions but that doesn't seem to be the LLVM way and it's also less flexible.

GHC Heap Check (case merging)

I investigated heap check a bit more and it seems to me that it's largely
GHC's fault. LLVM does do loop unswitching which correctly pulls out
loop-invariant heap checks but that happens fairly late in its pipeline
and heap checks interfere with optimisations before that.

However, we really shouldn't be generating those heap checks in the first
place. Here is a small example loop:

Note how in each loop iteration, we add 12 to Hp, then do the heap check
and then subtract 12 from Hp again. I really don't think we should be
generating that and then relying on LLVM to optimise it away.

This happens because GHC commons up heap checks for case alternatives and
does just one check before evaluating the case. The relevant comment from
CgCase.lhs is this:

A more interesting situation is this:

!A!;
...A...
case x# of
0# -> !B!; ...B...
default -> !C!; ...C...

where !x! indicates a possible heap-check point. The heap checks
in the alternatives can be omitted, in which case the topmost
heapcheck will take their worst case into account.

This certainly makes sense if A allocates. But with vector-based code at
least, a lot of the time neither A nor C will allocate and C will
tail-call A again so by pushing the heap check into !A!, we are now doing
it in the loop rather than at the end.

It seems to me that we should only do this if A actually allocates and
leave the heap checks in the alternatives if it doesn't (perhaps we could
also use a common heap check if all alternatives allocate). I tried to
hack this and see what happens but found the code in CgCase and friends
largely incomprehensible. What would I have to change to implement this
(perhaps controlled by a command line flag) and is it a good idea at all?