Sunday, 29 March 2009

Current shadow stack overheads in HLVM

Many people, including the developers of Haskell (see here) and OCaml (see here) compilers, have been quick to dismiss the shadow stack algorithm for performance reasons.

Despite these concerns, we chose to use a shadow stack in HLVM for several reasons:

GC performance is low on the list of priorities for HLVM.

Shadow stacks are easy to implement entirely from within HLVM whereas conventional strategies entail injecting stack maps into stack frames on the normal stack and that requires low-level hacking with LLVM's experimental GC API in C++.

Shadow stacks are easy to debug because they are just another data structure in the heap.

HLVM provides typed nulls and the ability to read array lengths or type constructor tags without an indirection. This requires the use of an unconventional struct-based reference type that is incompatible with LLVM's current GC API that can only handle individual pointers.

Consequently, it is interesting to use HLVM to measure just how expensive its GC and shadow stack implementations are. This cannot be done directly because the shadow stack is essential for garbage collections to work so removing the shadow stack must also remove the whole GC. However, we can benchmark with full GC enabled, with shadow stack updates but no collections and with GC completely disabled:

The difference between the last two measures gives an estimate of the time spent updating the shadow stack and we can present that as a ratio of the original running time:

This is only an estimate because disabling the GC affects the performance of the rest of the system, most notably leaving more allocated blocks for malloc to handle and degrading cache coherence by cold starting all allocated values.

These results show that the time spent manipulating the shadow stack is:

Insignificant for the fib, ffib, mandel and mandel2 benchmarks, which is expected because their inner loops act on value types that the GC is oblivious to.

Between 10 and 25% for the sieve, Array.fold, List.init and gc benchmarks, which is expected because they use reference types in their inner loops.

Around 70% in the case of the 10-queens benchmark, which was entirely unexpected!

We can also see that the List.init benchmark is spending most of its time performing collections.

These results are very encouraging for two reasons:

HLVM's comparatively poor performance on the List.init benchmark is not due to the shadow stack but, rather, is due to inefficiencies in our collection algorithm. Specifically, the mark phase uses a hash table with a fixed number of buckets that scales poorly when the heap contains many allocated blocks. Increasing the size of the hash table provides a 36% performance improvement.

HLVM's comparatively poor performance on the list-based 10-queens benchmark is due to the presence of reference types in the inner loop: the List.filter function. Specifically, the environment of the predicate closure and the list itself. Inlining and/or CSE will go a long way to eliminating this overhead.

Suffice to say, we are very happy with these results and intend to continue using the shadow stack algorithm.