ghc-gc-tune: Tuning Haskell GC settings for fun and profit

Inspired by a comment by Simon Marlow on Stack Overflow, about the time and space tradeoffs we make with garbage collection, particularly with a generational GCs, I wrote a small program, ghc-gc-tune, to traverse the garbage collector variable space, to see the relationship between settings and program performance. Given a program, it will show you an (optionally interactive) graph of how -A and -H flags to the garbage collector affect performance.

Previously I’ve had good success exploring multi-variable spaces for optimizations with GAs in Haskell, to find strictness flags and LLVM flag settings, so I was keen to see what the GC space looked like. In this initial GC search, however, I don’t use a GA, instead just measuring time as two variables change over the entire space.

Here’s an example for the binary-trees language shootout benchmark, where the GHC default settings are known to be suboptimal (the benchmark disallows changes to the default runtime GC settings):

Running time of the binary-trees benchmark as -A and -H vary

The flags we use are:

-A, the size of the initial thread allocation area for the youngest generation.

-H, the suggested overall heap size

ghc-gc-tune, in the style of ghc-core, wraps a compiled Haskell program, and runs it with varying values of -A and -H, recording various statistics about the program. The output can be rendered interactively, or to png, pdf or svg. It would augment use of heap profiling, ThreadScope and ghc-core for analyzing and improving Haskell program behavior.

In this case, ghc-gc-tune recommends ﻿the somewhat surprising ﻿-A64k -H32M, and binary-trees runs in 1.12s at N=16, while for the default GC settings it completes in 1.56s. So ghc-gc-tune found settings that improved performance by 28%. Nice.

I already knew that a large -A setting helped this program (corresponding to the broad plateau for large -A values in the above graph), however, I was surprised to see the best result was with a very small -A setting, and medium sized -H setting, resulting in only 5% of time spent in GC, and 36M total allocated — the narrow valley on the far side of the graph. Very interesting! And is that my L2 cache in the square at x= 2M, y = 2M? Sure looks like it.

Here’s a video of the same graph in the tool’s interactive mode (without any -t flag):

Currently, the sampling is vary simplistic, with a fixed set of logscale values taken. A clever sampling algorithm would measure the heap used in the default case, and compute a range based on that, possibly with cutoffs for very pessimistic GC flags.

Another example: pidigits, with what I would consider far more typical behavior. Though again, a surprisingly small -A setting does well, and there’s an interesting pathological result with extremely large -H and very small -A settings.

PiDigiits GC space

You can get ghc-gc-tune from Hackage, via cabal, and note that it requires gnuplot installed. Let me know if you find it useful, and I welcome patches!

Future work will be to graph the Z axis as space, instead of time (so we can find GC settings that minimize the footprint), as well as adding other variables (such as parallel GC settings, and varying the number of generations).

8 comments

Interesting that the shape of these are roughly the same, my guess due to the following components:

a) A linear cost for heap size, which causes time to rise for huge -H values. malloc(1G) doesn’t seem to cost any measurable amount on my computer, does ghc do something to the allocated memory, or what?

b) an inverse cost for allocation area size, making very small -A values quite expensive, and responsible for the “ski-jump” shape for low -H

c) -A cost becomes mostly irrelevant when -H is big enough – perhaps to fit the working set – eliminating the ski jump.

GHC defaults are -H 0 and -A 512K. These are small benchmarks, and in general, I don’t think we can expect -H to be big enough to avoid the ski jump.

It’d be interesting to see how the other benchmarks fare, and also how things work out on other CPUs with different cache sizes etc. Perhaps information from -s output (number of collections etc) could be incorporated? Anyway, great work, Don!

I think your theory regarding L2 is not quite plausible, because the program is actually faster when the nursery is larger than that. Increasing the nursery size usually gives objects more time to die. Only if you go much larger than that do you lose due to locality and make things worse.