2015-07-05 perf for low-level Haskell profiling

TL;DR: perf can be particularly helpful when profiling performance critical sections of your programs. We show how through an example.

The perf_events framework in the linux kernel, together with the perf userland tools, can be used to profile applications without any instrumentation – beyond compiling them with debug symbols. The debug symbols are specified in the DWARF format. Recently, support was added to GHC to support DWARF debug symbols, thanks to work by Peter Wortmann.

In this article we go through the process of understanding the performance of a low-level Haskell program – a function computing the CRC32 digest of a ByteString. We compare a pure Haskell version, provided by the digest-pure package; to a version using a C function provided by zlib, provided by the digest package. After we understand why the Haskell version is slower we modify it to make it almost at fast as the C one.

Note that this document is meant to be a showcase of what can be done with perf, rather than a tutorial on how to make an Haskell program as fast as a C one. In this case, the shortest route would have been to directly replicate the behaviour of the C program in Haskell: we use perf to show how it’s possible to understand precisely where and why time is being spent.

The setup

Building Haskell libraries with DWARF

Getting up to speed with perf and Haskell is easy but tedious and undocumented. First, GHC HEAD (7.11) is needed, for DWARF support. GHC 7.10.1 has partial DWARF support, but some features are broken, so GHC HEAD is recommended. The issues with GHC 7.10.1 are resolved in GHC 7.10.2, which will be released shortly.

Moreover, the RTS and the libraries built as part of the GHC build process should be built with debug symbols. This can be done by adding

GhcRtsHcOpts += -g
GhcLibHcOpts += -g

to your mk/build.mk file in the ghc source tree.

Once you have GHC 7.11, special care is needed to install every library with the right options when using cabal. Specifically, we want to pass the -g options to GHC to generate debug symbols. Moreover, we must stop cabal from stripping the object files from debug symbols. For example, to install the library digest we’ll use

Debug symbols for non-Haskell code

Moreover, when debugging Haskell code involving foreign C code, we need to make sure that we have the debug symbols for the C libraries too. This can be accomplished in various ways – on Debian systems debug symbols are packaged separately in -dbg packages. So if we wanted to properly debug a program using the digest library, which uses zlib, we would need to issue

$ sudo apt get install zlib1g-dbg

Keeping the source code for built library

Moreover, if we want to see source code in the perf tools, as we’ll see in the following sections, we need to keep the Haskell source files where they were picked up to be built. The easiest way to do this is using cabal get, for example:

The same applies for C libraries that we want to debug with source code support.

Building the example program

As an example, we will be testing the digest-pure library, implementing the CRC32 algorithm in Haskell; against the digest library, which also offers the CRC32 algorithm by binding to the zlib library.

First we’ll build the zlib library manually to keep the source code in place, taking care of adding -g to the CFLAGS in the Makefile:

The Haskell version is more than 3 times slower than the C one. Let’s try to find out why.

Finding out where time is spent

The first thing we’ll do is understand where time is spent. We expect some CRC32 function to make up for most of the running time, but we’d like to verify this. Moreover, we’d like to know what exactly is taking time in the CRC32 function itself.

perf record will sample the execution with the aid of the perf_event_open linux syscall. Using perf report -i perf-hask.data we can inspect the result of the Haskell version. The main screen shows a breakdown of where time is spent in the executable:

Symbols like c5gn_info are generated by the GHC code generator. As we can see, almost all the time is spent in a single symbol, s4Tw_info. We can drill down into s4Tw_info, and “annotate” it, to get a precise breakdown of where time is spent in s4Tw_alone:

What this breakdown tells us is that a whopping 75% of the time is spent in xor and and instructions alone. Another 20% of the time is spent in looping code: a cmp at the beginning, and an inc at the end – which keep track of an index in the ByteString we’re digesting to find out when we’re done traversing it. Moreover, we get the Haskell source code interleaved with the assembly deriving from it, which lets us understand what Haskell code parts of s4Tw_info correspond to. In this case, the part that eats almost all of the time is the section of the code corresponding to the main section of the CRC32 algorithm, which is what we expected. What the code does is get a value out of pre computed table, and then use it together with the current CRC to get the new value.

To get more information about the symbol, we can also use objdump on the binary to get more information about the “compile unit” s4Tw_info comes from:

Which tells us the exact file the symbol comes from. The DWARF information also takes a stab at pairing the symbol with a name, although it is often difficult to do so: in this case it picked the inlined !!! function, which is misleading.

Check out this screencast for a taste of what interacting with perf on this data set looks like.

Now let’s do the same with the C version, using perf report -i perf-c.data:

In this case almost all the time is spent in libz.socrc32 function, which is again what we expect. However when drilling into the crc32 function, we get a much bigger disassemble output. While confusing at first, it is quite clear that all the time is spent in the mysterious DOLIT32, a rather long section of the code:

In fact, summing up the small percentages in DOLIT32 we find out that around 85% of the time is spent there. To find out what DOLIT32 does we need to consult the zlib source code. DOLIT32 is a macro repeating DOLIT4 8 times. DOLIT4 is a macro which computes the CRC32 4 bytes at a time, assuming little endianness. Equivalent macros for big endian architecture are present.

Why is the Haskell version is slower

The fact that zlib computes the CRC32 4 bytes at a time, and that it unrolls the main loop makes the code per byte needed to compute the CRC32 much smaller. The loop (from cmp to jmp) with DOLIT32 as a body uses around 110 instructions to compute the CRC32 for 32 bytes (≈3.5 instructions per byte), while the loop generated from the pure Haskell version uses 25 instructions for just one byte. This should amount to a seven fold increase in the instructions executed by the Haskell program.

We can easily verify this fact using perf. In this case we’ll use perf stat, which gives us global statistics, since we know that one symbol is taking up all the time. Otherwise we could use perf record -e to record the events we’re interested and get figures for the specific symbols.

The first thing to notice is the difference in the number of instructions. The Haskell program uses 201 billions, versus the 31 billions of the C program. A 6.5 difference, which is a factor close to our rough estimate of 7.

Note that the 6.5 difference in instructions does not translate to a 6.5 performance increase in the C program. This can be due to a variety of reasons, but the perf stat output already shows a factor: the C version spends more time stalled, as evident from the stalled-cycles-frontend hardware counter. This is most likely since the C code spends less time in the loop code, and more time fetching bytes from the input buffer and from the static tables used to compute the CRC32. In fact, the overall productivity is lower: 1 instruction per cycle in C versus more than 2 in Haskell. Thus, we only get a 3x speedup in C , as we have seen.

Making the Haskell version faster

Now that we know the main reason for the Haskell version being slower, we can make the Haskell version more similar to the C one. The first measure we adopted is to implement the CRC32 algorithm so that it consumes 4 bytes at a time. We choose to do this before loop unrolling because it has more possibilities to impact performance. For instance it makes the algorithm less sequential and increases the chances for pipelining, since we don’t have to wait for each byte to continue.

To do so, we will split the input ByteString in three sections: an aligned vector of Word32s, and the remaining unaligned leading and trailing bytes. For example, for a ByteString of length 26 starting at address 9, we will have 3 leading bytes (up to address 12), 6 aligned words (up to address 38), and 2 trailing bytes. This will give us a memory-aligned section to perform the CRC32 word-by-word instead of byte-by-byte. The type signature for the function splitting the ByteString will be

We use a continuation-passing style since we need to manipulate the pointer underlying the ByteString to know where the aligned section of the memory starts, and the functions that let us do this use CPS themselves.

Once we have this function, the ugly part is over: we can very easily port the algorithm present in zlib to Haskell, but in a more functional style:

The various crc32Table_n are various pre-computed tables needed to perform the algorithm, and !!! is an unsafe indexing operator (we know we’re going to be within the bounds). The Little is to indicate that this is the algorithm for little-endian architectures. A matching crc32UpdateBig has to be defined for big-endian architectures.

The number of instructions is reduced by almost three times, while the productivity stays the same – which explains the threefold increase in performance. Note that we’re still using more than twice the instructions as the C program, due to loop unrolling. However, loop unrolling seems to push the program towards being memory-bound and thus does not result in a great performance increase. We have not pursued this road further.

However, when analyzing the program through perf report, we get somewhat surprising results:

mov instructions from register to register make up for more than 60% of the time spent in the critical section of the code, while we would expect most of the time to be spent xoring and anding. I have not investigated why this is the case, ideas welcome!

Acknowledgements

Thanks to Peter Wortmann and to Arash Rouhani for the very informative discussions on the status of DWARF in GHC. Also thanks to Niklas Hambüchen for reviewing a draft of this blogpost.