Debugging GHC-compiled code with gdb

If you're working on a GHC port, tweaking the runtime, or modifying
the code generator, at some point you will have to become familiar
with how to use ​gdb for debugging
crashes in GHC-compiled programs.

Most of the time you can't even get a backtrace inside GHC-compiled
code. The GHC calling conventions are entirely different from C's,
and GHC-compiled code doesn't use the C stack. There are a
couple of exceptions:

when the crash happens inside the runtime, you can use the usual
C debugging facilities of gdb with full backtrace and
source-level debugging. (make sure you use a -debug
version of the runtime to get the debug info, though, and if
the bug happens with the non-threaded runtime then that will also
simplify things).

when the crash happens inside a foreign call invoked by Haskell
code. In this case you'll be able to see which Haskell fragment
invoked the foreign call (it'll be something like
MyModule_zdwccall13_info, more about symbol names later).

In fact, there are no useful backtraces in Haskell code,
because of lazy evaluation and tail-calls. Finding out "how we got
here" is often a challenge.

The recipe

Usually you want to do something along these lines:

Compile with -dcore-lint to make sure the crash isn't caused by the compiler
generating incorrect code in a way that can be detected statically.

Compile with -debug! This links the program with a debugging
version of the runtime, that includes extra assertions, debugging
code, and debugging info for gdb.

Don't compile with -threaded (assuming the bug still happens
with the non-threaded runtime).

Run the program in gdb until it crashes, type where to find
out whether the crash is in the RTS or in Haskell code. If you see
an uninformative backtrace with a symbol like s32y_info at
the top, then you're in Haskell. If you're in the runtime, then
we don't have anything more to tell you here, you can use gdb's
full C debugging facilities to track down the problem.

Make the crash happen as early as possible. -debug already
turned on lots of assertions, one of which might trigger. Also,
try +RTS -DS which turns on a bunch of extra "sanity
checking", i.e. expensive assertions about the state of the runtime
at regular points. One thing this enables is a full sweep of the
heap after each garbage collection to make sure there are no
dangling references. Also, it fills all free memory with the value
0xaaaaaaaa (the sound of GHC disappearing down a hole).

Look at the fragment of code that crashed: disassemble. It
should be pretty small (code blocks in GHC-generated code usually
are). Look at the instruction that caused the crash - the address
is given in the where output earlier. If you can tell what
the problem is right there, then great.

Poke around in the heap and stack and see if you can figure out
what has gone wrong. See "poking around", below.

Make the crash happen repeatably, if possible. That means after
the exact same number of garbage collections and context switches
each time. Use +RTS -C0 to turn off timer-based context switches.
Setting +RTS -V0 disables the RTS timer entirely. Disabling the RTS
timer altogether can make the point at which the program crashes reproducible,
and is very helpful when using the "going back in time" method described below.
If you're debugging heap profiling, add -i0 too.

Similarly, if possible, turn off address space layout randomization (ASLR).
On Linux you can do this with "echo 0 > /proc/sys/kernel/randomize_va_space".

If you still haven't figured it out, you probably need to go back in
time a bit to find out what was happening just before the crash.
This is the tricky bit, because going back in time isn't directly
supported by gdb, we need to find a way to hit a breakpoint a short
time before the crash happened. See "going back in time", below.

Going back in time

Set a breakpoint on a code fragment that is regularly executed.
Some good examples are stg_upd_frame_info (the standard
update code), stg_upd_frame_1_info (the update code for a
1-tagged object, eg. a cons cell), stg_ap_p_info (the apply
code for a single pointer argument).

break stg_upd_frame_info

Ignore that breakpoint for ever

ignore 1 9999999

Run the program

run

Find out how many times the breakpoint was hit

info break

Suppose it was hit 4325 times, then next time we'll ignore it
for 4324 times (i.e. subtract one), which will stop on the 4325th
time, just before the crash.

ignore 1 4324

Run the program

run

Single step until the crash happens

si...

If the crash doesn't happen for a long time, try picking another
break point (e.g. something you stepped through this time). You
might have to repeat this a few times to get to a useful point.

Sometimes the crash is a long way into the program, and setting a
frequently accessed breakopint like stg_upd_frame_info means
the program just runs too slowly; in this case you might have to
use a 2-stage strategy instead. Set a breakpoint on a less
frequently accessed point (a good one I use for this is
GarbageCollect, which stops at each GC), do the ignore thing
on this, and then switch to a more frequently accessed breakpoint
when you are closer to the crash.

Decoding symbol names

Symbols in GHC are encoded using something called the Z-encoding (see
​Encoding.hs). Basically special symbols are replaced by sequences
beginning with z or Z. eg. state# becomes
statezh. The letter z itself is replaced by zz.

External symbols follow the pattern
module_identifier_kind. Where module is the module
name, identifier is the identifier in the STG code that this
symbol relates to, and kind is the kind of symbol:

Internal symbols are of the form identifier_kind, i.e. the
same as external symbols but without the module name.

Poking around

There are two things you usually want to inspect inside a code
fragment: the contents of heap objects, and the contents of the
stack. A full description of the layout of these things is here: Commentary/Rts/Storage; what follows is a basic introduction to
looking at heap & stack objects in a running program.

You can display memory in gdb with something like x/4a to
display 4 words of memory, or using our gdb macros​ you get slighty
nicer output:

p4 displays 4 words of memory backwards, i.e. lower addresses at
the bottom. In this case I'm displaying memory pointed to by the
register rbx, which corresponds to the STG register R1 on
a recent x86_64 build. Check
​MachRegs.h to
see which machine registers correspond to which STG registers on your
platform.

In the example above, I can see that $rbx points to a heap
closure for the Int value 5. Closures always consist of an info
pointer (GHCziBase_Izh_con_info in this case, the I#
constructor), followed by any number of payload words (just one word
containing the value 5, here). Full details on closure layouts are in
​Closures.h.

It looks like the next word contains garbage, probably because it is
the next free location in the heap.

Info tables tell the garbage collector and other parts of the system
about the layout of closures, they are rarely used during actual
execution. The info pointer of a closure actually points to the
entry code for the closure; the info table lives just before the
entry code (this is a trick used by GHC so that the common operation
of jumping to the entry code for a closure can be done with a single
indirection). The layout of info tables is defined in
​InfoTables.h.

To display the stack, you need to know what the Sp register is
mapped to on this platform. On x86_64 you'll find it in $rbp:

The type field tells us what kind of object this is, in this
case 36}, which means a RET_SMALL stack frame (see
​ClosureTypes.h
for a list of closure types, but make sure you are
looking at the right version of this file for the build you're using,
because the types do change).

Useful flags

See -ddump-stg, -ddump-simpl, -ddump-cmm, -dppr-debug.

Useful hints

Use the dpc macro if you want to avoid the pain of matching the pc address with instructions at every program step:

Now, I can use file libraries/base/GHC/TopHandler.o.output to make sense of what is going on in TopHandler.o.

Mapping back to the STG code

To figure out which bit of Haskell code corresponds to the assembly
fragment you're looking at, you need to look at the STG intermediate
code generated by GHC. Use the -ddump-stg flag. The reason we
have to look at the STG is becase this is the last phase before code
generation, after all the transformations have happened, and the
symbol names in STG correspond pretty directly to the symbols you see
in the object code.

For example, if you're stopped in a code fragment s28a_info,
then you should be able to search for s28a in the STG output and
find it - but you first have to find which module it comes from, and
the best way to do that is to grep for s28a in all the modules
of your program.

When you've found & fixed it

Tell someone, they'll probably be impressed :-)

Try to think about whether adding an assertion somewhere might have
helped you find the bug quicker, and if so, add one. Test it. Add
some commentary next to the assertion to help someone else
understand the problem later.