Monday, March 5, 2012

Debugging memory explosions with GDB

When writing large/complex/multithreaded software its inevitable that bugs will happen. Valgrind is a great tool for finding memory leaks, however it's not very useful when a bug causes a malloc "explosion" and the OS responds by killing the process. It is especially difficult to find such a "malloc bomb" when it happens randomly after running the code for many hours (which would be days with valgrind's overheads).

Synopsis of a malloc() bomb:

run the compiled program inside a debugger: eg. `gdb yourprogramname`.

after watching the program run normally for many hours the bug occurs and memory usage spikes: the buggy program tries to allocate more memory from the heap than is available.

the OS starts paging memory to swap and the hard disk thrashes, the system starts to freeze and becomes unresponsive.

you cant get keyboard focus to the console to hit Ctrl+C, as a result you cant break/stop your code inside the debugger.

eventually the OS kills the program in a panic and after more disk thrashing keyboard control returns to the debugger saying "Abnormal program exit".

the program is killed with no opportunity to produce a core dump or break into the debugger.

Even if you were sitting at the gdb console waiting for this to occur, you likely wouldn't catch it before the system becomes unresponsive to the keyboard Ctrl+C. Unfortunately gdb and other debuggers don't allow breakpoints on total memory usage.

Solution: One solution that works very nicely on Ubuntu/Linux is a simple bash script that loops checking memory usage. First run your program at a console inside the debugger `gdb yourprogramname`, and then run/paste this script into another console:#!/bin/bash
#set the process name here (make sure there is only one instance)
NAME=yourprogramname
#set the process memory threshold in limit in KiB
MEMLIMIT=2000000
PID=`pgrep $NAME`
while true
do
MEM=`echo 0 $(cat /proc/$PID/smaps | grep Rss | awk '{print $2}' | sed 's#^#+#') | bc`
echo "Memory usage: $MEM"
if [ $MEM -gt $MEMLIMIT ]; then echo "***sending sigint***"; kill -INT $PID; fi
sleep 1
done
Make sure you are running only one instance of your program and it is inside the debugger. The script first finds the PID of your running program and then loops forever checking how much memory it is using. If total memory usage passes a predefined threshold (2GiB default above) the script sends SIGINT to your process. This has the same effect as hitting Ctrl+C in the debugger, however the script is 99.9% more likely to send the SIGINT before the disk thrashing and system becomes unresponsive. The repeated SIGINTs are ignored.