Posts Tagged ‘gdb’

I had a fun memory corruption to debug the last couple days. That’s probably something that not many people would be caught dead saying, but it happens that DB2’s internal memory allocation infrastructure has some awesome and powerful cross platform capabilities. One of these is a memory debugging runtime option that enables unwritable guard pages on allocation (using mprotect and other similar operating system primatives if available).

By enabling this memory debugging code in my test scenerio, I end up with a nice friendly SIGSEGV when an attempt to write past the end of an allocation was made. I say this is friendly, because compared to the internal functions of free() barfing long after the corruption, with no idea what could have caused it, nor when, a SIGSEGV at the point of corruption is very nice!

However, tt happened that the SIGSEGV in this case was actually a side effect of an earlier corruption. I see the following in the debugger when the exception occurs

Now that address looks a bit fishy doesn’t it. I happen to know this xiinfo->address was heap allocated, so my expectation was that it would be aligned nicely. What we’ve got here is a pair of 0xAA corruptions of the address, and that was enough to push a later dereference of the memory it was pointing to to get pushed into the guard page region past the allocation (the allocation size in this case was 1000 bytes < 0xAAAA). While I have a reproducable scenerio, noticing that my pointer here is being corrupted, unfortunately reduces the problem to tracking down a corruption that’s not occuring in my guard page region of memory any more. I’d thought of instrumenting the code in question with a validation routine that checks this address against an earlier cached value. That worked, but only triggered after the fact, and it was hard to see exactly what was causing the corruption. As a third step in the debugging process I used for the first time a hardware watchpoint, something I’d wanted to try for a while:

(gdb) help watch
Set a watchpoint for an expression.
A watchpoint stops execution of your program whenever the value of an expression changes.

Here’s an example of a fragment of the debugging session that shows a hardware watchpoint in action (with some names changed to protect the guilty)

I had a three thread timing hole scenerio that I wanted to confirm with the debugger. Adding blocks of code to selected points like this turned out to be really handy:

{
volatile int loop = 1 ;
while (loop)
{
loop = 1 ;
sleep(1) ;
}
}

Because the variable loop is local, I could have two different functions paused where I wanted them, and once I break on the sleep line, can let each go with a debugger command like so at exactly the right point in time

(gdb) p loop=0

(assigns a value of zero to the loop variable after switching to the thread of interest). The gdb ‘set scheduler-locking on/off’ and ‘info threads’ ‘thread N’ commands are also very handy for this sort of race condition debugging (this one was actually debugged by code inspection, but I wanted to see it in action to confirm that I had it right).

I suppose that I could have done this with a thread specific breakpoint. I wonder if that’s also possible (probably). I’ll have to try that next time, but hopefully I don’t have to look at race conditions like today’s for a quite a while!

You can hit ‘c’ to continue at this point, but if it happens repeatedly in various threads (like when one thread is calling pthread_kill() to force each other thread in turn to dump its stack and stuff) this repeated ‘c’ing can be a bit of a pain.

For the same SIGUSR1 example above, you can query the gdb handler rules like so:

I found myself looking at some code and unsure how it would behave. Being a bit tired today I couldn’t remember if continue pops you to the beginning of the loop, or back to the predicate that allows you to break from it. Here’s an example:

Once the variable x is modified, sure enough we break from the loop (note the sneaky way you have to modify variables in gdb, using the print statement to implicitly assign). There’s no chance to go back to the beginning and reset rc = 1 to keep going.

The conclusion: continue means goto the loop exit predicate statement, not continue to the beginning of the loop to retry. In the code in question a goto will actually be clearer, since what was desired was a retry, not a retry-if.

Recently some of our code started misbehaving only when compiled with the GCC compiler. Our post mortem stacktrace and data collection tools didn’t deal with this trap very gracefully, and dealing with that (or even understanding it) is a different story.

Observe that there are two sets of ” frames. One from the original SIGILL, and another one that our “main” thread ends up sending to all the rest of the threads as part of our process for freezing things to be able to take a peek and see what’s up.

This has got the si_addr value 0x00002AB821393257, which also matches frame 9 in the stack for sqluInitLoadEDU. What was at that line of code, doesn’t appear to be something that ought to generate a SIGILL:

Hmm. What is a ud2a instruction? Google is our friend and we find that the linux kernel uses this as a “guaranteed invalid instruction”. It is used to fault the processor and halt the kernel in case you did something really really bad.

Other similar references can be found, also explaining the use in the linux kernel. So what is this doing in userspace code? It seems like something too specific to get there by accident and since the instruction stream itself contains this stack corruption or any other sneaky nasty mechanism doesn’t seem likely. The instruction doesn’t immediately follow a callq, so a runtime loader malfunction or something else equally odd doesn’t seem likely.

Perhaps the compiler put this instruction into the code for some reason. A compiler bug perhaps? A new google search for GCC ud2a instruction finds me

...generates this warning (using gcc 4.4.1 but I think it applies to most
gcc versions):
main.cpp:12: warning: cannot pass objects of non-POD type .class A.
through .....; call will abort at runtime
1. Why is this a "warning" rather than an "error"? When I run the program
it hits a "ud2a" instruction emitted by gcc and promptly hits SIGILL.

Oh my! It sounds like GCC has cowardly refused to generate an error, but also bravely refuses to generate bad code for whatever this code sequence is. Do I have such an error in my build log? In fact, I have three, all of which look like:

It turns out that agtRqstCB is a rather large structure, and certainly doesn’t match the %p that the developer used in this debug build special code. The debug code actually makes things worse, and certainly won’t help on any platform. It probably also won’t crash on any platform either (except when using the GCC compiler) since there are no subsequent %s format parameters that will get messed up by placing gob-loads of structure data in the varargs data area inappropriately.

This should resolve this issue and allow me to go back to avoiding the (much slower!) intel compiler that is used by our nightly build process.

Note that the repeat count isn’t the total number of bytes to dump, but the total number of objects in the size specification:

(gdb) help x
Examine memory: x/FMT ADDRESS.
ADDRESS is an expression for the memory address to examine.
FMT is a repeat count followed by a format letter and a size letter.
Format letters are o(octal), x(hex), d(decimal), u(unsigned decimal),
t(binary), f(float), a(address), i(instruction), c(char) and s(string).
Size letters are b(byte), h(halfword), w(word), g(giant, 8 bytes).
The specified number of objects of the specified size are printed
according to the format.
Defaults for format and size letters are those previously used.
Default count is 1. Default address is following last thing printed
with this command or "print".

We have SLES10 linux machines, and the gdb version available on them is a old (so old that it no longer works with the version of the intel compiler that we use to build our product). Here’s a quick cheatsheet on how to download and install a newer version of gdb for private use, without having to have root privileges or replace the default version on the machine:

Executing these leaves you with a private version of gdb in ~/gdb/bin/gdb that works with newer intel compiled code.

This version of gdb has some additional features (relative to 6.8 that we have on our machines) that also look interesting:

disassemble start,+length looks very handy (grab just the disassembly that is of interest, or when the whole thing is desired, not more hacking around with the pager depth to get it all).

save and restore breakpoints.

current thread number variable $_thread

trace state variables (7.1), and fast tracepoints (will have to try that).

detached tracing

multiple program debugging (although I’m not sure I’d want that, especially when just one multi-threaded program can be pretty hairy to debug). I recall many times when dbx would crash AIX with follow fork. I wonder if other operating systems deal with this better?

reverse debugging, so that you can undo changes! This is said to be target dependent. I wonder if amd64 is supported?

catch syscalls. I’ve seen some times when the glibc dynamic loader appeared to be able to exit the process, and breaking on exit, _exit, __exit did nothing. I wonder if the exit syscall would catch such an issue.

I’ve got stuff interrupted with the debugger, so I can’t invoke our external tool to collect stacks. Since gdb doesn’t have redirect for most commands here’s how I was able to collect all my stacks, leaving my debugger attached:

(gdb) set height 0
(gdb) set logging on
(gdb) thread apply all where

Now I can go edit gdb.txt when it finishes (in the directory where I initially attached the debugger to my pid), and examine things. A small tip, but it took me 10 minutes to figure out how to do it (yet again), so it’s worth jotting down for future reference.

You may have to move up and down your stack frames to find the context required to make the call, or to get the parameters you need in scope. You have to think about (or exploit) the side effects of the functions you call.

Somewhat like modification of variables in the debugger, this capability allows you to shoot yourself fairly easily, and but that’s part of the power.

I don’t recall if many other debuggers had this functionality. I have a vague recollection that the sun workshop’s dbx did too, but I could be wrong.

Unless you are running on a 128 way (and god help you if you have to actively debug with that kind of concurrency), most of your threads will be blocked all the time, stuck in a kernel or C runtime function, and only that shows at the top of the stack.

You can list the top frames of all your functions easily enough, doing something like:

then page through that output, and find what you are looking for, set breakpoints and start debugging, but that can be tedious.

A different way, which requires some preparation, is by dumping to a log file, the thread id. There’s still a gotcha for that though, and you can see in the ‘info threads’ output that the thread ids (what’s you’d get if you call and log the value of pthread_self()) are big ass hexadecimal values that aren’t particularily easy to find in the ‘info threads’ output. Note that pthread_self() will return the base address of the stack itself (or something close to it) on a number of platforms since this can be used as a unique identifier, and linux currently appears to do this (AIX no longer does since around 4.3).

Also observe that gdb prints out (LWP ….) values in the ‘info threads’ output. These are the Linux kernel Task values, roughly equivalent to a threads’s pid as far as the linux kernel is concerned (linux threads and processes are all types of “tasks” … threads just happen to share more than processes, like virtual memory and signal handlers and file descriptors). At the time of this writing there isn’t a super easy way to dump this task id, but a helper function of the following form will do the trick:

You’ll probably have to put this code in a separate module from other stuff since kernel headers and C runtime headers don’t get along well. Having done that you can call this in your dumping code, like the output below tagged with the prefix KTID (i.e. what a DB2 developer will find in n-builds in the coral project db2diag.log).