Today I had to debug a crash to my application that happened in a really weird place. I examined the core dump and I was getting a segmentation fault at the time() function, which is part of the standard library.

Evidently, the chances that there was a bug in time() were pretty slim, so the problem must have been elsewhere, and merely manifested itself as a crash in time().

Opening the core in gdb, the GNU Debugger, and checking the backtrace command showed that the stack was corrupted: instead of getting a nice backtrace leading all the way to main (or to the clone() call that created my thread), I had about 6 levels of proper stack and then over 900 levels of “??” below that. Another obvious hint was that the hexadecimal addresses of functions in those invalid stack levels were completely different from the numbers seen in proper stack levels.

A stack corruption can only mean one thing: someone wrote something over the stack and filled the stack pointer address with garbage instead.

I then proceeded to look at the stack contents, hoping to find from which point did values start to look odd. In gdb, I ran the backtrace full command. This shows all local variables as well.

From there on, it was easy to spot a char[] buffer at the lowermost valid stack level that was being updated by functions higher up in the stack. If that buffer had overflowed, it would certainly make everything from there on in the stack invalid.

There were other pointers in that stack level right after the suspicious buffer. Using the up command, I went up, up, up until I reached that stack level, and then I could check the pointers using the print command. Indeed, gdb replied “cannot reach memory address” for their values — the pointers were invalid.

With the down command I went down the stack, right to the function that was manipulating that buffer. A quick look at the code, combined with checking the values of local variables with print confirmed my suspicions. An off-by-one error made my loop go beyond the end of the buffer, corrupting the stack and causing the crash.

As for time()? It didn’t really crash there. Its return value was being assigned to an address that was made invalid by stack corruption, and gdb couldn’t tell the difference between the crash happening at time() or at its return value, probably due to compiler optimizations.

Posted by hisham on Tuesday, February 14, 2012 16:33:31 in en_US, Coding