Debugging a rare / unreproducible bug..

Dunno where else to ask this because it's not quite a programming question. Like the title says, how do I debug a rare / unreproducible bug? My application is having some weird behaviours. It sometimes crashed and went back to Windows and displayed the usual error message with those send / don't send button (you know what I'm talking about), yet sometimes it would run normally. Also because the application needed to be run at least for one whole day, I also needed the application to be robust. And there's one more bug that would happen when I tested it to be run for 1 whole day. It would make go black. I'm quite sure that it's not a screen saver / hardware safe mode issue because when I pressed CTRL+ALT+DEL, the task manager shows up. So, can anybody help me here? How do you usually detect a bug that happens randomly like this? How do you usually test for an application robustness? Thanks in advance.

Code review. It's probably a buffer overrun somewhere or null pointer dereference. Some may suggest a debugger, but I don't think you're at that point yet. Instead, try looking through your code for "off-by-one" errors, bad loop conditions, etc, generally go through all major constructs and/or functions and think to yourself "under what circumstances can this possibly go wrong?". And short from random bit-flipping caused by cosmic rays, try to code for a contingency.

Have done that but can't seem to see anything weird. BTW, one more thing that I don't get was the treatment between when it crashed and when it doesn't is the same, nothing at all. No input from any external source, even keyboard. Just the application that runs its own routines all day long. At least for now. Maybe the problem is a call to a null pointer just like you said but I don't know which one because the object creation / deletion is automated from the application itself with some sort of schedulers.

Install WinDbg and just run the program from within the debugger (no breakpoints or anything, just run it).
If it does crash, at least you'll find yourself inside the debugger, and not at a meaningless dialog going nowhere.

Make sure you are setting pointers to NULL once you free the memory they point to. If you doint *ptr = malloc(MAGIC_NUMBER); and then later free(ptr);, and then dereference ptr before setting it to NULL or re-assigning it to some other chunk of memory, you are likely to experience the exact problem you are describing. ptr is a dangling pointer once you free it and before you re-assign it.

Once you call free, anything can happen to that block of memory, from nothing to the operating system reclaiming it for another process. If you get "lucky" it will be left alone, even after additional calls to malloc or new. For instance, malloc's algorithm could be skipping this block for some reason, so every time you dereference that dangling pointer, you happen to be referring to the old value and everything seems to work.

Later (the next few cycles, an hour later, however long it takes for you to call malloc enough that it arrives back at that particular block of memory) the memory is finally overwritten with something else. Now when you dereference ptr, it might give weird results, or it might cause a seg fault, or it might continue operating with no obvious effect. Since you have no way to know what the memory block was overwritten with, you have no way to know how or why it is acting that way (unless you use a debugger and look at the value of the block ptr is pointing to).

So, while you don't see any difference between the first run and the second, the algorithm malloc is using might be taking different paths, or the operating system might be shuffing memory around and that dangling pointer is left out of loop since the memory it is pointing to is technically pronounced available, etc.

abachler: "A great programmer never stops optimizing a piece of code until it consists of nothing but preprocessor directives and comments "

And adding logging to the code would also help - showing what the program is doing - even if it's not showing where it actually goes wrong or why, it will be very helpful to understand what the steps are to reproduce the problem, so recording each user-action and/or data-input would be useful - perhaps you will then notice something that is different between the crashing and non-crashing scenarios of performing the same steps.

--
Mats

Compilers can produce warnings - make the compiler programmers happy: Use them!
Please don't PM me for help - and no, I don't do help over instant messengers.

Remember that under windows, you have to CloseHandle() on a thread after it finishes, or you leak handles and eventually the OS will refuse to give you any more of them. A simple way to check if this is the case is to look in task manager under the performance tab and see if the handle count is gradually creeping up.

Lots of replies already. Thanks guys. I'll try to answer them all at once.

@Salem: It's the release build. Now that I think about it, the debug version won't run because it always shows an error. But the weird thing is, the release build didn't show this particular error at all. FYI, actually the code that gives an error is from another programmer. He said it's fine as long as the release doesn't show this error. And I just take his word on this.

@jEssYcAt: Maybe that's the problem. I actually has an idea who the culprit is (see the reply to Salem above). I admit, I haven't checked this code at all. I just assume it worked based on that programmer said. He also used this code for his application, and it (seemed) working just fine.

@CornedBee: What's a code linter?

@matsp: Actually, that idea has occured in my head. But I didn't do that because I still need to do some other things.

@medievalelks: What's a Bounds Checker?

@abachler: Not that I know of. But I don't know if the other guy used a thread in his code.

I'm with Sang-drax and Salem here: It is highly likely that the debug build is "correctly" pointing out something that is wrong, whilst the release build is missing it, and most of the time it's not making a lot of difference, but sometimes causes a crash. Typically, this is "out of bounds" on memory allocations or arrays.

--
Mats

Compilers can produce warnings - make the compiler programmers happy: Use them!
Please don't PM me for help - and no, I don't do help over instant messengers.

It's a program or a library that examines your program as it is run, detecting any buffer overruns or memory errors. The only one I've really used is Valgrind, which is fantastic, but unfortunately only runs under Linux. I've heard of Purity, Electric Fence, and dmalloc(), but never tried any of them.

dwk

Seek and ye shall find. quaere et invenies.

"Simplicity does not precede complexity, but follows it." -- Alan Perlis
"Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
"The only real mistake is the one from which we learn nothing." -- John Powell