tags:

views:

answers:

I'm working on a multithreaded C++ application that is corrupting the heap. The usual tools to locate this corruption seem to be inapplicable. Old builds (18 months old) of the source code exhibit the same behaviour as the most recent release, so this has been around for a long time and just wasn't noticed; on the downside, source deltas can't be used to identify when the bug was introduced - there are a lot of code changes in the repository.

The prompt for crashing behaviuor is to generate throughput in this system - socket transfer of data which is munged into an internal representation. I have a set of test data that will periodically cause the app to exception (various places, various causes - including heap alloc failing, thus: heap corruption).

The behaviour seems related to CPU power or memory bandwidth; the more of each the machine has, the easier it is to crash. Disabling a hyper-threading core or a dual-core core reduces the rate of (but does not eliminate) corruption. This suggests a timing related issue.

Now here's the rub:
When it's run under a lightweight debug environment (say Visual Studio 98 / AKA MSVC6) the heap corruption is reasonably easy to reproduce - ten or fifteen minutes pass before something fails horrendously and exceptions, like an alloc; when running under a sophisticated debug environment (Rational Purify, VS2008/MSVC9 or even Microsoft Application Verifier) the system becomes memory-speed bound and doesn't crash (Memory-bound: CPU is not getting above 50%, disk light is not on, the program's going as fast it can, box consuming 1.3G of 2G of RAM). So, I've got a choice between being able to reproduce the problem (but not identify the cause) or being able to idenify the cause or a problem I can't reproduce.

My current best guesses as to where to next is:

Get an insanely grunty box (to replace the current dev box: 2Gb RAM in an E6550 Core2 Duo); this will make it possible to repro the crash causing mis-behaviour when running under a powerful debug environment; or

Rewrite operators new and delete to use VirtualAlloc and VirtualProtect to mark memory as read-only as soon as it's done with. Run under MSVC6 and have the OS catch the bad-guy who's writing to freed memory. Yes, this is a sign of desperation: who the hell rewrites new and delete?! I wonder if this is going to make it as slow as under Purify et al.

And, no: Shipping with Purify instrumentation built in is not an option.

A collegue just walked past and asked "Stack Overflow? Are we getting stack overflows now?!?"

And now, the question: How do I locate the heap corruptor?

Update: balancing new[] and delete[] seems to have gotten a long way towards solving the problem. Instead of 15mins, the app now goes about two hours before crashing. Not there yet. Any further suggestions? The heap corruption persists.

Update: a release build under Visual Studio 2008 seems dramatically better; current suspicion rests on the STL implementation that ships with VS98.

A:

You tried old builds, but is there a reason you can't keep going further back in the repository history and seeing exactly when the bug was introduced?

Otherwise, I would suggest adding simple logging of some kind to help track down the problem, though I am at a loss of what specifically you might want to log.

If you can find out what exactly CAN cause this problem, via google and documentation of the exceptions you are getting, maybe that will give further insight on what to look for in the code.

You tried old builds, but is there a reason you can't keep going further back in the repository history and seeing exactly when the bug was introduced?

I have reason to suspect this problem's been in the code base since the begining of time. Further historic checking is unlikely to find a version that doesn't blow up, and there is a good chance a sufficently old version just won't work against the host.

Otherwise, I would suggest adding simple logging of some kind to help track down the problem, though I am at a loss of what specifically you might want to log.

Logging isn't going to help; it would have to be everywhere and if enough is added then the program will go so slow the problem won't happen.

If you can find out what exactly CAN cause this problem, via google and documentation of the exceptions you are getting, maybe that will give further insight on what to look for in the code.

The exceptions are caused by heap corruption.

Pageheap looks like what I'll be trying tomorrow, especially in this mode:

Full-page heap reveals corruptions in heap blocks by placing a non-accessible page at the end of the allocation. The advantage of this approach is that you achieve "sudden death," meaning that the process will access violate (AV) exactly at the point of failure. This behavior makes failures easy to debug. The disadvantage is that every allocation uses at least one page of committed memory. For a memory-intensive process, system resources can be quickly exhausted.

That sounds just the ticket. My only concern is that it will slow the system down so that the problem stops happening. This seems to be a recurring issue - anything with any chance of finding the bug stops the bug happening. But I'll try!

Update: Turns out Pageheap.exe is an alias for one of the behaviours of Microsoft Application Verifier. Too slow; no reproduction.

What does the failure look like exactly? You say "including heap alloc failing" - could that mean you're simply running out of memory? (I'm not up on Windows programming, but that could be a cause in the Linux world.)

What does the failure look like exactly? You say "including heap alloc failing" - could that mean you're simply running out of memory? (I'm not up on Windows programming, but that could be a cause in the Linux world.)

C++ says that out of memory results in std::bad_alloc being thrown. What I'm seeing is Memory exceptions ("hey, you can't read (or maybe write) there!")

Building with 2008 would have caught crazy crap like that... maybe even MSVC6, but I'm not sure.

MSVC6 won't catch that but Lint would. De-linting your code might be a good place to start. It's only $250 (nought compared to the amount of time saved in debugging).

Tip for first-time lint users: Turn off everything and slowly turn stuff on. I started with non-needed headers and worked my way up to about 20 items so far. When I ran it first time overnight on our product it had more errors than lines of code!!

Building with 2008 would have caught crazy crap like that... maybe even MSVC6, but I'm not sure.

MSVC6 won't catch that but Lint would. De-linting your code might be a good place to start.

You're not wrong.

Unfortunately, this code doesn't even compile without producing copious MSVC6 level 4 warnings, so I hate to think what PC-lint would have to say about it (this is after I put in quite a bit of effort getting out the level 3 warnings!). Apparently the code base "periodically" gets PC-lint run over it, but I suspect the results are studiously ignored. It's a large code base, and tackling the issues in there would be non-trivial.

So from the limited information you have, this can be a combination of one or more things:

Bad heap usage, i.e., double frees, read after free, write after free, setting the HEAPNOSERIALIZE flag with allocs and frees from multiple threads on the same heap

Out of memory

Bad code (i.e., buffer overflows, buffer underflows, etc.)

"Timing" issues

If it's at all the first two but not the last, you should have caught it by now with either pageheap.exe.

Which most likely means it is due to how the code is accessing shared memory. Unfortunately, tracking that down is going to be rather painful. Unsynchronized access to shared memory often manifests as weird "timing" issues. Things like not using acquire/release semantics for synchronizing access to shared memory with a flag, not using locks appropriately, etc.

At the very least, it would help to be able to track allocations somehow, as was suggested earlier. At least then you can view what actually happened up until the heap corruption and attempt to diagnose from that.

Also, if you can easily redirect allocations to multiple heaps, you might want to try that to see if that either fixes the problem or results in more reproduceable buggy behavior.

When you were testing with VS2008, did you run with HeapVerifier with Conserve Memory set to Yes? That might reduce the performance impact of the heap allocator. (Plus, you have to run with it Debug->Start with Application Verifier, but you may already know that.)

You can also try debugging with Windbg and various uses of the !heap command.

We've had pretty good luck by writing our own malloc and free functions. In production, they just call the standard malloc and free, but in debug, they can do whatever you want. We also have a simple base class that does nothing but override the new and delete operators to use these functions, then any class you write can simply inherit from that class. If you have a ton of code, it may be a big job to replace calls to malloc and free to the new malloc and free (don't forget realloc!), but in the long run it's very helpful.

In Steve Maguire's book Writing Solid Code (highly recommended), there are examples of debug stuff that you can do in these routines, like:

Keep track of allocations to find leaks

Allocate more memory than necessary and put markers at the beginning and end of memory -- during the free routine, you can ensure these markers are still there

memset the memory with a marker on allocation (to find usage of uninitialized memory) and on free (to find usage of free'd memory)

Another good idea is to never use things like strcpy, strcat, or sprintf -- always use strncpy, strncat, and snprintf. We've written our own versions of these as well, to make sure we don't write off the end of a buffer, and these have caught lots of problems too.

The apparent randomness of the memory corruption sounds very much like a thread synchronization issue - a bug is reproduced depending on machine speed. If objects (chuncks of memory) are shared among threads and synchronization (critical section, mutex, semaphore, other) primitives are not on per-class (per-object, per-class) basis, then it is possible to come to a situation where class (chunk of memory) is deleted / freed while in use, or used after deleted / freed.

As a test for that, you could add synchronization primitives to each class and method. This will make your code slower because many objects will have to wait for each other, but if this eliminates the heap corruption, your heap-corruption problem will become a code optimization one.

Graeme's suggestion of custom malloc/free is a good idea. See if you can characterize some pattern about the corruption to give you a handle to leverage.

For example, if it is always in a block of the same size (say 64 bytes) then change your malloc/free pair to always allocate 64 byte chunks in their own page. When you free a 64 byte chunk then set the memory protection bits on that page to prevent reads and wites (using VirtualQuery). Then anyone attempting to access this memory will generate an exception rather than corrupting the heap.

This does assume that the number of outstanding 64 byte chunks is only moderate or you have a lot of memory to burn in the box!

Is this in low memory conditions? If so it might be that new is returning NULL rather than throwing std::bad_alloc. Older VC++ compilers didn't properly implement this. There is an article about Legacy memory allocation failures crashing STL apps built with VC6.

Run the original application with ADplus -crash -pn appnename.exe
When the memory issue pops-up you will get a nice big dump.

You can analyze the dump to figure what memory location was corrupted.
If you are lucky the overwrite memory is a unique string you can figure out where it came from. If you are not lucky, you will need to dig into win32 heap and figure what was the orignal memory characteristics. (heap -x might help)

After you know what was messed-up, you can narrow appverifier usage with special heap settings. i.e. you can specify what DLL you monitor, or what allocation size to monitor.

Hopefully this will speedup the monitoring enough to catch the culprit.

In my experience, I never needed full heap verifier mode, but I spent a lot of time analyzing the crash dump(s) and browsing sources.

P.S:
You can use DebugDiag to analyze the dumps.
It can point out the DLL owning the corrupted heap, and give you other usefull details.

This catches memory leaks and also inserts guard data before and after the memory block to capture heap corruption. You can just integrate with it by putting #include "debug.h" at the top of every CPP file, and defining DEBUG and DEBUG_MEM.

For static analysis consider compiling with PREfast (cl.exe /analyze). It detects mismatched delete and delete[], buffer overruns and a host of other problems. Be prepared, though, to wade through many kilobytes of L6 warning, especially if your project still has L4 not fixed.

PREfast is available with Visual Studio Team System and, apparently, as part of Windows SDK.

The little time I had to solve a similar problem.
If the problem still exists I suggest you do this :
Monitor all calls to new/delete and malloc/calloc/realloc/free.
I make single DLL exporting a function for register all calls. This function receive parameter for identifying your code source, pointer to allocated area and type of call saving this information in a table.
All allocated/freed pair is eliminated. At the end or after you need you make a call to an other function for create report for left data.
With this you can identify wrong calls (new/free or malloc/delete) or missing.
If have any case of buffer overwritten in your code the information saved can be wrong but each test may detect/discover/include a solution of failure identified. Many runs to help identify the errors.
Good luck.

Do you think this is a race condition? Are multiple threads sharing one heap? Can you give each thread a private heap with HeapCreate, then they can run fast with HEAP_NO_SERIALIZE. Otherwise, a heap should be thread safe, if you're using the multi-threaded version of the system libraries.

2009-07-30 13:48:40

A:

A couple of suggestions. You mention the copious warnings at W4 - I would suggest taking the time to fix your code to compile cleanly at warning level 4 - this will go a long way to preventing subtle hard to find bugs.

Second - for the /analyze switch - it does indeed generate copious warnings. To use this switch in my own project, what I did was to create a new header file that used #pragma warning to turn off all the additional warnings generated by /analyze. Then further down in the file, I turn on only those warnings I care about. Then use the /FI compiler switch to force this header file to be included first in all your compilation units. This should allow you to use the /analyze switch while controling the output