This is the analysis of a recent DrWatson(postmortem debugger) mini crash dump we received

Debugging Steps using WinDBG

Open the .dmp file using WinDBG

Make sure your symbol server path is set correctly(File->Symbol File Path), below is what I have set it tosrv*C:\Temp\Symbols*http://msdl.microsoft.com/download/symbols;

You should save the workspace to save your symbol path information unless you are using _NT_SYMBOL_PATH otherwise you will loose the information next time you start WinDBG. Remember to save it without opening a dump file to use it in next debugging session

In CLR 2.0, unhandled exceptions at the top of the stack on any thread will terminate the application

Exception has occurred after GC was triggered

Mark Phase was triggered for Generation 2 that means it was under severe memory pressure

since there are more than 1200 threads that itself amounts to 1.2 GB of virtual memory for just the user mode stack and each thread has user mode plus kernel mode stack although not much but kernel stack will also add up to another 3 virtual pages(12KB) 1200*12KB of physical memory, kernel stack is physical memory resident. There will be around 100 MB or so in MEM_IMAGE itself

Since this exception has occurred while starting a worker thread from managed thread pool and at the same GC was also triggered. So that means there was not enough memory

There is no thread leak, these many threads are by design(one thread per request, I don’t know why)

Some of the interesting points

There are more than 200 threads with preemptive GC disabled what that means is these threads can’t be suspended and GC Threads will have to wait for these threads to return to preemptive GC mode in order to claim the memory. Managed threads should not be in a state with preemptive GC disabled because this is very rare as interpreted from the call stack Thread::RareDisablePreemptiveGC

Preemptive GC: also very important. In Rotor, this is m_fPreemptiveGCDisabled field of C++ Thread class. It indicates what GC mode the thread is in: “enabled” in the table means the thread is in preemptive mode where GC could preempt this thread at any time; “disabled” means the thread is in cooperative mode where GC has to wait the thread to give up its current work (the work is related to GC objects so it can’t allow GC to move the objects around). When the thread is executing managed code (the current IP is in managed code), it is always in cooperative mode; when the thread is in Execution Engine (unmanaged code), EE code could choose to stay in either mode and could switch mode at any time; when a thread are outside of CLR (e.g, calling into native code using interop), it is always in preemptive mode.

CLR rotor source code has the following comments before calling Thread::RareDisablePreemptiveGC

We must do the following in this order, because otherwise we would be constructing the exception for the abort without synchronizing with the GC. Also, we have noCLR SEH set up, despite the fact that we may throw a ThreadAbortException.

Most likely it crashed because of not having enough memory to start a worker thread since it crashed with the call stack pointing to WokerThreadStart

The real challenge is how will you find out the exception detail for sure?How will you make sure that this is indeed a out of memory exception?

Since this is a mini dump with no information on CLR Data structures. Effective managed debugging requires a mini dump with full virtual memory because managed heap is created with virtualalloc but Dr. Watson as a postmortem debugger with default options creates a minidump which includes only all thread contexts and 10 instructions.