Sunday, June 03, 2007

Production Debugging for Hung ASP.Net 2 applications – a crash course

If your app is down right now skip the intro – I’ll understand.

Introduction

First up: debugging – even in the development environment – should never be a substitute to detailed diagnostic logging and appropriate – meaningful – error handling. Any time I fix an exception, if the problem is initially non-obvious, I first fix the logging code. Then and only then do I fix the exception.

Baking in a support mentality early in the development process is the only way to avoid an application that’s hard (and by that I mean expensive) to support. The customer might be paying for the support time, but if you can’t turnaround support issues quickly – and practically any issue in a production business application needs turning around quickly – they start questioning your competence.

But sometimes logs aren’t enough, and even with the best will in the world, sometimes they’re not sufficiently detailed (they won’t - for example – be much help tracking down that memory leak). Sometimes you need to know what’s thrashing right now; to see the smoking gun.

Enter the world of crash dump debugging.

I’d been following Tess’s blog for a while now, but still had a big hole in my understanding. Sure I now knew I could use the !clrstack command to start looking at managed thread stacks, but where the hell was I typing that? This then is my introduction to Production Debugging for .Net applications (or ‘how to understand what Tess is talking about’).

I’m going to be talking about debugging hung .Net 2 ASP.Net applications, but much of this is relevant to debugging any .Net application in a production or testing environment.

Getting the Dump

Assuming you don’t have time to read a more detailed article on this (and the Microsoft guide is 170+ pages!), just do this:

WARNING: Installing the debugging tools on your production server isn’t recommended. That’s not to say you have any choice in the matter, but you should be aware there are security and availability implications that – if your app is hung – you probably don’t care about right now.

If your app is thrashing CPU, use TaskManager to get the process ID (PID) of the W3WP process that’s using the most CPU (there may be more than one). NB: PIDs aren’t displayed by default: use View\Select Columns to turn them on.

At the command prompt CD to wherever you installed IISState (C:\IISState is the default). If there is existing content in the output folder, delete or rename the folder. Then execute the following:

IISState –p <PID> -d

(…replacing <PID> with the PID you determined above)

WARNING: At this point your app will become unresponsive until the dump completes

IISState will put a dump file (.dmp) and a text summary (.log) of the process’s threads etc… in the output directory.

Rename the output folder to something that reflects the date/time, so future dumps don’t overwrite it.

Since we’re doing a ‘hang’, ideally at this point you’d wait a few mins, and then take another dump. If you don’t, you may regret it later and / or have to wait for the issue to re-arise before you can definitively pin it down. However, if the world and his dog want the site back up right now then skip it – it’s your call.

Kill that process. IIS will recycle it, and things should go back to normal for now. Panic over.

Analyzing the Dump

Copy the renamed output directory down onto your machine. Forthe process ID you dumped there will also be an IISState summary file, eg IISState-5090.log. This log is a dump of the stack trace of all the threads in the process, and attempts to perform some additional diagnostics for you. Unfortunately, it doesn’t (at present) support .Net 2, so most likely it’s not going to tell you what the problem is. You should at least be able to see a few threads that look something like this:

What does this mean? Well IIState doesn’t understand what’s going on with the stack below that mscorwks dll, which screams .Net 2 stack to you (as does the comment against Thread Type). But the likelihood is there’s not much else we’re going to get out of that file[1]. So it’s time to look at the mini dump.

Open a Visual Studio 2005 Command Prompt window, and from there launch C:\Program Files\Debugging Tools for Windows\windbg.exe (this ensures the .Net 2 framework is in the PATH for the WinDBG process, and makes life simpler –if you don’t have VS2005 installed do it manually). Do File \ Open Crash Dump and brose to the dmp file you created with IISState. If it asks you about saving workspace information just hit no.

It’ll open a text window with some summary information about the dump. This window has a little command bar at the bottom. This is where you do stuff.

The first thing it probably says is “Symbol search path is: *** Invalid ***”. So we type .symfix into the command bar to set it up to dynamically load symbols from Microsoft as required. You can do something to your %path% so you don’t have to do this each time, but I’ve not got round to that yet.

…but this is exactly what it says it is – it’s the console debugger attaching (how IISState got the dump), and not a ‘real’ exception per-se.

You can type ~ to get a list of the threads, or ~ *kb to get them with a bit of stack trace too, but we’re still only looking at unmanaged threads here, so pretty meaningless to me.

Type .load SOS to load the managed code debugging extension SOS.dll, then type !threads to see all the managed threads. Now that’s more like it.

NB: If you’re debugging on a different platform to the original crash (eg server was Win2003, you’re on XP), then you’re going to get an error at this point “Failed to load data access DLL, 0x80004005”. The error message contains the fix: execute .cordll -ve -u -l and the correct version of mscordacwks.dll for the dump will automatically be loaded, eg:

Notice how the last column tells you what type of thread it is. You can get the managed stack trace for each one of these threads by selecting which thread you’re interested in from the Processes and Threads window (up on the toolbar) and typing !clrstack, but I find it more useful to just look at all of them straight off: ~*e !clrstack

You’ll see a lot of this:

OS Thread Id: 0x84 (0)Unable to walk the managed stack. The current thread is likely not a managed thread. You can run !threads to get a list of managed threads in the process

...because what you're doing is looking at the managed stack trace for all threads (and many of them won't be .Net threads remember), but eventually you’ll see some managed stack traces pop up.

In my case it was a bit of an open-and-shut case, because there was only one .Net threadactually doing anything in the process – kinda a giveaway:

Ok, so we’re dying in a 3rd party component. But we’re still going to need to replicate this somehow because (presumably) it doesn’t do this all the time. Fortunately in this case all we need to know is the arguments supplied to our GenerateDocumentById() method to attempt to reproduce in dev. They’re in the dump too, so let’s see what they were.

Select that thread (12) using ~12s (or the Processes and Threads window, and no, I don’t know what the ‘s’ suffix means) and use !clrstack –p to view the stack with parameters. It’s a big list, but in there somewhere is the bit I’m interested in:

The bit I care about on this is the documentId field itself, and being a value-type field (Int32) I can see the value right there: 8229991 (for a reference type like System.String I'd get another address to the instance itself, ie have to perform another !do).

And in this case, that was that. Well, it turned out to affect more than just this document ID, but at least I had enough to reproduce in the dev environment. And just in case anyone read the stack trace carefully, by the time we got to this stage Aspose already had a fix out for this issue, so I can’t really knock them. Finally whilst I was at it, I had a look at the objects in memory and began to wonder whether I’ve got a memory leak, so there’s probably a follow up to come.

Other SOS commands to look at:

!threadpool (also this is where the CPU utilization hides)!eeheap –gc (see memory allocations that are subject to the GC)!dumpheap –stat (see object allocations by type)!help (gives help on the last extension loaded, so SOS commands in our case)