programming and human factors

Debugging ASPNET_WP in Production

One of our production web servers keeps deadlocking the ASPNET_WP process, like so:

aspnet_wp.exe (PID: 3588) was recycled because it was suspected to be in a deadlocked state. It did not send any responses for pending requests in the last 180 seconds.

This is painful. It means the server becomes unvailable for over three minutes, and any pending requests return errors after ASPNET_WP is cycled. The best part is, this happens completely randomly. We can't force it to happen or duplicate it, we just have to wait for it to happen. And it inevitably does, several times per day. We went through all the normal troubleshooting procedures and exhausted them all, which left.. the tough stuff.

The article contains an excellent walkthrough, but here's the reader's digest version of what you need to do

Install the above tools on the web server with the problem. Unzip the dbgnetfx.exe contents to the debugging tools folder.

use the command line tool adplus.vbs -hang -p ASPNET_WP to generate a memory dump of the ASPNET_WP process. This will create a folder containing a fairly large file (mine was ~90mb) inside the debugging tools folder. This can be kind of a pain, because you have to trigger this after the crash or during the hang (as in my case). The adplus_aspnet.vbs file has some special functionality to "kick in" automatically during crash or hang scenarios.

Fire up the windbg.exe application, and open the crash dump file via the drop-down menus. You will need to set the symbol paths (most importantly, including Microsoft's public http:// symbol server URL) as listed in the document; scroll down to the section titled "To enter the symbol paths, do one of the following:". The windbg app has a command line entry area at the bottom, near the status bar, so that's where you want to enter those symbol path commands.

At this point skip directly to the .NET specific debugging information, which relies on the windbg add in "sos.dll". That's contained in the dbgnetfx.exe archive. Scroll down to .load SOSsos.dll (er, "son of strike"? I want some of what they're smoking at MS!) and proceed from there.

Once you've gone through all that rigamarole, you actually get some useful, .NET specific information, such as all the thread info:

I have changed the name of our application to "CrazyApp" to protect the guilty, and I have simplified the dump to only two of the 14 threads. Based on these thread command lists, it now very clear what is going on here: we're blocking while waiting for database resources via the System.Data.OracleClient.DBObjectPool.GetObject command, on every single thread!