My name is Brad and I’ve been on the PFE team here at Microsoft for many years. Suffice to say, I’m overdue for contributing to the team blog. I’ve seen lots of interesting (and not-so-interesting) issues with customers all over the world in my time at PFE. What follows is an issue I worked earlier this year. For me, the most interesting part of this issue was not so much in finding root cause as it was the process of discovering who was behind root cause.

The Problem

It all started with a customer who reported their ASP.NET application had an OutOfMemory issue. These kinds of issues are not at all uncommon in the .NET world, and the trick usually comes down to finding what object(s) are rooted so that the .NET Garbage Collector can’t reclaim the memory associated with said object(s).

Getting data from the problem

They sent me a dump of the problematic application pool, mentioning that they dumped it two hours after they received notice about the OutOfMemoryException (OOM). My initial thought on this was that they had recycled the process and then obtained a dump of the fresh new process. This would obviously be no good since the w3wp instance exhibiting the OOM was gone, and the dump instead represented a new process instance with no memory pressure. Unfortunately, this wouldn’t be the first time I’d had this problem.

However, when I received the dump, I was pleasantly surprised to see that they had dumped the correct process instance. It was over 1GB in size, representing a 32-bit process with significant memory pressure. And when I looked at the length of time the process had been alive, that verified this was the same instance that threw the OOM.

After searching through the dump I found the problematic object that led to the OOM. And after talking with the right folks at the customer, we pieced back together how the problem arose in addition to a resolution to this problem.

So one question was answered, but another one remained: How did this process manage to stay alive for a full two hours after getting the OOM? I’ve been debugging issues like this since .NET 1.0 was released, and I can tell you that this wasn’t a normal set of circumstances.

The thread that threw the OOM turned out to be the same thread that answered the question.

As seen in the stack above, after the OOM was thrown in frame 0x14, a message box was thrown in frame 0x8. A message box will keep the process alive until someone clicks OK, Cancel, etc. and the message box goes away. In short, message boxes in server-side processes are never a good thing since they will hang your application!

Once again, we have answered one question, but another question remains: the process stayed alive for two hours after it threw the OOM because a message box was thrown, but who threw the message box, and why?

Analyzing the data, Part 2

As is common when finding a message box that’s popped up in a process, I wanted to see what the message box said. From MSDN, we learn that the second parameter to user32!MessageBoxExA is the text of the message, and the third parameter is the text of the caption in the message box. Using the kb command, we can retrieve these parameters, and then dump them to find the values:

I’ve seen this message before, and it can have different underlying causes. The bottom line – at this point in the troubleshooting phase – is that it isn’t a custom message box thrown carelessly by application code. The trick now is to determine how the message box was thrown.

I made a few unusual observations about the call stack in question.

- First, there are two versions of the C++ runtime on the stack (7.1 and 8.0). This isn’t common.

- Secondly, the sequence of events as told by the stack seems very unusual. It appears the .NET Framework throws an OOM (frame 0x14). Then eventually when the underlying OS handles the exception in frame 0xe, it somehow goes back to BaseThreadStart in frame 0xd where the default unhandled exception filter (UEF) is called in frame 0xc. From there we wind up back in the .NET Framework’s UEF in frame 0xb that appears to call the 7.1 CRunTime’s abort() function in frame 0xa. Finally, a message box results. What a wild ride!!

- Finally… wait, the .NET Framework appears to throw a message box (if you observe frames 0xb through 0x8 of the stack)??? This is something I never would have predicted!

Fortunately, I’d been fooled by such an appearance before, so I knew to take a look at the raw stack first, before I went reading through the .NET source to see if mscorwks!InternalUnhandledExceptionFilter really does call for a message box to be thrown. When I looked at the raw stack around the frames 0xb through 0x8, this is what I found in between all the frames displayed from a kb command:

Aha! So there is, in fact, someone else in between the .NET Framework and the call to throw the message box.

Now, let’s say you still want some kind of proof that the .NET Framework doesn’t make the call to msvcr71!abort(), which results in a call to show the message box. I had my doubts that mscorwks.dll had the 7.1 CRT as a dependency. On a tip from fellow PFE Zach Kramer, you can prove this by running dumpbin /imports on the binary (Dumpbin is an old – but very handy - utility that still ships with Visual Studio). I could always obtain the binary by asking my customer for their mscorwks.dll, but it’s much easier to just use psscor2!savemodule.

Searching Imports.txt for the string ‘msvcr7’ comes up empty. In fact, the build of mscorwks used by my customer’s application was msvcr80.dll. This makes sense when you look at frames 0x14-0x13 in our call stack.

What if we employ the same technique on this ThirdParty.dll – will it show the assembly depends on the 7.1 CRT? Scanning the output for dumpbin /imports on ThirdParty dll shows the following:

So we know this ThirdParty.dll has the 7.1 CRT as a dependency, but that alone doesn’t prove this component throws the message box. And according to my customer, the vendor of ThirdParty.dll had been approached many times in the past regarding these “phantom message boxes” hanging production applications. But the vendor had denied any involvement without explicit proof. So the presence of their dll on the stack next to a call to msvcr71!abort() might not suffice when I confronted them with this issue. I felt I still had some work ahead of me.

Deeper dive

Before continuing, let’s do a quick review of what we know and don’t know:

1. The process hung because a message box was thrown. Ironically, this kept the process alive so that a dump could be taken, and from this dump we learned root cause of the OOM.

2. The message box text and caption indicate it was thrown as a result of some systematic process – not due to some “rogue code” in the customer’s application.

3. Contrary to the appearance of the kb command, the message box was not thrown by the .NET Framework.

4. Based on the placement of ThirdParty.dll in the raw stack, the fact that ThirdParty.dll has the 7.1 CRT as a dependency, and the fact that the stack shows the message box was thrown directly by the CRT, and we’re likely to make the most progress by trying to rule out ThirdParty.dll as the culprit (or alternatively, prove that it is the culprit).

5. From the call stack, it appears the .NET Framework had already instructed the OS to handle the exception. Why does it appear that the .NET Framework got a second go-around at handling the exception, and how do we connect the dots from the .NET Framework, to this ThirdParty.dll, to the 7.1 CRT’s call to abort()?

6. Why does msvcr71!abort() throw a message box – was it explicitly instructed to do this by someone?

First, let’s tackle the question about connecting the dots seen in the call stack frames. There’s a pretty clear and concise explanation for this in an MSDN Magazine article from 2008.

When an exception goes unhandled and the OS invokes the topmost [Unhandled Exception Filter], it will end up invoking the CLR's UEF callback. When this happens, the CLR will behave like a good citizen and will first chain back to the UEF callback that was registered prior to it. Again, if the original UEF callback returns indicating that it has handled the exception, then the CLR won't trigger its unhandled exception processing

In other words, ThirdParty.dll had registered its UEF before the .NET Framework. And its UEF took the default road of calling abort() and throwing a message box. After ThirdParty.dll registered its UEF, the .NET Framework then registered its UEF callback. But it didn’t want to be rude and step over the UEF that ThirdParty.dll had registered first, so it chains back to it. Therefore, the result of msvcr71!abort() being called is due to the UEF registered by ThirdParty.dll.

Next, let’s tackle the question about msvcr71!abort() throwing a message box. Since this ThirdParty.dll was using the 7.1 build of the CRT, let’s look in MSDN for the information on msvcr71!abort():

abort determines the destination of the message based on the type of application that called the routine. Console applications always receive the message through stderr. In a single or multithreaded Windows application, abort calls the Windows MessageBox API to create a message box to display the message with an OK button. When the user clicks OK, the program aborts immediately.

To influence the behavior of abort(), simply call _set_error_mode in the DllMain so that it doesn’t exercise the default behavior of throwing a message box. MSDN’s documentation on _set_error_mode states you can use _OUT_TO_STDERR for the lone parameter, and this will avoid the message box when abort() is called.

A colleague of mine, Senior Escalation Engineer Bret Bentzinger, offered to write some sample code that would load ThirdParty.dll and test this proposed resolution of passing _OUT_TO_STDERR to _set_error_mode. Doing this confirmed that no message box was thrown, and the thread exits without hanging the process.

In the end, this problem of throwing a message box from the CRT’s call to abort() isn’t a new one. This issue has been around for ages. But it was my first opportunity to drive such an issue and see it through to a resolution. My customer got a fix from the vendor, and we were able to help the vendor write better, more stable code. It was a win-win situation!

PowerShell
is one of those things that falls into my “other duties as assigned” repertoire.
It’s something that I’ve used for years to get things done but it’s not often I encounter
a Dev at a customer that has worked with it much. In my...

This
is a common topic and I thought I’d write up some thoughts I have on it. In-fact,
I was just working with a customer on improving their code reviews and what they should
be checking for and the question arose - “Should performance be targeted...

For my first “real” attempt at creating a business application using Silverlight I decided to use WCF RIA Services, Silverlight 4.0, and Visual Studio.NET 2010. What easier way to get started than through a template provided by VS.NET 2010, right? Well...

The Background
About 3 months ago as some colleagues and I were working on the "Advanced Debugging Hands On Lab for Windows Azure" (for more info contact me via this blog) we identified an interesting opportunity within the Azure MMC . If you've worked...

During the deployment testing of an integration component with Commerce Server 2009’s Order Service I encountered the following error at a client site " Column requires a valid DataType ." It was occuring right at the point when deserializing the dataset...

I found it very odd that this was not available when doing some search engine queries. My requirement was to take in an AD user name and query SharePoint 2010 to determine the SharePoint groups in which the account belongs. The code was to run from...

In
my job as a PFE for Microsoft , I read,
review and fix a lot of code. A lot of code. It’s a large part of what
I love about my job. The code is generally written by large corporations or for
public websites . Every now and...

I
was working on an internal project a bit ago and one of the requirements was to implement
a fancy Word document. The idea was that all of the editing of the text/code
samples/etc. would be done in the application and then the user could just...

An
online translator really isn’t all that new. They’ve been around for at least
8 years or so. I remember the days when I would use Babelfish for
all of my fun translations. It was a great way to get an immediate translation
for...

Really interesting blog post by the IE
team on some of the new DOM traversal features in IE9 (and other browsers).
Often times, you need to traverse the DOM to find a particular element or series of
elements. In the past, you might need...

One thing you might encounter when you start your development on Windows
Azure> is that there is an insane number of options available for number of options
for logging. You can view a quick primer here .
One of the things that I...

I seem to get this question a lot and come across many customer environments where
they have enabled web gardening thinking that it will automagically improve the performance
for their site/application.
Most time, that is not the case...

First,
a few questions:
Do you enjoy helping developers write better code?
Do you enjoy solving complex problems that span multiple technologies?
Do you enjoy optimizing and improving code?
Are you passionate...

If you have visited my blog anytime in the last 2 weeks
– you may have noticed an error page. This was due to my hosting provider “accidentally”
deleting my site’s database. This was actually a perfect storm of sorts.
All three of...

I recently ran into a problem with my Flip
UltraHD Video camcorder where it would not turn on. Unlike other camcorders
in the Flip family, there is no microscopic reset button anywhere on the device.
After e-mailing support, I received...

I recently came across the situation where I had several PowerPoint decks that were
VERY well documented. Essentially, each slide had reams of notes in the Notes
panel of the deck. This is both good and bad. It was good because for...

I ran across a blog entry with a consolidated list of links to the SharePoint 2007 planning worksheets. These are good starting points for your discovery, analysis, and design and are provided by Microsoft. I would suggest tweaking them to meet your...

Over the past three plus years that I have been working with SharePoint, I have never had the pleasure of giving an Excel Services presentation to a client. Well, thanks to our awesome sales team this past week I was able to do that very thing. I have...

After the second installment of our Speaker Idol format, I was asked by INETA to write an article for the January newsletter targeting other user group leaders. It came out great. The article describes the event format and some of our lessons learned...

On Monday the PnP team released the 2nd drop of the SharePoint 2010 Guidance. Included is an example of a sandboxed solution, which is a good list aggregation scenario related to SOW’s (statements of work) and estimates across a number of sub-sites....

On Friday the Patterns and Practices team released the first drop of the SharePoint 2010 Guidance, http://spg.codeplex.com/Release/ProjectReleases.aspx?ReleaseId=38461 . It includes the upgrades for service locator, config manager, and logger. There...

I’ve compiled a list of some current (& some oldies but goodies) links to good reference spots for VS and TFS. A lot of these are well known, but if you are new to VSTS/TFS these are some you should take a look at.

I was delivering a PowerShell class, and the question of how to create a remote printer mapping came up.

Turns out that enterprise administrators may have the need to help users with their printer connections, setting them up for them.

As I started thinking about the problem, it dawned on me that drive and printer mappings are user specific. They are stored as part of a user’s profile, and since the user profile does not get loaded until the user is actually logged on, looks like little can be done.

Domain user accounts do have a couple of properties that allow administrators to establish a “Home” share, and to assign it a drive letter. Beyond that, they are on their own.

However, there is a solution!

For printers, in particular, turns out that there is a feature that allows setting up a “global” printer mapping. This mapping will be created for the computer, and any user that logs on will be able to use it (provided (s)he has permissions).

The regular Add Printer wizard only allows to create a printer mapping for the user running the wizard:

So that will not help. However, the PrintUI.dll library exposes the functionality to create a computer specific printer mappings!

Simply run the PrintUIEntry (case sensitive) entry point for that library using rundll32 and pass the appropriate parameters… and voilà, the printer mapping is created.

What? How do you do that? Thanks for asking!

Open a PowerShell window (if you still are in the old days, go ahead, use CMD.exe instead) and type:

RunDLL32 PrintUI.dll PrintUIEntry /?

to get a help dialog with information on usage:

To create a global mapping, use the Global Add command with the /n parameter:

RunDLL32 PrintUI.dll PrintUIEntry /ga /n\\SERVER\PRINTER_NAME

To delete the mapping, use the Global Delete command:

RunDLL32 PrintUI.dll PrintUIEntry /gd /n\\SERVER\PRINTER_NAME

If you want to remotely do this on another computer, the usage indicates that you can specify the computer name with the /c parameter:

The topic came up a while back on one of our aliases – How can I dump out a string?

There are a lot of ways do this in the debugger and there are also debugger extensions that will help out with. What most people in WinDBG do is us the “du” command for Unicode strings and “da” for ASCII strings. However this has a problem in that output looks like:

You cannot easily cut and paste this into other places. You have the addresses and such. Also, this only handles a few lines. If you check out the help for the du commands there is /c switch which says only dump the characters instead of all the characters. You would end up with a command like - “du /c”. To take it one step further you can actually create an alias: