In last week’s episode we discussed how 32-bit processes on 64-bit Windows might corrupt the exception state after a crash, and how any processes on 64-bit Windows might actually continue running after a crash. Serious stuff.

This week’s installment of “Failing to Fail” is less dramatic, but still important for developers who want robust software, as we cover failure to terminate and failures to record a crash dump.

Update: a technique for handling abort() was added to the post and to the sample code, July 22, 2012.

As a special bonus I also mention how to record crash dumps from all crashing processes on your machine, to make debugging easier than ever before.

Crashes happen. Any program more complicated than “Hello world” probably has some bugs. One measure of professional software development is how you deal with these crashes. What should happen is that the program should save a crash dump and then commit suicide (TerminateProcess() or _exit(), not ExitProcess() or exit()).

What you don’t want is for the doomed process to put up a dialog saying “Hey, I’m a doomed process”. But unfortunately that is what the Visual C++ C Run Time (VC++ CRT) does in some cases, as we see to the right.

If you accidentally call a pure virtual function (see the sample code for one possible way this can happen) then the handler for this brings up a dialog. If you’re a developer then you can attach a debugger and get a call stack, but most of the world is not developers. They don’t know what a pure virtual function call is, and they don’t care. Displaying this dialog just slows down the crash recovery process, while confusing your users.

But it’s worse than that. If you have a bevy of exception handlers ready to catch Win32 exceptions (access violations, etc.) then you will be disappointed because they won’t catch pure-call errors, even after someone presses OK. So, your in-house crash-dump recording system is helpless against this bug, which means it takes longer to get it fixed.

Worse yet, if this error happens on a server (I’ve seen it happen) then your headless server now has a hung process that is waiting for someone to click OK. Unit tests will timeout eventually, and servers may timeout if you have a watchdog, but the whole process is delayed by this dialog.

I wouldn’t be writing about this unless I had a solution to offer. The dialog above is the default behavior, but changing the default is simple enough once you know that you should. All you have to do is call _set_purecall_handler() with a function that intentionally crashes. My preferred implementation does a __debugbreak() followed by TerminateProcess(). If I’m running under the debugger this drops me into it quite neatly, and if I’m not then my unhandled exception filter will catch the exception and write out a minidump. The TerminateProcess() is there to discourage people who catch the exception in the debugger from trying to continue.

See the sample code for a concrete example of setting this up. You can use the menu options to try triggering pure-call errors with and without installing the error handler.

Invalid parameters aren’t technically crashes

The VC++ CRT detects a few types of invalid parameters to CRT functions and it treats them as fatal errors. This includes buffer overflow detection if you use the safer CRT functions (and you haven’t requested truncation), but the simplest way to trigger these checks is with “printf(NULL);”.

No dialog pops up – at least not in release builds – and the process is terminated, but it isn’t terminated through calling your carefully crafted exception handlers. Windows Error Reporting (WER) will be notified of the problem, which is good, but I want these invalid parameters treated like a crash so that my exception handlers get invoked. Luckily there is an easy solution for this problem as well. If you call _set_invalid_parameter_handler() then you can give it the same code (just with a different signature) as for your pure-call handler so that your exception handlers will notice something has gone wrong. And now your programs will be crashier than ever before. Which is a good thing. This technique is also demonstrated in the sample code.

WER is your friend

Windows Error Reporting (WER) is a handy feature built in to Windows. Most developers know that WER records crash dumps on millions of users’ machines and stores them, and most developers know that it is possible to get access to the crash dumps for your software. This is a fabulous way of finding out where your software is actually crashing on actual customers’ actual machines. There are a few hoops to jump through, but it’s worth getting it set up. However I have no special knowledge of how to arrange such access so I will say no more.

A lesser known feature of WER is that you can get it to record crashes on your own machines. All you have to do is set a few registry keys. I’m gonna go out on a limb here and say that every C++ developer on Windows should configure this. It’s trivially simple and WER will sometimes catch crashes that your other systems do not. WER is great at catching process startup and shutdown crashes, crashes in processes you forgot to add minidump handling to, and it even records minidumps for pure-virtual function calls and invalid CRT parameters.

The full documentation is available here. If you spend two minutes configuring this (I have the last 30 crashes saved as full dumps in c:\temp\crashdumps) then you will be better able to investigate crashes on your machine, regardless of what process is crashing.

Update – one more missed failure type

Stefan Reinalter pointed out that some libraries will handle errors by calling abort(), and this can be another way for a process to fail without your crash handler being called. He also supplied the fix, which is to call signal(SIGABRT, &AbortHandler); to install a handler which will be called if abort() is called. Signal can also be used to install handlers for other types of failures.

Homework

It’s not enough to read about this, you have to actually do a tiny bit of coding and registry work to get things crashing smoothly. Here are your tasks.

Be sure to call _set_purecall_handler, _set_invalid_parameter_handler, and signal. If you use the DLL version of the CRT then once per process is fine. If you use the static-link version of the CRT then you need to call it once for each copy of the CRT – once for each DLL that statically links the CRT. The sample code available here should help.

Configure the registry to save crash dumps on all of your machines, by following the simple directions here.

If you haven’t already then be sure to follow the instructions in last week’s post, including configuring VS to halt on first-chance exceptions, calling EnableCrashingOnCrashes(), and using SetUnhandledExceptionFilter() to catch crashes.

Share this:

Like this:

LikeLoading...

Related

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x faster. Unless it's eliminating large numbers of bugs.
I also unicycle. And play (ice) hockey. And juggle.

8 Responses to More Adventures in Failing to Crash Properly

I don’t think terminating on crash is always the best policy. For applications that manipulate user data, it’s often the worst policy. In most GUI apps, actions are invoked from the message loop; an exception handler wrapped around the message loop can catch and log[1] a quite a large fraction of issues and let the program resume, rather than die. That way, the user has a chance to save their data. And I’m assuming here a sane language that wraps Win32 exceptions, abstract method (i.e. pure virtual) calls, floating point exceptions, null pointer dereferences etc. as language exceptions.

Of course there’s caveats. The logic that reads and writes the data needs to be resilient. A consideration may be made for backups. An better solution is something like Office, where the app saves the partial document, restarts itself, and tries to leave the user where they left off, little the worse for wear; but some apps with a lot of modality in their UI may be inconvenienced quite a bit here. For example, an IDE in the middle of a complex debugging session; if some menu item is broken due to an abstract method call bug, it would be great if that message was simply reported as an “Oops” (and logged[1] etc.) but the user still left so they can continue what may be a very intricate debugging scenario. I’ve had debugging situations that have cost me many hours to set up owing to variances of timing, heap randomization, etc., and throwing all that state away would not be appreciated.

[1] And by logging I mean capturing stack trace with line numbers etc., and dispatching them (with permission etc.) off so the underlying bugs can be fixed. I’m not talking about brushing bugs under the carpet here with a catch-all message dialog that dumps it all on the user.

There’s certainly room to disagree on this, but I still think fail-fast (crash instead of recover) is the best policy.

By definition when you continue after an unexpected occurrence you are in an unknown state. This can lead to security breaches (continuing after a crash is an easy way to defeat ASLR), data corruption, and many other problems. The security problem is a huge issue — these days a crash may not be a benign accident, it may instead be a symptom of an attack. I think that recovering and continuing is an attempt at kindness that can go horribly wrong.

However I think the main reason for crashing at the first sign of trouble is that this is the only way to be sure that the problem is taken seriously. Far too many teams ignore warnings of all types, whether these be compiler warnings, linker warnings, asserts, or crashes that are handled. Whether during testing or during production all teams know that crashes must be taken seriously and must be investigated.

> And I’m assuming here a sane language that wraps Win32 exceptions

A sane language that wraps Win32 exceptions sounds like an oxymoron to me. Python uses exceptions for its error handling but if it silently handled an access violation and translated it into a Python exception I would be disappointed.

Also, just recently one of my colleagues found that when building Metro apps using WinRT, the generated code swallows exceptions and aborts using a new “__abi_FailFast” method, which appears to call “_invoke_watson” directly. We resorted to some linker hacks to allow this to abort in a way that Breakpad could catch it:https://bugzilla.mozilla.org/show_bug.cgi?id=775378

I dug into this after looking at an xperf trace which showed a game spending 66% of its CPU time in AcXtrnal.dll!NS_FaultTolerantHeap::FthDelayFreeQueueFlush. That’s a huge amount of CPU time to consume while trying to hide application instability. All developers should disable this on their machines.

Why does CrashHandler in your sample not call __debugbreak()? Is it intentional that you just call TerminateProcess? I’d like to hear what you would recommend putting in an unhandled exception filter, and how you recommend writing minidumps!

I said in the article that my preferred implementation was __debugbreak() followed by TerminateProcess(), to ensure that a crash occurs and either drops into the debugger or saves a minidump, followed by exiting.

How to write minidumps would have to be a separate topic, but it’s actually not hard, although uploading them to your servers can be a bit of work. You can let Windows record them (and go through WinQual to retrieve them), use SteamWorks, or use MiniDumpWriteDump.

Ah, I thought the __debugbreak() was just for purecall, invalid parameter and abort(), and that unhandled exceptions somehow deserved different treatment. Thanks for the reply, and thanks for the excellent set of articles so far!