Task Exception Handling in .NET 4.5

For the .NET Framework 4.5 Developer Preview, a lot of work has been done to improve the Task Parallel Library (TPL), in terms of functionality, in terms of performance, and in terms of integration with the rest of the .NET Framework. With all of this work, we’ve strived for a very high compatibility bar, which means your applications that use TPL in .NET 4 should “just work” when they upgrade to run against .NET 4.5 (if they don’t, please let us know, as it’s something we’ll want to fix). Even with such a high compatibility bar, however, there are a few interesting behaviors to be aware of when it comes to Tasks and exception handling.

Unobserved Exceptions

Those of you familiar with Tasks in .NET 4 will know that the TPL has the notion of “unobserved” exceptions. This is a compromise between two competing design goals in TPL: to support marshaling unhandled exceptions from the asynchronous operation to the code that consumes its completion/output, and to follow standard .NET exception escalation policies for exceptions not handled by the application’s code. Ever since .NET 2.0, exceptions that go unhandled on newly created threads, in ThreadPool work items, and the like all result in the default exception escalation behavior, which is for the process to crash. This is typically desirable, as exceptions indicate something has gone wrong, and crashing helps developers to immediately identify that the application has entered an unreliable state. Ideally, tasks would follow this same behavior. However, tasks are used to represent asynchronous operations with which code later joins, and if those asynchronous operations incur exceptions, those exceptions should be marshaled over to where the joining code is running and consuming the results of the asynchronous operation. That inherently means that TPL needs to backstop these exceptions and hold on to them until such time that they can be thrown again when the consuming code accesses the task. Since that prevents the default escalation policy, .NET 4 applied the notion of “unobserved” exceptions to complement the notion of “unhandled” exceptions. An “unobserved” exception is one that’s stored into the task but then never looked at in any way by the consuming code. There are many ways of observing the exception, including Wait()’ing on the Task, accessing a Task<TResult>’s Result, looking at the Task’s Exception property, and so on. If code never observes a Task’s exception, then when the Task goes away, the TaskScheduler.UnobservedTaskException gets raised, giving the application one more opportunity to “observe” the exception. And if the exception still remains unobserved, the exception escalation policy is then enabled by the exception going unhandled on the finalizer thread.

In .NET 4.5, Tasks have significantly more prominence than they did in .NET 4, as they’re baked in to the C# and Visual Basic languages as part of the new async features supported by the languages. This in effect moves Tasks out of the domain of experienced developers into the realm of everyone. As a result, it also leads to a new set of tradeoffs about how strict to be around exception handling. An example of how this could affect developers is highlighted in the below code snippet:

Task op1 = FooAsync(); Task op2 = BarAsync(); await op1; await op2;

In this code, the developer is launching two asynchronous operations to run in parallel, and is then asynchronously waiting for each using the new await language feature (for multiple reasons, it would be better if this code were written with a single “await Task.WhenAll(op1, op2)” statement, rather than individually awaiting each of op1 and op2, but that ideal doesn’t significantly decrease the likelihood that the above code will still be written). If while testing this method neither op1 or op2 faults, there’s no problem as there are no exceptions. If just op1 faults but op2 completes successfully, there’s no problem: op1’s exception will propagate out of the await and everything will proceed as expected. And if just op2 faults but op1 completes successfully, again, no problem. However, consider what will happen if both op1 and op2 fault. Awaiting op1 will propagate op1’s exception, and therefore op2 will never be awaited. As a result, op2’s exception will not be observed, and the process would eventually crash.

To make it easier for developers to write asynchronous code based on Tasks, .NET 4.5 changes the default exception behavior for unobserved exceptions. While unobserved exceptions will still cause the UnobservedTaskException event to be raised (not doing so would be a breaking change), the process will not crash by default. Rather, the exception will end up getting eaten after the event is raised, regardless of whether an event handler observes the exception. This behavior can be configured, though. A new CLR configuration flag may be used to revert back to the crashing behavior of .NET 4, e.g.

Note that this change doesn’t mean developers should be careless about ignoring unhandled exceptions… it just means the runtime is a bit more forgiving than it used to be. My recommendation is that any developer building library components should run their tests with this flag enabled, and should make sure that no exceptions are going unobserved in the components they build. That way, the application developer consuming these components can make the best decision for their app as to whether to set this flag or not: it would be unfortunate if the application developer wanted to be strict about enforcing all exceptions to be observed, but couldn’t because the library they consumed wasn written by developers that weren’t careful enough.

“Task.Result” vs “await task”

When you use Task.Wait() or Task.Result on a task that faults, the exception that caused the Task to fault is propagated, but it’s not thrown directly… rather, it’s wrapped in an AggregateException object, which is then thrown. There were two primary motivations for wrapping all exceptions like this. First, as of .NET 4, there was no good way in managed code to throw an exception that had previously been thrown without overwriting important information stored in that exception. Namely, throwing an exception with “throw e;” overwrites the exceptions’ stack trace and its “Watson bucket” information (the data collected and uploaded to help application developers after deployment find the root cause of the most common crashes in their applications) with details about the throw site, leading to very poor debuggability when dealing with exceptions marshaled across threads. For this reason, it’s been a .NET design guideline that in cases like this, where an exception’s propagation needs to be interrupted, it should be wrapped in another exception object. You can see that with reflection, for example, where exceptions are wrapped in a TargetInvocationException. For better or worse, this is a way to preserve the relevant information as part of an inner exception. And thus for TPL, where marshaling exceptions across threads is the name of the game, we wrap exceptions before propagating them.

We could have picked various wrapper exception types, such as TargetInvocationException, but we chose AggregateException because of the second motivation: tasks may fault with more than one exception, and thus need to store multiple. Whether because of child tasks that fault, or because of combinators like Task.WhenAlll, a single task may represent multiple operations, and more than one of those may fault. In such a case, and with the goal of not losing exception information (which can be important for post-mortem debugging), we want to be able to represent multiple exceptions, and thus for the wrapper type we chose AggregateException.

That explains why we chose the design we did for .NET 4. Now, let’s assume for a moment that we didn’t have to deal with the first issue above, that of overwriting an exception’s information. In that case, we have a choice to make, since whether the task actually stores multiple exceptions in an aggregate is separate from whether multiple exceptions are propagated in aggregate when you wait on a task. We have three primary options here: only propagate the first exception even if there are multiple, always propagate all exceptions in an aggregate even if there’s only one, and propagate a single exception if there’s one or an aggregate if there’s multiple. We very quickly ruled out the last option, as in many scenarios it leads to needing duplicate catch blocks: for cases when multiple exceptions occur, you need a handler for an aggregate exception (and that handler needs to be able to special case each of the inner exceptions), and for cases when there’s only one exception, you need a separate catch block for each of the specialized exceptions… as a result, you end up having the same logic duplicated in two places. In our experience, it’s also just more difficult to reason about. That leaves the options of always propagating the first (by some definition of “first”) or always propagating an aggregate. When designing Task.Wait in .NET 4, we chose the latter. That decision was influenced by the need to not overwrite details, but also by the primary use case for tasks at the time, that of fork/join parallelism, where the potential for multiple exceptions is quite common.

While similar to Task.Wait at a high level (i.e. forward progress isn’t made until the task completes), “await task” represents a very different primary set of scenarios. Rather than being used for fork/join parallelism, the most common usage of “await task” is in taking a sequential, synchronous piece of code and turning it into a sequential, asynchronous piece of code. In places in your code where you perform a synchronous operation, you replace it with an asynchronous operation represented by a task and “await” it. As such, while you can certainly use await for fork/join operations (e.g. utilizing Task.WhenAll), it’s not the 80% case. Further, .NET 4.5 sees the introduction of System.Runtime.ExceptionServices.ExceptionDispatchInfo, which solves the problem of allowing you to marshal exceptions across threads without losing exception details like stack trace and Watson buckets. Given an exception object, you pass it to ExceptionDispatchInfo.Create, which returns an ExceptionDispatchInfo object that contains a reference to the Exception object and a copy of the its details. When it’s time to throw the exception, the ExceptionDispatchInfo’s Throw method is used to restore the contents of the exception and throw it without losing the original information (the current call stack information is appended to what’s already stored in the Exception).

Given that, and again having the choice of always throwing the first or always throwing an aggregate, for “await” we opt to always throw the first. This doesn’t mean, though, that you don’t have access to the same details. In all cases, the Task’s Exception property still returns an AggregateException that contains all of the exceptions, so you can catch whichever is thrown and go back to consult Task.Exception when needed. Yes, this leads to a discrepancy between exception behavior when switching between “task.Wait()” and “await task”, but we’ve viewed that as the significant lesser of two evils.

As I mentioned previously, we have a very high compatibility bar, and thus we’ve avoided breaking changes. As such, Task.Wait retains its original behavior of always wrapping. However, you may find yourself in some advanced situations where you want behavior similar to the synchronous blocking employed by Task.Wait, but where you want the original exception propagated unwrapped rather than it being encased in an AggregateException. To achieve that, you can target the Task’s awaiter directly. When you write “await task;”, the compiler translates that into usage of the Task.GetAwaiter() method, which returns an instance that has a GetResult() method. When used on a faulted Task, GetResult() will propagate the original exception (this is how “await task;” gets its behavior). You can thus use “task.GetAwaiter().GetResult()” if you want to directly invoke this propagation logic.

Does ContinueWhenAll(Task[], ...) observe the exceptions in Task[] or do we need to do some handling on Task[]?

More specifically what is "so on" in "There are many ways of observing the exception, including Wait()’ing on the Task, accessing a Task<TResult>’s Result, looking at the Task’s Exception property, and so on. "

The "and so on" also includes Task.GetAwaiter().GetResult(), Task.WaitAll, etc. Continuations do not observe the Task's exception, and that includes ContinueWhenAll. An easy way to force observation is to add a call to Task.WaitAll as the first line in your ContinueWhenAll's delegate; at that point the tasks are all completed, so you're just taking advantage of WaitAll's observation logic.

Roman

25 Apr 2013 3:33 AM

Thanks guys, just wasted half an hour debugging a trivial NullReferenceException - just because it was completely invisible.

This was a really, really bad idea. Also, changing such a major behaviour in a supposedly 100%-compatible, in-place upgrade? You're gotta be kidding me.

Ganesh

30 Apr 2013 1:18 PM

Hi Steve,

My understanding is that in an application that is targeting .NET 4 but is deployed on a machine with .NET 4.5, Tasks will NOT throw exceptions. For this, we have to use ThrowUnobservedTaskExceptions attribute. But will this have any side effects on a machine with only .NET 4 installed? Since ThrowUnobservedTaskExceptions is only understood by .NET 4.5?

Sorry, a bit late to this party. 4.5 async is excellent and a real paradigm shift towards simplicity. One question to see if I am missing something in my understanding. You said:

"My recommendation is that any developer building library components should run their tests with this flag enabled, and should make sure that no exceptions are going unhandled in the components they build."

Should that be "... are going UNOBSERVED in the components they build..."

I'm OK with unhandled exceptions being thrown from libraries - most of the Microsoft async methods throw exceptions that are not handled in their libraries (e.g. Stream.CopyToAsync throws ArgumentNullException). However, unobserved exceptions that slip through try...catch() blocks would be nasty.

I can see at least one benefit of using the code snippet with 2 awaits, instead of Task.WhenAll - that is, I can then resume continue processing on the result of the first await, while the second task is potentially still "in flight".

However, you mention that, "for multiple reasons", using Task.WhenAll would have been better. However, I do not know what these reasons are, apart from exception observation (mentioned here). I would be glad if you could elaborate on these reasons.

@Jean Hominal: Yes, if you want to process individual tasks before all of them are done, then Task.WhenAll is not appropriate. That's not what was being done in the example I was commenting on, however, as in that example, the two awaits were back to back. In that case, WhenAll would await them more efficiently, and it would ensure that exceptions were handled well.

We have a windows service using 4.0 framework that launches x number of Tasks (defined in configuration) which run continuously for the lifetime of the service. These tasks check and process items from a queue that generally always has items.

The way the service is written in the current incarnation, the Tasks are launched and the return instance is thrown away. The Tasks continuously run until the service is stopped. There is no wait or join performed on the Tasks during the onstop event (the obvious issues with this architecture will be fixed in the next incarnation). The tasks themselves run in an infinite loop, each iteration processing one item from a common queue.

I have recently discovered this design during a code review related to a service crash in production, and am working towards resolving the issues. I have already built an event handler to log unobserved exceptions using ContinueWith and the onfaulted option, In the handler I am looking at the Exception property, flattening the aggregate exceptions and looping through the exception collection to log each error with Enterprise Library exception handling. This resolves the teardown and missing information about the root cause of the error, however I had a few questions regarding Task completion and garbage collection. Understanding this a bit better might help determine the best means to resolve the remaining issues with our implementation.

In the scenario above, we have a collection of Tasks that the original developers assumed would run without completion.

If one of these Tasks then throws an exception and faults, is it immediately picked up by GC (and in 4.0 cause a tear down of the service due to the exception being unobserved)?

Or will the Task (perhaps because there is still a reference to it in the associated the TaskScheduler, and the underlying thread is never joined back to the calling thread) not be collected until the service itself terminates, at which point the service that might otherwise end gracefully instead ends with a crash report due to the tear down?

Part of my concern is this: if the Task instance is not GCed and in fact sticks around after completion - perhaps waiting for a join back to the calling thread - we could end up in a situation where every Task instance in the service could eventually crash, but without any exception handling in place the service would just run in a sort of "dead" state. From what I could tell (I have not yet written a test app to validate this, going to do so today if I have time) it appears the completed task will GC and cause the application to crash.

Understanding this behavior will help determine the priority in resolve this issue, as having services fail quietly while still showing they are running is a worse situation then knowing when a service crashed and restarting it.

@rshadman: The TaskSchedulers that ship in the .NET Framework do not hold onto Task instances after they've completed running. As long as you don't keep them alive by maintaining a reference to them in your own code, they will become available for collection as soon as they've completed. That doesn't mean they'll be collected immediately, though... the act of dropping a Task doesn't trigger a GC, it just means the Task becomes available for GC, so if you don't have a lot of other allocation happening in the system, it could take some time for a GC to occur.

That was what I thought - thanks for confirming this. My own testing (since posting) also confirmed as much - that the Task won't be GCed on any known schedule. In fact with my test application (in a windows form) the completed task was never GCed until I closed the app.