Exception HandlingConsidered Harmful

Recent programming languages such as Java, Python and Ruby have chosen to use exception handling as their primary method of error handling, replacing the traditional approach of error return codes. I believe continuing this trend for future programming languages would be a mistake, for two reasons...

Exception handling introduces a hidden, "out-of-band" control-flow possibility at essentially every line of code. Such a hidden control transfer possibility is all too easy for programmers to overlook – even experts. When such an oversight occurs, and an exception is then thrown, program state can quickly become corrupt, inconsistent and/or difficult to predict (think about an exception unexpectedly being thrown part way through modifying a large data structure, for example).

Exception handling does not fit well with most of the highly parallel programming models currently in use or being explored (fork/join, thread pools and task queues, the CSP/actor model etc), because exception handling essentially advocates a kind of single-threaded "rollback" approach to error handling, where the path of execution – implicitly a single path – is traversed in reverse by unwinding the call stack to find the appropriate error handling code.

Good Intentions

Exception handling was originally intended to solve several perceived problems with the traditional approach of error handling via return codes.

First, by separating the error handling code from the main body of normal code, it was hoped that the code would be less cluttered, and hence cleaner, with the normal, non-error case easier to follow because it was not obscured by necessary but tedious and unlikely error checking/handling.

Second, by allowing a separation between the point where an error occurs and the point where it is handled, even a potentially very large separation across many function calls, it was hoped to enable better handling of errors deep within libraries, allowing those errors to be propagated back to the application without requiring a whole chain of error checking and returning code to be written, and thus avoiding the tendency for libraries to swallow or generalize errors because it was too much hassle to feed them all the way back in full detail.

Finally, exceptions were seen as a solution to the "semi-predicate" problem, where for some operations every possible return value is valid and thus an error must be indicated through some other, more indirect means, such as a pass-by-reference error argument or an internal success/failure state indicator within an object.

To solve these problems, exception handling essentially advocates a kind of "rollback" approach to error handling. When an error occurs an exception is "thrown", which engages the runtime system to begin a rollback operation by unwinding the call stack, destroying local objects as it goes, until a suitable error handler "catch" block is reached, and execution continues from there.

The primary intended benefit of such an approach is that all of the code between the place where the error happens and is thrown, and the place where the exception is caught and handled, can simply remain blissfully unaware of the error, and not have to detect and handle it explicitly. Local objects just get destroyed automatically while unwinding the call stack, and all is well.

Sounds good, right?

Hidden Control Flow & Corrupt State

One immediately obvious problem with a "rollback" style approach to error handing is that many operations are not so trivially rolled back simply by destroying local objects (and perhaps letting heap objects be cleaned up by a garbage collector). The classic example is I/O – you cannot un-print something to the screen, un-ask for user input, un-overwrite a file's contents, or un-send a network packet. All true, and an excellent point.

But that's just the tip of the iceberg. I/O isn't even the real problem. It is just one of a number of possible non-local side effects that code might have. Far more common, yet often overlooked, is state in general – any code which simply makes changes to some part of a shared data structure, like a document model or a scene graph. Unwinding the stack and destroying local objects won't undo those changes. In fact, in an exception-rich environment where the act of making such changes can potentially cause an exception, it is impossible to write a strongly exception-safe function that has two or more unrelated side effects, of any kind, that cannot be performed atomically.

It is impossible to write a strongly exception-safe function
that has two or more unrelated side effects, of any kind,
that cannot be performed atomically.

Consider an exception unexpectedly being thrown part way through modifying a large data structure, for example. How likely is it that the programmer has written code to correctly catch that exception, undo or reverse the partial changes already made to the data structure, and re-throw the exception? Very unlikely! Far more likely is the case that the programmer simply never even considered the possibility of an exception happening in the first place, because exceptions are hidden, not indicated in the code at all. When an exception then occurs, it causes a completely unexpected control transfer to an earlier point in the program, where it is caught, handled, and execution proceeds – with a now corrupt, half-modified data structure!

Any non-trivial shared-data-modifying algorithm cannot, in general, be truly strongly exception-safe unless either the programming language itself provides some form of transactional capability (eg: SQL's commit approach), or the programmer simulates transactional behavior in code by making a copy of the data, modifying the copy, and doing some kind of pointer swap to make the new copy the "real thing" atomically – which is ridiculously tedious and clearly not practical for large objects or complex data structures.

So if you're in the middle of modifying data, and an exception occurs, you could easily end up leaving the data in a half-baked state. That is really, really dangerous, because it invites the possibility of silent data corruption. In most cases, any clearly visible error signal, even program termination, is by far preferable to the possibility of silent data corruption. And exception handling simply isn't a clearly visible error signal. Most of the calling code can, and does, simply ignore exceptions, assuming some code further back will catch and handle them.

Thus, coding styles relying on exception handling over anything more than trivial distance between throw and catch have a tendency to "take simple, reproducible and easy to diagnose failures and turn them into hard-to-debug subtle corruptions", to quote Larry Osterman.

Forcing the calling code to handle the error right away is the correct approach, because it forces the programmer to think about the possibility of an error occurring. That's a key point. The fact that this clutters the code with error checking is unfortunate, but it is a small price to pay for correctness of operation. Exceptions tend to allow, even encourage, programmers to ignore the possibility of an error, assuming it will be magically handled by some earlier exception handler.

Forcing the calling code to handle the error right away is
the correct approach, because it forces the programmer
to think about the possibility of an error occurring.

Exceptions tend to allow, even encourage, programmers
to ignore the possibility of an error, assuming it will be
magically handled by some earlier exception handler.

In order to write exception-safe code, at every significant line of code the programmer must take the possibility of an exception and rollback happening into account, to be sure the code cleans up properly and leaves things in a suitable, stable state if an exception occurs – that it doesn't leave a data structure half-modified, or a file or network connection open, for example. That is decidedly non-trivial. It takes a great deal of time and effort, it requires a very high degree of discipline to get right, and it is just far too easy to forget or overlook something – even experts frequently get it wrong.

Putting more general issues aside for just a moment, the C++ exception handling system in particular wasn't very well thought out IMHO, and is by far the weakest part of the language – so much so that I generally recommend people don't use C++ exceptions at all, and turn them off in their compiler if possible.

Exception handling is the only C++ language feature which requires significant support from a complex runtime system, and it's the only C++ feature that has a runtime cost even if you don't use it – sometimes as additional hidden code at every object construction, destruction, and try block entry/exit, and always by limiting what the compiler's optimizer can do, often quite significantly. Yet C++ exception specifications are not enforced at compile time anyway, so you don't even get to know that you didn't forget to handle some error case! And on a stylistic note, the exception style of error handling doesn't mesh very well with the C style of error return codes,
which causes a real schism in programming styles because a great deal of C++ code must invariably call down into underlying C libraries.

Furthermore, because C++ doesn't have garbage collection it is all too easy even for experts (see here, here, here and here) to accidentally write code which leaks memory if an exception is thrown by some function you call, even if you yourself don't use exceptions. This is further complicated by C++'s lack of a finally block to simplify cleanup. It is also particularly easy in C++ to leave objects in a half-baked state when an exception occurs, because even many "primitive" operations like assignment can potentially throw exceptions. In practice, it becomes essentially impossible not to leave objects in a half-baked state once the objects grow beyond trivial size/complexity. Even many of the STL containers are not strongly
exception-safe – they don't leak memory, but they might leave your data in a half-baked state where the operation was only "partially" done, which is not terribly useful or helpful.

The core problem is the hidden control-flow possibility. There's a famous joke about a mythical programming language construct called comefrom, which is a parody on the problematic goto statement found in many early programming languages. The idea is that the programmer can, at any point in the program, say "comefrom 20", and any time execution reaches line 20 it will immediately jump to the "comefrom" code. The point being made here is that nothing on line 20 itself indicates that control flow might be diverted like this. Exception handling introduces precisely this kind of hidden control flow possibility, at nearly every significant line of code: every function/method call, every new object construction, every overloaded operator etc.

Exception handling thus breaks the "principle of least astonishment", and breaks it HUGE.

Joel Spolsky expresses the issue in his concise and down-to-earth manner as follows: "They are invisible in the source code. Looking at a block of code, including functions which may or may not throw exceptions, there is no way to see which exceptions might be thrown and from where. This means that even careful code inspection doesn't reveal potential bugs. ... To write correct code, you really have to think about every possible code path through your function. Every time you call a function that can raise an exception and don't catch it on the spot, you create opportunities for surprise bugs caused by functions that terminated abruptly, leaving data in an inconsistent state, or other code paths that you didn't think about."

Mismatch With Parallel Programming

The very idea of rollback/unwinding which is so central to exception handling more-or-less inherently implies that there is a sequential call chain to unwind, or some other way to "go back" through the callers to find the nearest enclosing catch block. This is horribly at odds with any model of parallel programming, which makes exception handling very much less than ideal going forward into the many-core, parallel programming era which is the future of computing.

Even when considering the simplest possible parallel programming model of all – a straightforward parallel fork/join, such as processing all of the elements of an array in parallel – the problem is immediately obvious. What should you do if you fork 20 threads and just one of them throws an exception? Unwind back past the forking and kill the other 19 threads, risking data corruption? Unwind but leave the other 19 threads running never to be joined/reaped, and doing who knows what to objects you supposedly destroyed during the unwinding? Make the programmer put in a catch block at the point of forking, which still has to choose between those two basic possibilities anyway?

Moving to more interesting and useful models of parallelism, exception handling again seems completely mismatched. Today, for example, the most common practical model used for flexible parallelism is a pool of worker threads each executing small units of work, often called tasks or operations, which are stored in some kind of work queue and dispatched to the thread pool one after another as each thread finishes its current task. Applying exception handling to such a scheme seems impossible, since the units of work are essentially detached from any "caller". The whole concept of unwinding the call stack makes no sense at all in such a situation.

More sophisticated parallel programming models, such as asynchronous message passing between communicating sequential processes (CSP or the "actor" model), have similar properties to the thread pool and task queue approach, though these properties are hidden by proper language support. Again, since there is no obvious execution path to unwind, and since messages between objects/actors are frequently asynchronous, it is difficult to see how the general approach of exception handling can be applied.

Finally, because exceptions are an out-of-band control mechanism, existing outside the normal call/return mechanism, they don't fit very well when the CSP or actor model is taken to its logical next step, with objects/actors on different systems connected by a network. You can easily return an error code over a byte stream that happens to be a network connection, but you can't easily throw an exception back over a network connection, because the exception is "out of band" – it doesn't come back via the normal data channel. An elaborate runtime system could, of course, work around this, but is that really a sensible approach?

The simple fact is the concept of rollback/unwinding just doesn't work very well in a highly parallel situation, even a simple one like fork/join, let alone more sophisticated and useful models like thread pools or CSP/actors. Trying to retrofit exceptions and rollback/unwinding into a parallel environment seems like an exercise in complexity, frustration and ultimately futility.

Exceptional Exceptions

Many advocates of exception handling admit that it is best used only for extremely rare "exceptional" cases. In other words, you should use error return codes for anything that might actually happen in real life, but as long as you only use exceptions for things that will never actually happen they're fine. Maybe I'm exaggerating for effect here, but you get the point.

I personally take the view that most of the "exceptional" cases they're talking about should basically just be guaranteed by the system to never happen at all – memory allocation failures, runtime stack exhaustion, other kinds of resource exhaustion, memory access violations etc. We shouldn't be exposing those kinds of things to applications at all, because in nearly all cases there is precious little the application can sensibly do to recover from the error anyway. There's useful complexity and then there's useless complexity, and having to write application code to deal with things that will never really happen, or for which the only safe response is program termination anyway, is just adding useless complexity.

Instead, we should be presenting applications with the illusion of a machine with infinite resources, thereby making writing applications that much simpler and less error-prone. If physical resources actually do become exhausted, it should be the responsibility of the operating system, not the application, to take appropriate action. As a simple example, memory allocation should be guaranteed not to fail in general, with special options to return NULL on failure for those few rare cases where recovery from failure makes sense (such as allocating a very large image or handling the possibility of failure in some alternative way like working at a lower resolution).

For those of you who say "but what about small, embedded devices that have real resource limits?", the answer there is simply to go and look at what's actually being done in the embedded space today. We already have small embedded devices which function as wireless network hotspots, print servers, music servers and NAS servers, all at the same time, all in the size of a power brick. The notion of having "special" versions of programs which run in embedded space and which constantly have to handle resource limits is just as dead as the idea of "special" content for mobile devices (can anyone remember WAP or i-Mode?).

The future is essentially standard, general-purpose applications, maybe slightly cut down, running on top of slightly cut down but essentially standard, full-blown OSs, all on your phone, or your watch, or inside your soap dispenser. It's a world where even your toaster runs Linux. In such a world, exposing resource limits like the remote possibility of memory allocation failure to applications is just silly.

Finally

The cold, hard truth is that if you exclude trivial use of exceptions where the exception is caught and handled immediately, essentially mimicking the old error return code approach, then 90% of the other exception handling code out there in the wild isn't exception-safe. It works just fine, as long as an exception never actually happens, but if one does you're basically hosed. Or, to quote Michael Grier: "Exceptions only really work reliably when nobody catches them."

I believe this clearly tells you there is a problem with the language feature, and the very idea IMHO. I am certain 99% of C++ code isn't exception-safe, I'm equally sure 99% of Objective-C code isn't exception-safe, and I'd be willing to bet a good 90% of Java code isn't exception-safe either, even with garbage collection to clean up memory leaks. The problem isn't just memory leaks, or even unclosed files and network sockets, it's modifications to shared data structures (and related equivalents like database state, partially written files etc). Those don't get undone by unwinding the stack and destroying local objects, nor by a garbage collector, no matter how smart it is about trying to call finalize() methods in the right order.

I vote that all programming language designers should just say no to exceptions. I know I do.

Exception handling doesn't really work. It doesn't give the benefits it claims. Hardly any real-world code uses it correctly except in the trivial case, which is just a more verbose equivalent of error return codes. Nobody really uses exceptions to any genuine benefit. They just get in the way and make writing code more silently error-prone. And exception handling is a horrible, horrible mismatch to highly parallel programming.

Error return codes work. They are simple. They are effective. They have stood the test of time. More to the point, they are what everyone actually uses when they know an error might really actually happen! That tells you a LOT.

If you're a programming language designer, I encourage you to just say no to exceptions, and take your first step into a better, more reliable world.

Lighterra is the software company of Jason Robert Carey Patterson, a systems programmer with interests centered around performance and the hardware/software interface, such as the design of new programming languages and compilers, optimization algorithms to make code run faster, chip design and microarchitecture, and parallel programming across many processor cores, GPUs and network clusters.