DISCLAIMER: it is not something you can find in existing C++; moreover, it is not even a proposal. Rather, it is a proposal-for-proposal – thinking aloud on certain issues which MIGHT be a Good Thing(tm) to introduce into the hopefully-upcoming next incarnation of C++ exceptions proposed in [P0709R0].

Fail Fasta fail-fast system is one which immediately reports at its interface any condition that is likely to indicate a failure. Fail-fast systems are usually designed to stop normal operation rather than attempt to continue a possibly flawed process.— Wikipedia —In particular, while there is an agreement that robust systems and programs SHOULD fail fast, there is a very significant difference in interpretation of HOW they should fail. In particular, in the earliest-I-know-about article on failing fast programs given in [Gray85], they’re arguing – along the same lines as for hardware – for failing fast and hard.1 In [Armstrong03] they go a bit further and argue for failing-fast with failures controlled by higher-level modules – with modules being simpler and simpler all the way to the point when the module is that simple that it cannot possibly fail.2 On the other hand, in [Shore04], they say that “a crash is never appropriate”, and argue for continuing working after an assertion has happened – while reporting the problem to developers (assertions IS still a bug which should be fixed ASAP, otherwise we’re not speaking about failing fast). And in [Neumanns16], the argument goes along the lines that for some (critical) systems we should fail-safe, avoiding crashes, while for some other systems, we should fail-fast-and-hard (and that unless proven otherwise, we should default to failing fast-and-hard).

Wikipedia definition of failing fast (on the right) seems to say that “stopping normal operation”, while being “usual”, is not a strict requirement for failing fast.

To answer the question “what we really should mean under failing-fast”, let’s get back to the roots and take a closer look at what exactly we’re trying to achieve with the strategy of failing fast:

Reducing Debugging Time. Failing fast DOES shorten debugging times immensely – there is no doubt about it. However, this benefit only applies to failing fast while debugging and testing.

Producing More Reliable Software. Sweeping bugs under the carpet by gracefully handling them (so gracefully that we don’t even report them), is not a good way to produce as-bug-free-software-as-possible. Conversely, failing fast does help us to produce more reliable programs – however, once again, this benefit only applies to failing fast while debugging and testing.

Reducing Cost of Failure. This is the point where different schools of thought start to diverge. In [Hoang18], they say that failing fast (and hard) reduces the costs of the bug, but in [Shore04] they say that “crash is never appropriate”, and [Neumanns16] argues that for critical systems hard crash is not acceptable, and (going beyond critical systems) provides an example of a text editor which generally benefits from trying to save the document before crashing.3 I have my own opinion on this one – but I will articulate it a bit later.

Achieving Fault Tolerance. This one is really tricky; in particular, as noted above, I don’t buy an argument-coming-from-hardware-side that we should just retry and hope for the best (this DOES work for hardware, but not for software – and even less for a good testable software <oh-the-irony />). OTOH, a line of thought from [Armstrong03] might work – but then we still have the problem of the topmost-module (which MUST NOT fail), so the question “whether we should fail fast AND hard” still doesn’t have a valid-for-everybody answer.4

1 based on a hypothesis that software failures are irreproducible so simple retry will solve the problem, which is IMNSHO is fundamentally flawed for software. Very briefly: (a) good software has to be testable, and (b) to be testable you pretty much have to be deterministic, hence (c) relying on being irreproducible is a Bad Thing(tm) – which BTW is illustrated by tons of real-world failures, Ariane 5 discussed below included

2 I won’t argue on this one, at least it is MUCH more viable than simple retrying

3 ideally – to a separate file so user can choose between two versions

4 and in practice, problems with saving-and-recovering consistent-with-the-rest-of-the-world state of failed modules will often stop us from following this nice model, so we will run into modules-which-MUST-NOT-fail much earlier than they become simple enough to be inherently fail-tolerant <sad-face />

Ariane 5 Disaster as a Result of Failing-Hard

Let’s take our own look at this cost-of-failure question – more specifically, let’s take a look at one of the worst (to date) software disasters, a completely software-induced crash of Ariane 5 rocket back in 96. A rather good overview of the problem can be found in [Wikipedia.Ariane5], but here are the points which are most interesting for us now:

the failure was due to an unhandled CPU exception caused by a conversion-causing-precision-loss

the failure has happened in a part of the software which did nothing of use for Ariane 5 (it was reused from Ariane 4 and was allowed to operate “for commonality reasons”)

engineers knew of problems of failing-fast-AND-hard, and did handle such exceptions for most of the variables, but not for this particular one.

In other words:

there EXIST real-world cases when failing hard is NOT a good option (which BTW is perfectly in line with [Neumanns16])

If Ariane 5 is not a good enough case for you (on the basis that in theory it could fall and kill somebody, which is arguable but I don’t want to argue it here) – consider life-support system where we DO know that a crash will kill the patient/astronaut/any-other-person.

OTOH, opposite cases also DO exist; in particular, for not a rocket but a military missile I’d certainly argue for it failing fast AND hard (ideally – leading to self-destruction in an as-failsafe-way-as-possible).

it IS a well-known problem for developers of critical systems (Ariane 5 devs DID handle some of the exceptions gracefully – unfortunately, it didn’t happen in for ALL the cases)

Note, however, that I am NOT arguing that failing fast was a bad thing for Ariane 5. To the contrary:

in debugging mode this kind of conversion SHOULD have crashed the system

in production – it also SHOULD have fail-fast (detecting the bug), but then it SHOULD have try to ignore the problem and continue running (“failing soft” which in this case is equivalent to “failing safe”). In other words, in production an attempt should have been made to ignore the overflow (“failing soft”), but at the same time “failing soft” does NOT prevent recording it (ideally – sending it via telemetry channel) – and fixing the bug for the next run. In other words, it is STILL a bug which we should fix ASAP (i.e. we STILL want to benefit from following fail-fast ideology), but at the same time, failing fast is NOT a good reason to crash (for Ariane 5, we do NOT want to fail-hard).

Production Bug Handling Quadrant

These days, everybody and their rabbit are drawing magic quadrants, and I didn’t draw my own magic quadrant yet. Here is my humble attempt to fill this glaring gap in my writings </self-irony>, presenting a Production Bug Handling Quadrant:

NB: Note that there is NO green color in the quadrant. This is intentional: we’re dealing with OUTRIGHT BUGS here, so we HAVE to admit that something has already went wrong.

Usually, the worst possible case is to fail Hard+Slow; this is what usually happens when we do nothing to detect our bugs in runtime.

Ignoring bug-induced conditions leads us to failing Soft+Slow; it MIGHT help in some cases (such as the one of Ariane 5), though, IF we can report such conditions to developers, we SHOULD do it (it is STILL a bug even if we’re ignoring it!), and such reporting would make our program to fail Fast+Soft

Reporting the bug and trying to continue is Fast+Soft failure handling; it IS appropriate at least whenever we’re reasonably sure that the crash of our program represents worst-possible case (as for Ariane 5, life-support systems etc.)

Calling terminate() on bug-induced condition means failing Fast+Hard. It IS appropriate whenever the worst case in case of us continuing operation is substantially worse than the simple crash (military missile being one such example).

Summary on Fail-Fast vs Fail-Hard

My main point up to now was NOT to say which way of handling bugs in production runtime is The Right One; on the contrary, the point is that

Different programs require different ways of handling production bug-induced failures

Also, I am arguing that:

Fail-Fast is a Good Thing(tm)

In line with [Neumanns16], I DO agree that Fail-Fast-AND-Hard MUST be used during debugging (and with as many meaningful asserts/contracts/… as possible scattered around our code).

In production, it is also VERY important to detect bugs and report them to developers. However, how to handle them after reporting – is a separate topic which is NOT directly related to failing fast.

“Failing-Fast does NOT mean we should necessarily Fail-Hard(!). In certain (production!) cases, Failing-Fast-AND-Soft IS a substantially better alternative.Failing-Fast does NOT mean we should necessarily Fail-Hard(!). In certain (production!) cases, Failing-Fast-AND-Soft IS a substantially better alternative.

In addition to the cases mentioned above, one very good real-world case for Fail-Fast-AND-Soft happens whenever we’re reasonably sure that the bug was manifested during read-only operation (so there is very little risk of damaging our in-memory state). This, in particular, is a very common scenario in VALIDATE and CALCULATE stages of VALIDATE-CALCULATE-MODIFY-SIMULATE (Re)Actor pattern (see, for example, [DDoMOGv2] – in particular, an option to have this pointer as const during these two stages is discussed). As VALIDATE and CALCULATE stages are inherently read-only, it means that we can more or less assume that failure in them did NOT change the state, so handling of such failures5 is trivial – we can just ignore incoming event and try to process the next one.

I can say that under such circumstances, Fail-Fast-AND-Soft DID save my bacon quite a few times in projects where direct damages6 cost hundreds-of-thousands-per-hour. In other words, I am ALL for having this option of Failing-Fast-AND-Soft at least for some of the projects out there.

5 that is, assuming that worst-case is NOT worse than crash

6 i.e. NOT accounting for loss of customer loyalty etc.

Unchecked Exceptions on Top of P0709

With all this in mind, I am sure that it is a Really Good Idea(tm) to introduce a concept of such last-hope exceptions into C++; actually, the concept which-I-think-is-appropriate is more generic than that, and is similar to so-called “unchecked exceptions” from Java. In Java, all the exceptions are divided into “checked exceptions” (derived from class Exception) and “unchecked exceptions” (derived from class RuntimeException). In runtime, both checked and unchecked exceptions are exactly the same, the only difference is that IF “checked exceptions” MAY arise in a function, it MUST be either caught, or declared to be re-thrown within the function.

Given the reasoning above, I clearly like this idea (well, with all the changes necessary to make it live in C++ world), so I would try to integrate the concept of “unchecked exceptions” into C++ exceptions (more specifically, on top of P0709); it is IMO very straigtforward:

we’re saying that there are two wide subtypes of std::error (distinguished by domain) – “checked” and “unchecked”

for “checked” std::errors – which include ALL exceptions which P0709 proposes to keep – everything stays as discussed in P0709 (including static checks of try-expressions within throw functions)

for “unchecked” std::errors – everything works exactly the same as for “checked” ones, excluding:

any requirements for static checks do NOT apply to “unchecked” exceptions

unchecked exceptions MAY be thrown out of nothrow functions without causing trouble

“unchecked” std::errors are OPTIONAL – and MAY be specified as one of implementation-defined options in any of the following places:

for heap allocations (which is perfectly in line with special handling of heap exhaustions mentioned in P0709 – which BTW had a consensus in SG147)

for assert()/contracts/nothrow violations (while each of them DOES indicate a bug, neither of them is guaranteed to be fatal).

allowing the implementation to replace (again, OPTIONALLY and implementation-defined) dreaded UBs at least in the following cases:

dereferencing null pointer (is already handled on vast majority of modern CPUs without any performance issues, but this behavior is non-standardized)

calling a function by a null pointer

integer divide by zero

signed integer overflows

probably also for LOTS of other cases which SHOULDN’T happen – but which DO happen.

NB: at the very same places it would be nice to allow implementation to define behavior as a programmer’s choice between:

UB

terminate()

“unchecked” std::error

“'unchecked' std::errors are treated as 'something which should never ever happen, but in practice MAY occur as a result of potentially-recoverable bug'“unchecked” std::errors are treated as “something which should never ever happen, but in practice MAY occur as a result of potentially-recoverable bug”

as a result, most of the libraries (probably including std) don’t need to bother about being exception safe in presence of “unchecked” exception.

More formally, we can speak about “{strong|weak} exception safety given checked exceptions” (i.e. when exceptions can occur ONLY within those try-expressions), or about much-stronger-and-much-more-difficult-to-achieve “{strong|weak} exception safety given unchecked exceptions” (i.e. when exceptions can occur in pretty much each and every line of code; this includes contract violations and heap allocations which are responsible for 90+% of all the potential failures).

OTOH, LOTS of things will still work for unchecked exceptions even if they’re NOT specifically addressed

in particular, they MAY be caught (for example, right-outside-react()-function, or in the thread func)

RAII still works for them

and so on.

All these cases have the following common properties:

They SHOULD NOT happen, and they (except maybe for heap exhaustion) DO indicate a bug

However, a bug MAY be recoverable from (and it is up to the app dev to specify whether she feels good with such an attempt at the recovery, having performed that rocket-vs-missile failure kind of analysis).

As demonstrated above, the best answer on WHAT TO DO if such a bug is encountered, heavily depends on the type of the app

Spending special efforts to be exception-safe in presence of such unchecked exceptions is very difficult, and most of the time it is NOT required.

However, as real-world use of similar practices has shown8, even in absence of such special coding efforts, this kind of exceptions OFTEN allows to reduce the cost of the bug (at least compared to crashing or calling terminate()).

Benefits of such “unchecked” exceptions include:

allowing to avoid Ariane5-like disasters in a standard and uniform manner (without resorting to system- and/or CPU-reliant trickery). IMNSHO, this alone is sufficient to push the concept of unchecked exceptions into the language.

even more uniform error handling model (heap exhaustion is no longer a special case, but just “one of those things which happen only in very abnormal situations”)

this comes at zero additional cost-to-developers (i.e. they do NOT need to handle all this stuff, so most of app-level code will NOT differ from the one in P0709)

“whether we may throw an unchecked exception or want to terminate” becomes a deployment-time decision (in particular, it won’t cause an avalanche of code changes in case when we start throwing an “unchecked” std::error). Among other things, it should simplify life of library developers.

ability to standardize existing behaviors across different platforms – while leveraging zero-cost abilities and techniques already used in modern systems/CPUs (such as protected page @ zero address, CPU exception on divide-by-zero, etc.).

an ability to allow apps to avoid A LOT of dreadful UBs (and IMNSHO, 99% of industry devs for 99% of their code will prefer “as few UBs as humanly possible even if they might theoretically cause a minor performance loss”).

7 IMNSHO, 9-0-3-0-0 (SF-F-N-WA-SA) does qualify as a consensus

8 in particular, _set_se_translator() DID allow to recover from transient read-only CPU reads, to throw current event away, and to work until scheduled restart – fixing the bug in the next release.

Conclusion

Of course, the above is just thinking-aloud ramblings and is light years away from being a solid proposal. However, any such proposal HAS to start with some thinking-aloud, so well – here it is, feel free to comment on the negatives of such an approach.

Comments

I think that when you propose “unchecked exceptions MAY be thrown out of nothrow functions without causing trouble,” that’s a contradiction in terms. If a nothrow/noexcept function throws, the caller WILL have trouble, period.
Now, I actually would have liked exceptional-unwinding-out-of-a-noexcept-function to have produced UB, in which case the vendor would technically have been free to make that UB look exactly as if the noexcept function had thrown an exception.
But the user’s calling code still wouldn’t be _expecting_ their call to unwind exceptionally!
The caller uses noexcept functions to eliminate unexpected control flow edges. If those control flow edges can show up anyway, then we risk leaving pointers un-freed, mutexes un-unlocked, files un-closed, and so on. Even when our intention is to fail fast, unreclaimed resources are trouble. Hypothetical worst case, an unexpected throw which leads to a mutex-not-being-unlocked could turn our hoped-for “fail fast” into a “deadlock and fail never.”

Correction: while it IS a bug, in the real world the caller MAY or MAY NOT have trouble. Just as with all these exceptions-which-should-never-happen: yes, there are bugs, but they MAY (or MAY NOT) cause Real Trouble. Sure, leaking resources IS a bug, but MUCH more often than not it is small enough to continue working until the planned system restart (or at least until an emergency-but-still-graceful restart); yes, unlocked mutex is most likely a Big Trouble(tm), but (a) mutexes are to be avoided at app-level anyway, and (b) in 99+% of cases RAII will unlock our mutex, so Big Trouble(tm) won’t materialize.

Sure, but “risk of trouble” != “trouble”. In other words, the problem is NOT guaranteed, and this is exactly the point – for apps where terminate() is THE worst-case scenario (as for life-support system or for Ariane 5 control system), then taking ANY risk is better than a guaranteed call to terminate(). And BTW, as RAII is still working for these exceptions – risks, while existing, are still not THAT bad (as I wrote, taking risks with such exceptions DID save from quite a few crashes in a real-world app with direct damages of downtime being $100K/hour).

Exactly – while the concept of fail-fast as such is known for ages, it is relatively new in software, that’s why I used wording of “using a concept of fail-fast for our programs” – which is IMO still precise enough.

As for Erlang – I wasn’t able to find a reference to fail-fast in Erlang (besides some quote on http://wiki.c2.com/?FailFast ), but if you give me a link to a publication – I will be happy to add it (TBH, it won’t change much in the overall analysis).

I definitely agree on not changing the overall analysis. I just find the old articles quite informative in many ways.

Gray’s report is mainly about hardware, but it also mentions software modules. See section “Fault containment through fail-fast software modules.”

Joe Armstrong’s PhD Thesis “Making reliable distributed systems in the presence of software errors” has a quite good section on Erlang philosophy. Although, it’s not that much older than Shore’s article. Link: http://erlang.org/download/armstrong_thesis_2003.pdf

Thanks, I added the refs and my very cursory analysis of them. Very briefly: (a) I _strongly_ disagree with Gray’s hypothesis of software-behaving-like-hardware-so-bugs-are-irreproducible – it DOES NOT stand, especially for good testable software. OTOH, (b) Armstrong’s idea of hierarchies-with-each-higher-level-being-simpler DOES fly to a certain extent – though from my experience, for real-world apps applicability limits will kick in, causing us to have those modules-which-MUST-NOT-fail much more complicated than they should be on paper :-(, so the question of the “how to recover the module-which-MUST-NOT-fail?” is still very actual.

This is simply not true. If they had catched and ignored this exception the ariane 4 (for which this software was developed) *would not have flown* because the CPU couldn’t handle the load. And if the software hadn’t been used on the Ariane 4 it wouldn’t have been used on the 5 either.

There are lots of interesting lessons to draw from the Ariane 5 crash, but simply “catching all exceptions” is NOT one. (I discussed the case in some detail in my 2018 Meeting C++ talk.)

Going into history of Ariane 4 doesn’t affect a simple still-undisputed fact: IF this (or any similar) exception is ignored on Ariane 5, THEN a half-a-billion-dollar crash wouldn’t happen.

Analysis of “how it should have been avoided in the first place” is a different story (in fact, lack of simulation is unforgivable to start with), but my key point thatthere EXIST cases when ignoring all exceptions is preferable to crashing,
still stands.