Andrew Koenig

Dr. Dobb's Bloggers

Operating Systems Help Us Limit Harm From Failures

November 07, 2012

Last week, I suggested that programs should use assert statements to handle conditions that "can't happen," such as invariant failures. I want to continue that discussion by looking at how a user might deal with the failure of such a program.

Last week, I suggested that programs should use assert statements to handle conditions that "can't happen," such as invariant failures. I want to continue that discussion by looking at how a user might deal with the failure of such a program.

Suppose you call a function that reports failure. If you are to recover from such a situation at all, it is important to be able to limit the harm that the function's failure might have done. For example, suppose the function reads an input file. If it reports failure — even if that failure can be due only to a bug in the function itself — it is nice to be able to trust that the function did not change the contents of the input file.

The ability to limit harm in such circumstances is an important service of an operating system. For example, if you tell the operating system that a file is read-only, then it is much harder for even a broken program to change that file's contents than if you do not make the file read-only. Similarly, most modern operating systems make it somewhere between difficult and impossible for a bug in one process to change memory belonging to a different process.

Processes do something else important, too: They limit the scope of "program termination." In C++ terms, when we talk about a program terminating because of an assert, what we really mean is that a process is terminating. Moreover, the operating system usually prevents a process that terminates from corrupting memory belonging to other processes.

As a result, most operating systems make it possible for us to use processes to deal with assertion failure: We can put part of our computation in a separate process, and then, if that process fails, we can be reasonably confident that that failure did not sneak back across the process boundary and corrupt our own computation.

This strategy is important for building large, reliable systems. For example, Microsoft Windows has the notion of a service, which is a process that the operating system runs automatically. Associated with each service is a set of instructions for what should happen if that service terminates. Should the system restart the failed process? If so, should it do so immediately, or should it wait a while? If the process fails again, what should the system do? Is there a limit to the number of times the system should try to restart the process before giving up? Should the system take some other action in addition to (or instead of) restarting the process? And so on.

Of course the system's mechanism for starting and restarting services might itself be broken. For that matter, there might be bugs that create holes in the firewalls between processes. However, both of these mechanisms are fairly easy to understand, and therefore to implement and test. Moreover, they are apt to be very widely used, which makes them much less likely to fail than the individual services they control.

In short, assert statements, together with facilities such as operating-system-assisted firewalls, suggest a strategy of:

Arranging for parts of a system to fail as quickly as possible when they do fail, and

Wrapping those parts in a way that limits the scope of the failure.

Those wrappers might consist of putting different parts of the system in different processes, or running different parts at different times and using files to pass information between them, or even running them on different computers connected by a network. Either way, the underlying idea is the same: Arrange for each part of the system to be able to complete its work successfully or to report failure — and limit the scope of that failure — as aggressively as possible.

Of course, it's not always possible to limit the ill effects from a failure completely. Next week, I'll discuss how we can use data-structure audits to reduce the harm in such cases.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!