Cascade failures

In a typical system, the software fits into several natural layers.
The GUI is at the topmost level in the hierarchy, and might
interact with a database or a control program layer.
These layers then interact with other layers, until finally, the lowest layer
controls the hardware.

What happens when a process in the lowest layer fails?
When this happens, the next layer often fails as well—it sees that its driver is no longer available and faults.
The layer above that notices a similar condition—the resource
that it depends on has gone away, so it faults.
This can propagate right up to the highest layer, which may report some kind
of diagnostic, such as "database not present."
One of the problems is that this diagnostic masks the true
cause of the problem—it wasn't really a problem with the
database, but rather it was a problem with the lowest-level driver.

We call this a cascade failure—lower levels causing higher
levels to fail, with the failure propagating higher and higher until
the highest level fails.

In this case, maximizing the MTBF would mean making not only the lower-level
drivers more stable, but also preventing the cascade failure in the first place.
This also decreases the MTTR because there are fewer things to repair.
When we talk about in-service upgrades, below, we'll see that preventing cascade
failures also has some unexpected benefits.

To prevent a cascade failure, you can:

provide a backup mechanism for failing drivers, so that when a
driver fails, it almost immediately cuts over to a standby, and

provide a fault-tolerance mechanism in each layer that can deal with a momentary outage
of a lower-level layer.

What might not be immediately obvious is that these two points are interrelated.
It does little good to have a higher-level layer prepared to deal with
an outage of a lower-level layer, if the lower-level layer takes a long time to recover.
It also doesn't help much if the low-level driver fails and its standby takes
over, but the higher-level layer isn't prepared to gracefully handle that momentary outage.