"Our bodies have great availability. I have soft errors all the time: my memory fails once in a while, but I don't 'crash.' My whole body doesn't shut down when I cut a finger."

--Robert Morris, director of IBM's Almaden Research Laboratory, on the inspiration of IBM's autonomic computing project

In order to deal with the errors, failures, faults, and other computer problems that occur with great frequency, three general methods have been used and continue to be studied and researched. These three software techniques are different in the way they handle potential faults during a program's execution.

Fault avoidance consists of creating programs that are free of faults. Developing fault-free software must include a precise system specification, the extensive use of reviews during development, and careful planning and implementation of system testing.

However, because constructing a completely fault-free program is well-known to be generally unachievable, another technique used is fault removal. This method consists of accepting the existence of faults, then removing them after programs have already been written, but this proves to be another very difficult goal to reach.

Finally, fault tolerance is the technique that has been the focus of much recent research. It consists of acceping the existence of faults, admitting that they cannot be removed, and using a combination of detection and recovery mechanisms to ensure that the faults do not cause programs to fail.

Some general methods to improve dependability in both hardware and software are listed below:

good design methods

extra-reliable (expensive) components

improved production techniques

improved power supply and cooling

adding redundancy:

in time: do computations several times, with the same hardware (HW) and software (SW), or with different HW and/or SW (design diversity)

in space: have multiple units perform same function, and add majority voting equipment

>> Vocabulary

failure: a component's inability to perform to its specifications (note: specifications can be wrong)

error: cause of a failure

fault: anomalous component condition:

internal: design or manufacturing fault, damage, aging

external: harsh conditions, radiation, electromagnetism, misuse

design faults: those faults that after having been repaired result in a system with a different specification