Design for Safety

Unfortunately, everyone had forgotten why the
branch came off the top of the main and nobody realized that this was
important.

Trevor KletzWhat Went Wrong?

Before a wise man ventures into a pit, he lowers
a ladder -- so he can climb out.

Rabbi Samuel Ha-Levi Ben Joseph Ibm Nagrela

Software design must enforce safety constraints. Reviewers should be
able to trace from requirements to code and vice versa. In addition to
the specific safety constraints developed for the system being designed,
the design should incorporate basic safety design principles.

Hazard Elimination

Hazard elimination is the least expensive, and most effective, method
of handling system hazards. If addressed early in the system design
process, hazards can often be eliminated at almost no cost whatsoever.

Substitution may be applied to eliminate hazards in several ways.
Safe or safer materials can be used in place of hazardous ones. Simple
hardware devices are often safer than using a computer to enforce safety
constraints. There is no technological imperative that says we must use
computers to control dangerous devices. Introducing new technology
introduces unknowns and even unknown-unknowns.

Simplification may also eliminate hazards. Simple software designs
should be testable. The number of states of the software should be
limited. Determinism should be preferred over nondeterminism.
Multitasking designs are much more complicated, so single tasking should
be used instead. And polling should be used instead of interrupts,
wherever possible. Software designs should also be easily understood and
readable.

The interactions between software components should be limited and
straightforward. Reducing and simplifying interfaces will eliminate
errors and make designs more testable. Individual components should
include only the minimum feature set and capability required by the
system. It is easy to add functions to software, but hard to practice
restraint. Unnecessary or undocumented features add complexity.
Constructing a simple design requires discipline, creativity, restraint,
and time. The structural decomposition should match the functional
decomposition so that it is easy to map chunks of the program to their
intended purpose.

A tightly coupled system is one that is highly interdependent. Each
part is linked to many other parts. Failure or unplanned behavior in one
can rapidly affect the status of others. If processes are time-dependent
and cannot wait, there is little slack in the system for unusual
circumstances. Likewise, a tightly coupled system often has invariant
sequences and only one way to reach the program's goal. System accidents
are often caused by unplanned interactions. Coupling creates an
increased number of interfaces and potential interactions. Unless
carefully controlled, computers tend to increase the coupling in a
system. There are several principles of decoupling that can be applied
to software designs. Software can be modularized, so that functionality
is divided into discrete units. Firewalls (not in the security sense)
can be used to prevent communication between parts of the system that
should not interact. Read-only or restricted write memories can prevent
coupling by controlling who can affect certain data values. Lastly,
decoupling software can eliminate the hazardous effects of common
hardware failures.

Elimination of human errors requires reducing the opportunity for
human error by design. In general, humans are good at reacting to their
own mistakes. If a system makes the results of an error clear, the
operator may be able to correct it. There are many ways to increase the
safety of human-computer interaction. Make sure that the status of
components is clear to the operator at all times. Design software to be
error tolerant for the inevitable mistakes in operator entry of
commands.

It is also desirable to use a programming language that is not only
simple itself, but encourages the production of simple and
understandable programs. Some programming languages have been found to
be particularly error prone.

Lastly, to reduce hazardous conditions, software should only contain
code that is absolutely necessary to achieve the required functionality.
This has significant implications for COTS (Commercial Off The Shelf
software), which must be designed with a more general marketplace
environment in mind. The extra code in COTS may lead to hazards and make
software analysis more difficult. Another way to reduce hazardous
conditions in software is to initialize the hardware memory to a bit
pattern that will revert to a safe state if, for any reason, the
instructions start being read from random memory.

The design of a turbine generator makes a good example.

The safety requirements are:

Must always be able to close steam valves within a few hundred
milliseconds.

Under no circumstances can steam valves open spuriously, whatever
the nature of internal or external fault.

The functioning of the system is divided (decoupled) on separate
processors. The first processor controls all non-critical functions;
loss of this processor cannot endanger the turbine nor cause it to
shutdown. This processor has less important governing functions,
supervisory, coordination, and management functions. The second
processor has only a small number of critical functions. These functions
can be examined with much greater scrutiny.

The turbine-generator uses polling. There are no interrupts except
for a fatal store fault, which is nonmaskable. Timing and sequencing
are thus fully defined, so more rigorous and exhaustive testing is
possible.

All messages are unidirectional. No recovery or contention
protocols are required, which leads to a higher level of
predictability.

Self-checks of the sensibility of incoming signals and whether the
processor is functioning correctly are performed. The failure of the
self-check leads to reversion to a safe state through fail-safe
hardware.

A simple state table defines the scheduling of tasks and the
self-check criteria appropriate under particular conditions.

Hazard Reduction

Hazards may be reduced by passive safeguards, which maintain safety
merely by their presence, or by active safeguards, which require the
hazard or condition to be detected and corrected. Passive safeguards
cause the system to fail into a safe state, whereas active safeguards
must become active and direct the system to safety. Passive systems rely
only on physical principles, while active mechanisms depend on less
reliable detection and recovery means. However, passive safeguards tend
to be more restrictive in terms of design freedom and are not always
feasible to implement.

Hazards can be reduced by designing the system for controllability.
The system can be made easier to control, both for humans and computers.
Try to use incremental control. Perform steps incrementally rather than
in one step, and provide feedback to test the validity of assumptions
and models upon which decisions are made. Providing feedback also allows
taking corrective action before significant damage is done. Feedback may
also be provided in terms of intermediate states and partial results.
Controllability is also enhanced by lowering time pressures, perhaps by
slowing the process rate.

Decision aids can also help to control a system, as can use
monitoring. It is difficult to make monitors independent, however.
Checks require access to information being monitored, but these checks
may corrupt that information. Monitoring also depends on assumptions
about the structure of the system and about errors that may or may not
occur. These assumptions may be incorrect under certain conditions.
Common incorrect assumptions may be reflected in both the design of the
monitor and the devices being monitored.

In general, the farther down in the hierarchy a check can be made,
the better. It means detecting the error closer to the time it occurred
and before erroneous data can propagate to other components. It is
easier to isolate and diagnose the problem as a lower level. And, the
lower the level at which the failure is detected, the more likely the
system is to be able to fix the erroneous state rather than recover to a
safe state.

Writing effective self-checks is very hard, and the number that can
be included is usually limited by time and memory. It is best to limit
checks to safety-critical states. Use hazard analyses to determine
optimal check contents and locations. And be wary, added monitoring and
checks can cause failures themselves.

In addition to designing for controllability, several types of
barriers can help in hazard reduction. Lockouts make access to dangerous
states difficult or impossible. For software, that means avoiding EMI,
limiting authority, and controlling access to and modification of
critical variables. Some techniques can be adapted from security for
this.

Inversely, lockins make it difficult or impossible to leave a safe
state. This addresses the need to protect the software against
environmental conditions, such as operator errors or data arriving in
the wrong order or at an unexpected speed. Completeness criteria can
help ensure that specified behavior is robust against mistaken
environmental conditions.

Interlocks can be used to enforce a sequence of actions or events.
For example:

Event A does not occur inadvertently

Event A does not occur while condition C exists

Event A occurs before event D

Examples of interlocks include batons, critical sections, and
synchronization mechanisms.

Remember, the more complex the design, the more likely errors
will be introduced by the protection facilities themselves.

Detonation of nuclear weaponry makes a good example for hazard
reduction. The safety of a nuclear device depends on that device NOT
working. Three basic techniques (called "positive measures")
are used to prevent unintended detonation:

Isolation

Critical elements are kept separate by barriers.

Inoperability

The device is stored in an inoperable state. For example, an
ignition device or arming pin may be removed while in storage.

Incompatibility

Detonation requires an unambiguous indication of human intent
be communicated to the weapon.

Protecting the entire communication system against all
credible abnormal environments, including sabotage, is not
practical.

Instead, a unique signal of sufficient information complexity
that it is unlikely to be generated by an abnormal environment
is used.

Nuclear systems feature:

Unique signal discriminators that must:

Accept proper unique signals while rejecting all spurious
inputs

Have rejection logic that is highly immune to abnormal
environments

Provide predictably safe response to abnormal environments

Be analyzable and testable

Barriers that Protect unique signal sources

Removable barriers between these sources and communication
channels

A diagram of the safeguards against accidental nuclear detonation is
shown in the figure below.

The device may require unique signals from several different
individuals along various communication channels, using different types
of signals (energy and information) to ensure a proper intent.

Another means of reducing hazards is failure minimization. Safety
factors and safety margins are used to cope with uncertainties in
engineering. These inaccuracies arise from inaccurate calculations or
models, limitations in knowledge, and variation in strength of a
specific material due to differences in composition, manufacturing,
assembly, handling, environment, or usage.

There are some ways to minimize problems when they cannot be
eliminated. Safety factors and margins are appropriate for continuous
and non-action systems. See the figure below.

Redundancy can increase reliability and reduce failures. However, it
assumes a model of random wearout. It is not so effective at
common-cause or common-mode failures, which may affect all redundant
parts equally. Redundancy can also add so much complexity to the system
(to coordinate the redundant components) that the complexity causes
failures. Certainly, redundant components are more likely to operate
spuriously. And redundant components may cause a false sense of
security. This was one of the contributing causes to the Challenger
accident. Certainly, redundancy has its place, and it can be useful in
reducing hardware failures, but what about software?

Claims are made that design redundancy and design diversity can
provide the benefits of redundancy to software. The bottom line is that
claims that multiple version software will achieve ultra-high
reliability levels are not supported by empirical data or theoretical
models.

Schemes have been proposed for standby spared and for concurrent use
of multiple devices with a voting scheme to resolve differences.
Identical designs may be used or intentionally diverse ones. But
diversity must be carefully planned to reduce dependencies. These
dependencies may be reintroduced in maintenance, testing, and repair. In
the end, redundancy is most effective against random failures, not
design errors.

Software suffers from design errors, not random failures. Data
redundancy allows for detecting errors in data using schemes such as
parity bits, checksums, message sequence numbers, and duplicate pointers
or other structural information. Algorithmic redundancy involves
multiple versions voting on results. Of course, these versions must be
guaranteed to meet the same requirements using difficult to write
acceptance tests.

Multi (or N) version programming assumes that the probability of
correlated failures is very low for independently developed software. It
assumes that software errors occur at random and are unrelated. Even
small probabilities of correlated failures cause a substantial reduction
in expected reliability gains. Professor Nancy Leveson and John Knight
conducted a series of experiments to examine failure independence in
N-version programming, embedded assertions versus N-version programming,
and fault tolerance versus fault elimination.

The failure independence experiment collected 27 programs written
from one requirements specification. Graduate students and seniors from
two universities wrote the programs. The evaluation of these programs
simulated a production environment, using 1,000,000 input cases. Each of
the programs, taken individually, was of high quality. The results of
the experiment rejected the independence hypothesis. Analysis of
reliability gains must include the effect of dependent errors.
Statistically correlated failures result from the nature of the
application and the "hard" cases in the input space.

This should make intuitive sense. The unusual corner cases in input
that are hard for one designer are likely to be hard for another. For
example, imagine a program that takes the coordinates of three points
and finds the three angles of the triangle formed by those points. It
should seem likely that more designers will have errors in handling the
case where all three points lie on one line or are even at the same
coordinates. Harder input cases are harder for all designers, so errors
are not likely to be randomly distributed around the program.

Furthermore, the programs with correlated failures were structurally
and algorithmically very different. The conclusion is that correlations
are due to the fact that the problem was the same, not due to the tools
used or languages used or even algorithms used.

Multi-version programming also suffers from the consistent comparison
problem. The consistent comparison problem arises from the use of
finite-precision real numbers (rounding errors). Correct versions may
arrive at completely different correct outputs and thus be unable to
reach a consensus even when none of the components "fail".
This may cause failures that would not have occurred with single
versions. In general, there is no practical solution to the problem.

Another experiment was performed regarding self-checking software.
This experiment used the launch interceptor programs (LIP) from the
N-version programming study. 24 graduate students from UCI and UVA were
employed to instrument 8 programs (chosen randomly from the subset of 27
in which error were found). The students were provided identical
training materials. In a first round, students wrote checks based solely
on the specification for the software, then the participants were given
a program to instrument. The students were allowed to make any number or
type of check. The students treated this as a competition among
themselves to see who could detect the most errors. The data collected
is shown below; more errors were added in relation to the self-checking
than were found.

Another hope for multi-version programming was that fault tolerance
could replace fault elimination. The hope was that if several versions
of a program are running and voting on results, one need not eliminate
defects from the software. For any given input sequence, the majority of
the versions of the software should still vote for the correct answer.
Thus, expensive testing and fault elimination processes can be removed
from the organization. Experimentation does not support this hypothesis.

Fault tolerance has been compared to fault elimination, including
techniques such as run-time assertions (self-checks), multi-version
voting, functional testing augmented with structural testing, code
reading by stepwise abstraction, and static data-flow analysis. The
problem used in the experiment was a combat simulation problem (from
TRW). The programmers employed in the experiment were separate from the
teams that detected faults in the software. Eight versions were produced
with 2 person teams. The number of modules varied from 28 to 75, and the
number of lines of code from 1200 to 2400. The experimenters tried to
hold the resources constant for each technique.

The results showed that multi-version programming is not a substitute
for testing. The resultant system did not tolerate most of the faults
detected by fault-elimination techniques. The system was also fairly
unreliable in tolerating the faults that it was capable of tolerating.
The scaled-back testing done in conjunction with the multi-version
project was not able to detect errors that cause coincident failures
across multiple versions of the software. The results cast doubt on the
effectiveness of multi-version voting as a test oracle. Instrumenting
the code to examine internal states was much more effective. Lastly, the
intersection of the sets of fault found by each method was relatively
small.

In summary, these results don't necessarily mean that N-version
programming shouldn't be used, but it is important to have realistic
expectations of the benefits to be gained and the costs involved. The
costs are very high: more than N times the cost for one version of the
software. In practice, there will be a great deal of similarity in the
designs produced. If mid-algorithm cross checks are used between
versions, then even more similarity of designs will result as each
version must produce the same interim values. Because N-version
programming depends on design diversity, this means that the safety of
the system is dependent on a quality that has been systematically
eliminated. There is no way to tell how different two software designs
are in their failure behavior. Lastly, requirements flaws are not
handled by multiple implementation versions, and requirements
specifications are where most safety problems arise.

Recovery techniques are also sometimes applied to software. Recovery
comes in two forms. Backward recovery is a process of detecting an error
and returning to a known good state. This assumes that the error can be
detected before it does any damage, and it assumes that the alternative
(state that the system recovers to) is more effective than the failure
state. Forward recovery uses robust data structures, dynamically altered
flow control, and ignoring single cycle errors. The real problem is
detecting erroneous states.

Hazard Control

The first technique of hazard control is limiting exposure. The
system should start out in a safe state and require deliberate change to
move to an unsafe state. Critical flags and conditions should be written
as close to the code they protect as possible. And critical conditions
should never be complementary; for example, absence of an armed
condition should be used to indicate that the system is unarmed.

Isolation and containment is also used to control hazards. An example
of these kinds of controls are physical barriers such as concrete walls.
Protection systems and fail-safe design are also ways to control
hazards. These depend on the existence of a safe state and the
availability of adequate warning time. There may be multiple safe
states, depending upon process conditions, so a way to choose between
them is necessary. The general rule of thumb is that safe states should
be easy to get into and hazardous states should be hard to get into. A
good example is a chemical process that may take hours to start but can
be stopped nearly instantly if the operator presses a panic button.
Watchdog timers are similar; they time software to see if it appears to
have gone dead. If so, the watchdog timer signals some problem. The
software the watchdog timer observes should not be responsible for
setting the timer, however. Sanity checks are also a good form of
fail-safe design, as are "I'm alive" signals. Protection
systems should provide information about their control actions and
status to operators.

It is important to consider the social engineering of protection
systems as well. Management and operators make changes to procedure and
devices once a system is in use. The easier and faster it is to return a
system to operational state, the less likely it is that the protection
systems will be purposely bypassed or turned off.

Damage Reduction

It may be necessary to determine a "point of no return"
beyond which recovery is not likely or even possible. Beyond that point,
the goal is simply to minimize the damage done.

Modification and Maintenance

Many accidents happen when systems are modified and maintained.
Systems evolve. Operators and management change procedures. New
equipment may be added to existing systems. Repairs and replacements are
carried out. Changes that affect the design of the system must be
reanalyzed for their impact on the safety of the system. When that
reanalysis is carried out, it is essential for the system documentation
to be updated with the design rationale that supports the changes. This
will help to preserve the system safety sought in the initial system
design.