Safety Problems, Hazard Analysis, and Hazard Control:

Copyright(c), 1990, 1995 Fred Cohen - All Rights Reserved

As the examples from List 1 begin to indicate, safety
problems have many factors. In general, they tend to be related to
system complexity, in that more complex systems are more likely to
contain problems that cause hazards. Safety is not just a matter of
increasing reliability, because under current technology, we are
unable to achieve ultra-high reliability in software, and research in
this area is still in its infancy.

Hardware fault tolerance techniques are primarily aimed at
preventing, detecting, and correcting errors due to random faults, while
software reliability is difficult to assess in terms of random behavior
because of its extreme complexity. N-modular redundancy is used in
hardware to allow detection and masking of faults, but experiments in
N-version programming have been relatively unsuccessful [Kelly83]
[Chen78] , primarily because it is difficult to assure correctness
of design, behavior of independently generated implementations of the
same specification tend to be quite divergent, specification is
generally too imprecise to allow unique solution, and specifications are
no more reliable than software in terms of safety properties.

Extensive reuse of certified software has not yet progressed
to the point of widespread practicality, although a great deal of
effort in this area is underway and there are many examples of low
level packages that are extensively reused. Exhaustive testing and
verification are impractical for most software because of the large
number of states and paths through a program. These problems are
significantly exacerbated by interrupts which create the possibility
of large numbers of branches at any point in any program, dynamic
allocation which is susceptible to the availability of resources, and
heavy loading, which is often difficult to simulate under test conditions.

There is no way to guarantee that simulations are accurate
because assumptions must be made about the controlling and controlled
processes and environments which may not be valid in every possible
application. The problem is amplified when writing software for
hardware that is new or does not yet exist, as is often done for the
most critical portions of operating systems since a system, once
built, typically uses an operating system for most further
development. As we have seen from the study of operating system
protection, there are great difficulties in writing correct software
policies, designing appropriate models, transforming these into
specifications, and implementing them correctly.

Computers are often used in safety critical systems because
of their versatility, power, performance, and efficiency, while they
present safety risks because of their extreme complexity and our
inability to provide correct software for them. Software is just one
part of the system, and while many techniques are used to assure safe
operation of the hardware in critical systems, software is often
given a great burden. Hazards typically arise from hardware component
failures, interfacing problems (communication and timing), human
error, and environmental stress. Software is often used to replace
standard hardware safety devices such as interlocks, and this often
places a disproportionate burden on the software engineer. Software
controls cannot view these in isolation because problems often caused
by complex interactions between components and by multiple failures.

It is quite likely that the future of software safety will be
similar to that of secure operating system design. A few basic
principles will be formalized, and systems will be generated in such
a manner as to allow verification that the implementation meets the
safety policies. Eventually, automatic programming holds hope for
assurance of implementation and testing techniques, but the problems
of policy, modeling, and specification are well beyond the state of
the art in software safety.

There are no mathematically based software safety policies in
the literature, and it is unlikely that any such policies will come
into being without a substantial advancement in the state of the art.
The closest thing to a safety policy comes from science fiction in
Isaac Asimov's "I Robot", wherein the three laws of robotics are
built into the "positronic brains" of robots. These three laws are
(approximately):

A robot may not injure a human being, or through inaction, allow
a human being to come to harm.

A robot must obey the orders given to it by human beings except
where such orders would conflict with the 1st law.

A robot must protect its own existence as long as such
protection does not conflict with the 1st or 2nd laws.

In fiction, Asimov covers a number of scenarios in which the
interactions of these laws create problems that are invariably
solved by either the humans in charge of the robots or the robots
themselves. In reality, these policies are impossible to implement
because of the undecidability of whether or not a given action will
harm or keep from harm.

Because of the state of the art in policy making, the closest
thing to a policy that exists in software safety is the policy of
reducing risks to an 'acceptable' level. In risk analysis, the
acceptability of risks is often assessed by comparison to other risks
in everyday environments. For example, if the risk due to a
particular system is reduced to the level where the increased hazard
to each individual at risk is equivalent to that presented by the
individual crossing the street one additional time in a lifetime, it
might be acceptable. A fairly standard metric for measuring risk is
the average reduction in life expectancy, but any number of other
metrics may be used as well.

In practice, safety is implemented by step-wise improvement.
We identify hazards posing unacceptable risks, determine if and how
the system can exercise those hazards, and design the system so as to
eliminate or minimize those hazards. The problem with this method is
that there may be hazards that are not identified because there is no
clear policy or model on which to base our analysis. Thus the state
of affairs in safety is similar to the problem of fixing leaky sieves
in operating systems.

In order to improve the situation to some degree, there are
published standards for safety which specify pre-defined hazards.
DoD nuclear safety requirements and NRC nuclear reactor safety
standards are typical. We can also improve the situation by using
hierarchical structure to reduce the complexity of design and
analysis [Newell] and providing standardized tools for risk
analysis, but these techniques do not in any way preclude the possibility
of catastrophic failure in ways not specified under ad-hoc techniques.

Hazard control is generally based on the elimination of
hazards or minimization of their occurrence or effects. Safety
analysis is generally done in a precedence order as follows:

The difference between intrinsic safety and fault tolerance
is relative in that the fault tolerance at any given implementation
level is normally treated as intrinsic safety at the next higher
level of implementation. As an example, the design of semiconductor
gates involves a great deal of redundancy in that many atomic
particles are involved in storing a bit. At the level of the computer
designer, gates are treated as having intrinsic reliability
properties, and fault tolerance is used to improve the system
reliability over the mission time where appropriate. At the OS
level, the hardware is generally assumed to provide an intrinsic
level of protection, and the OS is designed to add redundancy to
achieve desired system goals. At the level of designing tools under
an operating system, the OS is assumed to provide a level of
intrinsic protection, and any added protection is provided by
redundancy at that level. At the application level, the tools are
assumed to provide a given level of intrinsic protection, and
additional protection is added as required. At the user level,
intrinsic behavior is expected, while the user provides some
additional protection in the form of procedures for handling
exceptional cases. In many systems, multiple users are provided to
protect against failures in individuals, and in most large
organizations, further redundancy is used to assure that the
organization doesn't depend too heavily on any given group.

Design for intrinsic safety primarily involves the use of
high quality equipment at the next lower implementation level. This
usually involves fail safe mechanisms and reliability techniques.
Minimizing hazard occurrence generally involves active monitoring of
potentially hazardous conditions, automatic control of protection
mechanisms, lockouts of functions that cause hazards in particular
situations, lockins that force activities in particular situations,
and interlocks that force complex sequences of activities before
performing high risk functions or force active signaling to continue
performing hazardous activities. Automated safety devices to control
potentially hazardous conditions usually involve hazard detections
and warnings, fail safe designs, and damage control or containment.
Procedures and training help personnel react to hazards.