In October 1992, the London Ambulance Service suffered a disaster that
brought their operations to a virtual standstill over 36 hours, and
cost approximately 20 people their lives in the process. Upon further
investigation, it was discovered that the new computer aided dispatch
(CAD) software was responsible for the crisis.

The problem lay in the design of the software, which was completely
inadequate for the needs of the London Ambulance Service. So terribly
pathetic were this software's services that response times to emergency
calls were as high as 11 hours during the 36 hours crisis.

Some of the worst problems in the system included the following:

Inability of the software to distinguish between duplicate calls
from different people pertaining to the same incident.

Failure of the software to maintain and keep track of logged calls.
One particularly evident case was a woman who called the emergency
number after being trapped by the body of her collapsed husband. She
repeated called the ambulance service every half-an-hour, only to
be told that there was no record of her earlier call. When an ambulance
eventually arrived nearly 3 hours later, her husband had already died.

(More contintued down the page)

A more detailed analysis revealed more clearly exactly the chain of
events that led to such a catastrophic failure. The official report,
released in February of the following year, stated that "neither the
Computer Aided Dispatch system itself, nor its users, were ready for
full implementation on 26 October 1992. The CAD software was not complete,
not properly tuned, and not fully tested." As it turns out, the LAS
was using a manual paper-and-pen based approach to dispatching, where
urgent calls were manually written down and transcribed and passed on.

The LAS had also been previously criticized in public forums for having
slow response times under their manual dispatch system, and their upper
management believed that a computer based dispatch program was the quickest
way to increase response time and improve efficiency. Therefore, LAS
applied pressure to the contractor to advance the timeline more quickly
than was prudent, resulting in a lower quality product.

Perhaps due to the compressed timeline, the LAS implemented the system,
allowing it to "go live" on October 26, 1992 with its staff having little
or no training in the new system. This one-phase deployment lacked testing
and critical oversight, and was a contributing factor in the subsequent
disaster. Even worse, in the face of such a rushed-system, LAS had no
backup plan, no failover mechanism to handle problems in the CAD system.

In addition, Systems Options, that was contracted by LAS to complete
this software system was completely out of its depth in writing such
software. Systems Options had no prior experience in writing large and
complex systems, much less mission critical CAD systems on which people's
lives would depend. However, allegations arose that LAS settled for
the lowest-bid from competing software vendors, rather than ensuring
a contractor with the necessary experience and depth to provide a reliable
end product.

LAS made critical mistakes at every juncture in the design, development,
and deployment of their CAD system and there were unfortunately no checks
in place to prevent the ambulance crisis from occurring.

China Airlines A300 Disaster

In April of 1994, China Airlines A300 crashed at Japan's Nagoya airport,
killing 264 of 271 people on board. The most likely cause of the crash
was not solely the fault of software, but the confused interactions
between software and human, in this case between the 26-year old copilot
of the plane who was attempting to land the plane and the autopilot
of the plane.

Two minutes before the plane was about to land, the autopilot of the
plane went into take-off/go-around for reasons the investigation could
not determine. In effect, this caused the autopilot to attempt to control
the plane in a way that was directly opposite to what the human pilot
was attempting to control.

Despite the warnings from the pilot, the copilot continued to attempt
to the land the plane with the autopilot in go-around mode. The autopilot
was therefore attempting to gain altitude, increasing the pitch of the
plane, while the copilot tried to decrease altitude using different
parts of the plane. The crew then switched the autopilot out of go-around
mode, but could not undo some of the changes the autopilot made to the
stabilizer flaps on the wings, which caused the plane to increase in
altitude.

The continued climb prevented the plane from landing, so the crew switched
the go-around mode of the autopilot again, causing the plane to continue
climbing. Finally, the engine stalled after the angle of ascent increased
to 53 degrees, and the plane fell towards the ground, crashing tail-first.

This is not an overt case of software failure that cost, but rather
a case where the specifications and interface were less than optimal
for communication between the [human] pilot and the autopilot. The design
of the software was such that there were no audio cues signifying when
the autopilot was engaged or disengaged. Also, below a certain critical
altitude, the autopilot resisted de-activation, because the designers
originally feared that below this altitude there was insufficient time
for a human pilot to regain control of the aircraft.

In addition, the system lacked a way to resolve a conflict of control-the
question of who to trust in this situation, the autopilot or the human
pilot, was never addressed. If the system had built-in safeguards that
resolved conflicting actions and handed control over to a single pilot,
then perhaps this disaster might have been averted.

Lauda Air B767 Accident

On May 26, 1991, Lauda Air Boeing B767 suffered an in-flight problem
and broke apart over Thailand approximately 7000 meters in the air after
departing from Bangkok. There remain some unsolved portions about the
precise cause of the problem, due to a damaged flight data recorder,
but the strongest possibility is that a thrust reverser deployed during
the flight. The thrust reverse then reduced lift by 25% for the airplane,
causing the flight crew to lose control of the airplane.

The problem lay in the problem of design and testing. Initially, Boeing
claimed that there was software in place that made accidental inflight
deployment of the thrust reversers impossible. In simulations, however,
it was later shown by tests that the disintegration of certain physical
locks might lead to a scenario where a thrust reverser might be deployed,
despite the software supposedly in place to prevent such a thing. Another
possibility that was investigated was a fault in the proximity switch
electronics unit (PSEU), and its accompanying operating software.

Simulation scenarios similar to the crash showed that unless full wheel
and full rudder were applied within seconds of such a thrust reverse,
the airplane would no longer be capable of controlled flight. The same
report concluded that "Ěrecovery from the event was uncontrollable [sic]
for an unexpecting flight crew".

Because the investigating groups never completely positively identified
the cause as the thrusters reversing, it is only speculation about the
precise sequence of occurrences that led to the crash.

However, it was clear in later testing that there were flaws in the
system that Boeing had designed. The question of where the responsibility
for such critical flaws remains unsolved. Although software failure
is just one of several possibilities in this case, why did Boeing fail
to isolate the multiple problems, both software and hardware related,
during the testing phase of the 767's subsystems?

The other interesting aspect is the tension between the amount of control
given to software automation, like the autopilot, versus the amount
of control given to human users. In this case, it is possible that over-reliance
on software automation may have decreased the readiness of the crew
to respond to emergencies by generating a sense of complacency where
the crew felt the computer would be able to handle most emergency situations
adequately.

Airbus A320 Crash in France:
User Interface in Critical Systems can be Critical

The overall reliability in an otherwise robust safety-critical system
can be compromised by a poor human-computer interface, as this case
study shows.

Synopsis: January 20, 1992, an Airbus A320 jetliner crashed near Strasbourg,
France. A board of inquiry found the fault to lie in "pilot error."
However, others have criticized the design of the A320's "glass cockpit"
which, allegedly, was confusing and hampered the pilots' ability to
monitor flight conditions, such as the high rate of descent experienced
by the doomed plane. Further, the system did not warn pilots of danger
in time for corrective action.

The A320 jetliner was introduced in 1987 by Airbus. There had been
two previous crashes of A320's before the 1992 crash. The first occurred
in 1988 when a plane owned by Air France crashed during a demonstration
flight at an air show. The official cause was reckless piloting, though
the pilot insists the plane failed to warn him of loss of altitude.
The second crash was by a Indian Airlines A320 landing at Bangalore.
A pilot pushed an incorrect button which idled the engines, causing
the plane to drop rapidly and crash land on a golf course.

The 1992 crash near Strasbourg of the Air Inter flight killed 87 people.
The pilots were apparently unaware of the plane's too rapid descent
as it approach Strasbourg. There may have been a computer warning a
second before the crash that altitude was too low - not enough time
to do anything. This system measures the aircraft's altitude using a
radio beam and calls out altitudes at certain height intervals. The
pilots received an altitude message a second before the crash. A second
warning system which warns pilots of a too rapid descent or low altitude
was not installed on the Air Inter as it was not required and Air Inter
felt it gave to many false warnings, leading pilots to ignore those
warnings.

How did the plane begin descending too fast? It is believed that pilots
had confused the "vertical-speed" and "flight-path-angle" modes of descent,
and were in the wrong mode. The two modes had very similar display formats.
The pilots were very busy at the time making a last minute change in
the flight plan, requested by the tower, and thus were probably concentrating
on the navigational display, and so altitude and vertical speed indicators
on the main display were overlooked.

An inquiry board deemed the cause of the accident to be pilot error,
because the pilots should have noticed (or not made) the error in the
descent mode. Others feel some of the blame lies in the design of the
system interface. Flint Pellet, of Global Information Systems Technology,
says on the RISKS forum, it's "more of a user-interface design error,
if you ask me. If you overload a person with things to do and input
to consider to the point where they can no longer keep up, it is hardly
reasonable to simply brush it off as 'human error' when they fail to
keep up."

The A320 uses extensive computer control. The computer control system
has full authority to override a pilot's action, for example, only allowing
a pilot to bank so far as stresses on the plane are not above limits.
Some pilots have claimed this is restrictive in the event of an emergency.
Further, the plane is so automated that the pilot is for the most part
reduced to the role of system manager, programming in flight paths and
such, while the computer actually flies the plane. With little to do,
pilots can become complacent. The computers display flight information
on computer screens in the cockpit, hence the "glass cockpit". These
displays, as in the A320, make it somewhat more difficult to monitor
trends in flight data than traditional mechanical instrumentation.

So, while the crash of the A320 in Strasbourg was due to a pilot error
in inputting an incorrect flight mode, at least part of the blame lies
in a user interface that made it hard for the error to be detected.
The human factor must be carefully considered in the design of a safety-critical
system, including such factors as complacency arising from little interaction
with the system.

Computer Failures in Two Traffic Systems

These case studies remind us to not to forget to test "upgrades" before
installing them on critical systems and to think about what the safest
state for a system is in the event of an error.

Synopsis: The traffic system in Austin, Texas, fails after a software
modification is installed without being test first. The traffic system
in Lakewood, Colorado, fails when the only disk drive on the only computer
fails. Lights in both cases defaulted to blinking red causing massive
backups and a multitude of accidents.

On April 13, 1990, programmers for the city of Austin, Texas, modified
software controlling the city's stoplights. The software was loaded
into the main computer which sent the changes to the stoplights. The
software was not properly tested beforehand and about 360 of the city's
600 or so intersections controlled by the system received erroneous
data. Receiving the bad data, the lights defaulted to blinking red,
bringing traffic to a grinding halt and causing widespread accidents.

In order to fix the system, each intersection had to be individual
reset by city work crews. This could not be done remotely.

Some critics, such as King Ables of Micro Electronics and Computer
Technology, believe it would have been better for the lights to go into
a green-yellow-red cycle with a default timing and that this would have
lessened traffic problems and accidents.

Information on whether fire and rescue vehicle response times were
hampered was not available.

In the Lakewood, Colorado, case, a hard drive running the city's traffic
management software failed on February 27, 1990. The city had no backup
computer or drive, though the hard drive was backed up on tape. Of course,
there was no drive to restore the tape contents onto.

As in the Austin case, lights defaulted to blinking red and had to
be individually reset at the intersection. Traffic was extremely slow
and the accident rate high.