Title: Getting it Right
1Getting it Right
- The Cost of Computer Failure
- 13 February 2008
2The Getting it Right
- Alan Clements
- School of Computing
- University of Teesside
- Middlesbrough
- England
3Overview of the Lecture
- Computer systems control all aspects of life.
- The failure of computer systems can have
catastrophic effects. - Already major aviation disasters have taken
place because of computer errors. - We must ensure that designers know how failures
occur and how they can be prevented. - We have to balance new disasters due to computers
against old disasters due to humans.
4The Computer Aided Crash
- My academic interest is computer architecture.
- I propose to talk about the need for computer
scientists to understand the consequence of the
systems they design by reference computer-based
accidents particularly in the aviation industry.
5Ethics and Professionalism
- All professional organizations stress that
students be taught ethics. - The emphasis on ethics is not for idealistic
reasons it is because of the high cost of
lapses in ethics. - History is full of examples of the tremendous
cost of neglecting ethics. - Corporate manslaughter is a new offense in the
UK. - Corporate manslaughter is a crime that can be
committed by a company in relation to a
work-related death. - The offence is intrinsically linked to whether a
senior manager - a "controlling mind and will" of
the company - is guilty of manslaughter. - If the director or manager is found guilty, the
company is guilty. - King Hammurabi said If a building collapsed and
kills people builder shall be put to death.
6Non-aviation Examples of Computer Error
- Lets look at two classical computer errors.
- One involves a therapeutic X-ray machine and the
other a surface-to-air missile. - Both errors were due to poor design rather than
component failure.
7The Therac 25 Incidents
- The Therac-25 was a therapeutic X-ray machine
designed to treat cancer sufferers. - It operated in two modes X-ray and electron
beam. - In the X-ray mode a powerful electron beam was
aimed at a target to generate X rays.
8The Therac 25 Incidents
- If the machine was set in the X-ray mode and the
target was not engaged, the patient would receive
a fatal does of high intensity radiation. - Early Therac models had electro-mechanical
interlocks that made it impossible to energize
the electron beam if the target were not in
place. - The Therac-25 used a PDP-11 computer to perform
all operations including moving the target into
place when in the X-ray mode.
9The Therac 25 Incidents
- On several occasions the target was not rotated
into position. - Patients suffered massive doses of high intensity
electron beams leading to both thermal and
electromagnetic radiation damage. - Six accidents involving massive overdoses of
radiation occurred between 1985 and 1987 before
the machines were recalled. - This was a failure of design and of imagination.
It was also a failure by the regulatory bodies to
anticipate the problem and then to respond to it.
10Reasons for the Therac 25 Failures
- Software re-used from older models that had
hardware interlocks. - The hardware provided no way for the software to
verify that sensors were working correctly. - The operator interface was not correctly
synchronized with the system operation. If the
operator corrected an error too quickly, a race
condition occurred. This was missed during
testing, because operators werent fast enough
for the problem to occur. - The software set a flag by incrementing it. If it
was incremented too often, arithmetic overflow
occurred and the software bypassed safety
checks.
11A Comment by the FDA on the Therac Manual
- The operator's manual does not explain nor even
address the malfunction codes... - The materials provided give no indication that
these malfunctions could place a patient at risk. - The program does not advise the operator if a
situation exists wherein the ion chambers used to
monitor the patient are saturated, thus are
beyond the measurement limits of the instrument. - This software package does not appear to contain
a safety system to prevent parameters being
entered and intermixed that would result in
excessive radiation being delivered to the
patient under treatment.
12The Patriot Missile Failure
- The Patriot missile was used in the first Iraq
war to destroy incoming Scud missiles. - The position of a Scud missile was calculated
using a formula that involved time. - Patriot software measured time in increments of
0.1s second. - The decimal value 0.1 cannot be exactly
represented in binary as it is a recurring
fraction. - The Patriot used 24-bit arithmetic to represent
time.
13The Arithmetic Failure
- The longer a Patriot missile is in operation
(booted up) the greater the accumulated time
error becomes. - On February 25,1991, a patriot missile had been
operating for over 100 consecutive hours. - The period of operation gave rise to an
accumulated error of 0.34s. - A Scud flies at over 1,600 m/s and covers over
500 m in this time. - The Patriot missed the Scud.
- The Scud struck a US army barracks killing 28
soldiers.
14Arithmetic Error
- In this example, the failure occurred directly as
a consequence of the imprecision of the tracking
algorithm. - However, the failure to re-start the Patriot
should have been anticipated by the designers.
15Accident Rates by Aircraft Generation
Flight Safety Foundation October 2005
16Accident Rates and Fatalities 1956-2004
Flight Safety Foundation October 2005
17A Preview of the Civil Aviations Future
- Commercial aircraft of the future will have a
cockpit with two seats. - The left-hand seat will hold a dog and the
right-hand seat will hold a pilot. - The purpose of the pilot is to feed the dog.
- The purpose of the dog is to bite the pilot if he
touches any of the controls.
18Life before the Computer
- The question is not, Do computers cause errors?
but, Do computers cause more or less errors than
people alone? - Consider first a disaster that didnt require
computer intervention.
19Air Florida Flight 90
- The crash of Air Florida flight 90 into the
Potomac on takeoff at Washingtons National
Airport defies belief. - The pilots failed to apply de-icing during a
severe storm. - This resulted in them applying too little power.
- Even as they fell out of the sky they did not
advance the throttles.
20To Err is Human
Cockpit voice recorder transcription of flight
AF90. TWR tower CAM1 pilot 1 CAM2 pilot 2
- 155924 TWR Palm 90 cleared for takeoff.
- 155932 CAM-1 Okay, your throttles.
- 155935 SOUND OF ENGINE SPOOLUP
- 155949 CAM-1 Holler if you need the wipers.
- 155951 CAM-1 It's spooled. Real cold, real
cold. - 155958 CAM-2 God, look at that thing. That
don't seem right, does it? Uh, that's not right. - 160009 CAM-1 Yes it is, there's eighty.
- 160010 CAM-2 Naw, I don't think that's right.
Ah, maybe it is. - 160021 CAM-1 Hundred and twenty.
- 160023 CAM-2 I don't know
- 160031 CAM-1 Vee-one. Easy, vee-two.
- 160039 SOUND OF STICKSHAKER STARTS AND
CONTINUES UNTIL IMPACT - 160041 TWR Palm 90 contact departure control.
- 160045 CAM-1 Forward, forward, easy. We only
want five hundred. - 160048 CAM-1 Come on forward....forward, just
barely climb. - 160059 CAM-1 Stalling, we're falling!
- 160100 CAM-2 Larry, we're going down, Larry....
- 160101 CAM-1 I know it.
- 160101 SOUND OF IMPACT
21People Good but with Bad Bits
- The Air Florida incident indicate that people
- The can be bad and make errors so fundamental
that it impossible to understand how the mistake
could ever have been made. - However, peolpe can be good use initiative
solve new problems in real-time.
22Computers to the Rescue
- The computer apparently provides a solutions to
all our problems. - The computer is accurate (error free) and
reliable. Given the same data and same program it
always achieves the same result. - The computer never gets distracted or gets tired.
23Why use Computers in Aviation?
- Computers can do things we cannot do ourselves
for example, navigation. - Computers are more reliable that humans.
- Computers are economic for example, fly-by-wire
saves the cost of a lot of heavy hydraulics and
mechanical linkages. - Computers can make flying more safe for example
the A320s envelope protection mechanism.
24The Impossibility of Testing
- Computer software and hardware cannot be fully
tested. - If you wanted to test a computer memory but
looking for every possible fault, the test would
take far longer than the expected life of the
universe to complete. - In practice, most defects can be found with a
reasonably small number of tests. But it is
impossible to guarantee that all defects will be
found in a finite time.
25The Impossibility of Testing Software
From Leveson Consider the loop with several
pathways though it determined by data values. If
the loop is executed 20 times. The number of
possible pathways is 100 trillion.
26New ideas are always being introduced
27Misuse Cases Looking for Hostile Content
Ian Alexander, IEEE Software, Feb 2003
Threatens
Drive the car
Steal the car
Includes
Mitigates
Includes
Driver
Lock the car
Car thief
Threatens
Includes
Short the ignition
Mitigates
Lock the transmission
28Misuse Cases Looking for Hostile Content
Ian Alexander, IEEE Software, Feb 2003
Threatens
Control the car
Make the car skid
Has exception
Mitigates
Driver
Control traction
Weather
Mitigates
ABS control
29The Computer Controlled Accident
- The theme of this lecture is the danger of
computers inducing errors into systems. - These errors are often caused by a failure of the
human-computer interface. - Consequently, these errors can also be regarded
as a failure of the designers to anticipate
problems. - Many of these problems are not new, unusual, or
radical. They are the problems of everyday life
but with more serious consequences.
30Four Incidents Involving Computers
- Lets look at some examples of situations in
which the use of a computer can be argued to have
caused a crash.
31A320
32The Flight Envelope
For a light aircraft from http//www.auf.asn.au/gr
oundschool
The envelope defines the aircraft's safe area of
operation. The boundaries of the flight envelope
are the aerodynamic stall and structural damage.
33The A320 Cockpit
The sidestick gives a very uncluttered
layout. The sidestick provides a demand input to
the computer. The computer controls the flying
surfaces according to a set of algorithms. One
pilot can lockout the other pilots sidestick.
Special problems of the A320 fly-by-wire system
34A320 Alpha Floor Protection
- Alpha Floor is a low speed protection mechanism.
When activated, it provides TOGA
(take-off-go-around) thrust. - As the aircraft decelerates into the alpha
protection range, the Alpha Floor is activated,
even if the auto-thrust is disengaged. - Alpha Floor is inhibited below 100 feet radio
altitude. - If a rapid avoidance maneuver is required to
escape terrain or wind shear it is safe to
rapidly pull the sidestick fully aft. - The aircraft will pitch up to maximum Alpha,
engage TOGA thrust and climb away. - Conventional aircraft cannot perform such a
maneuver safely while remaining within the flight
envelope.
35Incident 1 - Habsheim
- The Habsheim crash is one of the most
controversial of crashes involving a
computer-controlled civilian transport airliner. - An A320 was to overfly Mulhouse-Habsheim airport
at an airshow. - The pass was to be at low speed, gear down, at
100 feet agl. - The first officer informed the captain that the
aircraft was reaching 100 feet. The descent
continued to 50 feet and further to 30-35 feet. - Go-around power was added. The A320 continued and
touched trees at the end of the runway at a 14º
pitch attitude and engine speed being 83 N1. - The plane sank slowly into the forest and a fire
broke out.
36Incident 1 - Habsheim
37Incident 1 - Habsheim
- This is one of the most controversial incidents
in aviation history and has not been resolved.
Some even believe that the data recorders were
falsified. - The controversy arises because of the dispute
between the pilot and aircraft manufacturers
not least because of the radical nature of the
A320 (the first civil fly-by-wire airlines where
the computer had ultimate authority).
38Incident 1 Habsheim
- The crew, who survived, the crash maintain that
the computer-controlled aircraft was responsible
for the incident and that the aircraft did not
respond to increased throttle input. - The manufacturers point out
- The crew were performing aerobatic maneuvers near
the ground when they had not been trained to do
this and the aircraft was not designed for
aerobatics. - They had disabled the aircrafts automatic
go-around mechanism designed to execute a
go-around after an aborted approach. - They had forgotten that the response time of all
jet engines (when spooled down) is about 5
seconds.
39Incident 1 Habsheim - Comments
- If we accept that the aircraft was not at fault,
the crash happened because the crew overestimated
the capability of the computers. - Indeed, they assumed that a computerized aircraft
could not crash no matter what they did with
it. - Perhaps future aircraft should have a fear factor
built in they are designed to fail at random
and plunge towards the ground after informing the
crew that you are on your own have a nice day.
40Incident 2 - Strasburg
- A particularly tragic incident occurred in 1992
near Strasburg, France. - The crew were descending to land and selected a
glide angle of 3.3º. - The auto pilot operates in a dual-mode
configuration where the demand unit 3.3 refers
either to a glide-slope of 3.3º or a descent rate
of 3,300 feet per minute. - They may have selected the wrong mode and did not
monitor the aircrafts progress. - The aircraft crashed on the top of a mountain in
winter with the loss of 87 souls.
41Incident 2 - Strasburg
- It is easy to argue that the fault lies with the
crew. They selected the wrong descent mode and
failed to monitor the subsequent descent profile. - However, the human interface designers failed to
appreciate what it feels like in the cockpit when
working under the stress of an approach in busy
airspace. - The interface designers made is very easy to miss
the error by using 3.3 for both units. They
provided no additional feedback. - By building intelligence into the flight control
system it might have been possible for the
aircraft to detect that the action was
unreasonable under the current circumstances
and to have queried it.
42Incident 3 Warsaw When has a Plane Landed?
- When an aircraft lands, spoilers on the wings are
deployed to destroy lift buckets are placed
behind the jets to reverse thrust and the brakes
applied. - The computerized A320 defined a landing as 12
tons weight on the left main gear, 12 tons weight
on the right main gear and the wheels rotating at
72 kts.
43Incident 3 Crash at Warsaw
- On 14 September 1993 a Lufthansa A320 landed at
Warsaw airport in stormy conditions with a
cross-wind and heavy rain. - The aircraft did not come to a halt and crashed
with the loss of two people. - The strong cross-winds forced the aircraft to
bank into wind resulting in a touch down on the
right gear. The left gear did not contact the
runway for another 9 seconds (1525m from the
runway threshold). - The brakes were not applied for a further 4
seconds because the wheels were aquaplaning.
44Incident 4 Nagoya Another Wrong Mode
Another mode failure the crash of a China Air
A300 at Nagoya Airport
On 26 April 1996 an A300 was established on the
glide slope to Nagoya airport in Japan. The A300
is not a fly-by-wire aircraft but has
conventional hydraulic controls. The aircraft
suddenly diverted from the glide slope and
started to climb. The crew attempted to continue
with the landing rather than initiating a
go-around. The crew struggled with the autopilot
and the aircraft adopted a nose up attitude of
18º. The crew were pushing forward on the control
yoke to continue with the landing. The aircraft
adopted a 52º nose up attitude and stalled 1,800
feet with a speed of only 78 kts.
45Incident 4 Nagoya Another Wrong Mode
The China Air crash is a remarkable, albeit
tragic, incident. The cause was rapidly
determined from the CVR and flight data
recorder. On approach the autopilot was
accidentally switched into the go-around mode by
the first officer who was flying. The autopilot
attempted to execute a go-around by raising the
nose. The alpha-floor protection was triggered
because the aircraft was near the stall speed and
maximum thrust applied. The pilot wrestled with
the aircraft, rather than continuing with the
go-around (since the approach was now no longer
stable). The crew were fighting the autopilot
that was raising the nose. The pilot was
controlling the elevators, whereas the autopilot
was controlling the more powerful horizontal
stabilizer. The computer won the struggle.
46Incident 4 Nagoya Another Wrong Mode
The auto-pilot was applying climb power. The crew
throttled back and made the crash
inevitable. Normally, the application of pilot
control input of about 40 lb push or 100 lb pull
automatically disengages the autopilot. However,
this mode is disabled below 1,500 feet in cose
the pilot accidentally nudges the control column
disengaging the autopilot very late in a
landing. It is not clear why the crew did not
recognize the nature of the situation and take
appropriate action. They could have continued
with the go-around and made a second approach.
They could have disengaged the autopilot and have
taken control. They chose to wrestle with the
computer and lost.
47Incident 4 Nagoya Air China
- Computers and humans come out of this incident
badly. - The stall-prevention system (increasing thrust)
contributed to the stall a safety mechanism
should not make things worse! - It is surprising that the go-around function
could be engaged with no audible warning or
indication of a major change in operating mode. - It is surprising that the crew did not recognize
what was happening and disengage the autopilot. - The accident investigation made suggestions
concerning the training of pilots in the
operation of the autopilot.
48Cali 1995 Computers and People in Error
The captain asked "would you like to shoot the
one nine straight in?" The first officer
responded, "Yeah, we'll have to scramble to get
down. We can do it."
- A 757 crashed in Columbia as a result of a late
change in the flight plan during the approach
phase. - By accepting a straight-in approach to runway
19, the crew needed to accomplish the following
actions expeditiously - Locate, remove from its binder, and position the
chart for the approach to runway 19 - Review the approach chart for radio frequencies,
headings, altitudes, distances, and missed
approach procedures - Select and enter data into the flight management
system (FMS) computers regarding the new approach
- Compare information on the approach chart with
approach information displayed from FMS data - Verify that selected radio frequencies, airplane
headings, and FMS- entered data were correct - Recalculate airspeeds, altitudes, and
configurations - Hasten the descent of the airplane because of the
shorter distance available to the end of new
runway. - Monitor the course and descent of the airplane,
while maintaining communications with ATC
49Cali Confusion
- The crew became confused and lost situational
awareness. - They needed to return to a waypoint but the
flight management system had deleted if because
theyd passed it. - The wanted to enter a new way point ROZO. They
typed R and the computer came up with ROMEO
which they accepted. - Flying to ROMEO took them across a mountain ridge
into another valley where they descended. - They used speed brakes to expedite their descent.
- When the ground proximity warning sounded, they
executed an escape maneuver without retracting
the speed brakes. This prevented them from
climbing to safety.
50Cali Whose Fault?
- The aircraft was flying rapidly in mountainous
country with no ground radar observation. - The crew became confused while trying to set up a
new approach all this activity took place in
seconds. - The data entry system permitted them to make a
data-entry mistake and descend into terrain. - The aircraft permitted them to execute an escape
maneuver when they still had speed brakes
deployed. - Pilot training, flight management system design,
ground-air communication, and flight controls all
played a part in the disaster.
51The Biggest Killer of them All
- The worst aviation disaster of all time involved
an error of communications.
The catastrophe was caused by the misuse of the
English expression I am -ing
You would expect that critical communications in
aviation use a clear unambiguous protocol. This
was not true on 27 March 1977.
52The Tenerife Disaster
In foggy weather a KLM 747 was waiting to depart
at the end of the runway. The Dutch captain was
very impatient and wanted to get away. The Dutch
captain said We are at take off. I assume that
in his haste he used a Dutch grammatical
construction which meant We are taking off when
expressed in Dutch. The Dutch captain began to
open the throttle, but his copilot stopped him.
Shortly afterward, the Dutch captain began the
takeoff roll in the heavy mist. While this was
happening, a second 747 was crossing the runway.
The Dutch captain was aware of this maneuver but
did not check that it has been completed. The KLM
747 collided with the Pan Am 747 with the loss of
583 lives.
53G-KMAM Computers 0, Humans 1
- Sometimes humans can overcome computer errors
because of training and intelligence, even when
the advice given by the computer is incorrect. - In 1995 an A320 departed London Gatwick and an
uncommanded roll to the right took place. They
could not turn left. - The computer indicates a significant error and
the EFCS reverted to its alternate law mode. - The crew returned to land the computer advised
a FLAP 3 landing. - They could not control the aircraft and selected
a FLAP 1 landing after much hunting for paper
operating manuals. The landing was safe. - It was later found that, after maintenance,
spoilers had been left in a maintenance mode. - When the crew tested the controls they did not
appreciate that an error message would appear
only if the stick was held in position for 3.5s.
54Summary
- Computers can improve the way in which we do
almost anything. - However, the correct implementation of some
activities is important because a failure can
lead to a loss of life. - It is important that we who teach subjects like
computer science make students aware of the
possible consequences of their actions.