Getting it Right - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Getting it Right

Description:

When an aircraft lands, spoilers on the wings are deployed to destroy lift; ... The KLM 747 collided with the Pan Am 747 with the loss of 583 lives. Alan Clements ... – PowerPoint PPT presentation

Number of Views:174
Avg rating:3.0/5.0
Slides: 55
Provided by: ALA74
Category:
Tags: days | getting | lives | of | our | right | spoilers

less

Transcript and Presenter's Notes

Title: Getting it Right


1
Getting it Right
  • The Cost of Computer Failure
  • 13 February 2008

2
The Getting it Right
  • Alan Clements
  • School of Computing
  • University of Teesside
  • Middlesbrough
  • England

3
Overview of the Lecture
  • Computer systems control all aspects of life.
  • The failure of computer systems can have
    catastrophic effects.
  • Already major aviation disasters have taken
    place because of computer errors.
  • We must ensure that designers know how failures
    occur and how they can be prevented.
  • We have to balance new disasters due to computers
    against old disasters due to humans.

4
The Computer Aided Crash
  • My academic interest is computer architecture.
  • I propose to talk about the need for computer
    scientists to understand the consequence of the
    systems they design by reference computer-based
    accidents particularly in the aviation industry.

5
Ethics and Professionalism
  • All professional organizations stress that
    students be taught ethics.
  • The emphasis on ethics is not for idealistic
    reasons it is because of the high cost of
    lapses in ethics.
  • History is full of examples of the tremendous
    cost of neglecting ethics.
  • Corporate manslaughter is a new offense in the
    UK.
  • Corporate manslaughter is a crime that can be
    committed by a company in relation to a
    work-related death.
  • The offence is intrinsically linked to whether a
    senior manager - a "controlling mind and will" of
    the company - is guilty of manslaughter.
  • If the director or manager is found guilty, the
    company is guilty.
  • King Hammurabi said If a building collapsed and
    kills people builder shall be put to death.

6
Non-aviation Examples of Computer Error
  • Lets look at two classical computer errors.
  • One involves a therapeutic X-ray machine and the
    other a surface-to-air missile.
  • Both errors were due to poor design rather than
    component failure.

7
The Therac 25 Incidents
  • The Therac-25 was a therapeutic X-ray machine
    designed to treat cancer sufferers.
  • It operated in two modes X-ray and electron
    beam.
  • In the X-ray mode a powerful electron beam was
    aimed at a target to generate X rays.

8
The Therac 25 Incidents
  • If the machine was set in the X-ray mode and the
    target was not engaged, the patient would receive
    a fatal does of high intensity radiation.
  • Early Therac models had electro-mechanical
    interlocks that made it impossible to energize
    the electron beam if the target were not in
    place.
  • The Therac-25 used a PDP-11 computer to perform
    all operations including moving the target into
    place when in the X-ray mode.

9
The Therac 25 Incidents
  • On several occasions the target was not rotated
    into position.
  • Patients suffered massive doses of high intensity
    electron beams leading to both thermal and
    electromagnetic radiation damage.
  • Six accidents involving massive overdoses of
    radiation occurred between 1985 and 1987 before
    the machines were recalled.
  • This was a failure of design and of imagination.
    It was also a failure by the regulatory bodies to
    anticipate the problem and then to respond to it.

10
Reasons for the Therac 25 Failures
  • Software re-used from older models that had
    hardware interlocks.
  • The hardware provided no way for the software to
    verify that sensors were working correctly.
  • The operator interface was not correctly
    synchronized with the system operation. If the
    operator corrected an error too quickly, a race
    condition occurred. This was missed during
    testing, because operators werent fast enough
    for the problem to occur.
  • The software set a flag by incrementing it. If it
    was incremented too often, arithmetic overflow
    occurred and the software bypassed safety
    checks.

11
A Comment by the FDA on the Therac Manual
  • The operator's manual does not explain nor even
    address the malfunction codes...
  • The materials provided give no indication that
    these malfunctions could place a patient at risk.
  • The program does not advise the operator if a
    situation exists wherein the ion chambers used to
    monitor the patient are saturated, thus are
    beyond the measurement limits of the instrument.
  • This software package does not appear to contain
    a safety system to prevent parameters being
    entered and intermixed that would result in
    excessive radiation being delivered to the
    patient under treatment.

12
The Patriot Missile Failure
  • The Patriot missile was used in the first Iraq
    war to destroy incoming Scud missiles.
  • The position of a Scud missile was calculated
    using a formula that involved time.
  • Patriot software measured time in increments of
    0.1s second.
  • The decimal value 0.1 cannot be exactly
    represented in binary as it is a recurring
    fraction.
  • The Patriot used 24-bit arithmetic to represent
    time.

13
The Arithmetic Failure
  • The longer a Patriot missile is in operation
    (booted up) the greater the accumulated time
    error becomes.
  • On February 25,1991, a patriot missile had been
    operating for over 100 consecutive hours.
  • The period of operation gave rise to an
    accumulated error of 0.34s.
  • A Scud flies at over 1,600 m/s and covers over
    500 m in this time.
  • The Patriot missed the Scud.
  • The Scud struck a US army barracks killing 28
    soldiers.

14
Arithmetic Error
  • In this example, the failure occurred directly as
    a consequence of the imprecision of the tracking
    algorithm.
  • However, the failure to re-start the Patriot
    should have been anticipated by the designers.

15
Accident Rates by Aircraft Generation
Flight Safety Foundation October 2005
16
Accident Rates and Fatalities 1956-2004
Flight Safety Foundation October 2005
17
A Preview of the Civil Aviations Future
  • Commercial aircraft of the future will have a
    cockpit with two seats.
  • The left-hand seat will hold a dog and the
    right-hand seat will hold a pilot.
  • The purpose of the pilot is to feed the dog.
  • The purpose of the dog is to bite the pilot if he
    touches any of the controls.

18
Life before the Computer
  • The question is not, Do computers cause errors?
    but, Do computers cause more or less errors than
    people alone?
  • Consider first a disaster that didnt require
    computer intervention.

19
Air Florida Flight 90
  • The crash of Air Florida flight 90 into the
    Potomac on takeoff at Washingtons National
    Airport defies belief.
  • The pilots failed to apply de-icing during a
    severe storm.
  • This resulted in them applying too little power.
  • Even as they fell out of the sky they did not
    advance the throttles.

20
To Err is Human
Cockpit voice recorder transcription of flight
AF90. TWR tower CAM1 pilot 1 CAM2 pilot 2
  • 155924 TWR Palm 90 cleared for takeoff.
  • 155932 CAM-1 Okay, your throttles.
  • 155935 SOUND OF ENGINE SPOOLUP
  • 155949 CAM-1 Holler if you need the wipers.
  • 155951 CAM-1 It's spooled. Real cold, real
    cold.
  • 155958 CAM-2 God, look at that thing. That
    don't seem right, does it? Uh, that's not right.
  • 160009 CAM-1 Yes it is, there's eighty.
  • 160010 CAM-2 Naw, I don't think that's right.
    Ah, maybe it is.
  • 160021 CAM-1 Hundred and twenty.
  • 160023 CAM-2 I don't know
  • 160031 CAM-1 Vee-one. Easy, vee-two.
  • 160039 SOUND OF STICKSHAKER STARTS AND
    CONTINUES UNTIL IMPACT
  • 160041 TWR Palm 90 contact departure control.
  • 160045 CAM-1 Forward, forward, easy. We only
    want five hundred.
  • 160048 CAM-1 Come on forward....forward, just
    barely climb.
  • 160059 CAM-1 Stalling, we're falling!
  • 160100 CAM-2 Larry, we're going down, Larry....
  • 160101 CAM-1 I know it.
  • 160101 SOUND OF IMPACT

21
People Good but with Bad Bits
  • The Air Florida incident indicate that people
  • The can be bad and make errors so fundamental
    that it impossible to understand how the mistake
    could ever have been made.
  • However, peolpe can be good use initiative
    solve new problems in real-time.

22
Computers to the Rescue
  • The computer apparently provides a solutions to
    all our problems.
  • The computer is accurate (error free) and
    reliable. Given the same data and same program it
    always achieves the same result.
  • The computer never gets distracted or gets tired.

23
Why use Computers in Aviation?
  • Computers can do things we cannot do ourselves
    for example, navigation.
  • Computers are more reliable that humans.
  • Computers are economic for example, fly-by-wire
    saves the cost of a lot of heavy hydraulics and
    mechanical linkages.
  • Computers can make flying more safe for example
    the A320s envelope protection mechanism.

24
The Impossibility of Testing
  • Computer software and hardware cannot be fully
    tested.
  • If you wanted to test a computer memory but
    looking for every possible fault, the test would
    take far longer than the expected life of the
    universe to complete.
  • In practice, most defects can be found with a
    reasonably small number of tests. But it is
    impossible to guarantee that all defects will be
    found in a finite time.

25
The Impossibility of Testing Software
From Leveson Consider the loop with several
pathways though it determined by data values. If
the loop is executed 20 times. The number of
possible pathways is 100 trillion.
26
New ideas are always being introduced
  • For example

27
Misuse Cases Looking for Hostile Content
Ian Alexander, IEEE Software, Feb 2003
Threatens
Drive the car
Steal the car
Includes
Mitigates
Includes
Driver
Lock the car
Car thief
Threatens
Includes
Short the ignition
Mitigates
Lock the transmission
28
Misuse Cases Looking for Hostile Content
Ian Alexander, IEEE Software, Feb 2003
Threatens
Control the car
Make the car skid
Has exception
Mitigates
Driver
Control traction
Weather
Mitigates
ABS control
29
The Computer Controlled Accident
  • The theme of this lecture is the danger of
    computers inducing errors into systems.
  • These errors are often caused by a failure of the
    human-computer interface.
  • Consequently, these errors can also be regarded
    as a failure of the designers to anticipate
    problems.
  • Many of these problems are not new, unusual, or
    radical. They are the problems of everyday life
    but with more serious consequences.

30
Four Incidents Involving Computers
  • Lets look at some examples of situations in
    which the use of a computer can be argued to have
    caused a crash.

31
A320
32
The Flight Envelope
For a light aircraft from http//www.auf.asn.au/gr
oundschool
The envelope defines the aircraft's safe area of
operation. The boundaries of the flight envelope
are the aerodynamic stall and structural damage.
33
The A320 Cockpit
The sidestick gives a very uncluttered
layout. The sidestick provides a demand input to
the computer. The computer controls the flying
surfaces according to a set of algorithms. One
pilot can lockout the other pilots sidestick.
Special problems of the A320 fly-by-wire system
34
A320 Alpha Floor Protection
  • Alpha Floor is a low speed protection mechanism.
    When activated, it provides TOGA
    (take-off-go-around) thrust.
  • As the aircraft decelerates into the alpha
    protection range, the Alpha Floor is activated,
    even if the auto-thrust is disengaged.
  • Alpha Floor is inhibited below 100 feet radio
    altitude.
  • If a rapid avoidance maneuver is required to
    escape terrain or wind shear it is safe to
    rapidly pull the sidestick fully aft.
  • The aircraft will pitch up to maximum Alpha,
    engage TOGA thrust and climb away.
  • Conventional aircraft cannot perform such a
    maneuver safely while remaining within the flight
    envelope.

35
Incident 1 - Habsheim
  • The Habsheim crash is one of the most
    controversial of crashes involving a
    computer-controlled civilian transport airliner.
  • An A320 was to overfly Mulhouse-Habsheim airport
    at an airshow.
  • The pass was to be at low speed, gear down, at
    100 feet agl.
  • The first officer informed the captain that the
    aircraft was reaching 100 feet. The descent
    continued to 50 feet and further to 30-35 feet.
  • Go-around power was added. The A320 continued and
    touched trees at the end of the runway at a 14º
    pitch attitude and engine speed being 83 N1.
  • The plane sank slowly into the forest and a fire
    broke out.

36
Incident 1 - Habsheim
37
Incident 1 - Habsheim
  • This is one of the most controversial incidents
    in aviation history and has not been resolved.
    Some even believe that the data recorders were
    falsified.
  • The controversy arises because of the dispute
    between the pilot and aircraft manufacturers
    not least because of the radical nature of the
    A320 (the first civil fly-by-wire airlines where
    the computer had ultimate authority).

38
Incident 1 Habsheim
  • The crew, who survived, the crash maintain that
    the computer-controlled aircraft was responsible
    for the incident and that the aircraft did not
    respond to increased throttle input.
  • The manufacturers point out
  • The crew were performing aerobatic maneuvers near
    the ground when they had not been trained to do
    this and the aircraft was not designed for
    aerobatics.
  • They had disabled the aircrafts automatic
    go-around mechanism designed to execute a
    go-around after an aborted approach.
  • They had forgotten that the response time of all
    jet engines (when spooled down) is about 5
    seconds.

39
Incident 1 Habsheim - Comments
  • If we accept that the aircraft was not at fault,
    the crash happened because the crew overestimated
    the capability of the computers.
  • Indeed, they assumed that a computerized aircraft
    could not crash no matter what they did with
    it.
  • Perhaps future aircraft should have a fear factor
    built in they are designed to fail at random
    and plunge towards the ground after informing the
    crew that you are on your own have a nice day.

40
Incident 2 - Strasburg
  • A particularly tragic incident occurred in 1992
    near Strasburg, France.
  • The crew were descending to land and selected a
    glide angle of 3.3º.
  • The auto pilot operates in a dual-mode
    configuration where the demand unit 3.3 refers
    either to a glide-slope of 3.3º or a descent rate
    of 3,300 feet per minute.
  • They may have selected the wrong mode and did not
    monitor the aircrafts progress.
  • The aircraft crashed on the top of a mountain in
    winter with the loss of 87 souls.

41
Incident 2 - Strasburg
  • It is easy to argue that the fault lies with the
    crew. They selected the wrong descent mode and
    failed to monitor the subsequent descent profile.
  • However, the human interface designers failed to
    appreciate what it feels like in the cockpit when
    working under the stress of an approach in busy
    airspace.
  • The interface designers made is very easy to miss
    the error by using 3.3 for both units. They
    provided no additional feedback.
  • By building intelligence into the flight control
    system it might have been possible for the
    aircraft to detect that the action was
    unreasonable under the current circumstances
    and to have queried it.

42
Incident 3 Warsaw When has a Plane Landed?
  • When an aircraft lands, spoilers on the wings are
    deployed to destroy lift buckets are placed
    behind the jets to reverse thrust and the brakes
    applied.
  • The computerized A320 defined a landing as 12
    tons weight on the left main gear, 12 tons weight
    on the right main gear and the wheels rotating at
    72 kts.

43
Incident 3 Crash at Warsaw
  • On 14 September 1993 a Lufthansa A320 landed at
    Warsaw airport in stormy conditions with a
    cross-wind and heavy rain.
  • The aircraft did not come to a halt and crashed
    with the loss of two people.
  • The strong cross-winds forced the aircraft to
    bank into wind resulting in a touch down on the
    right gear. The left gear did not contact the
    runway for another 9 seconds (1525m from the
    runway threshold).
  • The brakes were not applied for a further 4
    seconds because the wheels were aquaplaning.

44
Incident 4 Nagoya Another Wrong Mode
Another mode failure the crash of a China Air
A300 at Nagoya Airport
On 26 April 1996 an A300 was established on the
glide slope to Nagoya airport in Japan. The A300
is not a fly-by-wire aircraft but has
conventional hydraulic controls. The aircraft
suddenly diverted from the glide slope and
started to climb. The crew attempted to continue
with the landing rather than initiating a
go-around. The crew struggled with the autopilot
and the aircraft adopted a nose up attitude of
18º. The crew were pushing forward on the control
yoke to continue with the landing. The aircraft
adopted a 52º nose up attitude and stalled 1,800
feet with a speed of only 78 kts.
45
Incident 4 Nagoya Another Wrong Mode
The China Air crash is a remarkable, albeit
tragic, incident. The cause was rapidly
determined from the CVR and flight data
recorder. On approach the autopilot was
accidentally switched into the go-around mode by
the first officer who was flying. The autopilot
attempted to execute a go-around by raising the
nose. The alpha-floor protection was triggered
because the aircraft was near the stall speed and
maximum thrust applied. The pilot wrestled with
the aircraft, rather than continuing with the
go-around (since the approach was now no longer
stable). The crew were fighting the autopilot
that was raising the nose. The pilot was
controlling the elevators, whereas the autopilot
was controlling the more powerful horizontal
stabilizer. The computer won the struggle.
46
Incident 4 Nagoya Another Wrong Mode
The auto-pilot was applying climb power. The crew
throttled back and made the crash
inevitable. Normally, the application of pilot
control input of about 40 lb push or 100 lb pull
automatically disengages the autopilot. However,
this mode is disabled below 1,500 feet in cose
the pilot accidentally nudges the control column
disengaging the autopilot very late in a
landing. It is not clear why the crew did not
recognize the nature of the situation and take
appropriate action. They could have continued
with the go-around and made a second approach.
They could have disengaged the autopilot and have
taken control. They chose to wrestle with the
computer and lost.
47
Incident 4 Nagoya Air China
  • Computers and humans come out of this incident
    badly.
  • The stall-prevention system (increasing thrust)
    contributed to the stall a safety mechanism
    should not make things worse!
  • It is surprising that the go-around function
    could be engaged with no audible warning or
    indication of a major change in operating mode.
  • It is surprising that the crew did not recognize
    what was happening and disengage the autopilot.
  • The accident investigation made suggestions
    concerning the training of pilots in the
    operation of the autopilot.

48
Cali 1995 Computers and People in Error
The captain asked "would you like to shoot the
one nine straight in?" The first officer
responded, "Yeah, we'll have to scramble to get
down. We can do it."
  • A 757 crashed in Columbia as a result of a late
    change in the flight plan during the approach
    phase.
  • By accepting a straight-in approach to runway
    19, the crew needed to accomplish the following
    actions expeditiously
  • Locate, remove from its binder, and position the
    chart for the approach to runway 19
  • Review the approach chart for radio frequencies,
    headings, altitudes, distances, and missed
    approach procedures
  • Select and enter data into the flight management
    system (FMS) computers regarding the new approach
  • Compare information on the approach chart with
    approach information displayed from FMS data
  • Verify that selected radio frequencies, airplane
    headings, and FMS- entered data were correct
  • Recalculate airspeeds, altitudes, and
    configurations
  • Hasten the descent of the airplane because of the
    shorter distance available to the end of new
    runway.
  • Monitor the course and descent of the airplane,
    while maintaining communications with ATC

49
Cali Confusion
  • The crew became confused and lost situational
    awareness.
  • They needed to return to a waypoint but the
    flight management system had deleted if because
    theyd passed it.
  • The wanted to enter a new way point ROZO. They
    typed R and the computer came up with ROMEO
    which they accepted.
  • Flying to ROMEO took them across a mountain ridge
    into another valley where they descended.
  • They used speed brakes to expedite their descent.
  • When the ground proximity warning sounded, they
    executed an escape maneuver without retracting
    the speed brakes. This prevented them from
    climbing to safety.

50
Cali Whose Fault?
  • The aircraft was flying rapidly in mountainous
    country with no ground radar observation.
  • The crew became confused while trying to set up a
    new approach all this activity took place in
    seconds.
  • The data entry system permitted them to make a
    data-entry mistake and descend into terrain.
  • The aircraft permitted them to execute an escape
    maneuver when they still had speed brakes
    deployed.
  • Pilot training, flight management system design,
    ground-air communication, and flight controls all
    played a part in the disaster.

51
The Biggest Killer of them All
  • The worst aviation disaster of all time involved
    an error of communications.

The catastrophe was caused by the misuse of the
English expression I am -ing
You would expect that critical communications in
aviation use a clear unambiguous protocol. This
was not true on 27 March 1977.
52
The Tenerife Disaster
In foggy weather a KLM 747 was waiting to depart
at the end of the runway. The Dutch captain was
very impatient and wanted to get away. The Dutch
captain said We are at take off. I assume that
in his haste he used a Dutch grammatical
construction which meant We are taking off when
expressed in Dutch. The Dutch captain began to
open the throttle, but his copilot stopped him.
Shortly afterward, the Dutch captain began the
takeoff roll in the heavy mist. While this was
happening, a second 747 was crossing the runway.
The Dutch captain was aware of this maneuver but
did not check that it has been completed. The KLM
747 collided with the Pan Am 747 with the loss of
583 lives.
53
G-KMAM Computers 0, Humans 1
  • Sometimes humans can overcome computer errors
    because of training and intelligence, even when
    the advice given by the computer is incorrect.
  • In 1995 an A320 departed London Gatwick and an
    uncommanded roll to the right took place. They
    could not turn left.
  • The computer indicates a significant error and
    the EFCS reverted to its alternate law mode.
  • The crew returned to land the computer advised
    a FLAP 3 landing.
  • They could not control the aircraft and selected
    a FLAP 1 landing after much hunting for paper
    operating manuals. The landing was safe.
  • It was later found that, after maintenance,
    spoilers had been left in a maintenance mode.
  • When the crew tested the controls they did not
    appreciate that an error message would appear
    only if the stick was held in position for 3.5s.

54
Summary
  • Computers can improve the way in which we do
    almost anything.
  • However, the correct implementation of some
    activities is important because a failure can
    lead to a loss of life.
  • It is important that we who teach subjects like
    computer science make students aware of the
    possible consequences of their actions.
Write a Comment
User Comments (0)
About PowerShow.com