Title: One-line presentation headline
1An Analysis of Causation inAerospace Accidents
Kathryn Anne Weissweissk_at_mit.eduhttp//www.mit.e
du/weissk Complex Systems Research Laboratory
(CSRL)Department of Aeronautics and
AstronauticsMassachusetts Institute of
Technology Tuesday, September 7, 2004
This paper was presented at the Digital Avionics
Systems Conference in 2001. This paper and
similar papers on accidents, accident modeling
and accident reports can be found at
http//sunnyday.mit.edu/accidents/index.html
2Recent Aerospace Losses
Ariane 5
Titan/Centaur/Milstar
SOlar HeliosphericObservatory
Mars Climate Orbiter
3Ariane 5
- June 4, 1996, 40 seconds after launch, the
launcher veered off its nominal flight path and
exploded - Reused the IRS software from Ariane 4 on the
Ariane 5 - The time sequence of the Ariane 5 lift-off is
significantly different from that of the Ariane 4 - A function was left in the Ariane 5 software for
commonality reasons, based on the view that,
unless proven necessary, it was not wise to make
changes in software which worked well on Ariane
4 - An exception was raised causing the nozzle of the
solid rocket boosters to deflect, from which the
launcher experienced high aerodynamic loads
4Mars Climate Orbiter
- Relied heavily on previous designs of MGS and
Pathfinder - There was an error in the spacecrafts navigation
measurements of nearly 100 km, which resulted in
a much lower altitude than expected during MOI
and led to the vehicles break-up in the
atmosphere - The conversion factor from English to Metric
units was erroneously left out of the AMD files - Interface Specification required that the
impulse-bit calculations should be done using
Metric Units - The software supplied by a vendor that used
English units
5Titan/Centaur/Milstar
- Mission to place Milstar in a geosynchronous
orbit - Roll rate filter constant should have been
entered as1.992476, but was entered as
0.1992476 - Centaur/Milstar began experiencing instability
about the roll axis during the first burn - Instability greatly magnified during Centaurs
second main engine burn, resulting in vehicle
tumbling - The Centaur attempted to compensate with its RCS,
which ultimately depleted available propellant - The third engine burn terminated early
- Milstar satellite placed in a low elliptical
final orbit
6SOHO Background
- SOHO, or the SOlar Heliospheric Observatory, is a
joint effort between NASA and ESA to perform
helioseismology and monitor the solar atmosphere,
corona and wind - SOHO was launched on December 2, 1995, was
declared fully operational in April of 1996, and
completed a successful two-year primary mission
in May of 1998 - It then entered into its extended mission phase
- After roughly two months of nominal activity,
contact with SOHO was lost June 25, 1998
7SOHO Loss (1/4)
- The loss was preceded by a routine calibration of
the spacecraft's three roll gyroscopes (named A,
B and C) and by a momentum management maneuver - In order to increase the amount of science done
during the mission and to increase the gyros
lifespans, a decision was made to compress the
timeline of the operational procedures for
momentum management, gyro calibration and science
instrument calibration into one continuous
sequence - The previous process had included a day between
completing gyro calibration and beginning the
momentum management procedures
8SOHO Loss (2/4)
- Because the gyro calibration in the new
compressed timeline was immediately followed by a
momentum management procedure, despinning the
gyros at the end of the gyro calibration and
re-enabling the on-board software gyro control
function was not required - However, after the gyro calibration, Gyro A was
specifically despun in order to conserve its
life, while Gyros B and C remained active
9SOHO Loss (3/4)
- The modified predefined command sequence in the
on-board control software had an error it did
not contain a necessary function to reactivate
Gyro A, which was needed by the Emergency Sun
Reacquisition - This omission resulted in the removal of the
functionality of the spacecrafts normal safe
mode, ESR, and ultimately caused the sequence of
events that led to the loss of telemetry - In addition, there was another error in the
software that resulted in leaving Gyro B in its
high gain setting following the momentum
management maneuver - This error originally triggered the ESR
10SOHO Loss (4/4)
- The first error was contained within a software
function called A_CONFIG_N - ESR requires the use of Gyro A for roll control
- Any procedure that spins down Gyro A must set a
flag in the computer to respin Gyro A whenever
the safe mode is triggered - When A_CONFIG_N was modified, the software enable
command was omitted due to a lack of system
knowledge of the person who modified the
procedure - Because the change had not been properly
communicated, the operator procedures did not
indicate that Gyro A had been spun down
11Lessons Learned
- We can learn lessons from these and other (all
very different) aerospace accidents by examining
the factors common among them - These factors are systemic and indicative of many
accidents involving aerospace software systems - Systemic factors can be grouped into the
following categories - Flaws in the Safety Culture
- Ineffective Organizational Structure
- Ineffective Technical Activites
12Flaws in the Safety Culture
- Overconfidence and Complacency
- Success is ironically one of the progenitors of
accidents - In SOHO led to inadequate testing and review of
changes to ground-issued commands, a false sense
of confidence in the team's ability to recover
from an ESR, the use of challenging schedules,
etc. - Discounting or Not Understanding Software Risks
- An engineering culture that has unrealistic
expectations about software and the use of
computers - Changing (SOHO) software without introducing
errors or undesired behavior is much more
difficult than building correct software initially
13Flaws in the Safety Culture (Cont.)
- Assuming Risk Decreases over Time
- In the Titan/Centaur/Milstar loss, the Titan
Program Office decided that because software was
mature, stable, and had not experienced problems
in the past, they could use the limited
resources available after the initial development
effort to address hardware issues - Inadequate Emphasis on Risk Management
- Incorrect Prioritization of Changes
- Slow Understanding of the Problems Associated
with Human-Automation Mismatch
14Ineffective Organizational Structure
- Diffusion of Responsibility and Authority
- In almost all of the spacecraft accidents, there
appeared to be serious organizational and
communication problems among the geographically
dispersed partners - Low-level status or Missing System Safety Program
- In the SOHO report, no mention is made to any
formal safety program. - Limited Communication Channels and Poor
Information Flow
15Ineffective Technical Activities
- Flawed or Inadequate Review Process
- For SOHO, the changes to the ground-generated
commands were subjected to very limited review - Inadequate Specifications
- Software-related accidents almost always are due
to misunderstandings about what the software
should do - Inadequate System and Software Engineering
- Software Reuse Without Appropriate Analysis of
its Safety - Two of the spacecraft accidents, Titan and
Ariane, involved reused software originally
developed for other systems
16Ineffective Technical Activities (Cont.)
- Unnecessary Complexity and Software Functions
- The Ariane 5 and Titan IVB-32 accidents clearly
involved software that was not needed, but
surprisingly the decision to put in or to keep
these features (in the case of reuse) was not
questioned in the accident reports. - Inadequate System Safety Engineering
- Test and Simulation Environments that do not
Match the Operational Environment - A general principle in testing aerospace systems
is to fly what you test and test what you fly
17Ineffective Technical Activities (Cont.)
- Deficiencies in Safety-Related Information
Collection and Use - Operational Personnel Not Understanding the
Automation - The SOHO report says that the software enable
function had not been included as part of the
modification to A-CONFIG-N due to a lack of
system knowledge of the person who modified the
procedure - Inadequate and Ineffective Cognitive Engineering
and Feedback - SOHO controllers did not have the information
they needed about the state of the gyros and the
spacecraft in general to make appropriate
decisions
18Conclusions
- By examining recent, software-related aerospace
accidents, we notice similarities, or systemic
factors, involved in the losses - These similarities and parallels should help in
focusing efforts to prevent future accidents