Title: Aerospace Mishaps and Lessons Learned
1Aerospace Mishaps and Lessons Learned
- 2004 MAPLD International Conference
- Washington, D.C.
- September 7, 2004
2"... most accidents are not the result of unknown
scientific principles but rather of a failure to
apply well-known, standard engineering practices."
- Nancy Leveson in Safeware, 1995.
3Seminar Program
Time Speaker Affiliation Mishap Title
900 Richard Katz NASA Office of Logic Design Introduction
915 Faith Chandler NASA HQ Using Root-Cause Analysis to Understand Failures
1000 Jonathan F Binkley Aerospace Corp. The Space System Engineering Database (SSED)
1045 BREAK
1100 Owen Brown DARPA Apollo 13 Mishap
1200 Kathryn Anne Weiss MIT An Analysis of Causation in Aerospace Accidents
1245 LUNCH
130 Susan C. Lee JHU/APL The Near Earth Asteroid Rendezvous (NEAR) Rendezvous Burn Anomaly
245 Rick Obenschain NASA GSFC SEASAT Lessons Learned and Not Learned
330 BREAK
345 Keith E. Van Tassel NASA JSC STS-86/SAFER
430 Paul Cheng Aerospace Corp Aerospace 100 Questions That Should Be Asked During Technical Reviews
515 Keith Avery Mission Research Corp. STRV-1c/1d Mishap
600 SESSION ENDS
4Training vs. Education
- The NASA Office of Logic Design works to educate
design engineers, not train them. - Training promotes rote responses
- Education promotes thinking and the ability to
adapt to and cope with new situations. - Hence, MAPLD hosts seminars and not training
sessions.
5Design Seminars
- These case studies are real and are not contrived
examples. Many of the leaders have first hand
knowledge of these mishaps. - Contribute Discuss the topics presented,
disagree with them, present interesting cases you
wish to share, additional lessons, or alternative
viewpoints. - Do not sit there quietly and expect to be treated
like a cocker spaniel being trained and drilled
to emit Pavlovian responses in response to
stimuli (bell for dogs, donuts for engineers).
6Material
- Material will be made available on
- CD-ROM
- Hardcopy
- klabs.org
- All public domain, you may use the material as
you wish.
7I Was Reading AWST
Aviation Week Space Technology, August 23/30,
2004, pp. 29-30
8Barto's Law Every circuit is considered guilty
until proven innocent.
9A Recent Mishap(that gave me the idea for this
seminar)
10Background
- Popular single board computer
- Everything was working fine
- Ran vibration test
- Unpowered and unmonitored
- Subsequently failed to boot intermittently
- Testing at manufacturers also showed
intermittent failures, although at a lower rate
than observed at the contractor.
11Projects Corrective Action
- Unit (S/N 031) pulled from the flight instrument
- New unit (S/N 034) installed in the flight
instrument - Repeated testing with the new unit was successful
- Signed off, ready for launch
12Risk Reduction Effort
- Reviewed problem/failure report
- No root cause or failure mechanism identified
- Conclusion of the Verification and Analysis
Section stated - No direct or indirect evidence given in the
Verification and Analysis section to support a
workmanship issue. - No analysis given to show that the workmanship
problem was not systemic to all units. Since the
unit is clearly marginal and it is difficult to
make fail, it is not shown that other units have
sufficient margin to support operation in all
operating environments over the design life of
the unit.
Each time there was a failure to boot, the power
was cycled and the computer subsequently
rebooted. The result of the testing at XXXXXX
was that the most probable cause of the boot
failure was a workmanship issue specific to SN034
and is not endemic to the XXXXXXXX computer and
therefore does not affect SN031.
13Risk Reduction Effort
- Note the analyst consistently remarks that
after a failed boot the next power cycle results
in correct operation of the board. Yet the board
fails multiple times. This is evidence of the
PC mentality seen in many Projects where, when
there is a problem, the solution is to switch the
power off and back on to correct it. - Contractor and Project claimed repeatedly that
the unit was troubleshot and nothing more could
be done.
14Lets Take a Closer Look
- Examination of failures at manufacturer
- The failures reported were a result of test
equipment there was zero failures detected at
the manufacturer - Intermittent operation of the computer could not
be supported. Electrical environment suspicion
grows - What if analysis results in a large number of
possible failure mechanisms
15Lets Take a Closer Look
- Examination of troubleshooting at contractor
- Previously claimed fully troubleshot
- Examination shows that no oscilloscope probe ever
touched the board - Examined at interface points only
- Throughout organization failures to boot were
routine - Many failures reports written over many units.
- Contractor did not use available diagnostic
signals and port to ascertain status of the CPU
and computer
16Troubleshooting Again
- Contractor fought hard to prevent
- Stalled effort for many months
- Initial examination showed that the protection
signals for the EEPROM memories did not behave as
predicted by the analysis - Contractor would not show the analysis
- Examination of diagnostic signals quickly showed
that the CPU had halted
17Troubleshooting Results
- Cause of failure determined
- Known issue with pipeline timing
- Software service routines not installed to handle
all conditions - Project previously had assured the independent
review that software was installed to handle all
conditions - Did not fail at manufacturer since test software
installed properly handled the interrupt from the
pipelining issue - No support for a workmanship issue specific to
SN034 - Flight software rewritten
18Lessons and Suggestions
- Problem/Failure Reports
- Examine original documents.
- Request and examine all related P/FRs from all
units - Provide direct evidence (at a minimum!) for
determination of the cause of failure - Intermittents after vibration test led to the
conclusion of a workmanship error the bad
solder joint was never identified - Failures at the manufacturer reinforced the
false conclusion as those failures were not
examined in detail and were a result of a testing
error. - Do not conduct reviews in a board room with
PowerPoint slides - Pack up your oscilloscope and go into the lab
19Enjoy your seminar!