Aerospace Mishaps and Lessons Learned - PowerPoint PPT Presentation

About This Presentation
Title:

Aerospace Mishaps and Lessons Learned

Description:

Contractor did not use available diagnostic signals and port to ascertain status of the CPU and computer Troubleshooting Again Contractor fought hard to prevent ... – PowerPoint PPT presentation

Number of Views:203
Avg rating:3.0/5.0
Slides: 20
Provided by: wyuk
Learn more at: http://klabs.org
Category:

less

Transcript and Presenter's Notes

Title: Aerospace Mishaps and Lessons Learned


1
Aerospace Mishaps and Lessons Learned
  • 2004 MAPLD International Conference
  • Washington, D.C.
  • September 7, 2004

2
"... most accidents are not the result of unknown
scientific principles but rather of a failure to
apply well-known, standard engineering practices."
  • Nancy Leveson in Safeware, 1995.

3
Seminar Program
Time Speaker Affiliation Mishap Title
900 Richard Katz NASA Office of Logic Design Introduction
915 Faith Chandler NASA HQ Using Root-Cause Analysis to Understand Failures
1000 Jonathan F Binkley Aerospace Corp. The Space System Engineering Database (SSED)
1045 BREAK    
1100 Owen Brown DARPA Apollo 13 Mishap
1200 Kathryn Anne Weiss MIT An Analysis of Causation in Aerospace Accidents
1245 LUNCH    
130 Susan C. Lee JHU/APL The Near Earth Asteroid Rendezvous (NEAR) Rendezvous Burn Anomaly
245 Rick Obenschain NASA GSFC SEASAT Lessons Learned and Not Learned
330 BREAK    
345 Keith E. Van Tassel NASA JSC STS-86/SAFER
430 Paul Cheng Aerospace Corp Aerospace 100 Questions That Should Be Asked During Technical Reviews
515 Keith Avery Mission Research Corp. STRV-1c/1d Mishap
600 SESSION ENDS    
4
Training vs. Education
  • The NASA Office of Logic Design works to educate
    design engineers, not train them.
  • Training promotes rote responses
  • Education promotes thinking and the ability to
    adapt to and cope with new situations.
  • Hence, MAPLD hosts seminars and not training
    sessions.

5
Design Seminars
  • These case studies are real and are not contrived
    examples. Many of the leaders have first hand
    knowledge of these mishaps.
  • Contribute Discuss the topics presented,
    disagree with them, present interesting cases you
    wish to share, additional lessons, or alternative
    viewpoints.
  • Do not sit there quietly and expect to be treated
    like a cocker spaniel being trained and drilled
    to emit Pavlovian responses in response to
    stimuli (bell for dogs, donuts for engineers).

6
Material
  • Material will be made available on
  • CD-ROM
  • Hardcopy
  • klabs.org
  • All public domain, you may use the material as
    you wish.

7
I Was Reading AWST
Aviation Week Space Technology, August 23/30,
2004, pp. 29-30
8
Barto's Law Every circuit is considered guilty
until proven innocent.
9
A Recent Mishap(that gave me the idea for this
seminar)
10
Background
  • Popular single board computer
  • Everything was working fine
  • Ran vibration test
  • Unpowered and unmonitored
  • Subsequently failed to boot intermittently
  • Testing at manufacturers also showed
    intermittent failures, although at a lower rate
    than observed at the contractor.

11
Projects Corrective Action
  • Unit (S/N 031) pulled from the flight instrument
  • New unit (S/N 034) installed in the flight
    instrument
  • Repeated testing with the new unit was successful
  • Signed off, ready for launch

12
Risk Reduction Effort
  • Reviewed problem/failure report
  • No root cause or failure mechanism identified
  • Conclusion of the Verification and Analysis
    Section stated
  • No direct or indirect evidence given in the
    Verification and Analysis section to support a
    workmanship issue.
  • No analysis given to show that the workmanship
    problem was not systemic to all units. Since the
    unit is clearly marginal and it is difficult to
    make fail, it is not shown that other units have
    sufficient margin to support operation in all
    operating environments over the design life of
    the unit.

Each time there was a failure to boot, the power
was cycled and the computer subsequently
rebooted. The result of the testing at XXXXXX
was that the most probable cause of the boot
failure was a workmanship issue specific to SN034
and is not endemic to the XXXXXXXX computer and
therefore does not affect SN031.
13
Risk Reduction Effort
  • Note the analyst consistently remarks that
    after a failed boot the next power cycle results
    in correct operation of the board. Yet the board
    fails multiple times. This is evidence of the
    PC mentality seen in many Projects where, when
    there is a problem, the solution is to switch the
    power off and back on to correct it.
  • Contractor and Project claimed repeatedly that
    the unit was troubleshot and nothing more could
    be done.

14
Lets Take a Closer Look
  • Examination of failures at manufacturer
  • The failures reported were a result of test
    equipment there was zero failures detected at
    the manufacturer
  • Intermittent operation of the computer could not
    be supported. Electrical environment suspicion
    grows
  • What if analysis results in a large number of
    possible failure mechanisms

15
Lets Take a Closer Look
  • Examination of troubleshooting at contractor
  • Previously claimed fully troubleshot
  • Examination shows that no oscilloscope probe ever
    touched the board
  • Examined at interface points only
  • Throughout organization failures to boot were
    routine
  • Many failures reports written over many units.
  • Contractor did not use available diagnostic
    signals and port to ascertain status of the CPU
    and computer

16
Troubleshooting Again
  • Contractor fought hard to prevent
  • Stalled effort for many months
  • Initial examination showed that the protection
    signals for the EEPROM memories did not behave as
    predicted by the analysis
  • Contractor would not show the analysis
  • Examination of diagnostic signals quickly showed
    that the CPU had halted

17
Troubleshooting Results
  • Cause of failure determined
  • Known issue with pipeline timing
  • Software service routines not installed to handle
    all conditions
  • Project previously had assured the independent
    review that software was installed to handle all
    conditions
  • Did not fail at manufacturer since test software
    installed properly handled the interrupt from the
    pipelining issue
  • No support for a workmanship issue specific to
    SN034
  • Flight software rewritten

18
Lessons and Suggestions
  • Problem/Failure Reports
  • Examine original documents.
  • Request and examine all related P/FRs from all
    units
  • Provide direct evidence (at a minimum!) for
    determination of the cause of failure
  • Intermittents after vibration test led to the
    conclusion of a workmanship error the bad
    solder joint was never identified
  • Failures at the manufacturer reinforced the
    false conclusion as those failures were not
    examined in detail and were a result of a testing
    error.
  • Do not conduct reviews in a board room with
    PowerPoint slides
  • Pack up your oscilloscope and go into the lab

19
Enjoy your seminar!
Write a Comment
User Comments (0)
About PowerShow.com