Title: Developing Medical Software: Pitfalls and Prophylactics
1Developing Medical Software Pitfalls and
Prophylactics
- Elliot Jaffe
- Seminar in Computer Assisted-Surgery, Medical
Robots and Medical Imaging - Fall 2002
2Outline
- Why should you be worried?
- Case Study Therac-25
- US Government Guidelines
3What? Me worry?
- Software is used in medical devices
- Monitoring
- Planning
- Surgery
- Visualization
- Software fails
4Case Study Therac-25
- 1983 1987
- AECL Atomic energy of Canada Ltd.
- 6 reported accidents
- Changed the way software is developed and
verified as part of a medical device
5Medical Linear Accelerators
Linac North Oakland Medical Center
6Therac-25 genesis
- Therac-6 6MeV X-ray accelerator
- Therac-20 20MeV Dual Mode (Electron/X-ray)
accelerator - Upgraded with Dec PDP-11 minicomputer for ease of
use - Could be operated without computer
7Therac-25
- Dual Mode 25MeV accelerator
- Electron/X-Ray
- Can be operated ONLY through the computer
- Computer controls and monitors system
- Some hardware safety mechanisms and interlocks
were replaced with software - First working prototype 1976
- First commercial product 1982
8Treatment Goals
- Deliver high energy radiation for the treatment
of cancer - Radiation needs to be focused and controlled
- Multiple energy levels
- X-Ray
- Electron
9Therac-25 Operation
- Turntable to select from three modes
- Visual
- Electron
- X-Ray
- Turntable is moved mechanically
- Software monitors position of turntable
10Turntable
11Operator Interface
Cursor should be here during operation
12Therac-25 Error States
- Treatment Suspend
- Requires complete machine restart
- Treatment Pause
- Operator types P to proceed
13Therac-25 Error Messages
- HTILT, VTILT, etc.
- MALFUNCTION ltngt
- 1 lt n lt 64
- No documentation
- No indication of severity
- Occurred on average 40 times a day!
14Therac-25 Event 1
- June 1985 10MeV electron treatment
- Patient reported tremendous force of heat
this red-hot sensation - Technician replied that it was impossible
- AECL claimed it was impossible
- Never reported to FDA
15Therac-25 Event 1
- Patient received severe radiation burn
- Patients breast was removed
- Shoulder and arm was paralyzed
- AECL refused to believe that it was caused by
Therac-25 - Lawsuit settled out-of-court
16Therac-25 Event 2
- July 26, 1985
- HTILT message, Treatment Pause
- Operators resumed treatment
- Repeated 5 times until machine stopped
- Patient reported electric tingling shock
17Therac-25 Event 2
- Patient died of cancer
- Autopsy revealed that a total-hip replacement
would have been required due to radiation
exposure - Reported to AECL, FDA
- AECL believed it to be a hardware problem
18Therac-25 Event 2
- AECL could not reproduce the reported behavior
- AECL modified turntable
- Fixed potential error in 3-bit turntable
location identifier
19Turntable
20Therac-25 Event 2
- AECL claimed
- analysis of the hazard rate of the new solution
indicates an improvement over the old system by
at least 5 orders of magnitude
21Therac-25 Event 3
- December 1985
- After upgrade from event 2
- Patient developed parallel striped pattern in
treatment area - AECL reported Could not have been produced by
any malfunction of the Therac-25 or by any
operator error. - Not reported to FDA
- Patient required surgery to repair tissue damage
22Therac-25 4
- March 21, 1986
- Operator entered x instead of e
- Moved cursor and corrected error
- Began treatment
- MALFUNCTION 54
- Continued Treatment
- MALFUNCTION 54
- Machine shutdown
23Therac-25 Event 4
- Patient monitors video and audio were broken
- Patient received electric shock, started to get
up and was then shocked in the arm - Patient pounded on treatment door
- Patient sent home
- Machine checked out ok
24Therac-25 Event 4
- Patient died of overdose 5 month later
- AECL suggested an electrical problem in the area
- Independent engineering firm checked and found no
problem
25Therac-25 Event 5
- April 11, 1986
- Same operator
- Same editing
- MALFUNCTION 54
- Audio monitor (now working) reported a loud sound
from machine - Patient died May 1, 1986 (three weeks later) of
acute high-dose radiation to his brain
26Therac-25 Event 5
- Physicist took machine out of service
- Reported to AECL
- Operator and Physicist were able to reproduce the
failure - AECL still could not reproduce the failure
- FDA declares system defective
27Therac-25 Event 5 - cause
- Operating system was a hand-coded real-time
system developed by one programmer in the 1970s. - Problem was traced to race condition in the main
loop - Result was that x-ray beam could be used through
the electron magnet
28Therac-25 Event 6
- January 17, 1987
- Operator set turntable to field light position
- Gave command to system to set turntable to
x-ray - Ran treatment
- System reported no dose or dose rate
- Re-ran treatment
- Patient died in April, 1987 of problems related
to overdose - AECL and FDA notified
29Therac-25 Event 6 - cause
- Software bug
- Register overflow
- 8 bit register used for multiple purposes
- Once or twice in each setup phase, the register
overflows, allowing the system to think that the
turntable was reset
30Lessons Learned
- Studies reported 12 lessons learned
- We will cover five of them
31Overconfidence in Software
- First safety analysis did not include software,
even though it was responsible for safety of the
system - When problems did occur, it was assumed to be a
hardware failure
32Reliability vs. Safety
- Therac-25 ran for three years in production
without a problem - Tens of Thousands of patients were treated before
the first known overdose - Reliability leads to complacency
- Reliability ! Safety
33Lack of Defensive Design
- Software was designed for small memory footprint
- Self Checks, Error Detection, Error handling and
Auditing was left out
34Unrealistic risk assessment
- First Risk Assessment did not include software
- AECL claimed 5 orders of magnitude improvement
from changing one microswitch - Software is harder to assess for failures than
hardware
35Inadequate Software Engineering Practices
- Software specification was after-the-fact
- Dangerous design/coding practices could have been
avoided - Audit trails should be built into the production
software - Software should be tested at the unit, module and
system level - Regression testing on all changes
- GUI should be designed, not implemented
36Software Reuse
- Therac-25 used software from T-20
- Reliability ! Safety
- Assumptions and Preconditions may have changed
- Sometimes its better to rewrite from scratch
37US Government Guidelines
- Significantly reduce the risk of death or injury
- Impose standards and best practices to raise the
overall level of the industry - Define minimum requirements for
- New products
- Derivative products
38Level of Concern
- Major device directly affects the patient or
operator and failure could result in death or
serious injury - Moderate device directly affects the patient and
failure could result in non-serious injury - Minor failures will not result in injury
39Levels of Concern
- Does the software
- Control life support device?
- Control delivery of harmful energy?
- Control treatment delivery?
- Provide diagnosis as basis for treatment?
- Monitor vital signs?
- If no to all these questions, then concern is
minor
40Requirements for minor concern
- Software Description
- Device Hazard analysis
- Software functional Requirements Specification
- Architecture Design chart
- Validation, Verification and Testing
- Release Version Number
41Requirements for Moderate/Major concern
- Full Software Requirements Spec.
- Design Specification
- Traceability analysis
- Development lifecycle documentation
- Configuration management
- Maintenance activities
- Revision Level History
- Unresolved Anomalies (bugs)
42Software Requirements Spec
- Hardware requirements
- Programming languages
- Interface requirements
- Software functional requirements
- Software performance requirements
43Software Requirements Spec
- Algorithms for therapy, diagnosis, monitoring,
alarms, analysis, interpretation (with supporting
clinical data) - Device limitation due to software
- Internal software tests and checks
- Error and interrupt handling
44Software Requirements Spec
- Fault detection, tolerance and recovery
characteristics - Safety requirements
- Timing and memory requirements
- Use of off-the-shelf software
45Risk/Hazard Analysis Tools
- Fault Tree Analysis (FTA)
- Used in initial design phase
- Failure Modes Effect and Criticality Analysis
(FMECA) - Used in design and development phase
- Failure Reporting and Corrective Action System
(FRACAS) - Used during product lifecycle
46Fault Tree Analysis
- Identify a failure or safety hazard, then attempt
to identify all possible ways to create that
hazard - Answers the question
- How can event X occur?
- Used in Military and Nuclear Industry since the
1970s
47Fault Tree Analysis Example
Simplified fault tree diagram for an infusion pump
48Fault Tree Analysis
- Demonstrates that the system will not reach an
unsafe state - Identifies areas for improvement
- Provides a systematic hazard review
49FMEA
- Assume a basic defect at the component level,
assess the effect and identify potential
solutions - Answer the question
- What happens if event X occurs?
- Used in Automobile manufacturing
50FMEA Example
FAILURE MODE AND EFFECTS ANALYSIS
(FMEA) Subsystem/Name DC motor P
Probabilities (chance) of Occurrences Model
Year/Vehicle(s) 2000/DC motor S Seriousness
of Failure to the Vehicle D Likelihood that
the Defect will Reach the customer R Risk
Priority Measure (P x S x D) 1 very low or
none 2 low or minor 3 moderate or
significant 4 high 5 very high or
catastrophic
No. Part Name Part No. Function Failure Mode Mechanism(s) Cause(s) of Failure Effect(s) Of Failure P.R.A. P.R.A. P.R.A. P.R.A. Recommended Corrective Action(s) Action(s) Taken
No. Part Name Part No. Function Failure Mode Mechanism(s) Cause(s) of Failure Effect(s) Of Failure P S D R Recommended Corrective Action(s) Action(s) Taken
1 Position Controller Receive a demand position Loose cable connection Incorrect demand signal Wear and tear Operator error Motor fails to move Position controller breakdown in a long-run 2 4 4 4 1 3 8 48 Replace faulty wire. Q.C checked. Intensive training for operators.
51FMEA
- Reveals unforeseen hazards
- Does not consider multiple failures
- Can be very time consuming
52FRACAS
- A process for tracking system reliability/safety
- A set of procedures, policies and software tools
- Used from beginning to end of the product
lifecycle
53FRACAS
- Record events
- Analyze failure modes
- Verify corrective actions
- Identify failure trends
- Determine the failure contribution of individual
parts
54Risk/Hazard Analysis Tools
- Fault Tree Analysis
- Failure Modes Effect and Criticality Analysis
(FMECA) - Failure Reporting and Corrective Action System
(FRACAS) - Failures occur, the question is Did we prepare
for them?
55Conclusion
- People depend on medical software with their
lives! - Safety is part of the design and development
process
56Bibliography
- Nancy Leveson, Safeware System Safety and
Computers, Addison-Wesley, 1995 - ComputingCases.org Therac-25 case study
- Guidance for the Content of Premarket Submissions
for Software Contained in Medical Devices,
http//www.fda.gov/cdrh/ode/software.pdf - Laura M. Ippolito, Dolores R. Wallace. A Study
on Hazard Analysis in High Integrity Software
Standards and Guidelines , NISTIR 5589,
http//hissa.nist.gov/HHRFdata/Artifacts/ITLdoc/55
89/hazard.html - Daniel Kamm, P.E.,C.Q.A., An Introduction to
Risk/Hazard Analysis for Medical Devices,
http//www.fda-consultant.com/risk1.pdf