Title: 332:437 Lecture 2 Fault Tolerance Examples
1332437 Lecture 2Fault Tolerance Examples
- Active Redundancy Techniques
- Hardware for Active Redundancy Systems
- Fault Tolerance Applications
- Lucent Technologies 5 ESS
- NASA Space Shuttle
- Galileo Interplanetary Probe
- Hardware Design Methodology
- System Partitioning
- Summary
Material from Design and Analysis of Digital
Fault Tolerant Systems, By Barry Johnson,
Addison-Wesley Publishers.
2Redundancy Techniques
- Active Let error happen disrupt system
- Detect error with test hardware.
- Reconfigure system restart
- Example Communications Satellite
- TMR too expensive
- Duplication with Comparison
- Problems
- Even in fault-free system, digital words may not
agree - Solution Ignore K least significant bits
3Active Fault Tolerance
4Active Redundancy Techniques
- Standby Replacement/Sparing
- Hot Standby Sparing Process Control
- Cold Standby Sparing Satellite
- Needs time to power up initialize the spare
- Pair a Spare
- Duplication with Comparison Standby Sparing
- Uses comparator error information
- Disconnects broken module inserts spare
5Redundancy Techniques
- Hybrid Active Passive
- N-modular Redundancy with Spares
- Triple-Duplex
- Active Passive Hybrid
- Increasing Hardware Cost
6N-Modular Redundancy with Spares -- Hybrid
7N-Modular Redundancy with Spares
System Inputs
8Software Implementation of Duplication with
Comparison
9Triple-Duplex Redundancy
10Switch in Self-Purging System
11Switch for Self-Purging Full Adder
12Disagreement Detector Circuit
13Sift-Out Modular Redundancy Unit
- Collector combines output to produce system
output - Contributions from faulty modules are ignored
14Hardware to Identify Faulty Modules
- Disagreement signals drive JK flip-flops
- Disagreements identify faulty modules
15Triple-Duplex Architecture
- Triplication for fault masking
- Duplication with comparison detects faults
16Sift-Out Modular Redundancy
- Collector sifts out defective modules
17Case Studies
- Lucent Technologies 5 ESS
- 100 duplication of Hardware
- ESS System Design
- Time-Space-Time (TST) Switch
- Signals Translated Pulse-code-modulated
(PCM) signals - Information Routed Over Lines
- Interchange Time slots (time)
- Switch Buses (space)
- Interchange Time slots (time)
18Installed ATT Electronic Switching Systems
19Probability of Operational Outage Due to Various
Causes
205 ESS Switching Block Diagram
215 ESS
- Uses Duplication-with-Comparison
- Assume perfect switchover to working computer
- MTTF m (Mean Time to Failure)
- 2 l2
- l Computer Failure Rate
- m Repair Rate
- Handles 65,000 metropolitan phone lines
- Also uses Information Redundancy
- Processor 1 too unreliable (1 ESS)
- Broken into 6 subsystems to improve reliability
- Multiplied MTTF by 6
- PU, CC, CS, PS, CSB, PSB subsystems
22Duplex Configuration for Switch
PU0
PU1
PSB1
PSB0
CSB1
CSB0
23NASA Space Shuttle
24Space Shuttle Computer
- Tasks
- Guidance
- Navigation
- Pre-flight checkout
- Software voting in a 5 computer complex
- Use 4 computers as a redundant set during
critical mission phases - 5th computer does non-critical tasks acts as
a backup
25Voting Method
- Vote on control outputs of 4 computers at control
actuators - Each computer compares outputs of 3 others to its
own - If disagreement Signal the disagreeing computer
- Each computer votes on the disagreement signals
- If defective, removes itself from service
- Tolerates up to 2 computer failures
26Reconfiguration After 2nd Failure
- Converts to a duplex computer system
- Can survive one more failure because of
comparison self-tests - 2 Vendors to minimize chance of common software
error - Primary Software IBM
- Backup Software -- Rockwell
27Spacecraft Systems
- Sub-systems
- Propulsion
- Power
- Data Communications
- Attitude Control
- Command, Control, Payload
28Fault Tolerance Maintenance Procedures
- When failure detected enter safe/hold mode
- Shed all non-essential power loads
- Stop mission sequencing solar array tracking
- Orient for maximum solar power
- Ground personnel diagnose failure from prior
outputs of 5 subsystems - Select spacecraft system reconfiguration
- Send workaround commands to spacecraft
29Fault Detection Mechanisms
- Self-tests of sub-systems
- Cross-checking between duplicated sensors
- Ground-initiated tests diagnose/isolate
failures - Ground trend analysis find degraded / worn-out
units
30NASA Long-Life GalileoJupiter Fly-By Mission
- 19 8 mprocessors, 320 Kbyte ROM
- Uses block redundancy
- Command Data Subsystem (CDS)
- Active redundancy each block can issue
independent commands or both blocks work in
parallel on critical activity - All other systems active/standby pair
- Few Hardware Fault Detection Mechanisms
- Harsh Jupiter Environment
- Radiation
- Electrostatic Discharge
31Galileo Orbiter Block Diagram Active/Standby Pair
32Galileo Error Detection Mechanisms
- Test event durations (watchdog timer) data
transfers, parity/checksums on messages - Unexpected command codes
- Check loss of heartbeat between AACS CDS
(watchdog timer) - Mixture of spinning non-spinning scientific
experiments check for spin rate above/below set
values - Lost of Sun/Star identification no pulse from
acquisition sensor - Too large an error between control setting for
sub-module and its response
33Summary
- Active Redundancy Techniques
- Hardware for Active Redundancy Systems
- Fault Tolerance Applications
- Lucent Technologies 5 ESS
- NASA Space Shuttle
- Galileo Interplanetary Probe
- Hardware Design Methodology
- System Partitioning