Title: FaultTolerant Design for LongLife Deep Space Missions
1Fault-Tolerant Design for Long-Life Deep Space
Missions
2Contents
- Introduction
- Fault-Tolerant System Considerations and
Techniques - Historical Perspective
- Future Approach
- Conclusion
3Introduction
- Recently, planet Mars has been at the focal point
of astronomical attention because Mars will play
a key role in humanitys expansion to the deep
space - Future Mars transportation will require reliable
operations over a lifespan of years unlike - Space Shuttle which requires operations over
months - Space Station which is close enough to the Earth
for maintenance logistics
4Introduction
- Long operation period associated with deep space
missions demands - Innovative fault-tolerant technology development
- Applications of advanced redundancy techniques
- To enable Mars exploration safety, reliability
and autonomy must be improved - A new technology plan to guide the development of
the next generation fault tolerant computing
technology
5Fault Tolerant System Considerations
- Traditionally, avionic systems achieved
fault-tolerance through redundancy management - Redundancy management technique
- Detects and isolates a failure
- Performs hardware roconfiguration
- A combination of self-monitoring and
cross-comparison strategies lead to comprehensive
fault coverage at reduced risk and cost
6Fault Tolerant System Considerations
- Primary Flight Control System (PFCS) Baseline
Requirements - Mission reliability 0.95 success probability at
10 years with no repair - Throughput 100 million instructions per second
(MIPS) - Expandable I/O 100 Mbits/sec
- Expandable Memory 1 GByte
- Mass Storage Capacity 1 Terabyte
- Cycle Rate 100 Hz
- Hardware N-fail operation
- Low life-cycle cost
- Low power and mass
- Radiation tolerance
- Building block approach(Look for existing
soultions to the parts of the problem and combine
the soluitons)
7Fault Tolerant Techniques for Mars Applications
- Ultra-reliable systems for long-life applications
like human Mars exploration are required to
sustain - Permanent faults
- Transient (temporary) faults
- Intermittent (not continuous) faults
- Timing faults
- Latent (hidden) faults
- Worst-case fault scenarios with a lower
probability of occurence
8Fault Tolerant Techniques for Mars Applications
- Distributed Architectures are more suitable to
long-life space applications - Function integration
- Parallel computation
- Graceful performance growth
- Selective technology upgrade
- Appropriate levels of function reliability
- Graceful degradation of system capabilities in
the presence of faults - Efficient use of hardware resources
9Historical Perspective
- Long-Life Unmanned Redundant Systems
Viking
Voyager
Galileo
10Historical Perspective
- Safety Critical High Reliability Systems
Columbia Challenger Discovery
Atlantis Endeavour
11Long-Life Unmanned Redundant SystemsViking
- Viking is an instance of the pre-1970
Thermoelectric Outer Planets Spacecraft (TOPS)
concept - This spacecraft firstly introduced the use of
computer as a fault manager, to attempt to
reconfigure and restore the spacecraft to an
operational configuration - Fundamental strategy was to switch power on and
off to various alternative subsystems until
either the built-in fault monitoring indicated
operation was restored, or until commands from
the Earth are detected in the case of faults in
the communication chain - There was no real-time masking of faults, so if a
fault occured during a maneuver, an incorrect
maneuver would have been performed
Viking Fault-Tolerant Architecture
CCS Command Computer Subsystem FDS Flight Data
Subsytem
12Long-Life Unmanned Redundant SystemsVoyager
- Like Viking, Voyager is an instance of the
pre-1970 Thermoelectric Outer Planets Spacecraft
(TOPS) concept. - The improvement according to Viking is in only
limited ways, such as the addition of a pair of
seperate computers for the attitude and
articulation control - In both of them standby redundancy was used. The
standby spares where cross-strapped so that
either unit could be switched in to communicate
with the other units - Cross-strapping and switching allowed
reconfiguration around failed components, either
automatically or by the ground command
Voyager Fault-Tolerant Architecture
CCS Command Computer Subsystem FDS Flight Data
Subsytem AACS Attitude and Articulation Control
Subsystem
13Long-Life Unmanned Redundant SystemsGalileo
- Galileo mission is a follow on to the Voyager
Jupiter fly-by mission - Galileo design borrows heavily from the
experiences of the Voyager - Block redundancy (An error checking method that
generates a longitudal parity byte from a
specified string or block of bytes on a
longitudinal track.) is used throughout the
subsystems - All except CDS operates as an active/standby pair
- CDS operates as active redundancy wherein each
block can issue independent commands, or they can
operate in parallel on the same critical activity
Galileo Fault-Tolerant Architecture
CDS Command and Data Subsystem AACS Attitude
and Articulation Control Subsystem
14Long-Life Unmanned Redundant SystemsGalileo
- The major departure from the Voyager arcihtecture
is the extensive use of microprocessors and the
consequent use of bus oriented architecture to
facilitate communications among them - Galileo on-board fault detection software is
designed to alleviate the effects and symptoms of
faults, rather than to pinpoint the exact faults. - Fault identification and isolation are performed
by the ground intervention
Galileo Fault-Tolerant Architecture
CDS Command and Data Subsystem AACS Attitude
and Articulation Control Subsystem
15Safety Critical High Reliability SystemsShuttles
- Operational differences from planetary probes
- being absolutely certain no fault propagates to
the effectors during a relatively shorter
operation cycle - rather than relying on fault monitors to
interrupt processing and going through a
reconfiguration, powering several redundant
strings on and operating in parallel
16Safety Critical High Reliability SystemsShuttles
- Voting occurs both in General Purpose Computers
(GPCs) and at the final effectors - Voting is much more brute force than fault
moitoring, requiring more hardware but also
providing greater fault coverage - Much more suited to real-time safety-critical
maneuver control than a reconfiguration oriented
strategy as in Viking, Voyager and Galileo
Conceptual Shuttle Orbiter Fault-Tolerant
Architecture
GPC General Purpose Computer
17Mars Advanced Fault Tolerant Computing
ApproachFuture Manned Mars Missions
- Parallel-Hybrid Redundancy will be the base for
future long-life deep space missions - It combines the attractive features of parallel
processing and redundant computation - Computational elements can be arranged to provide
high throughput or ultra reliability or a
combination of them depending on the mission
phase
18Mars Advanced Fault Tolerant Computing
ApproachFuture Manned Mars Missions
- Parallel-Hybrid Redundancy was first used in 1979
when Fault Tolerant Multi-Processor (FTMP) was
designed and built - FTMP used conventional shared memory
multiprocessor architecture - Each virtual processor consisted of three real
processors working as a triad to provide
real-time fault masking - Upon detection of a fault in a processor, faulty
unit is replaced from a pool of spares
19Mars Advanced Fault Tolerant Computing
ApproachFuture Manned Mars Missions
- Parallel-Hybrid Redundancy had certain drawbacks
- It was not explicitly designed to meet rigorous
requirements of Byzantine resilience (Correctly
functioning components of a Byzantine fault
tolerant system will be able to reach the same
group decisions regardless of Byzantine faulty
components ) which is necessary to provide - Coverage of random hardware faults
- Ultra-high reliability
- Ease of validation
- It lacked ease of expandability due to redundant
bus connections between processors and main
memory - It did not support mixed redundancy because
processors are aranged to work in triads
regardless of the criticality of the application
20Mars Advanced Fault Tolerant Computing
ApproachFuture Manned Mars Missions
- To solve the deficiencies of FTMP a new
architecture called Fault Tolerant Parallel
Processor (FTPP) was conceived - It meets all requirements of random hardware
faults - FTPP will be the base of fault tolerance for
future manned Mars missions
FTPP Arcihtecture
21Mars Advanced Fault Tolerant Computing
ApproachFeatures of FTPP Parallel Procesing
- Parallel Processing is provided by
- 40 Processing Elements (PEs) in 5 Fault
Containment Regions (FCRs) - 2 Input/Output Controllers (IOCs) per FCR
FTPP Arcihtecture
22Mars Advanced Fault Tolerant Computing
ApproachFeatures of FTPP Scalable Performance
- Increasing the number of PEs in a single cluster
create a communication bottleneck in the Network
Elements (NEs) - FTPP relies on hierarchical approach to scaling
the performance by assebmling clusters via IOCs
FTPP Arcihtecture
23Mars Advanced Fault Tolerant Computing
ApproachFeatures of FTPP Mixed Redundancy
- Most fault tolerant computers are designed to
operate in a redundant mode only, which is a
waste of resources for the uncritical tasks - FTPP allows the processing elements to be
configured as - Simplexnon-critical tasks
- Triplextasks that require real-time fault
masking - Quadruplex or higher when two or more sequential
faults must be tolerated in a small time window
without the benefit of reconfiguration - In the figure
- 4 quads
- 3 triplexes
- 15 simplexes
FTPP Arcihtecture
24Mars Advanced Fault Tolerant Computing
ApproachFeatures of FTPP Dynamic
Reconfiguration
- Mission consists of several phases such as
launch, ascent, cruise from Earth orbit to Mars,
Mars orbit injection, Mars landing - For each phase the throughput, latency, iteration
rates and criticality changes over a wide range,
therefore the arcihecture must be flexible - Reconfiguration from high throughput to high
reliability - 3 PEs which are operating as independent simplex
elements can be synchronized to run the same task
(S2,S3,S13) - Replacing failed members
- A simplex in the same FCR as the failed member is
synchronized with the non-failed members of the
virtual group(Channel A of Q1 fails?S2,S7 or S12
can replace)
FTPP Arcihtecture
25Mars Advanced Fault Tolerant Computing
ApproachFeatures of FTPP Low Fault Tolerance
Overhead
- Frequent fault tolerant related functions such as
fault/error detection, error masking(voting) and
synchronization are implemented in the Network
Element - Less frequent functions such as identification of
faulty modules, reconfiguration and reintegration
are implemented in software which executes on
PEs. - Each NE services 8 PEs
FTPP Arcihtecture
26Mars Advanced Fault Tolerant Computing
ApproachFeatures of FTPP Open Architecture
- FTTP provides open architecture for both hardware
and software including - Processors
- I/O modules
- Fiber optic links
- Operating Systems
FTPP Arcihtecture
27Mars Advanced Fault Tolerant Computing
ApproachFeatures of FTPP Small Physical Size
- Key element of meeting the weight, volume and
power requirements is the packaging technology - Multi-Chip Modules (MCMs) will be used
- A NE on a single MCM with less than 4 cm2
FTPP Arcihtecture
28Conclusion
- Future manned deep space missions will require
reliable operation over years and real-time
masking of critical faults - Current approaches are not enough and a new fault
tolerant approach is needed - FTPP is a powerful candidate for the spacecraft
which will bring the humans to Mars
29References
- Advanced fault tolerant computing for future
manned space missionsBenjamin, A.L. Lala,
J.H.Digital Avionics Systems Conference, 1997.
16th DASC., AIAA/IEEEVolume 2, 26-30 Oct. 1997
Page(s)8.5 - 26-8.5-32 vol.2 - NASA Website
- Computers in Spaceflight The NASA Experience
http//www.hq.nasa.gov/office/pao/History/computer
s/Ch6-2.html - NASA Jet Propulison Laboratory Website
- Voyager The Interstellar Mission
- http//voyager.jpl.nasa.gov/spacecraft/index.html