Title: SEU Tolerant Device, Circuit and Processor Design
1SEU Tolerant Device, Circuit and Processor Design
- William Heidergott
- General Dynamics C4 Systems
- Scottsdale, Arizona, USA
2Outline
- Energetic Particle Environments
- Space and Terrestrial Particles
- Single Particle Interaction With Devices
- Charge Generation and Collection
- Single Event Effects (SEE)
- Fault Tolerant Systems
- Fault Avoidance, Fault Masking and Detection,
Containment and Recovery Techniques - Validation
- Summary
3Energetic Particle Environments
- Interplanetary Space Particles and Spectra
- Galactic and Solar Cosmic Rays
- Highly Ionized
- Influenced by Magnetic Fields
- Very Energetic
- Relativistic (GeV) Energy
4Energetic Particle Environments
- Interplanetary Space Particles Interact with
Earths Atmosphere - Nucleonic Component of Importance to Terrestrial
Single Event Effects - Flux of Atmospheric Neutrons Varies With Altitude
- Production VS Removal Processes
- Maximum Flux at 60K Ft
5Energetic Particle Environments
- Long-Term Variation in Terrestrial Neutron Flux
- Variation Over Solar Cycle (30 Variation)
6Energetic Particle Environments
- Short-Term Variation in Interplanetary Flux
- Transient Variation Due to Solar Particle Event
7Energetic Particle Environments
- Short-Term Variation in the Environment
- Transient Variation Due to Solar Particle Event
8Energetic Particle Environments
- Terrestrial Energetic Particle Environment
- High Energy
Neutrons - Thermal
Neutrons - Alpha Particles
9Single Particle Interaction
- High Energy Neutron Interaction
10Single Particle Interaction
- Low Energy (Thermal) Neutron Interaction
11Single Particle Interaction
- Alpha Particle Interaction
12Single Particle Interaction
13Single Event Effects
- Single Event Effects (SEE)
- Destructive Single Event Effects
- Dielectric Rupture (SEDR) Thin Gate Oxides
- Gate Rupture (SEGR) and Burnout (SEB) Power
MOSFETS - Potentially Destructive
- Single Event Latchup (SEL) Bulk CMOS ICs
- Snapback (SES) SOI CMOS ICs
- Soft Errors
- Single Event Transient (SET)
- Single Event Upset (SEU)
- Single Event Functional Interrupt (SEFI)
14Fault Tolerant Systems
- Process of of Single Event Upset / Soft Error
Generation and Effect - External Energetic Particle Environment
- Transport of Energetic Particle
Environment - to Semiconductor Sensitive Volume
- Charge Generation and Collection
- Single Event Transient (SET)
- Generation and Propagation
- Single Event Upset (SEU)
- Undesired State of
System - Inability to
Provide - Service
(Failure)
15Fault Tolerant Systems
- Fault Avoidance, Fault Masking and Detection,
Containment and Recovery Techniques - External Energetic Particle Environment
- Transport of Energetic Particle
Environment - to Semiconductor Sensitive Volume
- Charge Generation and Collection
- Single Event Transient (SET)
- Generation and Propagation
- Single Event Upset
- Undesired State of
System - Inability to
Provide - Service
(Failure)
16Fault Tolerant Systems
- Fault Avoidance
- Prevent Critical System Operations During Severe
Environmental Conditions - Avionics Applications
-
17Fault Tolerant Systems
- Fault Avoidance
- Reduce Severity of the Energetic Particle
Environment (? Shielding) - C4 Solder Alpha Emission Keep-Out Areas for SEU
Susceptible Elements
18Fault Tolerant Systems
- Fault Avoidance
- Attenuate the Transport of Energetic Particle
Environment to Semiconductor Sensitive Volume
(Shielding) - Polyimide Layers Between Alpha Emitting Packaging
Materials and Sensitive Device Structures - Alpha Particle Range in Materials
- Si 23.6 ?m Polyimide 28.0 ?m
- Pb 11.5 ?m Au 6.6 ?m
- Al 19.5 ?m Resist 24.0 ?m
- Cu 7.0 ?m Air 4.7 cm
-
19Fault Tolerant Systems
- Fault Avoidance
- Attenuate the Transport of Energetic Particle
Environment to Semiconductor Sensitive Volume
(Shielding) - Terrestrial and Avionics Systems
- Thermal Neutron Shielding Work to be Published by
Full Circle Research, NASA, and Hybrid Plastics
Inc. at IEEE NSREC in July 05 - Metallized Polyhedral Oligomeric Silsesquioxanes
(POSS) Board Coating Material - Naturally Occurring Gadolinium
- Thermal Neutron Capture Cross Section of
48,890 Barnes -
20Fault Tolerant Systems
- Fault Avoidance
- Reduce Charge Generation and Collection Processes
- Silicon-On-Insulator (SOI)
- Removes Reverse Biased Source / Drain Node
Junction From Device Cross Section (Potentially
Reduced Cross Section) - Epi, Retrograde and Double/Triple Well Structures
- Reduces Carrier Lifetime in Region Below Device
Structure - Non-Ionizing Energy Deposition and Low
Temperature Buffer Layer (LT GaAs) - Reduces Carrier Lifetime in Region Below Device
Structure -
21Fault Tolerant Systems
- Fault Avoidance
- Attenuate Single Event Transient (SET) Pulse
Generation and Propagation -
22Fault Tolerant Systems
- Fault Avoidance
- Attenuate Single Event Transient (SET) Pulse
Generation and Propagation - Increase Memory Cell Node Capacitance (Increase
Critical Charge) - SRAM Metal-in-Metal (MIM) Capacitor
- DRAM Capacitor on Top of Memory Cell
- Trench DRAM Cell
23Fault Tolerant Systems
- Fault Avoidance
- Block Single Event Transient (SET) Pulse From
Producing a Single Event Upset -
24Fault Tolerant Systems
- Fault Avoidance
- Block Single Event Transient (SET) Pulse From
Producing a Single Event Upset -
25Fault Tolerant Systems
- Fault Avoidance
- Block Single Event Transient (SET) Pulse From
Producing a Single Event Upset -
26Fault Tolerant Systems
- Fault Masking Techniques
- Prevent Single Event Upsets From Producing and
Undesired State of the System - Redundancy
- Informational Redundancy
- Error Detection and Correction Coding
- Significant Use of EDAC in Systems
- Byte Correction to Mitigate SEFI in SDRAM
- Arithmetic Codes
- No Efficient Techniques Identified
- Spatial Redundancy
- n-Modular Redundant (nMR) Structures
- Significant Use in Systems
- Temporal Redundancy
27Fault Tolerant Systems
- Redundancy
- Informational Redundancy
- Error Detection and Correction Coding
28Fault Tolerant Systems
- Redundancy
- Error Detection and Correction Coding
29Fault Tolerant Systems
- Redundancy
- Error Detection and Correction Coding
30Fault Tolerant Systems
- Redundancy
- Spatial Redundancy
- n-Modular Redundant (nMR) Structures
- Triple Modular Redundancy (TMR)
31Fault Tolerant Systems
- Redundancy
- Spatial Redundancy
- n-Modular Redundant (nMR) Structures
- Triple Modular Redundancy (TMR)
32Fault Tolerant Systems
- Redundancy Techniques
- Spatial Redundancy
- n-Modular Redundant (nMR) Structures
33Fault Tolerant Systems
- Recent Onset of Combinatorial Logic Single Event
Transient Susceptibility
34Fault Tolerant Systems
- Recent Onset of Combinatorial Logic Single Event
Transient Susceptibility - Current Technology Provides Bandwidth for
Response - Capability to Propagate Short Pulses
- Clock Speed Increasing Probability of Clock
Occurrence With SET Within Set-Up/Hold Window
35Fault Tolerant Systems
- Redundancy
- Temporal Redundancy
36Fault Tolerant Systems
- Redundancy
- Temporal and Spatial Redundancy
37Fault Tolerant Systems
- Detection, Containment and Recovery
- Prevent an Undesired State of System From
Resulting in Failure - Detection
- Detection is the Difficult Aspect of This
Approach - Application-Oriented Fault Tolerance
- Acceptance Testing
- Algorithm Based Fault Tolerance (ABFT)
- Containment
- Hierarchical Error Containment Boundaries
- Confine Errors to Module or Subsystem
- Subsystems Validate Inputs and Check Results
- Recovery
- Recovery Blocks
38Fault Tolerant Systems
- Application-Oriented Fault Tolerance
- Constraint Predicates
- Identifies Specific Properties of Problems Which
Enable or Constrain Application Oriented Fault
Tolerance - Progress - Decompose Process Into Operations
Blocks, Providing Testability at Intermediate
Points - Surfaces Notion That the Number of Process Steps
is Known a-priori - Feasibility - Constraints Which are Apparent From
the Nature of the Problem. - Boundary Conditions
- Results Must Be Within the Solution Space of the
Problem - Consistency - Ability to Infer Validity of
Intermediate or Final Results - Input Variables
- Previous Intermediate or Final Results
39Fault Tolerant Systems
- Application-Oriented Fault Tolerance
- Software Components - Executable Assertions
- if not ASSERTION then ERROR
- Detection Capability is Determined by the
Perceptiveness of the ASSERTION - Containment and Recovery is Determined by the
Response Embedded in ERROR - N-Version Programming
- Parallel or Sequential Execution of Programs and
Comparing the Results - Design Diversity vs Redundancy
- Developed to Protect Against Design Defects
- Redundant Execution Against Transient Faults
40Fault Tolerant Systems
- Acceptance Testing
- Functionality and Data
- Assesses Reasonableness of Computation Results
- Allowable Range
- Consistency With Input Variables
- Consistency With Previous Results
- Mostly Ad-Hoc Developed Techniques
- Control Flow
- Validates Execution Flow Within Blocks and Paths
Between Blocks - Within Blocks - Set Block Tag to Key Value on
Entry, Test for Validity on Completion - Between Blocks - Set Path Tag to Key Value on
Branch Decision, Verify Proper Path Tag on
Destination Block Entry
41Fault Tolerant Systems
- Acceptance Testing
- Watchdog Coprocessor
- Extends Notion of Watchdog Timer to Include
Checking of On-Line Processor Operation and
Results - Classes of Assertions
- Inverse - Uses Output Results to Infer Required
Input Variables - Transformation - Converts Problem to a Simpler
One and Compares Approximated Results - Range - Pre-Established Limits on Results
- State - Coprocessor Execution of Self-Checking
Software
42Fault Tolerant Systems
- Algorithm Based Fault Tolerance (ABFT)
- Most ABFT Techniques Address Computational
Problems Which Exhibit Structure and Regularity - Matrix Computation
- Fourier Transform
- Least Squares Minimization
- Sorting
- QR Factorization
- Singular Value Decomposition
43Fault Tolerant Systems
- Recovery
- System Recovery
- Check-Pointing
- Backward Error Recovery
- Micro-Rollback
- Rollback
- Forward Operational Recovery
- Safe Point to Resume With Loss of Previous
Results - Redundant Modules to Take Over for Failed
Subsystem Until It Can Be Reinitialized - Hot, Warm, or Cold Spare
44Fault Tolerant Systems
- Validation and Verification
- Analytical Modeling
- Experimental Techniques
- Hardware Pin Faults
- Memory Corruption
- Ion Irradiation
- Simulation Modeling
- Op-Code Level Simulation
- Gate Level Simulation
- Register-Transfer Level
- Fault Emulation
- Memory
- Register Transfer Level
- Bus
45Summary
- Space Systems Applications
- Most Severe Environment and Significant
Consequence of Failure - Heritage of Most Single Event Effects
Understanding and Mitigation Techniques - Fault Tolerance Provisions
- Fault Avoidance
- Fault Masking Techniques
- Detection, Containment and Recovery Strategies
- Validation and Verification
- Access to Background and Current Information
46Information Sources
- Single Event Effects (December Issue of TNS)
Short Course Data Workshop
47Information Sources
- Single Event Effects (IRPS Conference Proc.)
48Information Sources
49Information Sources
50Information Sources
51Information Sources
52Information Sources