Title: Self Repair Technology for Logic Circuits
1Self Repair Technology for Logic Circuits
- Architecture, Overhead and Limitations
Heinrich T. Vierhaus BTU Cottbus Computer
Engineering Group
2Outline
1. Introduction Nano Structure Problems
2. The Problem of Wear-Out
3. Repair for Memory and FPGAs
4. Basic Logic Repair Strategies Structures
5. Test and Repair Administration
6. De-Stressing Strategies
7. Cost, Overhead, Single Points of Failure
8. Summary and Conclusions
31. Introduction
A bunch of new problems from nano structures ...
4Nanoelectronic Problems
Lithography
The wavelength used to map structural
information from masks to wafers is larger (4
times of more) than the minimum structural
features (193 versus 90 / 65 / 45 nm).
Adaptation of layouts for correction of mapping
faults.
Statistical Parameter Variations
The number of atoms in MOS-transistor channels
becomes so small that statistical variations of
doping densities have an impact on device
parameters such as threshold voltages.
5New Problems with Nano-Technologies
Light source
Wave length 193 nm
mask (reticle)
resist
exposed resist
wafer
Feature size down to 28 nm
6Layout Correction
Modified layout for compensation of mapping faults
Compensation is critical and non-ideal
Faults are not random but correlated!
Requires fast fault diagnosis
7Doping Fluctuations in MOS Transistors
Density and distribution of doping atoms cause
shifts in transistor threshold voltages!
8Nanostructure Problems
Individual device characteristics such as Vth are
more dependent on statistical variations of
underlying physical features such as doping
profiles.
Primary Relevance Yield
A significant share of basic devices will be out
or specs and needs a replacement by backup
elements for yield improvement after production.
Primary Relevance Yield
Smaller features mean higher stress (field
strength, current density), also foster new
mechanisms of early wear-out.
Primary Relevance Lifetime
Transient error recognition and compensation in
time is becoming a must due to e. g. charged
particles that can discharge circuit nodes.
Primary Relevance Dependability
9Fault Tolerant Computing
Works only for transient faults!
Software-based fault detection compensation
specific
Fault event
HW logic RT-level detection compensation
Typically works for transient and permanent
faults!
universal
very specific
Typically works for specific types of transient
faults only!
Transistor-and switch level compensation
102. Wear-Out Problems and Mechanisms
Structures on ICs used to live longer than either
their application or even their users. Not any
more ...
11IC Structures May Get Tired
Wear-out effects ICs in nano-electronics are
likely to appear much earlier, causing a lot of
problems for dependable long-time applications !
12Fault Effects on ICs
13Wear-Out Mechnisms
Metal Migration
Metal atoms (Al, Cu) tend to migrate under high
current density and high temperature.
Stress migration
Migration effects may be enhanced under
mechanical stress conditons.
Effect
Metal lines and vias may actually cause line
interrupts. The effect is partly reversible by
changing current directions.
14Metal Migration
15Transistor Degradation
Negative Bias Thermal Instability (NBTI) Reduced
switching speed for p-channel MOS transistors
that have operated under long-time constant
negative gate bias. The effect is partly
reversible.
Hot Carrier Injection (HCI) Reduced switching
speed for n-channel MOS transistors, induced by
positive gate bias and frequent switching. Not
reversible.
Gate Oxide Deterioration Induced by high field
strengh. Not reversible
Dielectric Breakdown Insulating layers between
metal lines may break causing shorts between
signal lines.
Design technology including a prospective life
time budget!!
16Management of Wear-Out by Fault Tolerant
Computing?
Built-in fault tolerance and error compensation
are needed in nano- technologies anyway and for
the management of transient faults.
Wear-out induced faults may show up as
intermittent faults first, which become more
and more frequent.
Fault in synchronous circuits and systems are
detected by clock cycle. Hence the detection
does not even recognize if the fault is
permanent or not for many types of fault tolerant
architecture.
17Triple Modular Redundancy
Execution Unit 1
Comparator Voter
input signal
Result out (majority)
Execution Unit 2
Error detect
Execution Unit 3
Can detect and compensate almost any type of fault
Overhead about 200-300 , additional signal delays
The voter itself is not covered but must be a
self checking checker
Standard (by law) in avionics applications!
18Error Detecting / Correcting Codes
Data
Data
Transmission / Storage
Error correction
Signature
Signature
Fault- detect
Comparison
Often applicable to 1- or 2-bit faults only
Often limited to certain fault models
(uni-directional)
Signature
Becomes expensive if applied to computational
units
19Can TMR and Codes CompensatePermanent Faults?
Fault / error detection circuitry typically works
on a clock-cycle base. It does not know if a
fault is transient or permanent.
A permanent fault is a fault event that occurs in
several to many successive clock cycles
repeatedly.
Error correction technology can detect and
compensate such permanent faults as well as
transient faults.
A critical condition occurs if transient faults
occur on top of permanent faults. Then the
superposition of fault effects is likely
to exceed the systems fault handling capacity.
System components that run actively in parallel
suffer from the same wear-out effects. Therefore
there is a an increase in dependability
before wear-out limits, but no significant life
time extension!
20Redundancy and Wear-Out
During the normal life time of the system,
duplication or triplication can enhance
reliability significantly. But also area and
power consumption are about triplicated.
And by the end of normal operating time (out of
fuel / steam) all three systems will fail shortly
one after the other !!
Reliability enhancement is not equal to life time
extension !!
21Self Repair?
Works only for transient faults!
Software-based fault detection compensation
specific
Fault event
HW logic RT-level detection compensation
Typically works for transient and permanent
faults!
universal
Self Repair for permanent faults!
very specific
Typically works for specific types of transient
faults only!
Transistor-and switch level compensation
223. Repair for Memory and FPGAs
Compensation of transient faults is not
enough. Some technologies for transient
compensation can handle permanent faults, too,
but not on the long run and with additional
transient faults!
23Memory Test Repair
Read- / write lines
Lines
Line address
spare column
columns
24Memory Test Repair (2)
Read- / Write lines
Lines
Line address
spare column
Memory BIST controller
columns
... is already state-of-the-art!
25FPGA-based Self Repair
26In-System FPGA Repair
27Repair Mechanism Row/Line-Shift
Little Overhead for the re-configuration
process
Loss of many good CLBs for every fault
28Distributed Backup CLBs
Minimum loss of functional CLBs
High effort for re-wiring requires massive
embedded computing power (32-bit CPU, 500 MHz)
29Self Repair within FPGA Basic Blocks
Heterogeneous repair strategies required (memory,
logic)
Logic blocks may use methods known from memory
BISR
Additional repair strategies are necessary for
logic elements
The basic overhead for FPGAs versus standard
logic (about 10) is enhanced.
Repair strategies for logic may use some features
already used in FPGAs (e. g. switched
interconnects).
30Structure of a CLB Slice
31FPGAs for a Solution?
The granularity of re-configurable logic blocks
(CLBs) in most FPGAs is the order of several
thousand transistors.
Replacement strategies must be placed on a
granularity of blocks in the area of 100-500
transistors for fault densities between 0.01
and 0.1 .
Efficient FPGA- repair mechanism requires
detailed fault diagnosis plus specific repair
schemes, which cannot be kept as
pre-computed reconfiguration schemes. Computation
of specific repair schemes requires
in-system EDA (re-placement and routing) with a
massive demand for computing power.
There is no source of such always available
computing power.
32Self-Repairing FPGA ?
Reconfigurable Logic
Memory
CLB
CLB
CLB
CLB
WB
WB
WB
New-Config.
CLB
CLB
CLB
CLB
WB
WB
WB
CLB
CLB
CLB
CLB
WB
WB
WB
Program
CLB
CLB
CLB
CLB
WB
WB
WB
Config. Scheme
CLB
CLB
CLB
CLB
WB
WB
WB
CLB
CLB
CLB
CLB
WB
WB
WB
Virtual CPU
33Advanced FPGA Structures
... are only partly re-configurable for
performance reasons !
34FPGA / CPLD Repair
Looks pretty easy at first glance because of
regular architecture!
Requires lines / columns of switches for
configuration at inputs and between AND / OR
matrices.
Requires additional programmability of
cross-points by double-gate transistor as in
EEPROMs or Flash memory.
Not fully compatible with standard CMOS
Limited number of (re-) configurations
Floating gate (FAMOS) transistors are
fault-sensitive!
354. Basic Logic Repair Strategies
Repair techniques that replace failing building
blocks by redundant elements from a silent
storage are not new.
IBM has been selling such computer systems
specifically for applications in banks for decade.
But always with few (2-10) backup elements (CPUs)
assuming a small number of failures (lt 10) within
years.
36Mainframes
.. will often contain redundant CPUs for
eventual fault compensation. But one faulty
transistor then costs a whole CPU, limiting
the fault handling to a few (about 10) permanent
fault cases.
37Granularity of Replacement
38Repair Overhead versus Element Loss
Repair procedure
Functioning
overhead
elements lost
New Methods and Archi- tectures
Prohibitive overhead
Prohibitive fault density
10
1k
10k
100k
1M
10M
1
100
Size of replaced blocks
(granularity)
39Built-in Self Repair (BISR)
BISR is well understood for highly regular
structures such as embedded memory blocks.
BISR is essentially depending on built-in self
test (BIST) with high diagnostic resolution.
Fault Detection
Fault Isolation
Redundancy Allocation
Fault Diagnosis
Fault / Redundancy Management
Redundancy management must monitor faults,
replacements, available redundancy and must also
re-establish a working system state after
power-down states.
40Levels of Repair
Transistors
-
Switch Level
Replace transistors or transistor groups
Losses by reconfiguration (switched
-
off good devices)
Potentially small ( 20
50) for transistor faults
Overhead for test and diagnosis Very high
Repair overhead will dominate reliability!
Gate Level
Replace gates or logic cells
Losses by reconfiguration
Medium (60 to 90 ) for single transistor faults
Overhead for test and diagnosis High
Macro
-
Block Level
Replace functional macros (ALU, FPU, CPU)
Losses by reconfiguration High, 99 or more
Overhead for test and diagnosis Maybe acceptable
41The Fault Isolation Problem
Load 1
Driver
Load 2
Gate- short
GND-shorts of input gates affect the whole
fan-in network and make redundancy obsolete!!
42Block-Level Repair
SE
SE
SE
Blocks of logic / RT elements (gates and larger)
contain a redundant element each that can
replace a faulty unit.
43Switching Concept (1)
44Switching Concept (2)
inputs
inputs
outputs
outputs
Functional Block 1
Functional Block 1
Functional Block 2
Functional Block 2
Functional Block 3
Functional Block 3
Replace- ment Block
Replace- ment Block
Test in
Test out
Test in
Test out
4
3
45A Regular Switching Scheme
The scheme is regular and scalable by nature,
comprising always k functional blocks of the
same nature plus 1 additional block for backup.
Building blocks are separated by (pass-)
transistor switches at inputs and outputs,
providing a full isolation of a faulty block.
Always 2 additional pass-transistors between two
functional blocks.
The reconfiguration scheme is regular in shifting
functionality between blocks, which results in a
simple scheme of administration.
The functional access to the spare block can be
used for testing purposes. In any state of (re-)
configuration, the potentially faulty block is
connected to test input / output terminals.
46Overhead Depending on Block Size
Transistors
Functional backup norm switch ext.
switch
Basic Element
3 /4- 2-NAND 12 4
18 24
3 / 4 2-AND 18 6
18 24
3/4 2-XOR 18 6
18 24
H- Adder 36 12
24 30
F- Adder 90 30
30 36
For small basic blocks, the switches make the
essential overhead (200)!
For larger basic blocks,the overhead can be
reduced to about 30-50
... not counting test- and administration
overhead!
Extract larger basic units from seemingly
irregular logic netlists!!
47Overhead
485. Test and Repair Administration
Test Generator
Logic
Conf.
Conf.
RLB
RLB
BIST
BIST
Logic
Configurator and Status Memory
RLB
RLB
Conf.
Conf.
RLB
RLB
BIST
BIST
RLB
RLB
System Monitoring
Test Analyzer
De-centralized test and control
Centralized Control
May be faulty!
49Blocks, Switching, Administration
Local (re-) configuration
Remote (re-) configuration
Columns of Switches
Columns of Switches
F-Unit
F-Unit
F-Unit
F-Unit
F-Unit
F-Unit
F-Unit
F-Unit
Red.-Unit
Red.-Unit
Red.-Unit
Red.-Unit
F-Unit
F-Unit
F-Unit
F-Unit
Conf.-Unit
Conf.-Unit
Decoder
Decoder
Conf.-Unit
Conf.-Unit
Global Control-Unit
Global Control-Unit
50Combining Test and Re-Configuration
Reference
Test input
Test out
Logic under Test
Compare
fault detect
next state
Config. Memory / Counter
51Test and Administration
Each of the elements in a block is testable via
specific test inputs.
Test is done by comparison with reference
outputs. The system is run through states of
re-configuration with the same input test pattern
applied. At test, a functional unit is always
removed from normal operation and connected to
test I / O s.
Test out
Test in
In case of a fault detect, the system is fixed
in the current status.
fix at fault
Such a procedure of self-test and
self-reconfiguration can run at every system
start-up, avoiding a central fault memory.
52Controller for (Re-) Configuration
Controller minimum complexity 80 transistors (3
1 configuration)
A controller may drive one or several
re-configurable blocks in parallel, depending on
their size
53Local Interconnects
The block-based repair scheme so far can not
cover faults on wires between re-configurable
blocks.
For small basic blocks (such as logic gates) the
majority of wiring is between re-configurable
units and not covered.
For larger (RT-level) basic blocks the majority
of wiring is within basic blocks and covered.
Schemes that can also cover inter-block wiring
are possible, but require FPGA-like configurable
switching and complex switching schemes.
54Essentials of the Repair Scheme
Logic self repair is feasible at cost below
triple modular redundancy (TMR).
There is a trade-off between the size or the
reconfigurable logic blocks (RLBs) and the
maximum tolerable fault density.
Administration, not redundancy makes the critical
overhead.
Efforts can be saved by administrating several
RLBs in parallel.
Low-level interconnects between RLBs make for the
essential single point of failure in the repair
scheme!
556. De-Stressing
56The Purpose of De-Stressing
Building blocks in digital systems of equal type
may be more or less heavily used.
Blocks running with the highest dynamic load and
at the highest temperature are candidates for
early failure.
Using otherwize silent resources to relieve
such units from stress periodically may serve the
overall life time of the system.
The re-configuration scheme developed for repair
may also serve such purpose with slight
modifications.
..and the scheme must be compatible with repair
architectures !
57The Scheme of De-Stressing
58Modified Control Scheme
For de-stressing, functions have to be shifted
while the system is in hot operation.
As long as all building blocks are fully
functional, running two functional blocks in
parallel serving the same inputs and outputs is
possible.
With a total of k building blocks (including the
spare one) there are k stable states of
re-configuration (1 normal, 3 repairs) and
(k-1) intermediate states for handover in case
of de-stressing.
There are no extra switches necessary, but an
additional overhead in state management and state
decoding.
59FSM including Transitional States
If a flying transition between repair states
becomes necessary, the control logic will have
seven states instead of four!
60Control Logic Functionality
61Extended Control Logic
627. Overhead and Limitations
BISR requires additional overhead.
The inevitable extra circuitry used for fault
administration is not fault-free by definition.
But we can assume that such circuitry, if
fabricated correctly, is not in heavy use all the
time and will exhibit much reduced failure from
stress.
Memory cells used for repair state administration
are prone to transient fault effects from
particle radiation.
Wit suitable state encoding (1-out of n-code)
parity check can be applied.
63Overhead
64Cost / Overhead
( 3 functional blocks plus 1 backup in RLB)
with / without extensions for de-stressing,
controller design optimized for supervision by
parity control.
65Sources of Overhead
Basic Complexity Overhead
in Block (trans.) redund.
switches control ctrl/destr.
2-NAND 4 33
250 675 1666
H-Adder 12 33
111 225 555
F-Adder 30 33
55 90 222
2Bit ALU 352 33 13
7.6 18.9
4Bit ALU 699 33 8.5
3.8 9.5
8Bit ALU 1367 33
6.2 2 4.8
Switches and control overhead dominate,
reasonable lower bound for complexity of basic
blocks is around 100-200 transistors.
66Overhead and Block Size
67The Switching Problem (1)
Compensates always on
Compensates always off
Compensates always on and always off
... always in one single transistor.
68Single Points of FailureTransistor Switches
69Pass Transistor Faults
Short
A short condition between the signal input
(Usign) and the control input (Uctrl) may be
solved by designing the gate input line (Rbr) as
a fuse. Then one additional transistor is needed
as a power sink.
70Blowing Fuses
718. Summary and Conclusions
Logic self-repair is not impossible, but noch
cheap either.
The lower bound for logic blocks is about 100
transistors.
Experience shows that most logic designs yield
some potential for logic extraction.
Repair technologies work even (much) better for
regular processor architectures such as VLIW
processors.
In real-life designs, a large part of the system
(memory, 50-90 ), functional units, 10-40 ) is
regular. Only a small fraction is
truly irregular and needs higher overhead.
No such strategy yet for analog and mixed signal
circuits !
72Real Embedded Systems
CPU
CPU
Mem.
Data Path
Data Path
Ctrl
Cache
Ctrl
Cache
Memory
Mixed Signal / RF
DSP
.. only a small fraction of the real system is
truly irregular and needs expensive logic
repair !
73Regular Processor Architectures
Register File
Crtl.- Logic
Needs Logic-BISR
Mult
Add
Multiple parallel Processing units
Regular processor structures with multiple
parallel units need expensive logic (self-)
repair only for their control logic.
Reconfiguration of data-path elements can be
arranged by software, which does not have
wear-out !
74Design for Repairability
RT netlist
Extract obvious regular blocks
RLB Control Circuitry
Random Logic
Compose RT-RLBs
done
Find and extract regular entities
Compose RLB control Scheme
Estimate Reliability
Compose Gate-Level RLBs
Random Rest Logic
75This is the END !
Thank you for not falling asleep !
(I would have....)