Title: FPGA SelfRepair using an Organic Embedded System Architecture
1FPGA Self-Repair using an Organic Embedded
System Architecture
Kening Zhang, Jaafar Alghazo and Ronald F. DeMara
University of Central Florida
06 December 2007
2Organic Computing (OC)biologically-inspired
computing with self-x properties
Technical Objective
support long lifetime missions with multiple
failure occurrences
Research Focus
OC Approach addresses system controllability
with increasing complexity
Communication networks among autonomous systems
Composed of large collection of autonomous
systems
Autonomous system owned sensor and actuators
System Property
- Self-organization
- Self-configuration
- Self-optimization
- Self-healing
- Self-protection
- Self-explaining
Self-x Characteristics
- Context-awareness
- Self-synchronization
- Example Relevance
- How to achieve sustainable presence in NASAs
Moon, Mars Beyond objective???
Reconfigurable Hardware with Self-Healing based
on SRAM FPGA platform
Sponsors NASA FPGA platform and Genetic
Algorithm research DARPA
OC approach and SOAR Longevity Platform
3Goal Autonomous FPGA Refurbishment
increase availability without carrying
pre-configured spares
- Redundancy
- increases with amount
- of spare capacity
-
- restricted at design-time
-
-
- based on time required to select spare
resource - determined by adequacy of spares available (?)
-
- yes
- Refurbishment
- weakly-related to number
- recovery capacity
-
- variable at recovery-time
- based on time required to find suitable
recovery - affected by multiple characteristics (
or -) - yes
?
Overhead from Unutilized Spares weight, size,
power Granularity of Fault Coverage
resolution where fault handled
Fault-Resolution Latency availability via
downtime required to handle fault Quality
of Repair likelihood and completeness
Autonomous Operation fix without outside
intervention
?
?
?
?
?
4Fault-Handling Techniques for SRAM-based FPGAs
Device Failure
Characteristics
Duration
Transient SEU
Permanent SEL, Oxide Breakdown, Electron
Migration, LPD
Device Configuration
Processing Datapath
Device Configuration
Processing Datapath
Target
BIST
Evolutionary
Approach
Scrubbing
TMR
STARS
CED
Vigander
OC
Methods
Supplementary Testbench
Duplex Output Comparison
Duplex/Triplex Output Comparison
Detection
(not addressed)
Cartesian Intersection
Isolation
(not addressed)
Bitwise Comparison
Majority Vote
Autonomous Element (AE)
Fast Run-time Location
Worst-case Clock Period Dilation
Diagnosis
Autonomous Supervisor (AS)
unnecessary
Population-based GA using Extrinsic
Fitness Evaluation
Evolutionary Algorithm using Intrinsic
Fitness Evaluation
Recovery
Replicate in Spare Resource
Select Spare Resource
Reload Bitstream / Invert Bit Value
Ignore Discrepancy
5Autonomous System-on-a-Chip (ASoC) Architecture
- Dual-layer ASoC proposed by Lipsa et al Lipsa
05 - Functional Layer
- Functional Elements (FEs) e.g. CPU, RAM, Network
interface - Autonomic Layer
- Autonomic Elements (AEs)
- Monitor
- Actuator
- Communication interface
- Autonomic Supervisor (AS)
- UCF Approach for fault coverage
- Functional Layer Autonomic Layer
- achieved by assessing consensus
- among elements
- first to realize failure detection
- consensus provides an organic method
- for fitness evaluation of competing
alternatives during - evolution providing a self-regulating
approach to fault resolution
6EHW Environments
- Evolvable Hardware (EHW) Environments enable
experimental methods to research soft
computing intelligent search techniques - EHW operates by repetitive reprogramming of
real-world physical devices using an iterative
refinement process
Extrinsic Evolution
Intrinsic Evolution
Application
Two modes of Evolvable Hardware
or
Genetic Algorithm
Genetic Algorithm
Deep Space Satellite gt100 FPGAs onboard
hostile environment radiation, thermal
stress How to achieve reliability to avoid
mission failure???
Simulation in the loop
Hardware in the loop
Done? Build it
software model
new approach to Autonomous Repair of failed
devices
device design-time refinement
device run-time refinement
7Genetic Algorithms (GAs)
- Mechanism coarsely modeled after neo-Darwinism
(natural selection genetics)
start
replacement
offspring
population of candidate solutions
evaluate fitness of individuals
Fitness function
mutation
crossover
selection of parents
parents
Goal reached
8Genetic Mechanisms
- Guided trial-and-error search techniques using
principles of Darwinian evolution - iterative selection, survival of the fittest
- genetic operators -- mutation, crossover,
- implementor must define fitness function
- GAs frequently use strings of 1s and 0s to
represent candidate solutions - Genotype chromosomes of GA operation if 100101
is better than 010001 it will have more chance to
breed and influence future population - Genotype changes during evolution must adhere to
the Xilinx-defined format of bitstream - To prevent undesirable conditions that may damage
the FPGA such as a mutation which has two logic
outputs tied together, a logical genotype is used
for evolution and mapped to physical phenotype - Logic functional logic index number for LUT
- Row/Column physical location of LUT in FPGA
9Loosely Coupled Solution on Xilinx Virtex II Pro
Virtex 4
The Virtex 2Pro/4 is mounted on a development
board which can then be interfaced with a
WorkStation running Xilinx EDK and ISE.
The entire system operates on a 32-bit basis
10Organic Embedded System (OES) Architecture
One Dimensional Column-oriented OES based on
Xilinx Virtex II Pro FPGA platform
- FEs and AEs reside on two distinct layers with
interconnection structure between them - AEs and FEs can either be realized in hardware,
software, or co-design - AE layer supervises functionality of FE elements
while requiring no application-specific
algorithms on the AE layer - Observer/Controller architecture includes an AS
element which had no counterpart to evaluate if
the AS fault-free, so address by minimizing its
complexity in proposed approach - utilize Xilinx partial reconfiguration technology
to manipulate relocatable bitstreams
11OES AE Component Design
- AEs decentralize Observer/Controller
functionality - Concurrent Error Detection (CED) unit collects 2
FE Outputs for discrepancy identification - A Checksum for AE fault detection which are
checked against Stored Checksum values - Evaluator of outputs from 2 FEs against checksum
and Actuator which initiates recovery phase - An important architectural property is that all
AE components are identical in structure despite
the fact that they monitor different types of
FEs. - Homogeneous characteristics deliver a
uniform-behavior property leveraged for
consensus-based evaluation fault-handling
methodology - OC Concept although AE components add an
additional complexity to the design, they will
ease integration of fault-handling difficulties
inherent with current commercial IP cores
12Consensus-Based Evaluation (CBE)
- Uses a Relative Fitness Measure
- Pairwise discrepancy checking yields relative
fitness measure - Broad temporal consensus in the population used
to determine fitness metric - Transition between Fitness States occurs in the
population - Provides graceful degradation in presence of
changing environments, applications and inputs,
since this is a moving measure - Test Inputs Normal Inputs for Data Throughput
- CBE does not utilizes additional functional nor
resource test vectors - Potential for higher availability as regeneration
is integrated with normal operation
13Genetic Operators Mutation
Typical Approach bit inversion of LUT
functionality Selected Approach input
interconnection of LUTs mutated
Rearrange input interconnection to search unused
LUT resources which occlude faulty resource
Mutation Genotype chromosomes
- original functionality is
- F F1(F3F4) w/ input F2 unassigned by
synthesis tool - mutation operator will change input F4 to unused
as F F1(F3F2) - shadow shows changed input and LUT contents
- some opportunity for input stuck-at fault or LUT
content stuck-at fault. - functionalities of LUTs remain undistorted while
search space explored
- Mutation Phenotype chromosomes
14Genetic Operators Cell Swapping
Cell-Swap operation on Phenotype chromosomes
Cell-Swap operation on Genotype chromosomes
interchanges two distinct LUT blocks while
maintaining correct logic order and
functionalities in genotype
- exchange all LUT input interconnections, LUT
content and physical 2-tuple (Col, Row) as well
as the logic sequence
15Genetic Operators PMX Operator
Partial Match Crossover (PMX) maintains crossover
information as well as order information
- two genotype configuration streams are aligned
at LUT boundary - crossover site selected at random along LUT
boundary - this crossover point defines a left/right
partition used to affect crossover through
LUT-by-LUT exchange - suppose crossover point at position 4 of the LUT
vector - first step is to map configuration B to
configuration A by exchanging the following
aligned LUTs (4,7),(5,2),(6,1),(7,5). - Applying PMX results in two new configurations A
and B
16Illustrative ExampleGate Level Design of OES
- Experiment circuit 1-bit
Full-adder - Fault-free model Duplex
- Fault-impact model TMR
- Fault-detect model CBE
- Fault recovery strategy GA operation
- Experimental setup
- Hardware prototype implemented in Xilinx
Virtex-II Pro FPGA - VHDL implementation
- Using the GNAT library along with the MRRA
framework and JTAG reconfiguration interface.
17MCNC-91 Benchmark Case Studies
System Availability under Multiple Faults
Fc number of correct behaviors of FE observed
during evolutionary recovery phase Fe number of
errant or discrepant behaviors 1 exactly one
output required to detect the fault during the
original CED configuration. 2 number of the
reconfigurations required, i.e. one from CED to
TMR, and one back from TMR to CED Fc1 Fe1
correct and faulty output number of the FE during
the AE repair period Fc2 Fe2 correct and
faulty output number during the FE repair period
n number of reconfigurations of the FE ß
represents reconfiguration to computation time
ratio
18Experimental Results
- Fault Free arrangement CED FEs with cold
standby FE - Inject a stuck-at-zero or stuck-at-one fault at
one of the FEs LUT input pins - CED -gt TMR to identify faulty FE or AE
- CBE used to resolve faulty AE
Redundancy for both FE (RFE) and AE (RAE) ratio
of unused LUT inputs to total number of LUTs
inputs
Fc number of correct behaviors of FE observed
during evolutionary recovery phase Fe number of
errant or discrepant behaviors n number of
reconfigurations of the FE ß represents
reconfiguration to computation time ratio
19Experimental Results
- Fault Free arrangement CED FEs with cold
standby FE - Inject a stuck-at-zero or stuck-at-one fault at
one of the FEs LUT input pins - CED -gt TMR to identify faulty FE or AE
- CBE used to resolve faulty AE
Redundancy for both FE (RFE) and AE (RAE) ratio
of unused LUT inputs to total number of LUTs
inputs
Fc number of correct behaviors of FE observed
during evolutionary recovery phase Fe number of
errant or discrepant behaviors n number of
reconfigurations of the FE ß represents
reconfiguration to computation time ratio
20Experimental Results
- Fault Free arrangement CED FEs with cold
standby FE - Inject a stuck-at-zero or stuck-at-one fault at
one of the FEs LUT input pins - CED -gt TMR to identify faulty FE or AE
- CBE used to resolve faulty AE
Redundancy for both FE (RFE) and AE (RAE) ratio
of unused LUT inputs to total number of LUTs
inputs
Fc number of correct behaviors of FE observed
during evolutionary recovery phase Fe number of
errant or discrepant behaviors n number of
reconfigurations of the FE ß represents
reconfiguration to computation time ratio
21Conclusion
- A self-adaptation and self-healing OES
architecture developed for autonomic operation
without human intervention. - The OES architecture is capable of handling many
single fault scenarios and several multiple fault
scenarios for small digital logic design. - Experimental result support our design objectives
during the repair phase averaged 75.05, 82.21,
and 65.21 for the z4ml, cm85a, and cm138a
circuits respectively under stated conditions. - Reconfiguration time ratio (ß) ratio is key
factor limiting availability during AE repair - Future work evaluate extensions of the OES
architecture addressing scalability of in terms
of pipelined stages
22Backup Slides
23Isolation of a single faulty individual with
1-out-of-64 impact
instantaneous DV (point values) for a sample
individual in population and population oracles
(solid lines)
Sliding Window
- Outliers are identified after EW iterations have
elapsed - Expected D.V. (1/64)600 9.375 from
individual impacted by fault - Isolated faulty individuals DV differs from the
average DV by 3? after 1 or more observation
intervals of length EW
24Future WorkDevelopment Board to Self-Contained
FPGA
- Qualitative Analysis of CRR model
- Number of iterations and completeness of
regeneration repair - Percentage of time the device remains online
despite physical resource fault (availability) - Hardware Resource Management
- Optimization of hardware profile for Xilinx
Virtex II Pro - Field Testing on SRAM-based FPGA in a Cubesat
mission
25OES Integrated FE and AE Failure Detection
Procedure
- System Initialization
- FE Initialization step
- Compute Checksum step
- FE Fault Detection/Recovery
- AE-CED fault detection
- FE fault-recovery
- AE fault detection Phase
- A fault may exist in the CED, Actuator, or
Evaluator, - A fault may exist in Check Sum component, or
- A fault may exist in the Stored CheckSum-LUT.
Runtime inputs to FE applied to both active
instance under a CED strategy. After allowing for
FE inputs propagation time through the AE, the
expected output will be supplied to AE-CED for
the fault detection. The output of the FE is then
compared in the AE-CED module and any discrepancy
between the two values will indicate that a fault
has occurred either of one the FE or the AE-CED
itself. Further detection will be required to
distinguish which of the two is faulty. If the AE
component is identified as innocent and then the
fault must of occurred in this output will be
discarded and control will branch to a fault
identification phase which will wakeup the cold
standby FE and construct a temporary TMR system
which can articulate the faulty FE under the new
supplied external input. Furthermore, as
descrived in Section 3.3, the actuator will
initiate a repair cycle which may require
automatic evolutionary repair of the identified
faulty FE which will be set as standby-under-repai
r and the AE-CED will return to receive the
remaining two active FEs inputs. The
decision-making procedure causes at least one
throughput-delay penalty
26Previous Work
- Detection Characteristics of FPGA Fault-Handling
Schemes
Strategy 1) Evolve redundancy into
design before the anticipated failure
or
27Previous Work
- Fault Recovery Characteristics of Selected
Approaches
Strategy 2) Evolve recovery from specific
failure after (and if) it occurs or
28CRR Arrangement in SRAM FPGA
- Configurations in Population
- C CL? CR
- CL subset of left-half configurations
- CR subset of right-half configurations
- CLCR C/2
- Discrepancy Operator
- Baseline Discrepancy Operator ? is dyadic
operator with binary output - Z(Ci) is FPGA data throughput output of
configuration Ci - Each half-configuration evaluates ? using
embedded checker (XNOR gate) within each
individual - Any fault in checker lowers that individuals
fitness so that individual is no longer preferred
and eventually undergoes repair
WTA
(Equivalence)
29Terminology and Characteristics
Pristine Pool CP. For any Ci?C, is member of CP
at generation G if and only if Suspect Pool
CS. For any Ci?C, is member of CS at generation
G if and only if at least one of Under Repair
Pool CU For any Ci?C, is member of CU at
generation G if and only if Refurbished Pool
CR after Genetic Operator applied, the new
generated individual is member of CR at
generation G if and only if
ED is Discrepancy Count of Ci and EC is
Correctness Count of Ci Length of Evaluation
Fitness Window W ED EC Fitness Metric f(Ci)
EC/ EW
30Sketch of CRR ApproachPremise Recovery
Complexity ltlt Design Complexity
- Initialization
- Population P of functionally-identical yet
physically-distinct configurations - Partition P into sub-populations that use
supersets of physically-distinct resources, e.g.
size P/2 to designate physical FPGA - left-half or right-half resource utilization
- Fitness Assessment
- Discrepancy Operator ? is some function of
- bitwise agreement between each halfs output
- Four Fitness States defined for Configurations as
- CP,CS,CU,CR with transitions, respectively
- Pristine Suspect Under Repair
Refurbished - Fitness Evaluation Window W determines
comparison interval - Regeneration
- Genetic Operators used to recover from fault
based on Reintroduction Rate ? - Operators only applied once then offspring
returned to service without for concern about
increasing fitness
fitness assessment via pairwise discrepancy
(temporal voting vs. spatial voting)
31States Transitions during lifetime of ith
Half-Configuration
Configuration Health States
32Procedural Flow under Competitive Runtime
Reconfiguration
- Integrates all fault handling stages using EC
strategy - Detects faults by the occurrence of discrepancy
- Isolates faults by accumulation of discrepancies
- Failure-specific refurbishment using Genetic
Operators - Intra-Module-Crossover, Inter-Module-Crossover,
Intra-Module-Mutation - Realize online device refurbishment
- Refurbished online without additional function or
resource test vectors - Repair during the normal data throughput process
33Fitness Evaluation Window
- Fitness Evaluation Window W
- denotes number of iterations used to evaluate
fitness before the state of an individual is
determined - Determination of W for 3x3 multiplier
- 6 input pins articulating 2664 possible inputs
- W should be selected so that all possible inputs
appear - More formally,
- Let rand(X) return some xi ? X at random
- Seek W ? ? rand(X) X with high
probability
- xK distinct orderings of K inputs showing in D
trials - if D constant, can calculate Pkgt1 successively
- probability PK of K inputs showing after D trials
is ratio of xK / KD
34W Determination
When K64
35Integer Multiplier Case Study
- 3bit x 3bit unsigned multiplier automated design
- Building blocks
- Half-Adder 18 templates created
- Full-Adder 24 templates
- Parallel-And 1 template created
- Randomly select templates for instantiation in
modules
GA operators External-Module-Crossover Internal-Mo
dule-Crossover Internal-Module-Mutation
GA parameters Population size 20 individuals
Crossover rate 5 Mutation rate up to
80 per bit
Experiments Demonstrate
Experimental Evaluation Xilinx Virtex II Pro on
Avnet PCI board
- Objective fitness function replaced by the
Consensus-based Evaluation Approach and Relative
Fitness - Elimination of additional test vectors
- Temporal Assessment process
36Template Fault Coverage
Half-Adder Template A
Half-Adder Template B
- Template A
- Gate3 is an AND gate
- Will lose correctness if a Stuck-At-Zero fault
occurs in second input line of the Gate3, an AND
gate - Template B
- Gate3 is a NOT gate and only uses the first input
line - Will work correctly even if second input line is
stuck at Zero or One
37Regeneration Performance
Parameters
Difference (vs. Hamming Distance) Evaluation
Window, Ew 600 Suspect Threshold ?S
1-6/60099 Repair Threshold ?R 1-4/600
99.3 Re-introduction rate ?r 0.1
Repairs evolved in-situ, in real-time, without
additional test vectors, while allowing device to
remain partially online.
38Isolation of a single faulty individual with
1-out-of-64 impact
- Outliers are identified after W iterations
elapsed - E.V. (1/64)600 9.375 from minimum impact
faulty individual - Isolated individuals f differs from the average
DV by 3? after 1 or more observation intervals of
length W