Title: Sustainable Fault-Handling of Reconfigurable Logic using Throughput-Driven Assessment
1Sustainable Fault-Handlingof Reconfigurable
Logic using Throughput-Driven Assessment
Carthik Anand SharmaUniversity of Central Florida
7 July 2008
2Motivation
- Mission-critical Embedded Systems require high
reliability and availability - Characteristics of Operating Environment may
induce hardware failures - Aging, Manufacturing Defects, etc.
- System Reliability
- Fault Avoidance. Always Possible? No
- Design Margin. Always Adequate? No
- Modular Redundancy. Always Recoverable?No
- Fault Refurbishment. Highly Flexible? Yes
but technically challenging to achieve
?
3Technical ObjectiveAutonomous FPGA
Regeneration
NASA Moon, Mars, and Beyond Realize 10s years
service life ???
Increased availability without pre-configured
spares
Reconfiguration allows new fault-handling
paradigm
- Redundancy
- increases with amount
- of spare capacity
-
- restricted at design-time
-
-
- based on time required to select spare
resource - determined by adequacy of spares available (?)
-
- yes
- Regeneration
- weakly-related to number
- recovery capacity
-
- variable at recovery-time
- based on time required to find suitable
recovery - affected by multiple characteristics (
or -) - yes
everyday example
spare tire
can of fix-a-flat
?
Overhead from Unutilized Spares weight, size,
power Granularity of Fault Coverage
resolution where fault handled
Fault-Resolution Latency availability via
downtime required to handle fault Quality
of Repair likelihood and completeness
Autonomous Operation recover without outside
intervention
?
?
?
?
?
4Fault-Handling Techniques for SRAM-based FPGAs
Reprogrammable Device Failure
Characteristics
Duration
Transient SEU
Permanent SEL, Oxide Breakdown, Electron
Migration, LPD
Device Configuration
Processing Datapath
Device Configuration
Processing Datapath
Target
BIST
Evolutionary
Repetitive Readback Wells00
TMR (conventional spatial redundancy)
Approach
STARS Abramovici01
CED McCluskey04
Sussex Vigander01
CRR
Methods
Supplementary Testbench
Duplex Output Comparison
Duplex Output Comparison
Detection
(not addressed)
Cartesian Intersection
Isolation
(not addressed)
Bitwise Comparison
Majority Vote
unnecessary
Fast Run-time Location
Worst-case Clock Period Dilation
Diagnosis
unnecessary
unnecessary
Population-based GA using Extrinsic
Fitness Evaluation
Evolutionary Algorithm using Intrinsic
Fitness Evaluation
Recovery
Replicate in Spare Resource
Select Spare Resource
Invert Bit Value
Ignore Discrepancy
5Contributions
- Strategy for Integrating all phases of Fault
Handling process - detection, isolation, diagnosis and recovery work
in synergy - Elimination of Additional Test Vectors
- enables detection and isolation with minimal
system downtime - Autonomous Group Testing techniques for FPGA
devices - isolates faults in FPGA while maintaining system
performance - Competitive Runtime Reconfiguration
- leverages iterative pairwise comparison and
functional - regeneration to provide adaptive refurbishment
- with resource recycling
6Previous Work
- Detection Characteristics of FPGA Fault-Handling
Schemes
Strategies 1) Evolve redundancy into design
before anticipated failure 2) Redesign
after detection of failure 3) Combine
desirable aspects of both strategies 1) 2)
7Group Testing Algorithms
- Origin World War II Blood testing
- Problem Test samples from millions of new
recruits - Solution Test blocks of sample before testing
individual samples - Problem Definition
- Identify subset Q of defectives from set P
- Minimize number of tests
- Test v-subsets of P
- Form suitable blocks
8CRR Arrangement in SRAM FPGA
- Configurations in Population
- C CL? CR
- CL subset of left-half configurations
- CR subset of right-half configurations
- CLCR C/2
- Discrepancy Operator
- Baseline Discrepancy Operator ? is dyadic
operator with binary output - Z(Ci) is FPGA data throughput output of
configuration Ci - Each half-configuration evaluates ? using
embedded checker (XNOR gate) within each
individual - Any fault in checker lowers that individuals
fitness so that individual is no longer preferred
and eventually undergoes repair
WTA
(Equivalence)
9Sketch of CRR ApproachPremise Recovery
Complexity ltlt Design Complexity
- Initialization
- Population P of functionally-identical yet
physically-distinct configurations - Partition P into sub-populations that use
supersets of physically-distinct resources, e.g.
size P/2 to designate physical FPGA - left-half or right-half resource utilization
- Fitness Assessment
- Discrepancy Operator ? is some function of
- bitwise agreement between each halfs output
- Four Fitness States defined for Configurations as
- CP,CS,CU,CR with transitions, respectively
- Pristine Suspect Under Repair
Refurbished - Fitness Evaluation Window W determines
comparison interval - Regeneration
- Genetic Operators used to recover from fault
based on Reintroduction Rate ? - Operators only applied once then offspring
returned to service without for concern about
increasing fitness
fitness assessment via pairwise discrepancy
(temporal voting vs. spatial voting)
10FPGA Genetic Representations
- Chromosome Goals
- Allow all possible LUT configurations
- Allow all possible CLB interconnections given
constraints of routing support - Disallow illegal FPGA configurations and
non-coding introns (junk DNA) - Facilitate crossover operator
- Bitstring representation is natural choice,
though may not scale well (investigating
generative reps) - Representation shown here is sample specific to
Xilinx Virtex FPGA
11Competitive Runtime Reconfiguration (CRR)
Evolutionary Computation strategies effective for
more than just repair phase continually
detect, rank, and isolate faults entirely within
the underlying data throughput flow
diverse alternatives working a-priori
fault detection by robust consensus over time
no test vectors
device remains online during repair
fault isolation is model-free and
self-calibrating
completely-repaired criteria can be ignored
graceful degredation via ranking of
alternatives
no reconfiguration when fault-free
performance readily adjustable
failures in population memory covered
checking logic part of individual hence also
competes for correctness
12Fitness Evaluation Window
- Fitness Evaluation Window E
- denotes number of iterations used to evaluate
fitness before the state of an individual is
determined - Determination of E for 3x3 multiplier
- 6 input pins articulating 2664 possible inputs
- W should be selected so that all possible inputs
appear - More formally,
- Let rand(X) return some xi ? X at random
- Seek W ? ? rand(X) X with high
probability
- xK distinct orderings of K inputs showing in D
trials - if D constant, can calculate Pkgt1 successively
- probability PK of K inputs showing after D trials
is ratio of xK / KD
13E Determination
When K64
14Integer Multiplier Case Study
- 3bit x 3bit unsigned multiplier automated design
- Building blocks
- Half-Adder 18 templates created
- Full-Adder 24 templates
- Parallel-And 1 template created
- Randomly select templates for instantiation in
modules
GA operators External-Module-Crossover Internal-Mo
dule-Crossover Internal-Module-Mutation
GA parameters Population size 20 individuals
Crossover rate 5 Mutation rate up to
80 per bit
Experiments Demonstrate
Experimental Evaluation Xilinx Virtex II Pro on
Avnet PCI board
- Objective fitness function replaced by the
Consensus-based Evaluation Approach and Relative
Fitness - Elimination of additional test vectors
- Temporal Assessment process
15Regeneration Performance
System Throughput during Regeneration for a 3x3
multiplier
Exp. Number Fault Location Failure Type Correctness after Fault Total Iterations Discrepant Iterations Repair Iterations Final Correctness Throughput ()
1 CLB3,LUT0,Input1 Stuck-at-1 52 / 64 1.7 ? 107 4.2 ? 105 1194 64 / 64 97.7
2 CLB6,LUT0,Input1 Stuck-at-0 33 / 64 8.0 ? 105 1.7 ? 104 47 64 / 64 97.9
3 CLB5,LUT2,Input0 Stuck-at-1 22 / 64 3.1 ? 106 6.8 ? 104 193 64 / 64 97.8
4 CLB7,LUT2,Input0 Stuck-at-0 38 / 64 8.1 ? 106 1.8 ? 105 513 64 / 64 97.7
5 CLB9,LUT0,Input1 Stuck-at-0 40 / 64 2.3 ? 106 7.1 ? 104 219 64 / 64 96.9
Average 32.6 / 64 6.4 ? 106 1.5 ? 105 433 64 / 64 97.6
Difference (vs. Hamming Distance) Evaluation
Window, Ew 600 Suspect Threshold ?S
1-6/60099 Repair Threshold ?R 1-4/600
99.3 Re-introduction rate ?r 0.1
Parameters
Repairs evolved in-situ, in real-time, without
additional test vectors, while allowing device to
remain partially online.
16Isolation Problem Outline
- Objectives
- Locate faulty logic and/or interconnect resource
a single stuck-at fault model is assumed - Online Fault Isolation device not entirely
removed from service - Features
- Runtime Reconfiguration FPGA resources
configured dynamically - Utilize Runtime Inputs avoid special
test-vectors, improve availability - Constraints
- Use pre-designed configurations defined by
target application - Subsets under test have constant resource
utilization range for a given isolation problem - Resource grouping influences fault articulation
resource-mapping and input vector might mask
hardware faults - Do not use specialized block designs
- Runtime reconfiguration initially limited to
column-swapping - Non-reasonable algorithm tests may be
repeated without gaining new isolation
information
17Discrepancy Mirror
- Mechanism for Checking-the-Checker (golden
element problem) - Makes checker part of configuration that
competes DeMara PDPTA-05
Fault Coverage
18- Influence of LUT utilization
Perpetually Articulating Inputs with Equiprobable
Distribution
Intermittently Articulating Inputs with
Equiprobable Distribution
- expected number of pairings grows sub-linearly
in number of resources - utilization below 20 or above 80 implicates
(or exonerates) a smaller sub-set of resources - 50 utilization, the expected number of pairings
for 1,000, 10,000, and 100,000 resources are
11.1, 14.9, and 17.6
- at 90 utilization mean value of 258 pairings
are required to isolate the faulty resource.
19Fault Location Using Dueling
- The set of all competing configurations is
represented by S. - Set Ck represents the resources utilized by
configuration k. - Each competing configuration k, 1 lt k lt S has
a unique binary - Usage Matrix Uk, 1 lt k lt p.
- Elements Uki,j, 1 lt i lt m, 1 lt j n, where m
and n represent the rows and columns in the
device layout respectively. - Elements Uki,j 1 denote the usage of resource
(i, j) by Ck. - The History Matrix H, with elements Hi,j 1 lt i
lt m, 1 lt j lt n, is an integer matrix used to
represent the relative fitness of individual
resources. - Hi,j provides instantaneous relative fitness
values of resources.
20Dueling Example
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 1 0 1 0 0
0 0 0 1 0 0 0 0 0 0
0 0 1 0 0 1 1 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 1 1 0 0 0
0 0 1 1 0 0 1 0 0 0
0 0 1 0 1 0 0 0 0 0
0 0 1 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
H i,j _at_ t 0
U2
U1
0 0 0 0 0 0 0 0 0 0
0 0 0 1 1 1 1 0 0 0
0 0 2 1 0 0 1 0 0 0
0 0 1 0 1 1 0 1 0 0
0 0 1 1 0 1 0 0 0 0
0 0 1 0 0 1 1 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
- H i,j changes after C1 and C2 are loaded
- U1 and U2 are corresponding Usage Matrices
- (3,3) is identified as the faulty resource
H i,j _at_ t 2
21Isolation Progress without Halving
- Without Halving
- Initially S 20,000
- Resource Utilization 40
- Number of suspected faulty elements constant at
36 after 23 iterations - No subsequent improvement due to lack of
differentiating information between competing
configurations
22Dueling with Modified Halving
- Dueling with Halving
- Halving works by swapping half the used columns
with unused ones -
- Halving progressively reduces the size of the
set of suspected faulty elements - Isolation proceeds till a single faulty element
is isolated - Fault isolated after 19 iterations
23Enhancing Embedded Core BIST using Group Testing
BIST Structure Used for Embedded Core Testing
XCVLX30 device - 32 DSP48E Cores divided into n
8 groups 8 x 6 2x1 multiplexers are needed. 6
columns of Comparators, each Column has 8
Comparators Comparators kn(i,j), 0 ? i,j ? 3, i?j
complete test for a group of 4 Flipflops FF0
through FF5 register comparison results for each
group Fault diagnosis script processes result of
each set of 6 outputs
24Embedded Core BIST using Group Testing
Resource Utilization
Faults in up to 2 BUTs in each group of 4 can be
isolated Isolation is achieved without device
reconfiguration in a single stage
25Logic Element Isolation Using Autonomous Group
Testing (AGT)
In each stage, suspect resources S are equally
shared among pstage individuals If S Smax then
mutually exclusive shares are possible,
else, nshare nreqd - R - S are shared
26Equal Share Strategy
27Fault Isolation Using FIAT
- Fault Insertion and Analysis Toolkit (FIAT)
- provides methods to modify Xilinx FPGA
configurations - inserts suck-at-faults at LUT inputs
- precludes need to edit configuration bitstream
- works in conjunction with Xilinx ISE software
(COTS design suite)
28AGT Experiments
- Experimental Setup
- DES-56 encryption circuit
- Xilinx ISE design tools to place and route the
design - Virtex II Pro FPGA device
- Fault Injection and Analysis Toolkit (FIAT)
- Application Programmer Interfaces (APIs) to
interact with the Xilinx ISE tools to inject and
evaluate faults - Editing the design file rather than the
configuration bitstreams to introduce
stuck-at-faults - Editing User Constraint Files (UCF) to control
resource usage
29AGT Isolation Progress
30AGT Maintaining Goodput
With ppreset 5, goodput is maintained at gt
90 Since goodput remains high, the rate of
fault isolation is slower, with
better-performing individuals selected to
maintain Goodput Fault detection latency is
minimal as compared to STARs, isolation is
achieved with manageable system performance
degradation
31Conclusion
- Graceful Performance Degradation
- elimination of additional test vectors
- temporal assessment using aging and outlier
detection - resource recycling to utilize residual
functionality - Population-Centric Assessment
- Provides adaptability and self-calibrating
autonomy with a relative assessment method - fitness assessment using population information
and competition - create a fully functional solution using
partially-fit individuals - Autonomous Group Testing
- Minimal latency fault detection
- Fault isolation without additional test vectors
- Efficient strategies for fast fault isolation
with minimal reconfiguration - Fast first-responder to faults via resource
tracking - Run-time Fault Management
- Can be realized using consensus-driven assessment
methods, and using information contained in the
population - Integrate Detection, Isolation, Repair under a
single Population-based technique
32Future Work
- Evolvable Sequential Logic Circuits
- Fitness assessment is a major challenge for large
circuits - Logic and Interconnect fault handling
- Need to integrate fault handling methods for
faults in logic and the interconnects - Extend group testing principles to interconnect
faults - Challenges in partial reconfiguration
- Need well-tested and supported APIs for runtime
reconfiguration of commercial FPGAs - Open standards in partial reconfiguration will
assist reliability studies - Decreased dependence on vendor-provided design
tools with an open bitstream structure is
essential - FIAT can be used to study fault isolation
properties of different approaches, and for
evaluating other group testing algorithms for
fault isolation - Extending AGT to other domains
- Group testing techniques presented here can
adapted for fault tolerant nano-scale mechanism,
software etc - Reliable, self-monitoring, self-adaptive organic
systems are a need, with increasing design
complexity and computational capabilities
33Publications
- Michael Georgiopoulos , Ronald F. DeMara, Avelino
J. Gonzalez, Annie S. Wu, Mansooreh Mollaghasemi,
Erol Gelenbe, Marcella Kysilka, Jimmy Secretan,
Carthik A. Sharma and Ayman J. Alnsour, A
Sustainable Model for Integrating Current Topics
in Machine Learning Research into the
Undergraduate Curriculum, accepted to the IEEE
Transactions in Education, July 2008. - A. Sarvi, C. A. Sharma and R. F. DeMara,
BIST-Based Group Testing for Diagnosis of
Embedded FPGA Cores, accepted to The 2008
International Conference on Embedded Systems and
Applications, Las Vegas, Nevada, USA (July 14-17,
2008). - C. A. Sharma, R. F. DeMara and A. Sarvi,
Self-Healing Reconfigurable Logic using
Autonomous Group Testing, submitted to ACM
Transactions on Autonomous and Adaptive Systems
(TAAS) of Special Issue on Organic Computing May
2007. - R. F. DeMara, K. Zhang, C. A. Sharma,
Consensus-based Evolvable Hardware for
Sustainable Fault Handling, submitted to The
IEEE Transactions in Evolutionary Computation Aug
2007. - R. N. Al-Haddad, C. A. Sharma, R. F. DeMara,
Performance Evaluation of Two Allocation Schemes
for Combinatorial Group Testing Fault Isolation,
in Proceedings of the International Conference on
Engineering of Reconfigurable Systems and
Algorithms ERSA 07,, Las Vegas, Nevada, U.S.A,
June 25 28, 2007. - R. S. Oreifej, C. A. Sharma, R. F. DeMara,
Expediting GA-Based Evolution Using Group
Testing Techniques for Reconfigurable Hardware,
in Proceedings of the IEEE International
Conference on Reconfigurable Computing and FPGAs
(Reconfig06), San Luis Potosi, Mexico, September
20-22, 2006, pp 106-113. - C. A. Sharma, R. F. DeMara, A Combinatorial
Group Testing Method for FPGA Fault Location, in
Proceedings of the International Conference on
Advances in Computer Science and Technology (ACST
2006), Puerto Vallarta, Mexico, January 23 - 35,
2006. - C. J. Milliord, C. A. Sharma, R. F. DeMara,
Dynamic Voting Schemes to Enhance Evolutionary
Repair in Reconfigurable Logic Devices, in
Proceedings of the International Conference on
Reconfigurable Computing and FPGAs (ReConFig05),
pp. 8.1.1 - 8.1.6, Puebla City, Mexico, September
28 - 30, 2005. - K. Zhang, R. F. DeMara, C. A. Sharma,
Consensus-based Evaluation for Fault Isolation
and On-line Evolutionary Regeneration, in
Proceedings of the International Conference in
Evolvable Systems (ICES05), pp. 12 -24,
Barcelona, Spain, September 12 - 14, 2005. - R. F. DeMara and C. A. Sharma, Self-Checking
Fault Detection using Discrepancy Mirrors, in
Proceedings of the International Conference on
Parallel and Distributed Processing Techniques
and Applications (PDPTA05), pp. 311-317, Las
Vegas, Nevada, U.S.A, June 27 30, 2005.
34Backup Slides
35Isolation Block Duelling
- Algorithm based on group testing methods
- Successive intersection to assess health of
resources - Each configuration k has a binary Usage Matrix
Uki,j 1 ? i ? m and 1 ? j ? n - m, n are the number of rows and columns of
resources in the device - Elements Uki,j 1 are resources used in k
- History Matrix H i,j 1 ? i ? m and 1 ? j ? n,
initially all zero, exists in which - entries represent the fitness of resources (i, j)
- Information regarding the fitness of resources
over time is stored - A discrepant output will lead to an increase in
the value of - Hi,j, ? Uki,j 1 ,k ? S
- All elements of H, corresponding to resources
used by discrepant configuration will be
incremented by one. - At any point in time, Hi,j will be a record the
outcomes of competitions - m successive intersections among
are performed until
S1
36Isolation of a single faulty individual with
1-out-of-64 impact
- Outliers are identified after W iterations
elapsed - E.V. (1/64)600 9.375 from minimum impact
faulty individual - Isolated individuals f differs from the average
DV by 3? after 1 or more observation intervals of
length W
37Isolation of a single faulty L individual with
10-out-of-64 impact
- Compare with 1-out-of-64 fault impact
- E.V. of (10/64)600 93.75 discrepancies for
faulty configuration - One isolation will be complete approx. once in
every 93.75/5 19 Observation Intervals - Fault Isolation demonstrated in 100 of case
38Isolation of 8 faulty individuals L4R4 with
1-out-of-64 impact
- Expected isolations do not occur approximately
40 of the time - Average discrepancy value of the population is
higher - Outlier isolation difficult
- Multiple faulty individual, Discrepancies
scattered
39Online Dueling Evaluation
- Objective
- Isolate faults by successive intersection between
sets of FPGA resources used by configurations - Analyze complexity of Isolation process
- Variables
- Total resources available
- Measured in number of LUTs
- Number of Competing Configurations
- Number of initial Seed designs in CRR process
- Degree of Articulation
- Some inputs may not manifest faults, even if
faulty resource used by individual - Resource Utilization Factor
- Percentage of FPGA resources required by target
application/design - Number of Iterations for Isolation
- Measure of complexity and time involved in
isolating fault
40For further info EH Websitehttp//cal.ucf.edu
41Fast Reconfiguration for Autonomously
Reprogrammable Logic
- Motivation
- Dynamic reconfiguration required by application
- Exploit architectural performance improvements
fully - Reconfiguration delay a major performance
barrier - Previous Work
- Methodology
- Multilayer Runtime Reconfiguration Architecture
(MRRA) - Spatial Management
- Prototype Development
- Loosely-Coupled solution
- Timing Analysis
- System-On-Chip solution
42Reconfiguration Demand during CRR
- For a complete repair
- Approximately 2,000 generations ( ) may be
required - For each generation, evaluations may be
up to 100 evaluations - Yielding the Cumulative Number of
Reconfigurations (CNR) up to - For each reconfiguration task
- Therefore, the total delay
Even if reconfiguration delay alone is assumed to
be in the order of tens or hundreds of
milliseconds ? Ltot gt 5.5 hours
43Previous Work - Algorithm Level
Approach Method Partial Reconfig Spatial Relocation Temporal Parallelism Area shape Run-Time Potential Limitations
Hauck, Li, Schwabe Bit file compression N/A No N/A N/A No Full reconfiguration required
Shirazi, Luk, Cheung Identifying common components Yes No Yes N/A No Design time work required
Mak, Young Dynamic Partitioning Yes No Yes N/A Yes Only desirable for large designs
Ganesan, Vemuri Pipelining Yes No Yes N/A Yes Limited pipeline depth
Compton, Li, Knol, Hauck Relocation and Defragmentation with new FPGA architecture Yes Yes No Row-based Yes Special FPGA architecture required
Diessel, Middendorf Schmeck, Schmidt Task Remapped and Relocated Yes Yes No Rectangle Yes Overhead for remapping calculations
Herbert, Christoph, Macro Partitioning and 2D Hashing Yes Yes Yes Rectangle Yes Rigid task modeling assumptions
compression method
temporal method
spatial method
44Multilayer Runtime Reconfiguration Architecture
(MRRA)
- Develop MRRA fast reconfiguration paradigm for
the CRR approach - Validate with real hardware platform along with
detailed performance analysis - First general-purpose framework for a wide
variety of applications requiring dynamic
reconfiguration - Extend existing theories on reconfiguration
45Loosely Coupled Solution
The Virtex-II Pro is mounted on a development
board which can then be interfaced with a
WorkStation running Xilinx EDK and ISE.
The entire system operates on a 32-bit basis
46Result Assessment
- Establish full functional framework of both
prototypes - Communication overhead, throughput and overall
speed-up analysis - Communication overhead for SOC solution is
decreased to micro or sub-micro second order Vs.
milliseconds order of Loosely Coupled solution - Up to 5-fold speedup is expected compared to the
Loosely Coupled solution - Translation Complexity Analysis
- The quantity of information that needs to be
translated to generate the reconfiguration
bitstream - Simplification from file level to bit level is
expected - Storage Complexity Analysis
- The memory space required for the run-time
algorithms - Decreased memory requirement is expected due to
the translation complexity improvement
47Publications
- Accepted Manuscripts
- R. F. DeMara and K. Zhang, Autonomous FPGA Fault
Handling through Competitive Runtime
Reconfiguration, to appear in NASA/DoD
Conference on Evolvable Hardware(EH05),
Washington D.C., U.S.A., June 29 July 1, 2005. - H. Tan and R. F. DeMara, A Device-Controlled
Dynamic Configuration Framework Supporting
Heterogeneous Resource Management, to appear in
International Conference on Engineering of
Reconfigurable Systems and Algorithms (ERSA05),
Las Vegas, Nevada, U.S.A, June 27 30, 2005. - R. F. DeMara and C. A. Sharma, Self-Checking
Fault Detection using Discrepancy Mirrors, to
appear in International Conference on Parallel
and Distributed Processing Techniques and
Applications (PDPTA05), Las Vegas, Nevada,
U.S.A, June 27 30, 2005. - Submitted Manuscripts
- R. F. DeMara and K. Zhang, Populational Fault
Tolerance Analysis Under CRR Approach, submitted
to International Conference on Evolvable Systems
(ICES05), Barcelona, Sept. 12 14, 2005. - R. F. DeMara and C. A. Sharma, FPGA Fault
Isolation and Refurbishment using Iterative
Pairing, submitted to IFIP VLSI-SOC Conference,
Perth, W. Australia, October 17 19, 2005. - Manuscripts In-preparation
- R. F. DeMara and K. Zhang, Autonomous Fault
Occlusion through Competitive Runtime
Reconfiguration, submission planned to IEEE
Transactions on Evolutionary Computation. - R. F. DeMara and C. A. Sharma, Multilayer
Dynamic Reconfiguration Supporting Heterogeneous
FPGA Resource Management, submission planned to
IEEE Design and Test of Computers. - Field Testing
- Implementation of CRR on-board SRAM-based FPGA
in a Cubesat mission
48EHW Environments
- Evolvable Hardware (EHW) Environments enable
experimental methods to research soft
computing intelligent search techniques - EHW operates by repetitive reprogramming of
real-world physical devices using an iterative
refinement process
Extrinsic Evolution
Intrinsic Evolution
Application
Two modes of Evolvable Hardware
or
Genetic Algorithm
Genetic Algorithm
Stardust Satellite gt100 FPGAs onboard
hostile environment radiation, thermal
stress How to achieve reliability to avoid
mission failure???
Simulation in the loop
Hardware in the loop
Done? Build it
software model
new approach to Autonomous Repair of failed
devices
device design-time refinement
device run-time refinement
49Genetic Algorithms (GAs)
- Mechanism coarsely modeled after neo-Darwinism
(natural selection genetics)
start
replacement
offspring
population of candidate solutions
evaluate fitness of individuals
Fitness function
mutation
crossover
selection of parents
parents
Goal reached
50Genetic Mechanisms
- Guided trial-and-error search techniques using
principles of Darwinian evolution - iterative selection, survival of the fittest
- genetic operators -- mutation, crossover,
- implementor must define fitness function
- GAs frequently use strings of 1s and 0s to
represent candidate solutions - if 100101 is better than 010001 it will have more
chance to breed and influence future population - GAs cast a net over entire solution space to
find regions of high fitness - Can invoke Elitism Operator (E1, E2 )
- guarantees monotonically increasing fitness of
best individual over all generations
51GA Success Stories
- Commercial Applications
- Nextel frequency allocation for cellular phone
networks -- 15M predicted savings in
NY market - Pratt Whitney turbine engine design ---
engineer 8 weeks
GA 2 days w/3x improvement - International Truck production scheduling
improved by 90 in 5 plants - NASA superior Jupiter trajectory optimization,
antennas, FPGAs - Koza 25 instances showing human-competitive
performance such as analog circuit design,
amplifiers, filters
52Representing Candidate Solutions
- Representation of an individual can be using
discrete values (binary, integer, or any other
system with a discrete set of values) -
- Example of Binary DNA Encoding
Individual (Chromosome)
GENE
53Genetic Operators
t
t 1
selection
reproduction
54Crossover Operator
Population
offspring
55Procedural Flow under Competitive Runtime
Reconfiguration
- Integrates all fault handling stages using EC
strategy - Detects faults by the occurrence of discrepancy
- Isolates faults by accumulation of discrepancies
- Failure-specific refurbishment using Genetic
Operators - Intra-Module-Crossover, Inter-Module-Crossover,
Intra-Module-Mutation - Realize online device refurbishment
- Refurbished online without additional function or
resource test vectors - Repair during the normal data throughput process
56Template Fault Coverage
Half-Adder Template A
Half-Adder Template B
- Template A
- Gate3 is an AND gate
- Will lose correctness if a Stuck-At-Zero fault
occurs in second input line of the Gate3, an AND
gate - Template B
- Gate3 is a NOT gate and only uses the first input
line - Will work correctly even if second input line is
stuck at Zero or One
57Evolvable Hardware
- Evolutionary Design
- Start with available CLBs and IOBs
- Implement a design using Genetic Operators etc
Limited or no ability to re-design to account for
suspected faulty resources
- Evolutionary Regeneration
- Start with an existing pool of designs
- Some existing configurations may use faulty
resources - Eliminate use of suspected faulty resources
- Genetic Operators can be applied to refurbish
designs
58Competitive Runtime Reconfiguration (CRR)Overview
- Uses a Relative Fitness Measure
- Pairwise discrepancy checking yields relative
fitness measure - Broad temporal consensus in the population used
to determine fitness metric - Transition between Fitness States occurs in the
population - Provides graceful degradation in presence of
changing environments, applications and inputs,
since this is a moving measure - Test Inputs Normal Inputs for Data Throughput
- CBE does not utilizes additional functional nor
resource test vectors - Potential for higher availability as regeneration
is integrated with normal operation
59Exploiting Population Information
- Population contains more robust information than
individuals - Utilize this information for robust fault
detection, faster regeneration, increased
diversity for adaptation - Detect Failure and Isolate Faulty Resources
- Detect by inconsistencies among the population
- Isolate faults using outlier identification and
aging - Realize Regeneration
- Recovery Complexity ltlt Design Complexity
- utilize diverse raw material during regeneration
vs. isolated re-design - Temporal consensus directs search
- Adaptable Performance based on Online Inputs
- The population evolves to changing physical
environment, input vectors, and target
application while increasing availability
60Selection Process
61Fitness Adjustment Procedure
62Discrepancy Mirror Circuit
Fault Coverage
Component Fault Scenarios Fault Scenarios Fault Scenarios Fault Scenarios Fault-Free
Function Output A Fault Correct Correct Correct Correct
Function Output B Correct Fault Correct Correct Correct
XNORA Disagree (0) Disagree (0) Fault Disagree(0) Agree (1) Agree (1)
XNORB Disagree (0) Disagree (0) Agree (1) Fault Disagree(0) Agree (1)
BufferA 0 0 High-Z 0 1
BufferB 0 0 0 High-Z 1
Match Output 0 0 0 0 1
63CGT-Pruned GA Simulator
64Repair Progress