Sustainable Fault-Handling of Reconfigurable Logic using Throughput-Driven Assessment

About This Presentation

Title:

Sustainable Fault-Handling of Reconfigurable Logic using Throughput-Driven Assessment

Description:

Title: A Genetic Representation for Evolutionary Fault Recovery in FPGAs Author: User Last modified by: Carthik Sharma Created Date: 8/17/2001 8:25:04 PM – PowerPoint PPT presentation

Number of Views:249

Avg rating:3.0/5.0

Slides: 65

Provided by: calUcfEd

Category:

more less

Transcript and Presenter's Notes

Title: Sustainable Fault-Handling of Reconfigurable Logic using Throughput-Driven Assessment

1
Sustainable Fault-Handlingof Reconfigurable
Logic using Throughput-Driven Assessment
Carthik Anand SharmaUniversity of Central Florida
7 July 2008
2
Motivation

Mission-critical Embedded Systems require high
reliability and availability
Characteristics of Operating Environment may
induce hardware failures
Aging, Manufacturing Defects, etc.
System Reliability
Fault Avoidance. Always Possible? No
Design Margin. Always Adequate? No
Modular Redundancy. Always Recoverable?No
Fault Refurbishment. Highly Flexible? Yes
but technically challenging to achieve

?
3
Technical ObjectiveAutonomous FPGA
Regeneration
NASA Moon, Mars, and Beyond Realize 10s years
service life ???
Increased availability without pre-configured
spares
Reconfiguration allows new fault-handling
paradigm

Redundancy
increases with amount
of spare capacity
restricted at design-time
based on time required to select spare
resource
determined by adequacy of spares available (?)
yes

Regeneration
weakly-related to number
recovery capacity
variable at recovery-time
based on time required to find suitable
recovery
affected by multiple characteristics (
or -)
yes

everyday example
spare tire
can of fix-a-flat
?
Overhead from Unutilized Spares weight, size,
power Granularity of Fault Coverage
resolution where fault handled
Fault-Resolution Latency availability via
downtime required to handle fault Quality
of Repair likelihood and completeness
Autonomous Operation recover without outside
intervention
?
?
?
?
?
4
Fault-Handling Techniques for SRAM-based FPGAs
Reprogrammable Device Failure
Characteristics
Duration
Transient SEU
Permanent SEL, Oxide Breakdown, Electron
Migration, LPD
Device Configuration
Processing Datapath
Device Configuration
Processing Datapath
Target
BIST
Evolutionary
Repetitive Readback Wells00
TMR (conventional spatial redundancy)
Approach
STARS Abramovici01
CED McCluskey04
Sussex Vigander01
CRR
Methods
Supplementary Testbench
Duplex Output Comparison
Duplex Output Comparison
Detection
(not addressed)
Cartesian Intersection
Isolation
(not addressed)
Bitwise Comparison
Majority Vote
unnecessary
Fast Run-time Location
Worst-case Clock Period Dilation
Diagnosis
unnecessary
unnecessary
Population-based GA using Extrinsic
Fitness Evaluation
Evolutionary Algorithm using Intrinsic
Fitness Evaluation
Recovery
Replicate in Spare Resource
Select Spare Resource
Invert Bit Value
Ignore Discrepancy
5
Contributions

Strategy for Integrating all phases of Fault
Handling process
detection, isolation, diagnosis and recovery work
in synergy
Elimination of Additional Test Vectors
enables detection and isolation with minimal
system downtime
Autonomous Group Testing techniques for FPGA
devices
isolates faults in FPGA while maintaining system
performance
Competitive Runtime Reconfiguration
leverages iterative pairwise comparison and
functional
regeneration to provide adaptive refurbishment
with resource recycling

6
Previous Work

Detection Characteristics of FPGA Fault-Handling
Schemes

Strategies 1) Evolve redundancy into design
before anticipated failure 2) Redesign
after detection of failure 3) Combine
desirable aspects of both strategies 1) 2)
7
Group Testing Algorithms

Origin World War II Blood testing
Problem Test samples from millions of new
recruits
Solution Test blocks of sample before testing
individual samples
Problem Definition
Identify subset Q of defectives from set P
Minimize number of tests
Test v-subsets of P
Form suitable blocks

8
CRR Arrangement in SRAM FPGA

Configurations in Population
C CL? CR
CL subset of left-half configurations
CR subset of right-half configurations
CLCR C/2
Discrepancy Operator
Baseline Discrepancy Operator ? is dyadic
operator with binary output
Z(Ci) is FPGA data throughput output of
configuration Ci
Each half-configuration evaluates ? using
embedded checker (XNOR gate) within each
individual
Any fault in checker lowers that individuals
fitness so that individual is no longer preferred
and eventually undergoes repair

WTA
(Equivalence)
9
Sketch of CRR ApproachPremise Recovery
Complexity ltlt Design Complexity

Initialization
Population P of functionally-identical yet
physically-distinct configurations
Partition P into sub-populations that use
supersets of physically-distinct resources, e.g.
size P/2 to designate physical FPGA
left-half or right-half resource utilization
Fitness Assessment
Discrepancy Operator ? is some function of
bitwise agreement between each halfs output
Four Fitness States defined for Configurations as
CP,CS,CU,CR with transitions, respectively
Pristine Suspect Under Repair
Refurbished
Fitness Evaluation Window W determines
comparison interval
Regeneration
Genetic Operators used to recover from fault
based on Reintroduction Rate ?
Operators only applied once then offspring
returned to service without for concern about
increasing fitness

fitness assessment via pairwise discrepancy
(temporal voting vs. spatial voting)
10
FPGA Genetic Representations

Chromosome Goals
Allow all possible LUT configurations
Allow all possible CLB interconnections given
constraints of routing support
Disallow illegal FPGA configurations and
non-coding introns (junk DNA)
Facilitate crossover operator
Bitstring representation is natural choice,
though may not scale well (investigating
generative reps)
Representation shown here is sample specific to
Xilinx Virtex FPGA

11
Competitive Runtime Reconfiguration (CRR)
Evolutionary Computation strategies effective for
more than just repair phase continually
detect, rank, and isolate faults entirely within
the underlying data throughput flow
diverse alternatives working a-priori
fault detection by robust consensus over time
no test vectors
device remains online during repair
fault isolation is model-free and
self-calibrating
completely-repaired criteria can be ignored
graceful degredation via ranking of
alternatives
no reconfiguration when fault-free
performance readily adjustable
failures in population memory covered
checking logic part of individual hence also
competes for correctness
12
Fitness Evaluation Window

Fitness Evaluation Window E
denotes number of iterations used to evaluate
fitness before the state of an individual is
determined
Determination of E for 3x3 multiplier
6 input pins articulating 2664 possible inputs
W should be selected so that all possible inputs
appear
More formally,
Let rand(X) return some xi ? X at random
Seek W ? ? rand(X) X with high
probability

xK distinct orderings of K inputs showing in D
trials
if D constant, can calculate Pkgt1 successively
probability PK of K inputs showing after D trials
is ratio of xK / KD

13
E Determination
When K64
14
Integer Multiplier Case Study

3bit x 3bit unsigned multiplier automated design
Building blocks
Half-Adder 18 templates created
Full-Adder 24 templates
Parallel-And 1 template created
Randomly select templates for instantiation in
modules

GA operators External-Module-Crossover Internal-Mo
dule-Crossover Internal-Module-Mutation
GA parameters Population size 20 individuals
Crossover rate 5 Mutation rate up to
80 per bit
Experiments Demonstrate
Experimental Evaluation Xilinx Virtex II Pro on
Avnet PCI board

Objective fitness function replaced by the
Consensus-based Evaluation Approach and Relative
Fitness
Elimination of additional test vectors
Temporal Assessment process

15
Regeneration Performance
System Throughput during Regeneration for a 3x3
multiplier
Exp. Number Fault Location Failure Type Correctness after Fault Total Iterations Discrepant Iterations Repair Iterations Final Correctness Throughput ()
1 CLB3,LUT0,Input1 Stuck-at-1 52 / 64 1.7 ? 107 4.2 ? 105 1194 64 / 64 97.7
2 CLB6,LUT0,Input1 Stuck-at-0 33 / 64 8.0 ? 105 1.7 ? 104 47 64 / 64 97.9
3 CLB5,LUT2,Input0 Stuck-at-1 22 / 64 3.1 ? 106 6.8 ? 104 193 64 / 64 97.8
4 CLB7,LUT2,Input0 Stuck-at-0 38 / 64 8.1 ? 106 1.8 ? 105 513 64 / 64 97.7
5 CLB9,LUT0,Input1 Stuck-at-0 40 / 64 2.3 ? 106 7.1 ? 104 219 64 / 64 96.9
Average 32.6 / 64 6.4 ? 106 1.5 ? 105 433 64 / 64 97.6

Difference (vs. Hamming Distance) Evaluation
Window, Ew 600 Suspect Threshold ?S
1-6/60099 Repair Threshold ?R 1-4/600
99.3 Re-introduction rate ?r 0.1
Parameters
Repairs evolved in-situ, in real-time, without
additional test vectors, while allowing device to
remain partially online.
16
Isolation Problem Outline

Objectives
Locate faulty logic and/or interconnect resource
a single stuck-at fault model is assumed
Online Fault Isolation device not entirely
removed from service
Features
Runtime Reconfiguration FPGA resources
configured dynamically
Utilize Runtime Inputs avoid special
test-vectors, improve availability
Constraints
Use pre-designed configurations defined by
target application
Subsets under test have constant resource
utilization range for a given isolation problem
Resource grouping influences fault articulation
resource-mapping and input vector might mask
hardware faults
Do not use specialized block designs
Runtime reconfiguration initially limited to
column-swapping
Non-reasonable algorithm tests may be
repeated without gaining new isolation
information

17
Discrepancy Mirror

Mechanism for Checking-the-Checker (golden
element problem)
Makes checker part of configuration that
competes DeMara PDPTA-05

Fault Coverage
18

Influence of LUT utilization

Perpetually Articulating Inputs with Equiprobable
Distribution
Intermittently Articulating Inputs with
Equiprobable Distribution

expected number of pairings grows sub-linearly
in number of resources
utilization below 20 or above 80 implicates
(or exonerates) a smaller sub-set of resources
50 utilization, the expected number of pairings
for 1,000, 10,000, and 100,000 resources are
11.1, 14.9, and 17.6

at 90 utilization mean value of 258 pairings
are required to isolate the faulty resource.

19
Fault Location Using Dueling

The set of all competing configurations is
represented by S.
Set Ck represents the resources utilized by
configuration k.
Each competing configuration k, 1 lt k lt S has
a unique binary
Usage Matrix Uk, 1 lt k lt p.
Elements Uki,j, 1 lt i lt m, 1 lt j n, where m
and n represent the rows and columns in the
device layout respectively.
Elements Uki,j 1 denote the usage of resource
(i, j) by Ck.
The History Matrix H, with elements Hi,j 1 lt i
lt m, 1 lt j lt n, is an integer matrix used to
represent the relative fitness of individual
resources.
Hi,j provides instantaneous relative fitness
values of resources.

20
Dueling Example
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 1 0 1 0 0
0 0 0 1 0 0 0 0 0 0
0 0 1 0 0 1 1 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 1 1 0 0 0
0 0 1 1 0 0 1 0 0 0
0 0 1 0 1 0 0 0 0 0
0 0 1 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
H i,j _at_ t 0
U2
U1
0 0 0 0 0 0 0 0 0 0
0 0 0 1 1 1 1 0 0 0
0 0 2 1 0 0 1 0 0 0
0 0 1 0 1 1 0 1 0 0
0 0 1 1 0 1 0 0 0 0
0 0 1 0 0 1 1 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0

H i,j changes after C1 and C2 are loaded
U1 and U2 are corresponding Usage Matrices
(3,3) is identified as the faulty resource

H i,j _at_ t 2
21
Isolation Progress without Halving

Without Halving
Initially S 20,000
Resource Utilization 40
Number of suspected faulty elements constant at
36 after 23 iterations
No subsequent improvement due to lack of
differentiating information between competing
configurations

22
Dueling with Modified Halving

Dueling with Halving
Halving works by swapping half the used columns
with unused ones
Halving progressively reduces the size of the
set of suspected faulty elements
Isolation proceeds till a single faulty element
is isolated
Fault isolated after 19 iterations

23
Enhancing Embedded Core BIST using Group Testing
BIST Structure Used for Embedded Core Testing
XCVLX30 device - 32 DSP48E Cores divided into n
8 groups 8 x 6 2x1 multiplexers are needed. 6
columns of Comparators, each Column has 8
Comparators Comparators kn(i,j), 0 ? i,j ? 3, i?j
complete test for a group of 4 Flipflops FF0
through FF5 register comparison results for each
group Fault diagnosis script processes result of
each set of 6 outputs
24
Embedded Core BIST using Group Testing
Resource Utilization
Faults in up to 2 BUTs in each group of 4 can be
isolated Isolation is achieved without device
reconfiguration in a single stage
25
Logic Element Isolation Using Autonomous Group
Testing (AGT)
In each stage, suspect resources S are equally
shared among pstage individuals If S Smax then
mutually exclusive shares are possible,
else, nshare nreqd - R - S are shared
26
Equal Share Strategy
27
Fault Isolation Using FIAT

Fault Insertion and Analysis Toolkit (FIAT)
provides methods to modify Xilinx FPGA
configurations
inserts suck-at-faults at LUT inputs
precludes need to edit configuration bitstream
works in conjunction with Xilinx ISE software
(COTS design suite)

28
AGT Experiments

Experimental Setup
DES-56 encryption circuit
Xilinx ISE design tools to place and route the
design
Virtex II Pro FPGA device
Fault Injection and Analysis Toolkit (FIAT)
Application Programmer Interfaces (APIs) to
interact with the Xilinx ISE tools to inject and
evaluate faults
Editing the design file rather than the
configuration bitstreams to introduce
stuck-at-faults
Editing User Constraint Files (UCF) to control
resource usage

29
AGT Isolation Progress
30
AGT Maintaining Goodput
With ppreset 5, goodput is maintained at gt
90 Since goodput remains high, the rate of
fault isolation is slower, with
better-performing individuals selected to
maintain Goodput Fault detection latency is
minimal as compared to STARs, isolation is
achieved with manageable system performance
degradation
31
Conclusion

Graceful Performance Degradation
elimination of additional test vectors
temporal assessment using aging and outlier
detection
resource recycling to utilize residual
functionality
Population-Centric Assessment
Provides adaptability and self-calibrating
autonomy with a relative assessment method
fitness assessment using population information
and competition
create a fully functional solution using
partially-fit individuals
Autonomous Group Testing
Minimal latency fault detection
Fault isolation without additional test vectors
Efficient strategies for fast fault isolation
with minimal reconfiguration
Fast first-responder to faults via resource
tracking
Run-time Fault Management
Can be realized using consensus-driven assessment
methods, and using information contained in the
population
Integrate Detection, Isolation, Repair under a
single Population-based technique

32
Future Work

Evolvable Sequential Logic Circuits
Fitness assessment is a major challenge for large
circuits
Logic and Interconnect fault handling
Need to integrate fault handling methods for
faults in logic and the interconnects
Extend group testing principles to interconnect
faults
Challenges in partial reconfiguration
Need well-tested and supported APIs for runtime
reconfiguration of commercial FPGAs
Open standards in partial reconfiguration will
assist reliability studies
Decreased dependence on vendor-provided design
tools with an open bitstream structure is
essential
FIAT can be used to study fault isolation
properties of different approaches, and for
evaluating other group testing algorithms for
fault isolation
Extending AGT to other domains
Group testing techniques presented here can
adapted for fault tolerant nano-scale mechanism,
software etc
Reliable, self-monitoring, self-adaptive organic
systems are a need, with increasing design
complexity and computational capabilities

33
Publications

Michael Georgiopoulos , Ronald F. DeMara, Avelino
J. Gonzalez, Annie S. Wu, Mansooreh Mollaghasemi,
Erol Gelenbe, Marcella Kysilka, Jimmy Secretan,
Carthik A. Sharma and Ayman J. Alnsour, A
Sustainable Model for Integrating Current Topics
in Machine Learning Research into the
Undergraduate Curriculum, accepted to the IEEE
Transactions in Education, July 2008.
A. Sarvi, C. A. Sharma and R. F. DeMara,
BIST-Based Group Testing for Diagnosis of
Embedded FPGA Cores, accepted to The 2008
International Conference on Embedded Systems and
Applications, Las Vegas, Nevada, USA (July 14-17,
2008).
C. A. Sharma, R. F. DeMara and A. Sarvi,
Self-Healing Reconfigurable Logic using
Autonomous Group Testing, submitted to ACM
Transactions on Autonomous and Adaptive Systems
(TAAS) of Special Issue on Organic Computing May
2007.
R. F. DeMara, K. Zhang, C. A. Sharma,
Consensus-based Evolvable Hardware for
Sustainable Fault Handling, submitted to The
IEEE Transactions in Evolutionary Computation Aug
2007.
R. N. Al-Haddad, C. A. Sharma, R. F. DeMara,
Performance Evaluation of Two Allocation Schemes
for Combinatorial Group Testing Fault Isolation,
in Proceedings of the International Conference on
Engineering of Reconfigurable Systems and
Algorithms ERSA 07,, Las Vegas, Nevada, U.S.A,
June 25 28, 2007.
R. S. Oreifej, C. A. Sharma, R. F. DeMara,
Expediting GA-Based Evolution Using Group
Testing Techniques for Reconfigurable Hardware,
in Proceedings of the IEEE International
Conference on Reconfigurable Computing and FPGAs
(Reconfig06), San Luis Potosi, Mexico, September
20-22, 2006, pp 106-113.
C. A. Sharma, R. F. DeMara, A Combinatorial
Group Testing Method for FPGA Fault Location, in
Proceedings of the International Conference on
Advances in Computer Science and Technology (ACST
2006), Puerto Vallarta, Mexico, January 23 - 35,
2006.
C. J. Milliord, C. A. Sharma, R. F. DeMara,
Dynamic Voting Schemes to Enhance Evolutionary
Repair in Reconfigurable Logic Devices, in
Proceedings of the International Conference on
Reconfigurable Computing and FPGAs (ReConFig05),
pp. 8.1.1 - 8.1.6, Puebla City, Mexico, September
28 - 30, 2005.
K. Zhang, R. F. DeMara, C. A. Sharma,
Consensus-based Evaluation for Fault Isolation
and On-line Evolutionary Regeneration, in
Proceedings of the International Conference in
Evolvable Systems (ICES05), pp. 12 -24,
Barcelona, Spain, September 12 - 14, 2005.
R. F. DeMara and C. A. Sharma, Self-Checking
Fault Detection using Discrepancy Mirrors, in
Proceedings of the International Conference on
Parallel and Distributed Processing Techniques
and Applications (PDPTA05), pp. 311-317, Las
Vegas, Nevada, U.S.A, June 27 30, 2005.

34
Backup Slides

On following pages

35
Isolation Block Duelling

Algorithm based on group testing methods
Successive intersection to assess health of
resources
Each configuration k has a binary Usage Matrix
Uki,j 1 ? i ? m and 1 ? j ? n
m, n are the number of rows and columns of
resources in the device
Elements Uki,j 1 are resources used in k
History Matrix H i,j 1 ? i ? m and 1 ? j ? n,
initially all zero, exists in which
entries represent the fitness of resources (i, j)
Information regarding the fitness of resources
over time is stored
A discrepant output will lead to an increase in
the value of
Hi,j, ? Uki,j 1 ,k ? S
All elements of H, corresponding to resources
used by discrepant configuration will be
incremented by one.
At any point in time, Hi,j will be a record the
outcomes of competitions
m successive intersections among
are performed until
S1

36
Isolation of a single faulty individual with
1-out-of-64 impact

Outliers are identified after W iterations
elapsed
E.V. (1/64)600 9.375 from minimum impact
faulty individual
Isolated individuals f differs from the average
DV by 3? after 1 or more observation intervals of
length W

37
Isolation of a single faulty L individual with
10-out-of-64 impact

Compare with 1-out-of-64 fault impact
E.V. of (10/64)600 93.75 discrepancies for
faulty configuration
One isolation will be complete approx. once in
every 93.75/5 19 Observation Intervals
Fault Isolation demonstrated in 100 of case

38
Isolation of 8 faulty individuals L4R4 with
1-out-of-64 impact

Expected isolations do not occur approximately
40 of the time
Average discrepancy value of the population is
higher
Outlier isolation difficult
Multiple faulty individual, Discrepancies
scattered

39
Online Dueling Evaluation

Objective
Isolate faults by successive intersection between
sets of FPGA resources used by configurations
Analyze complexity of Isolation process
Variables
Total resources available
Measured in number of LUTs
Number of Competing Configurations
Number of initial Seed designs in CRR process
Degree of Articulation
Some inputs may not manifest faults, even if
faulty resource used by individual
Resource Utilization Factor
Percentage of FPGA resources required by target
application/design
Number of Iterations for Isolation
Measure of complexity and time involved in
isolating fault

40
For further info EH Websitehttp//cal.ucf.edu
41
Fast Reconfiguration for Autonomously
Reprogrammable Logic

Motivation
Dynamic reconfiguration required by application
Exploit architectural performance improvements
fully
Reconfiguration delay a major performance
barrier
Previous Work
Methodology
Multilayer Runtime Reconfiguration Architecture
(MRRA)
Spatial Management
Prototype Development
Loosely-Coupled solution
Timing Analysis
System-On-Chip solution

42
Reconfiguration Demand during CRR

For a complete repair
Approximately 2,000 generations ( ) may be
required
For each generation, evaluations may be
up to 100 evaluations
Yielding the Cumulative Number of
Reconfigurations (CNR) up to
For each reconfiguration task

Therefore, the total delay

Even if reconfiguration delay alone is assumed to
be in the order of tens or hundreds of
milliseconds ? Ltot gt 5.5 hours
43
Previous Work - Algorithm Level
Approach Method Partial Reconfig Spatial Relocation Temporal Parallelism Area shape Run-Time Potential Limitations
Hauck, Li, Schwabe Bit file compression N/A No N/A N/A No Full reconfiguration required
Shirazi, Luk, Cheung Identifying common components Yes No Yes N/A No Design time work required
Mak, Young Dynamic Partitioning Yes No Yes N/A Yes Only desirable for large designs
Ganesan, Vemuri Pipelining Yes No Yes N/A Yes Limited pipeline depth
Compton, Li, Knol, Hauck Relocation and Defragmentation with new FPGA architecture Yes Yes No Row-based Yes Special FPGA architecture required
Diessel, Middendorf Schmeck, Schmidt Task Remapped and Relocated Yes Yes No Rectangle Yes Overhead for remapping calculations
Herbert, Christoph, Macro Partitioning and 2D Hashing Yes Yes Yes Rectangle Yes Rigid task modeling assumptions
compression method
temporal method
spatial method
44
Multilayer Runtime Reconfiguration Architecture
(MRRA)

Develop MRRA fast reconfiguration paradigm for
the CRR approach
Validate with real hardware platform along with
detailed performance analysis
First general-purpose framework for a wide
variety of applications requiring dynamic
reconfiguration
Extend existing theories on reconfiguration

45
Loosely Coupled Solution
The Virtex-II Pro is mounted on a development
board which can then be interfaced with a
WorkStation running Xilinx EDK and ISE.
The entire system operates on a 32-bit basis
46
Result Assessment

Establish full functional framework of both
prototypes
Communication overhead, throughput and overall
speed-up analysis
Communication overhead for SOC solution is
decreased to micro or sub-micro second order Vs.
milliseconds order of Loosely Coupled solution
Up to 5-fold speedup is expected compared to the
Loosely Coupled solution
Translation Complexity Analysis
The quantity of information that needs to be
translated to generate the reconfiguration
bitstream
Simplification from file level to bit level is
expected
Storage Complexity Analysis
The memory space required for the run-time
algorithms
Decreased memory requirement is expected due to
the translation complexity improvement

47
Publications

Accepted Manuscripts
R. F. DeMara and K. Zhang, Autonomous FPGA Fault
Handling through Competitive Runtime
Reconfiguration, to appear in NASA/DoD
Conference on Evolvable Hardware(EH05),
Washington D.C., U.S.A., June 29 July 1, 2005.
H. Tan and R. F. DeMara, A Device-Controlled
Dynamic Configuration Framework Supporting
Heterogeneous Resource Management, to appear in
International Conference on Engineering of
Reconfigurable Systems and Algorithms (ERSA05),
Las Vegas, Nevada, U.S.A, June 27 30, 2005.
R. F. DeMara and C. A. Sharma, Self-Checking
Fault Detection using Discrepancy Mirrors, to
appear in International Conference on Parallel
and Distributed Processing Techniques and
Applications (PDPTA05), Las Vegas, Nevada,
U.S.A, June 27 30, 2005.
Submitted Manuscripts
R. F. DeMara and K. Zhang, Populational Fault
Tolerance Analysis Under CRR Approach, submitted
to International Conference on Evolvable Systems
(ICES05), Barcelona, Sept. 12 14, 2005.
R. F. DeMara and C. A. Sharma, FPGA Fault
Isolation and Refurbishment using Iterative
Pairing, submitted to IFIP VLSI-SOC Conference,
Perth, W. Australia, October 17 19, 2005.
Manuscripts In-preparation
R. F. DeMara and K. Zhang, Autonomous Fault
Occlusion through Competitive Runtime
Reconfiguration, submission planned to IEEE
Transactions on Evolutionary Computation.
R. F. DeMara and C. A. Sharma, Multilayer
Dynamic Reconfiguration Supporting Heterogeneous
FPGA Resource Management, submission planned to
IEEE Design and Test of Computers.
Field Testing
Implementation of CRR on-board SRAM-based FPGA
in a Cubesat mission

48
EHW Environments

Evolvable Hardware (EHW) Environments enable
experimental methods to research soft
computing intelligent search techniques
EHW operates by repetitive reprogramming of
real-world physical devices using an iterative
refinement process

Extrinsic Evolution
Intrinsic Evolution
Application
Two modes of Evolvable Hardware
or
Genetic Algorithm
Genetic Algorithm
Stardust Satellite gt100 FPGAs onboard
hostile environment radiation, thermal
stress How to achieve reliability to avoid
mission failure???
Simulation in the loop
Hardware in the loop
Done? Build it
software model
new approach to Autonomous Repair of failed
devices
device design-time refinement
device run-time refinement
49
Genetic Algorithms (GAs)

Mechanism coarsely modeled after neo-Darwinism
(natural selection genetics)

start
replacement
offspring
population of candidate solutions
evaluate fitness of individuals
Fitness function
mutation
crossover
selection of parents
parents
Goal reached
50
Genetic Mechanisms

Guided trial-and-error search techniques using
principles of Darwinian evolution
iterative selection, survival of the fittest
genetic operators -- mutation, crossover,
implementor must define fitness function
GAs frequently use strings of 1s and 0s to
represent candidate solutions
if 100101 is better than 010001 it will have more
chance to breed and influence future population
GAs cast a net over entire solution space to
find regions of high fitness
Can invoke Elitism Operator (E1, E2 )
guarantees monotonically increasing fitness of
best individual over all generations

51
GA Success Stories

Commercial Applications
Nextel frequency allocation for cellular phone
networks -- 15M predicted savings in
NY market
Pratt Whitney turbine engine design ---
engineer 8 weeks
GA 2 days w/3x improvement
International Truck production scheduling
improved by 90 in 5 plants
NASA superior Jupiter trajectory optimization,
antennas, FPGAs
Koza 25 instances showing human-competitive
performance such as analog circuit design,
amplifiers, filters

52
Representing Candidate Solutions

Representation of an individual can be using
discrete values (binary, integer, or any other
system with a discrete set of values)
Example of Binary DNA Encoding

Individual (Chromosome)
GENE
53
Genetic Operators
t
t 1
selection
reproduction
54
Crossover Operator
Population
offspring
55
Procedural Flow under Competitive Runtime
Reconfiguration

Integrates all fault handling stages using EC
strategy
Detects faults by the occurrence of discrepancy
Isolates faults by accumulation of discrepancies
Failure-specific refurbishment using Genetic
Operators
Intra-Module-Crossover, Inter-Module-Crossover,
Intra-Module-Mutation
Realize online device refurbishment
Refurbished online without additional function or
resource test vectors
Repair during the normal data throughput process

56
Template Fault Coverage

Half-Adder Template A

Half-Adder Template A
Half-Adder Template B

Template A
Gate3 is an AND gate
Will lose correctness if a Stuck-At-Zero fault
occurs in second input line of the Gate3, an AND
gate
Template B
Gate3 is a NOT gate and only uses the first input
line
Will work correctly even if second input line is
stuck at Zero or One

57
Evolvable Hardware

Evolutionary Design
Start with available CLBs and IOBs
Implement a design using Genetic Operators etc
Limited or no ability to re-design to account for
suspected faulty resources

Evolutionary Regeneration
Start with an existing pool of designs
Some existing configurations may use faulty
resources
Eliminate use of suspected faulty resources
Genetic Operators can be applied to refurbish
designs

58
Competitive Runtime Reconfiguration (CRR)Overview

Uses a Relative Fitness Measure
Pairwise discrepancy checking yields relative
fitness measure
Broad temporal consensus in the population used
to determine fitness metric
Transition between Fitness States occurs in the
population
Provides graceful degradation in presence of
changing environments, applications and inputs,
since this is a moving measure
Test Inputs Normal Inputs for Data Throughput
CBE does not utilizes additional functional nor
resource test vectors
Potential for higher availability as regeneration
is integrated with normal operation

59
Exploiting Population Information

Population contains more robust information than
individuals
Utilize this information for robust fault
detection, faster regeneration, increased
diversity for adaptation
Detect Failure and Isolate Faulty Resources
Detect by inconsistencies among the population
Isolate faults using outlier identification and
aging
Realize Regeneration
Recovery Complexity ltlt Design Complexity
utilize diverse raw material during regeneration
vs. isolated re-design
Temporal consensus directs search
Adaptable Performance based on Online Inputs
The population evolves to changing physical
environment, input vectors, and target
application while increasing availability

60
Selection Process

61
Fitness Adjustment Procedure
62
Discrepancy Mirror Circuit
Fault Coverage
Component Fault Scenarios Fault Scenarios Fault Scenarios Fault Scenarios Fault-Free
Function Output A Fault Correct Correct Correct Correct
Function Output B Correct Fault Correct Correct Correct
XNORA Disagree (0) Disagree (0) Fault Disagree(0) Agree (1) Agree (1)
XNORB Disagree (0) Disagree (0) Agree (1) Fault Disagree(0) Agree (1)
BufferA 0 0 High-Z 0 1
BufferB 0 0 0 High-Z 1
Match Output 0 0 0 0 1
63
CGT-Pruned GA Simulator
64
Repair Progress

Write a Comment

User Comments (0)