Sustainable Fault-Handling of Reconfigurable Logic using Throughput-Driven Assessment - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Sustainable Fault-Handling of Reconfigurable Logic using Throughput-Driven Assessment

Description:

Title: A Genetic Representation for Evolutionary Fault Recovery in FPGAs Author: User Last modified by: Carthik Sharma Created Date: 8/17/2001 8:25:04 PM – PowerPoint PPT presentation

Number of Views:243
Avg rating:3.0/5.0
Slides: 65
Provided by: calUcfEd
Category:

less

Transcript and Presenter's Notes

Title: Sustainable Fault-Handling of Reconfigurable Logic using Throughput-Driven Assessment


1
Sustainable Fault-Handlingof Reconfigurable
Logic using Throughput-Driven Assessment
Carthik Anand SharmaUniversity of Central Florida
7 July 2008
2
Motivation
  • Mission-critical Embedded Systems require high
    reliability and availability
  • Characteristics of Operating Environment may
    induce hardware failures
  • Aging, Manufacturing Defects, etc.
  • System Reliability
  • Fault Avoidance. Always Possible? No
  • Design Margin. Always Adequate? No
  • Modular Redundancy. Always Recoverable?No
  • Fault Refurbishment. Highly Flexible? Yes
    but technically challenging to achieve

?
3
Technical ObjectiveAutonomous FPGA
Regeneration
NASA Moon, Mars, and Beyond Realize 10s years
service life ???
Increased availability without pre-configured
spares
Reconfiguration allows new fault-handling
paradigm
  • Redundancy
  • increases with amount
  • of spare capacity
  • restricted at design-time
  • based on time required to select spare
    resource
  • determined by adequacy of spares available (?)
  • yes
  • Regeneration
  • weakly-related to number
  • recovery capacity
  • variable at recovery-time
  • based on time required to find suitable
    recovery
  • affected by multiple characteristics (
    or -)
  • yes

everyday example
spare tire
can of fix-a-flat
?
Overhead from Unutilized Spares weight, size,
power Granularity of Fault Coverage
resolution where fault handled
Fault-Resolution Latency availability via
downtime required to handle fault Quality
of Repair likelihood and completeness
Autonomous Operation recover without outside
intervention
?
?
?
?
?
4
Fault-Handling Techniques for SRAM-based FPGAs
Reprogrammable Device Failure
Characteristics
Duration
Transient SEU
Permanent SEL, Oxide Breakdown, Electron
Migration, LPD
Device Configuration
Processing Datapath
Device Configuration
Processing Datapath
Target
BIST
Evolutionary
Repetitive Readback Wells00
TMR (conventional spatial redundancy)
Approach
STARS Abramovici01
CED McCluskey04
Sussex Vigander01
CRR
Methods
Supplementary Testbench
Duplex Output Comparison
Duplex Output Comparison
Detection
(not addressed)
Cartesian Intersection
Isolation
(not addressed)
Bitwise Comparison
Majority Vote
unnecessary
Fast Run-time Location
Worst-case Clock Period Dilation
Diagnosis
unnecessary
unnecessary
Population-based GA using Extrinsic
Fitness Evaluation
Evolutionary Algorithm using Intrinsic
Fitness Evaluation
Recovery
Replicate in Spare Resource
Select Spare Resource
Invert Bit Value
Ignore Discrepancy
5
Contributions
  • Strategy for Integrating all phases of Fault
    Handling process
  • detection, isolation, diagnosis and recovery work
    in synergy
  • Elimination of Additional Test Vectors
  • enables detection and isolation with minimal
    system downtime
  • Autonomous Group Testing techniques for FPGA
    devices
  • isolates faults in FPGA while maintaining system
    performance
  • Competitive Runtime Reconfiguration
  • leverages iterative pairwise comparison and
    functional
  • regeneration to provide adaptive refurbishment
  • with resource recycling

6
Previous Work
  • Detection Characteristics of FPGA Fault-Handling
    Schemes

Strategies 1) Evolve redundancy into design
before anticipated failure 2) Redesign
after detection of failure 3) Combine
desirable aspects of both strategies 1) 2)
7
Group Testing Algorithms
  • Origin World War II Blood testing
  • Problem Test samples from millions of new
    recruits
  • Solution Test blocks of sample before testing
    individual samples
  • Problem Definition
  • Identify subset Q of defectives from set P
  • Minimize number of tests
  • Test v-subsets of P
  • Form suitable blocks

8
CRR Arrangement in SRAM FPGA
  • Configurations in Population
  • C CL? CR
  • CL subset of left-half configurations
  • CR subset of right-half configurations
  • CLCR C/2
  • Discrepancy Operator
  • Baseline Discrepancy Operator ? is dyadic
    operator with binary output
  • Z(Ci) is FPGA data throughput output of
    configuration Ci
  • Each half-configuration evaluates ? using
    embedded checker (XNOR gate) within each
    individual
  • Any fault in checker lowers that individuals
    fitness so that individual is no longer preferred
    and eventually undergoes repair

WTA
(Equivalence)
9
Sketch of CRR ApproachPremise Recovery
Complexity ltlt Design Complexity
  • Initialization
  • Population P of functionally-identical yet
    physically-distinct configurations
  • Partition P into sub-populations that use
    supersets of physically-distinct resources, e.g.
    size P/2 to designate physical FPGA
  • left-half or right-half resource utilization
  • Fitness Assessment
  • Discrepancy Operator ? is some function of
  • bitwise agreement between each halfs output
  • Four Fitness States defined for Configurations as
  • CP,CS,CU,CR with transitions, respectively
  • Pristine Suspect Under Repair
    Refurbished
  • Fitness Evaluation Window W determines
    comparison interval
  • Regeneration
  • Genetic Operators used to recover from fault
    based on Reintroduction Rate ?
  • Operators only applied once then offspring
    returned to service without for concern about
    increasing fitness

fitness assessment via pairwise discrepancy
(temporal voting vs. spatial voting)
10
FPGA Genetic Representations
  • Chromosome Goals
  • Allow all possible LUT configurations
  • Allow all possible CLB interconnections given
    constraints of routing support
  • Disallow illegal FPGA configurations and
    non-coding introns (junk DNA)
  • Facilitate crossover operator
  • Bitstring representation is natural choice,
    though may not scale well (investigating
    generative reps)
  • Representation shown here is sample specific to
    Xilinx Virtex FPGA

11
Competitive Runtime Reconfiguration (CRR)
Evolutionary Computation strategies effective for
more than just repair phase continually
detect, rank, and isolate faults entirely within
the underlying data throughput flow
diverse alternatives working a-priori
fault detection by robust consensus over time
no test vectors
device remains online during repair
fault isolation is model-free and
self-calibrating
completely-repaired criteria can be ignored
graceful degredation via ranking of
alternatives
no reconfiguration when fault-free
performance readily adjustable
failures in population memory covered
checking logic part of individual hence also
competes for correctness
12
Fitness Evaluation Window
  • Fitness Evaluation Window E
  • denotes number of iterations used to evaluate
    fitness before the state of an individual is
    determined
  • Determination of E for 3x3 multiplier
  • 6 input pins articulating 2664 possible inputs
  • W should be selected so that all possible inputs
    appear
  • More formally,
  • Let rand(X) return some xi ? X at random
  • Seek W ? ? rand(X) X with high
    probability
  • xK distinct orderings of K inputs showing in D
    trials
  • if D constant, can calculate Pkgt1 successively
  • probability PK of K inputs showing after D trials
    is ratio of xK / KD

13
E Determination
When K64
14
Integer Multiplier Case Study
  • 3bit x 3bit unsigned multiplier automated design
  • Building blocks
  • Half-Adder 18 templates created
  • Full-Adder 24 templates
  • Parallel-And 1 template created
  • Randomly select templates for instantiation in
    modules

GA operators External-Module-Crossover Internal-Mo
dule-Crossover Internal-Module-Mutation
GA parameters Population size 20 individuals
Crossover rate 5 Mutation rate up to
80 per bit
Experiments Demonstrate
Experimental Evaluation Xilinx Virtex II Pro on
Avnet PCI board
  • Objective fitness function replaced by the
    Consensus-based Evaluation Approach and Relative
    Fitness
  • Elimination of additional test vectors
  • Temporal Assessment process

15
Regeneration Performance
System Throughput during Regeneration for a 3x3
multiplier
Exp. Number Fault Location Failure Type Correctness after Fault Total Iterations Discrepant Iterations Repair Iterations Final Correctness Throughput ()
1 CLB3,LUT0,Input1 Stuck-at-1 52 / 64 1.7 ? 107 4.2 ? 105 1194 64 / 64 97.7
2 CLB6,LUT0,Input1 Stuck-at-0 33 / 64 8.0 ? 105 1.7 ? 104 47 64 / 64 97.9
3 CLB5,LUT2,Input0 Stuck-at-1 22 / 64 3.1 ? 106 6.8 ? 104 193 64 / 64 97.8
4 CLB7,LUT2,Input0 Stuck-at-0 38 / 64 8.1 ? 106 1.8 ? 105 513 64 / 64 97.7
5 CLB9,LUT0,Input1 Stuck-at-0 40 / 64 2.3 ? 106 7.1 ? 104 219 64 / 64 96.9
  Average 32.6 / 64 6.4 ? 106 1.5 ? 105 433 64 / 64 97.6

Difference (vs. Hamming Distance) Evaluation
Window, Ew 600 Suspect Threshold ?S
1-6/60099 Repair Threshold ?R 1-4/600
99.3 Re-introduction rate ?r 0.1
Parameters
Repairs evolved in-situ, in real-time, without
additional test vectors, while allowing device to
remain partially online.
16
Isolation Problem Outline
  • Objectives
  • Locate faulty logic and/or interconnect resource
    a single stuck-at fault model is assumed
  • Online Fault Isolation device not entirely
    removed from service
  • Features
  • Runtime Reconfiguration FPGA resources
    configured dynamically
  • Utilize Runtime Inputs avoid special
    test-vectors, improve availability
  • Constraints
  • Use pre-designed configurations defined by
    target application
  • Subsets under test have constant resource
    utilization range for a given isolation problem
  • Resource grouping influences fault articulation
    resource-mapping and input vector might mask
    hardware faults
  • Do not use specialized block designs
  • Runtime reconfiguration initially limited to
    column-swapping
  • Non-reasonable algorithm tests may be
    repeated without gaining new isolation
    information

17
Discrepancy Mirror
  • Mechanism for Checking-the-Checker (golden
    element problem)
  • Makes checker part of configuration that
    competes DeMara PDPTA-05

Fault Coverage
18
  • Influence of LUT utilization

Perpetually Articulating Inputs with Equiprobable
Distribution
Intermittently Articulating Inputs with
Equiprobable Distribution
  • expected number of pairings grows sub-linearly
    in number of resources
  • utilization below 20 or above 80 implicates
    (or exonerates) a smaller sub-set of resources
  • 50 utilization, the expected number of pairings
    for 1,000, 10,000, and 100,000 resources are
    11.1, 14.9, and 17.6
  • at 90 utilization mean value of 258 pairings
    are required to isolate the faulty resource.

19
Fault Location Using Dueling
  • The set of all competing configurations is
    represented by S.
  • Set Ck represents the resources utilized by
    configuration k.
  • Each competing configuration k, 1 lt k lt S has
    a unique binary
  • Usage Matrix Uk, 1 lt k lt p.
  • Elements Uki,j, 1 lt i lt m, 1 lt j n, where m
    and n represent the rows and columns in the
    device layout respectively.
  • Elements Uki,j 1 denote the usage of resource
    (i, j) by Ck.
  • The History Matrix H, with elements Hi,j 1 lt i
    lt m, 1 lt j lt n, is an integer matrix used to
    represent the relative fitness of individual
    resources.
  • Hi,j provides instantaneous relative fitness
    values of resources.

20
Dueling Example
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 1 0 1 0 0
0 0 0 1 0 0 0 0 0 0
0 0 1 0 0 1 1 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 1 1 0 0 0
0 0 1 1 0 0 1 0 0 0
0 0 1 0 1 0 0 0 0 0
0 0 1 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
H i,j _at_ t 0
U2
U1
0 0 0 0 0 0 0 0 0 0
0 0 0 1 1 1 1 0 0 0
0 0 2 1 0 0 1 0 0 0
0 0 1 0 1 1 0 1 0 0
0 0 1 1 0 1 0 0 0 0
0 0 1 0 0 1 1 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
  • H i,j changes after C1 and C2 are loaded
  • U1 and U2 are corresponding Usage Matrices
  • (3,3) is identified as the faulty resource

H i,j _at_ t 2
21
Isolation Progress without Halving
  • Without Halving
  • Initially S 20,000
  • Resource Utilization 40
  • Number of suspected faulty elements constant at
    36 after 23 iterations
  • No subsequent improvement due to lack of
    differentiating information between competing
    configurations

22
Dueling with Modified Halving
  • Dueling with Halving
  • Halving works by swapping half the used columns
    with unused ones
  • Halving progressively reduces the size of the
    set of suspected faulty elements
  • Isolation proceeds till a single faulty element
    is isolated
  • Fault isolated after 19 iterations

23
Enhancing Embedded Core BIST using Group Testing
BIST Structure Used for Embedded Core Testing
XCVLX30 device - 32 DSP48E Cores divided into n
8 groups 8 x 6 2x1 multiplexers are needed. 6
columns of Comparators, each Column has 8
Comparators Comparators kn(i,j), 0 ? i,j ? 3, i?j
complete test for a group of 4 Flipflops FF0
through FF5 register comparison results for each
group Fault diagnosis script processes result of
each set of 6 outputs
24
Embedded Core BIST using Group Testing
Resource Utilization
Faults in up to 2 BUTs in each group of 4 can be
isolated Isolation is achieved without device
reconfiguration in a single stage
25
Logic Element Isolation Using Autonomous Group
Testing (AGT)
In each stage, suspect resources S are equally
shared among pstage individuals If S Smax then
mutually exclusive shares are possible,
else, nshare nreqd - R - S are shared
26
Equal Share Strategy
27
Fault Isolation Using FIAT
  • Fault Insertion and Analysis Toolkit (FIAT)
  • provides methods to modify Xilinx FPGA
    configurations
  • inserts suck-at-faults at LUT inputs
  • precludes need to edit configuration bitstream
  • works in conjunction with Xilinx ISE software
    (COTS design suite)

28
AGT Experiments
  • Experimental Setup
  • DES-56 encryption circuit
  • Xilinx ISE design tools to place and route the
    design
  • Virtex II Pro FPGA device
  • Fault Injection and Analysis Toolkit (FIAT)
  • Application Programmer Interfaces (APIs) to
    interact with the Xilinx ISE tools to inject and
    evaluate faults
  • Editing the design file rather than the
    configuration bitstreams to introduce
    stuck-at-faults
  • Editing User Constraint Files (UCF) to control
    resource usage

29
AGT Isolation Progress
30
AGT Maintaining Goodput
With ppreset 5, goodput is maintained at gt
90 Since goodput remains high, the rate of
fault isolation is slower, with
better-performing individuals selected to
maintain Goodput Fault detection latency is
minimal as compared to STARs, isolation is
achieved with manageable system performance
degradation
31
Conclusion
  • Graceful Performance Degradation
  • elimination of additional test vectors
  • temporal assessment using aging and outlier
    detection
  • resource recycling to utilize residual
    functionality
  • Population-Centric Assessment
  • Provides adaptability and self-calibrating
    autonomy with a relative assessment method
  • fitness assessment using population information
    and competition
  • create a fully functional solution using
    partially-fit individuals
  • Autonomous Group Testing
  • Minimal latency fault detection
  • Fault isolation without additional test vectors
  • Efficient strategies for fast fault isolation
    with minimal reconfiguration
  • Fast first-responder to faults via resource
    tracking
  • Run-time Fault Management
  • Can be realized using consensus-driven assessment
    methods, and using information contained in the
    population
  • Integrate Detection, Isolation, Repair under a
    single Population-based technique

32
Future Work
  • Evolvable Sequential Logic Circuits
  • Fitness assessment is a major challenge for large
    circuits
  • Logic and Interconnect fault handling
  • Need to integrate fault handling methods for
    faults in logic and the interconnects
  • Extend group testing principles to interconnect
    faults
  • Challenges in partial reconfiguration
  • Need well-tested and supported APIs for runtime
    reconfiguration of commercial FPGAs
  • Open standards in partial reconfiguration will
    assist reliability studies
  • Decreased dependence on vendor-provided design
    tools with an open bitstream structure is
    essential
  • FIAT can be used to study fault isolation
    properties of different approaches, and for
    evaluating other group testing algorithms for
    fault isolation
  • Extending AGT to other domains
  • Group testing techniques presented here can
    adapted for fault tolerant nano-scale mechanism,
    software etc
  • Reliable, self-monitoring, self-adaptive organic
    systems are a need, with increasing design
    complexity and computational capabilities

33
Publications
  • Michael Georgiopoulos , Ronald F. DeMara, Avelino
    J. Gonzalez, Annie S. Wu, Mansooreh Mollaghasemi,
    Erol Gelenbe, Marcella Kysilka, Jimmy Secretan,
    Carthik A. Sharma and Ayman J. Alnsour, A
    Sustainable Model for Integrating Current Topics
    in Machine Learning Research into the
    Undergraduate Curriculum, accepted to the IEEE
    Transactions in Education, July 2008.
  • A. Sarvi, C. A. Sharma and R. F. DeMara,
    BIST-Based Group Testing for Diagnosis of
    Embedded FPGA Cores, accepted to The 2008
    International Conference on Embedded Systems and
    Applications, Las Vegas, Nevada, USA (July 14-17,
    2008).
  • C. A. Sharma, R. F. DeMara and A. Sarvi,
    Self-Healing Reconfigurable Logic using
    Autonomous Group Testing, submitted to ACM
    Transactions on Autonomous and Adaptive Systems
    (TAAS) of Special Issue on Organic Computing May
    2007.
  • R. F. DeMara, K. Zhang, C. A. Sharma,
    Consensus-based Evolvable Hardware for
    Sustainable Fault Handling, submitted to The
    IEEE Transactions in Evolutionary Computation Aug
    2007.
  • R. N. Al-Haddad, C. A. Sharma, R. F. DeMara,
    Performance Evaluation of Two Allocation Schemes
    for Combinatorial Group Testing Fault Isolation,
    in Proceedings of the International Conference on
    Engineering of Reconfigurable Systems and
    Algorithms ERSA 07,, Las Vegas, Nevada, U.S.A,
    June 25 28, 2007.
  • R. S. Oreifej, C. A. Sharma, R. F. DeMara,
    Expediting GA-Based Evolution Using Group
    Testing Techniques for Reconfigurable Hardware,
    in Proceedings of the IEEE International
    Conference on Reconfigurable Computing and FPGAs
    (Reconfig06), San Luis Potosi, Mexico, September
    20-22, 2006, pp 106-113.
  • C. A. Sharma, R. F. DeMara, A Combinatorial
    Group Testing Method for FPGA Fault Location, in
    Proceedings of the International Conference on
    Advances in Computer Science and Technology (ACST
    2006), Puerto Vallarta, Mexico, January 23 - 35,
    2006.
  • C. J. Milliord, C. A. Sharma, R. F. DeMara,
    Dynamic Voting Schemes to Enhance Evolutionary
    Repair in Reconfigurable Logic Devices, in
    Proceedings of the International Conference on
    Reconfigurable Computing and FPGAs (ReConFig05),
    pp. 8.1.1 - 8.1.6, Puebla City, Mexico, September
    28 - 30, 2005.
  • K. Zhang, R. F. DeMara, C. A. Sharma,
    Consensus-based Evaluation for Fault Isolation
    and On-line Evolutionary Regeneration, in
    Proceedings of the International Conference in
    Evolvable Systems (ICES05), pp. 12 -24,
    Barcelona, Spain, September 12 - 14, 2005.
  • R. F. DeMara and C. A. Sharma, Self-Checking
    Fault Detection using Discrepancy Mirrors, in
    Proceedings of the International Conference on
    Parallel and Distributed Processing Techniques
    and Applications (PDPTA05), pp. 311-317, Las
    Vegas, Nevada, U.S.A, June 27 30, 2005.

34
Backup Slides
  • On following pages

35
Isolation Block Duelling
  • Algorithm based on group testing methods
  • Successive intersection to assess health of
    resources
  • Each configuration k has a binary Usage Matrix
    Uki,j 1 ? i ? m and 1 ? j ? n
  • m, n are the number of rows and columns of
    resources in the device
  • Elements Uki,j 1 are resources used in k
  • History Matrix H i,j 1 ? i ? m and 1 ? j ? n,
    initially all zero, exists in which
  • entries represent the fitness of resources (i, j)
  • Information regarding the fitness of resources
    over time is stored
  • A discrepant output will lead to an increase in
    the value of
  • Hi,j, ? Uki,j 1 ,k ? S
  • All elements of H, corresponding to resources
    used by discrepant configuration will be
    incremented by one.
  • At any point in time, Hi,j will be a record the
    outcomes of competitions
  • m successive intersections among
    are performed until
    S1

36
Isolation of a single faulty individual with
1-out-of-64 impact
  • Outliers are identified after W iterations
    elapsed
  • E.V. (1/64)600 9.375 from minimum impact
    faulty individual
  • Isolated individuals f differs from the average
    DV by 3? after 1 or more observation intervals of
    length W

37
Isolation of a single faulty L individual with
10-out-of-64 impact
  • Compare with 1-out-of-64 fault impact
  • E.V. of (10/64)600 93.75 discrepancies for
    faulty configuration
  • One isolation will be complete approx. once in
    every 93.75/5 19 Observation Intervals
  • Fault Isolation demonstrated in 100 of case

38
Isolation of 8 faulty individuals L4R4 with
1-out-of-64 impact
  • Expected isolations do not occur approximately
    40 of the time
  • Average discrepancy value of the population is
    higher
  • Outlier isolation difficult
  • Multiple faulty individual, Discrepancies
    scattered

39
Online Dueling Evaluation
  • Objective
  • Isolate faults by successive intersection between
    sets of FPGA resources used by configurations
  • Analyze complexity of Isolation process
  • Variables
  • Total resources available
  • Measured in number of LUTs
  • Number of Competing Configurations
  • Number of initial Seed designs in CRR process
  • Degree of Articulation
  • Some inputs may not manifest faults, even if
    faulty resource used by individual
  • Resource Utilization Factor
  • Percentage of FPGA resources required by target
    application/design
  • Number of Iterations for Isolation
  • Measure of complexity and time involved in
    isolating fault

40
For further info EH Websitehttp//cal.ucf.edu
41
Fast Reconfiguration for Autonomously
Reprogrammable Logic
  • Motivation
  • Dynamic reconfiguration required by application
  • Exploit architectural performance improvements
    fully
  • Reconfiguration delay a major performance
    barrier
  • Previous Work
  • Methodology
  • Multilayer Runtime Reconfiguration Architecture
    (MRRA)
  • Spatial Management
  • Prototype Development
  • Loosely-Coupled solution
  • Timing Analysis
  • System-On-Chip solution

42
Reconfiguration Demand during CRR
  • For a complete repair
  • Approximately 2,000 generations ( ) may be
    required
  • For each generation, evaluations may be
    up to 100 evaluations
  • Yielding the Cumulative Number of
    Reconfigurations (CNR) up to
  • For each reconfiguration task
  • Therefore, the total delay

Even if reconfiguration delay alone is assumed to
be in the order of tens or hundreds of
milliseconds ? Ltot gt 5.5 hours
43
Previous Work - Algorithm Level
Approach Method Partial Reconfig Spatial Relocation Temporal Parallelism Area shape Run-Time Potential Limitations
Hauck, Li, Schwabe Bit file compression N/A No N/A N/A No Full reconfiguration required
Shirazi, Luk, Cheung Identifying common components Yes No Yes N/A No Design time work required
Mak, Young Dynamic Partitioning Yes No Yes N/A Yes Only desirable for large designs
Ganesan, Vemuri Pipelining Yes No Yes N/A Yes Limited pipeline depth
Compton, Li, Knol, Hauck Relocation and Defragmentation with new FPGA architecture Yes Yes No Row-based Yes Special FPGA architecture required
Diessel, Middendorf Schmeck, Schmidt Task Remapped and Relocated Yes Yes No Rectangle Yes Overhead for remapping calculations
Herbert, Christoph, Macro Partitioning and 2D Hashing Yes Yes Yes Rectangle Yes Rigid task modeling assumptions
compression method
temporal method
spatial method
44
Multilayer Runtime Reconfiguration Architecture
(MRRA)
  • Develop MRRA fast reconfiguration paradigm for
    the CRR approach
  • Validate with real hardware platform along with
    detailed performance analysis
  • First general-purpose framework for a wide
    variety of applications requiring dynamic
    reconfiguration
  • Extend existing theories on reconfiguration

45
Loosely Coupled Solution
The Virtex-II Pro is mounted on a development
board which can then be interfaced with a
WorkStation running Xilinx EDK and ISE.
The entire system operates on a 32-bit basis
46
Result Assessment
  • Establish full functional framework of both
    prototypes
  • Communication overhead, throughput and overall
    speed-up analysis
  • Communication overhead for SOC solution is
    decreased to micro or sub-micro second order Vs.
    milliseconds order of Loosely Coupled solution
  • Up to 5-fold speedup is expected compared to the
    Loosely Coupled solution
  • Translation Complexity Analysis
  • The quantity of information that needs to be
    translated to generate the reconfiguration
    bitstream
  • Simplification from file level to bit level is
    expected
  • Storage Complexity Analysis
  • The memory space required for the run-time
    algorithms
  • Decreased memory requirement is expected due to
    the translation complexity improvement

47
Publications
  • Accepted Manuscripts
  • R. F. DeMara and K. Zhang, Autonomous FPGA Fault
    Handling through Competitive Runtime
    Reconfiguration, to appear in NASA/DoD
    Conference on Evolvable Hardware(EH05),
    Washington D.C., U.S.A., June 29 July 1, 2005.
  • H. Tan and R. F. DeMara, A Device-Controlled
    Dynamic Configuration Framework Supporting
    Heterogeneous Resource Management, to appear in
    International Conference on Engineering of
    Reconfigurable Systems and Algorithms (ERSA05),
    Las Vegas, Nevada, U.S.A, June 27 30, 2005.
  • R. F. DeMara and C. A. Sharma, Self-Checking
    Fault Detection using Discrepancy Mirrors, to
    appear in International Conference on Parallel
    and Distributed Processing Techniques and
    Applications (PDPTA05), Las Vegas, Nevada,
    U.S.A, June 27 30, 2005.
  • Submitted Manuscripts
  • R. F. DeMara and K. Zhang, Populational Fault
    Tolerance Analysis Under CRR Approach, submitted
    to International Conference on Evolvable Systems
    (ICES05), Barcelona, Sept. 12 14, 2005.
  • R. F. DeMara and C. A. Sharma, FPGA Fault
    Isolation and Refurbishment using Iterative
    Pairing, submitted to IFIP VLSI-SOC Conference,
    Perth, W. Australia, October 17 19, 2005.
  • Manuscripts In-preparation
  • R. F. DeMara and K. Zhang, Autonomous Fault
    Occlusion through Competitive Runtime
    Reconfiguration, submission planned to IEEE
    Transactions on Evolutionary Computation.
  • R. F. DeMara and C. A. Sharma, Multilayer
    Dynamic Reconfiguration Supporting Heterogeneous
    FPGA Resource Management, submission planned to
    IEEE Design and Test of Computers.
  • Field Testing
  • Implementation of CRR on-board SRAM-based FPGA
    in a Cubesat mission

48
EHW Environments
  • Evolvable Hardware (EHW) Environments enable
    experimental methods to research soft
    computing intelligent search techniques
  • EHW operates by repetitive reprogramming of
    real-world physical devices using an iterative
    refinement process

Extrinsic Evolution
Intrinsic Evolution
Application
Two modes of Evolvable Hardware
or
Genetic Algorithm
Genetic Algorithm
Stardust Satellite gt100 FPGAs onboard
hostile environment radiation, thermal
stress How to achieve reliability to avoid
mission failure???
Simulation in the loop
Hardware in the loop
Done? Build it
software model
new approach to Autonomous Repair of failed
devices
device design-time refinement
device run-time refinement
49
Genetic Algorithms (GAs)
  • Mechanism coarsely modeled after neo-Darwinism
    (natural selection genetics)

start
replacement
offspring
population of candidate solutions
evaluate fitness of individuals
Fitness function
mutation
crossover
selection of parents
parents
Goal reached
50
Genetic Mechanisms
  • Guided trial-and-error search techniques using
    principles of Darwinian evolution
  • iterative selection, survival of the fittest
  • genetic operators -- mutation, crossover,
  • implementor must define fitness function
  • GAs frequently use strings of 1s and 0s to
    represent candidate solutions
  • if 100101 is better than 010001 it will have more
    chance to breed and influence future population
  • GAs cast a net over entire solution space to
    find regions of high fitness
  • Can invoke Elitism Operator (E1, E2 )
  • guarantees monotonically increasing fitness of
    best individual over all generations

51
GA Success Stories
  • Commercial Applications
  • Nextel frequency allocation for cellular phone
    networks -- 15M predicted savings in
    NY market
  • Pratt Whitney turbine engine design ---
    engineer 8 weeks
    GA 2 days w/3x improvement
  • International Truck production scheduling
    improved by 90 in 5 plants
  • NASA superior Jupiter trajectory optimization,
    antennas, FPGAs
  • Koza 25 instances showing human-competitive
    performance such as analog circuit design,
    amplifiers, filters

52
Representing Candidate Solutions
  • Representation of an individual can be using
    discrete values (binary, integer, or any other
    system with a discrete set of values)
  • Example of Binary DNA Encoding

Individual (Chromosome)
GENE
53
Genetic Operators
t
t 1
selection
reproduction
54
Crossover Operator
Population
offspring
55
Procedural Flow under Competitive Runtime
Reconfiguration
  • Integrates all fault handling stages using EC
    strategy
  • Detects faults by the occurrence of discrepancy
  • Isolates faults by accumulation of discrepancies
  • Failure-specific refurbishment using Genetic
    Operators
  • Intra-Module-Crossover, Inter-Module-Crossover,
    Intra-Module-Mutation
  • Realize online device refurbishment
  • Refurbished online without additional function or
    resource test vectors
  • Repair during the normal data throughput process

56
Template Fault Coverage
  • Half-Adder Template A

Half-Adder Template A
Half-Adder Template B
  • Template A
  • Gate3 is an AND gate
  • Will lose correctness if a Stuck-At-Zero fault
    occurs in second input line of the Gate3, an AND
    gate
  • Template B
  • Gate3 is a NOT gate and only uses the first input
    line
  • Will work correctly even if second input line is
    stuck at Zero or One

57
Evolvable Hardware
  • Evolutionary Design
  • Start with available CLBs and IOBs
  • Implement a design using Genetic Operators etc
    Limited or no ability to re-design to account for
    suspected faulty resources
  • Evolutionary Regeneration
  • Start with an existing pool of designs
  • Some existing configurations may use faulty
    resources
  • Eliminate use of suspected faulty resources
  • Genetic Operators can be applied to refurbish
    designs

58
Competitive Runtime Reconfiguration (CRR)Overview
  • Uses a Relative Fitness Measure
  • Pairwise discrepancy checking yields relative
    fitness measure
  • Broad temporal consensus in the population used
    to determine fitness metric
  • Transition between Fitness States occurs in the
    population
  • Provides graceful degradation in presence of
    changing environments, applications and inputs,
    since this is a moving measure
  • Test Inputs Normal Inputs for Data Throughput
  • CBE does not utilizes additional functional nor
    resource test vectors
  • Potential for higher availability as regeneration
    is integrated with normal operation

59
Exploiting Population Information
  • Population contains more robust information than
    individuals
  • Utilize this information for robust fault
    detection, faster regeneration, increased
    diversity for adaptation
  • Detect Failure and Isolate Faulty Resources
  • Detect by inconsistencies among the population
  • Isolate faults using outlier identification and
    aging
  • Realize Regeneration
  • Recovery Complexity ltlt Design Complexity
  • utilize diverse raw material during regeneration
    vs. isolated re-design
  • Temporal consensus directs search
  • Adaptable Performance based on Online Inputs
  • The population evolves to changing physical
    environment, input vectors, and target
    application while increasing availability

60
Selection Process

61
Fitness Adjustment Procedure
62
Discrepancy Mirror Circuit
Fault Coverage
Component Fault Scenarios Fault Scenarios Fault Scenarios Fault Scenarios Fault-Free
Function Output A Fault Correct Correct Correct Correct
Function Output B Correct Fault Correct Correct Correct
XNORA Disagree (0) Disagree (0) Fault Disagree(0) Agree (1) Agree (1)
XNORB Disagree (0) Disagree (0) Agree (1) Fault Disagree(0) Agree (1)
BufferA 0 0 High-Z 0 1
BufferB 0 0 0 High-Z 1
Match Output 0 0 0 0 1
63
CGT-Pruned GA Simulator
64
Repair Progress
Write a Comment
User Comments (0)
About PowerShow.com