Using Software Rules To Enhance FPGA Reliability - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Using Software Rules To Enhance FPGA Reliability

Description:

Using Software Rules To Enhance FPGA Reliability Chandru Mirchandani Lockheed-Martin September 7-9, 2005 MIRCHANDANI P226-W/MAPLD2005 * FPGA Fault Tolerance ... – PowerPoint PPT presentation

Number of Views:83

Avg rating:3.0/5.0

Slides: 21

Provided by: ChandruMi4

Learn more at: http://klabs.org

Category:

more less

Transcript and Presenter's Notes

Title: Using Software Rules To Enhance FPGA Reliability

1
Using Software Rules To Enhance FPGA Reliability

Chandru Mirchandani
Lockheed-Martin
September 7-9, 2005

MIRCHANDANI
P226-W/MAPLD2005
1
2
FPGA Fault Tolerance

Historically realized through triple redundancy,
error correcting codes and replicated elements
The fault tolerance process is as good as the
tests run to validate its performance, e.g.
When invalid data is not ignored due to an
inherent fault in the lookup and compare sequence
The testing was not rigorous enough
The testing was not complete
Lack of real estate and logic on the device
precludes the ideal solution,
Make educated judgment calls on how much is
acceptable and for how long

3
Reconfiguring FPGAs

Replicated circuitry or triple redundancy,
achieved by having different devices or on the
same device
Same device to replicate a complete circuit will
not meet the constraint of lack of real estate
and will decrease performance due to routing
Could be used to ones advantage if sub-sets of
the circuit were replicated
Yu and McCluskey - reconfiguring the chip so that
a damaged configurable logic block (CLB) or
routing resource is not used by a design

4
Types of Errors

Yu and McCluskey When concurrent error
detection (CED) mechanisms detect an error for
the first time, it is treated as a transient
error otherwise, it is treated as a permanent
error
Transient error - the system recovers from
corrupt data and resumes normal operation
Permanent fault - fault diagnosis is initiated to
determine the location of the damaged resource,
and a suitable configuration is chosen according
to the available area
In the case of both types of errors, the design
in VHDL, i.e. FPGA software is the key to success

5
Software Reliability

Develop Criteria for Design Objective Acceptance
Prioritize tasks or functions in order of
criticality
Develop metrics to measure performance of tasks
with respect to constraints
Evaluate design options based on measured
reliability metrics

MIRCHANDANI
P226/MAPLD2005
5
6
Typical Software Options

Critical software functions are distributed as
redundant instances on multiple processors, thus
minimizing the loss of service due to a processor
failure..

MIRCHANDANI
P226/MAPLD2005
6
7
Redundant Instances of Software

Initially detect, contain and recover from faults
as soon as possible, and in the event this is not
possible
Allow the control to be passed on to the
redundant instance within the reliability and
availability requirements levied on the system
Finally, include language defined mechanisms to
detect and prevent the propagation of errors

MIRCHANDANI
P226/MAPLD2005
7
8
Methodology

Estimate the reliability based on instruction set
and operational usage
Re-design critical elements to decrease risk
Re-evaluate the risk of failure based on a change
in critical task design based on performance and
requirements
Re-evaluate the reliability based on failure rate
Factor in the Uncertainty in Evaluation

MIRCHANDANI
P226/MAPLD2005
8
9
Task Times
Task Class Steps Step Time (stask) Task Time Total Tasks Time (ttask)
Reading r ?xri Sr sr.?xri (sr.?xri).nr tr
Parsing p ?xpi sp sp.?xpi (sp.?xpi).np tp
Pre-processing p1 ?xp1i sp1 sp1.?xp1i (sp1.?xp1i).np1 tp1
Monitoring M ?xMi sM sM.?xMi (sM.?xMi).nM tM
Sorting s ?xsi ss ss.?xsi (ss.?xsi).ns ts
Processing P ?xPi sP sP.?xPi (sP.?xPi).nP tP
Post-processing p2 ?xp2i sp2 sp2.?xp2i (sp2.?xp2i).np2 tp2
Status-gathering S ?xSi sS sS.?xSi (sS.?xSi).nS tS
Writing w ?xwi sw sw.?xwi (sw.?xwi).nw tw
10
FPGA System - Conceptual

Consider a FPGA-based system comprising of the
Reading, Parsing and Pre-Processing Tasks..

each Task is a subsystem
11
Task Reliability Block Diagram
1-1-(exp(-(1-?h).?shwi.t).exp(-(1-?s).?sswi.t))
2
(exp(-?h.uh.?hwi.t).exp(-?s.us.?swi.t)
AND
OR
12
Definitions
Calendar Time t Mission Time to Calculate the Reliability
Execution ei Percentage of Mission Time used by the Task (or Subsystem)
Execution Time t ei . t
Usage for SW Percentage of the Total software used by the Task
Usage for HW Percentage of Area of the Active portion of the Device used by Task
?shwi Failure Intensity of Task i hardware with respect to Execution time
?sswi Failure Intensity of Task i software with respect to Execution time
?hi Fraction of Task i Task hardware that are common cause failures
?si Fraction of Task i Task software that are common cause failures
13
Parameters Derivations

Failure Intensity ?shwi ?hwi.uh.(1-?h)
Failure Intensity ?sswi ?swi.us.(1-?s)
Common Cause ?hwi.uh.(?h) and ?swi.us.(?s)
Execution Time t ei . t
RSSi Subsystem Reliability
System Reliability RS RSS1 . RSS2 . RSS3

Reading Parsing Pre-Processing
Usage SW - us 0.3 0.3 0.4
Usage HW - uh 0.3 0.4 0.3
?hwi 0.3 0.4 0.3
?swi 0.3 0.4 0.3
Execution - ei 0.2 0.1 0.7
MIRCHANDANI
P226/MAPLD2005
13
14
Extending the Rules

The programmed design, be it the original duplex
design, duplicated or diverse, or the option for
re-configuration, will optimize whatever option
is used to enhance Fault Tolerance
For example, in the Reading Task, it is shown
that the area usage and operational profile have
an effect on the predicted overall reliability of
the FPGA-based design
Yu and McCluskey, state that the designs of the
CED techniques are area dependent, more
conservative a design in terms of area, less
efficiently will the error detection algorithm
perform, however, but more efficiently or
optimally the re-configured design in the event
of a permanent failure.

15
Further Extension

Area usage has a higher propensity for multiple
faults, the operational profile that exercises a
part of the code more often, then the design and
its associated code has a greater propensity for
failures
The common cause fractions used in the paper are
relative numbers to illustrate the model
Redundancy of one, the fraction attributed to
hardware common cause failure is 1 . This
implies that there is an equal chance for a
common defect running in the hardware, in this
case the FPGA, to manifest itself anywhere in the
active area.

16
Assertions

The common cause fractions used in the paper are
relative numbers to illustrate the model
Redundancy of one, the fraction attributed to
hardware common cause failure is 1 . This
implies that there is an equal chance for a
common defect running in the hardware, in this
case the FPGA, to manifest itself anywhere in the
active area.
Implemented on different devices, this fraction
drops to ¼ because now the physical defects are
almost negligible, and the only common effects
are more environmental, i.e. temperature, power
and external stresses.

17
More Assertions

Software common cause fraction is high in both
cases, since we assume nearly all software
failures are common cause, very little change
from same device to different device, since the
design implemented is the same, but because the
devices are different, this a slight chance that
certain timing conditions may vary and hence the
¼ variation
Diverse design paradigm, the hardware dependence
remains in the same ratio relatively, but the
software fractions vary drastically. In the same
device, the common cause fraction is 50 and it
drops to 10 in the case of diverse designs on
different devices

18
System Configuration Options
Configuration HW Common Cause Fraction SW Common Cause Fraction
Configuration ?h ?s
Same Code Device 0.01 1
Same Code Diff Devices 0.0025 0.9975
Diff Code Same Device 0.01 0.5
Diff Code Devices 0.0025 0.1
19
Results
Option Configuration FPGA-based System Reliability
1 Same Code, Same Devices 0.895726564
2 Same Code, Diff Devices 0.895973815
3 Diff Code, Same Devices 0.944752579
4 Diff Code, Diff Devices 0.98356125
MIRCHANDANI
P226/MAPLD2005
19
20
Conclusions