Title: Using Software Rules To Enhance FPGA Reliability
1Using Software Rules To Enhance FPGA Reliability
- Chandru Mirchandani
- Lockheed-Martin
- September 7-9, 2005
MIRCHANDANI
P226-W/MAPLD2005
1
2FPGA Fault Tolerance
- Historically realized through triple redundancy,
error correcting codes and replicated elements - The fault tolerance process is as good as the
tests run to validate its performance, e.g. - When invalid data is not ignored due to an
inherent fault in the lookup and compare sequence - The testing was not rigorous enough
- The testing was not complete
- Lack of real estate and logic on the device
precludes the ideal solution, - Make educated judgment calls on how much is
acceptable and for how long
3Reconfiguring FPGAs
- Replicated circuitry or triple redundancy,
achieved by having different devices or on the
same device - Same device to replicate a complete circuit will
not meet the constraint of lack of real estate
and will decrease performance due to routing - Could be used to ones advantage if sub-sets of
the circuit were replicated - Yu and McCluskey - reconfiguring the chip so that
a damaged configurable logic block (CLB) or
routing resource is not used by a design
4Types of Errors
- Yu and McCluskey When concurrent error
detection (CED) mechanisms detect an error for
the first time, it is treated as a transient
error otherwise, it is treated as a permanent
error - Transient error - the system recovers from
corrupt data and resumes normal operation - Permanent fault - fault diagnosis is initiated to
determine the location of the damaged resource,
and a suitable configuration is chosen according
to the available area - In the case of both types of errors, the design
in VHDL, i.e. FPGA software is the key to success
5Software Reliability
- Develop Criteria for Design Objective Acceptance
- Prioritize tasks or functions in order of
criticality - Develop metrics to measure performance of tasks
with respect to constraints - Evaluate design options based on measured
reliability metrics
MIRCHANDANI
P226/MAPLD2005
5
6Typical Software Options
- Critical software functions are distributed as
redundant instances on multiple processors, thus
minimizing the loss of service due to a processor
failure..
MIRCHANDANI
P226/MAPLD2005
6
7Redundant Instances of Software
- Initially detect, contain and recover from faults
as soon as possible, and in the event this is not
possible - Allow the control to be passed on to the
redundant instance within the reliability and
availability requirements levied on the system - Finally, include language defined mechanisms to
detect and prevent the propagation of errors
MIRCHANDANI
P226/MAPLD2005
7
8Methodology
- Estimate the reliability based on instruction set
and operational usage - Re-design critical elements to decrease risk
- Re-evaluate the risk of failure based on a change
in critical task design based on performance and
requirements - Re-evaluate the reliability based on failure rate
- Factor in the Uncertainty in Evaluation
MIRCHANDANI
P226/MAPLD2005
8
9Task Times
Task Class Steps Step Time (stask) Task Time Total Tasks Time (ttask)
Reading r ?xri Sr sr.?xri (sr.?xri).nr tr
Parsing p ?xpi sp sp.?xpi (sp.?xpi).np tp
Pre-processing p1 ?xp1i sp1 sp1.?xp1i (sp1.?xp1i).np1 tp1
Monitoring M ?xMi sM sM.?xMi (sM.?xMi).nM tM
Sorting s ?xsi ss ss.?xsi (ss.?xsi).ns ts
Processing P ?xPi sP sP.?xPi (sP.?xPi).nP tP
Post-processing p2 ?xp2i sp2 sp2.?xp2i (sp2.?xp2i).np2 tp2
Status-gathering S ?xSi sS sS.?xSi (sS.?xSi).nS tS
Writing w ?xwi sw sw.?xwi (sw.?xwi).nw tw
10FPGA System - Conceptual
- Consider a FPGA-based system comprising of the
Reading, Parsing and Pre-Processing Tasks..
each Task is a subsystem
11Task Reliability Block Diagram
1-1-(exp(-(1-?h).?shwi.t).exp(-(1-?s).?sswi.t))
2
(exp(-?h.uh.?hwi.t).exp(-?s.us.?swi.t)
AND
OR
12Definitions
Calendar Time t Mission Time to Calculate the Reliability
Execution ei Percentage of Mission Time used by the Task (or Subsystem)
Execution Time t ei . t
Usage for SW Percentage of the Total software used by the Task
Usage for HW Percentage of Area of the Active portion of the Device used by Task
?shwi Failure Intensity of Task i hardware with respect to Execution time
?sswi Failure Intensity of Task i software with respect to Execution time
?hi Fraction of Task i Task hardware that are common cause failures
?si Fraction of Task i Task software that are common cause failures
13Parameters Derivations
- Failure Intensity ?shwi ?hwi.uh.(1-?h)
- Failure Intensity ?sswi ?swi.us.(1-?s)
- Common Cause ?hwi.uh.(?h) and ?swi.us.(?s)
- Execution Time t ei . t
- RSSi Subsystem Reliability
- System Reliability RS RSS1 . RSS2 . RSS3
 Reading Parsing Pre-Processing
Usage SW - us 0.3 0.3 0.4
Usage HW - uh 0.3 0.4 0.3
?hwi 0.3 0.4 0.3
?swi 0.3 0.4 0.3
Execution - ei 0.2 0.1 0.7
MIRCHANDANI
P226/MAPLD2005
13
14Extending the Rules
- The programmed design, be it the original duplex
design, duplicated or diverse, or the option for
re-configuration, will optimize whatever option
is used to enhance Fault Tolerance - For example, in the Reading Task, it is shown
that the area usage and operational profile have
an effect on the predicted overall reliability of
the FPGA-based design - Yu and McCluskey, state that the designs of the
CED techniques are area dependent, more
conservative a design in terms of area, less
efficiently will the error detection algorithm
perform, however, but more efficiently or
optimally the re-configured design in the event
of a permanent failure.
15Further Extension
- Area usage has a higher propensity for multiple
faults, the operational profile that exercises a
part of the code more often, then the design and
its associated code has a greater propensity for
failures - The common cause fractions used in the paper are
relative numbers to illustrate the model - Redundancy of one, the fraction attributed to
hardware common cause failure is 1 . This
implies that there is an equal chance for a
common defect running in the hardware, in this
case the FPGA, to manifest itself anywhere in the
active area.
16Assertions
- The common cause fractions used in the paper are
relative numbers to illustrate the model - Redundancy of one, the fraction attributed to
hardware common cause failure is 1 . This
implies that there is an equal chance for a
common defect running in the hardware, in this
case the FPGA, to manifest itself anywhere in the
active area. - Implemented on different devices, this fraction
drops to ¼ because now the physical defects are
almost negligible, and the only common effects
are more environmental, i.e. temperature, power
and external stresses.
17More Assertions
- Software common cause fraction is high in both
cases, since we assume nearly all software
failures are common cause, very little change
from same device to different device, since the
design implemented is the same, but because the
devices are different, this a slight chance that
certain timing conditions may vary and hence the
¼ variation - Diverse design paradigm, the hardware dependence
remains in the same ratio relatively, but the
software fractions vary drastically. In the same
device, the common cause fraction is 50 and it
drops to 10 in the case of diverse designs on
different devices
18System Configuration Options
Configuration HW Common Cause Fraction SW Common Cause Fraction
Configuration ?h ?s
Same Code Device 0.01 1
Same Code Diff Devices 0.0025 0.9975
Diff Code Same Device 0.01 0.5
Diff Code Devices 0.0025 0.1
19Results
Option Configuration FPGA-based System Reliability
1 Same Code, Same Devices 0.895726564
2 Same Code, Diff Devices 0.895973815
3 Diff Code, Same Devices 0.944752579
4 Diff Code, Diff Devices 0.98356125
MIRCHANDANI
P226/MAPLD2005
19
20Conclusions
- Cost and Schedule Slips
- Development Delays and Costs
- Adaptive Model
- Optimization and Design Constraints
- Contact Address chandru.j.mirchandani_at_lmco.com
MIRCHANDANI
P226/MAPLD2005
20