Title: Methodology to Compute Architectural Vulnerability Factors
1Methodology to Compute ArchitecturalVulnerability
Factors
- Chris Weaver1, 2
- Shubhendu S. Mukherjee1
- Joel Emer 1
- Steven K. Reinhardt1, 2
- Todd Austin2
- 1Fault Aware Computing Technology (FACT), VSSAD,
Intel - 2University of Michigan
2Overview
- Background
- Previous reliability estimation methodology
- Proposed methodology for early reliability
estimates - Sample analysis
- Conclusion
3Strike Changes State
0
1
4Failure Rate Definitions
- Interval-based
- MTBF Mean Time Between Failures
- Rate-based
- FIT Failure in Time 1 failure in a billion
hours - 1 year MTBF 109 / (24 365) FIT 114,155 FIT
- Additive
Cache 0 FIT
IQ 114K FIT
FU 114K FIT
Total of 228K FIT
5Motivation
6Results of precise early analysis
- If we meet goal
- we are done
- If we dont meet goal
- add error protection schemes
7Objectives
- Determine which bits matter
- Compute FIT rate
8Strike on state bit
Bit Read
no
yes
Bit has error protection
benign fault no error
no
yes
yes
Does bit matter?
Error is only detected (e.g., parity no
recovery)
Error can be corrected (e.g, ECC)
yes
no
benign fault no error
Detected, but unrecoverable error (DUE)
no error
We only focus on SDC FIT
9Architectural Vulnerability Factor (AVF)
- AVFbit Probability Bit Matters
-
- of Visible Errors
- of Bit Flips from Particle Strikes
FITbit intrinsic FITbit AVFbit
10Previous AVF Methodology
- Statistical Fault Injection with RTL
Simulate Strike on Latch
Logic
0
1
output
0
Does Fault Propagate to Architectural State
11Characteristics of SFI with RTL
- Naturally characterizes all logical structures
- RTL not till late in the design cycle
- Numerous experiments to flip all bits
- Generally done at the chip level
- Limited structural insight
12Objectives
- Determine which bits matter
- Earlier in the design cycle
- With fewer experiments
- At the structural-level
- Compute FIT rate
- Intrinsic FIT per bit
- Architectural Vulnerability Factor
13Our Analysis Which bits matter?
- Branch Predictor
- Doesnt matter at all (AVF 0)
- Program Counter
- Almost always matters (AVF 100)
14Architecturally Correct Execution (ACE)
Program Input
Program Outputs
- ACE path requires only a subset of values to flow
correctly through the programs data flow graph
(and the machine) - Anything else (un-ACE path) can be derated away
15Example of un-ACE instruction Dynamically Dead
Instruction
Dynamically Dead Instruction
Most bits of an un-ACE instruction do not affect
program output
16Dynamic Instruction Breakdown
Average across all of Spec2K slices
17Mapping ACE un-ACE Instructions to the
Instruction Queue
ACEInst
Architectural un-ACE
Micro-architectural un-ACE
18Vulnerability of a structure
- AVF fraction of cycles a bit contains ACE
state -
19Littles Law for ACEs
20Computing AVF
- Our approach is conservative
- We assume every bit is ACE unless proven
otherwise - Data Analysis
- Try to prove that data held in a structure is
un-ACE - Timing Analysis
- Tracks the time this data spent in the structure
21Computing FIT rate of a Chip
- Total FIT ? (FIT per biti X of bitsi X AVFi)
Intrinsic FIT per bit from externally published
data
22ResultsExperimental Setup
- Used ASIM modeling infrastructure
- Model of a Itanium2-like processor
- Ran all Spec2K benchmarks
- Compiled with highest level of optimization with
the Intel electron compiler - Simulated under a full OS
- Simulation points chosen using SimPoint (Sherwood
et al)
23Instruction Queue
ACE percentage AVF 29
24Functional Units
ACE percentage AVF 9
25Computing FIT rate of Chip
Intrinsic FIT per bit from externally published
data
26Summary
- Determine which bits matter
- ACE (Architecturally Correction Execution)
- Compute FIT rate
- Intrinsic FIT per bit
- AVF (Architectural Vulnerability Factor)
27Questions?
28Statistical Fault Injection (SFI)
- Algorithm
- Find a statistically significant set of bits
- Randomly select a bit
- Flip the bit
- Run two simulations one with bit flip and one
without bit flip - Run for pre-defined cycles
- Compare architectural state of two simulations
(e.g., register file) - If mismatch, declare an error
- Repeat algorithm with different bit flip
- AVF mismatches observed / total experiments
- Used widely
- has provided useful AVF numbers till date
29SFI vs. ACE analysis