Title: Automated Derivation of ApplicationAware Error Detectors
1Automated Derivation of Application-Aware Error
Detectors
- Karthik Pattabiraman
- Joint work with
- G.P Sagesse, N. Nakka, D. Chen, W. Healey, W. Gu,
Z. Kalbarczyk and Ravi K. Iyer
2Fault Tolerance Myths and Realities
- Error Detection is easy or unnecessary
- Error Detection is a hard problem
- Even the best recovery methods are useless
without efficient, low-latency error detection - Programs follow Crash-Failure semantics
- Error propagation causes hard-to-recover
failures - Duplication is easy and cost-efficient
- High hardware and performance cost
- Hard to avoid correlated failures
3Importance of Error Detection
Rm 0.9 0.989 0.999 0.999 0.972 0.978 0.9
78
Rm 0.7 0.908 0.988 0.996 0.868 0.918 0.9
21
Rm 0.5 0.748 0.931 0.990 0.700 0.812 0.8
33
C0.99, n2 C0.99, n4 C0.99, ninf C
0.8 , n2 C 0.8 , n4 C0.8, ninf
With low error-detection coverage, reliability
saturates
4Crash Latency Distributions for (Linux on
Pentium P4 and PowerPC G4)
- Measurements study of Linux kernel shows
Significant crash latency - Billion CPU cycles between the time of executing
corrupted instruction or accessing bad data and
the system crash - Application aware checking can reduce the latency
5Drawbacks of Duplication
- IBM G5 approach
- Replicated pipelines in lock-step (only 30 of
processor) - Correlated errors possible in shared state
- Tandem Non-Stop Himalaya
- Voting on every clock cycle at pins of processor
- Modern processors too complex to support this
- Vote on I/O or memory operation instead (Crash?)
- Many faults detected by duplication do not
manifest as application-visible errors - Application knowledge needed to detect errors
that matter - Can degrade overall availability if every error
is detected
6Application-Aware Error Detection
7Goals
- Embed error detectors in code based on
application-specific properties - Preemptively detect errors at runtime and prevent
error propagation resulting in corrupted states - Automatically derive detectors from application
code and execution - Provide efficient hardware/software support for
implementing the error detectors - Extend error detectors for security checking
8Approach
Determine where (program location and variable)
to place detectors for best coverage
Placement
Dynamic Analysis
Static Analysis
Instrument application to observe values at
detector points and form assertions based on
these values
Perform backward slicing on application code from
the detector points to form a minimum symbolic
expression
Reliability Security
Check assertions using a combination of software
and hardware
Runtime
9Fault Models
- Errors in application data
- Data value is corrupted at the time of its
definition (when it is written to or computed) - Hardware Errors represented
- Incorrect computation (Not detected by ECC !)
- Soft errors in memory, registers and cache
- Errors in Instruction issue/decode
- Software Errors represented
- Uninitialized values or incorrectly initialized
values - Memory corruption, dangling pointers
- Integer overflows, values out-of-bounds
- Timing errors and race conditions
10Approach
Determine where (program location and variable)
to place detectors for best coverage
Placement
Dynamic Analysis
Static Analysis
Instrument application to observe values at
detector points and form assertions based on
these values
Perform backward slicing on application code from
the detector points to form a minimum symbolic
expression
Reliability Security
Check assertions using a combination of software
and hardware
Runtime
11Where to Place the Detectors?
- Must Identify variable(s) to check and location
to place the detectors - Starting Point program's Dynamic Dependence
Graph - Metrics to choose candidate points for placement
- e.g., Fanout, Lifetime, Execution
- Employed Fault injection experiments to assess
the coverage of the selected - Experiments verify that it is sufficient to check
a single variable at a single detection point - A single detector in the code provides 60
coverage for a large application like GCC!
12Dynamic Dependence Graph
Assume loop executes 5 times
- Maps onto nodes 11, 16, 21, 26, 28.
- Used in 3 instructions
- BNE R1, R2, LOOP (same iter)
- LW R3, R1 A (next iter)
- ADDI R1, R1, 1 (next iter)
ADDI R1, R1, 1
13Coverage Results (Multiple Detectors)
- Fanout 80 coverage with 10 ideal detectors
- Lifetimes 90 coverage using 25 ideal
detectors) - Placing detectors Randomly on hot-paths Need
100 ideal detectors to achieve 90 coverage
14Approach
Determine where (program location and variable)
to place detectors for best coverage
Placement
Dynamic Analysis
Static Analysis
Instrument application to observe values at
detector points and form assertions based on
these values
Perform backward slicing on application code from
the detector points to form a minimum symbolic
expression
Reliability Security
Check assertions using a combination of software
and hardware
Runtime
15Deriving Detectors Static Code Analysis
- Identify and store the instructions that compute
the target variable at the detector location - Encode the instructions as a symbolic expression
- Reduce the expression to shorten instruction
sequence (to avoid simple duplication) - Only encode the variables that affect the value
of the chosen variable at the detector location
(program slicing) - Create specialized versions of the computation
slice depending on the path that is followed at
runtime (partial evaluation) - Instrument code to track paths followed at
runtime - Choose check depending on path followed at runtime
16Static Analysis Example
if (path1)
if (a 0)
else
then
then
else
f2 a e if (a ! 0)
f2 2 c e if (a0)
b a c d b e f d b
c a d b d e f b c
then
then
else
else
path1
path2
if (f2f)
use f
then
else
Declare Error in f along path and exit
Rest of code
17Path Slicing
- Perform backward traversal through the Static
Dependence Graph (SDG) starting from the detector
location - Upon a fork in the Control Flow Graph (CFG),
create two paths and continue expansion - Stop expanding paths upon encountering
- An instruction that has been visited before
- Beginning or end of function
- Function calls, returns, free instructions,
system calls
18Path-Tracking
- Paths
- Entry, A, C, D, F
- Entry, B, C, E, F
- Entry, A, C, E, F
- Entry, B, C, D, F
Entry
B
A
1
2
3
4
Basic-Block
C
2 2 2 2
Entry
2 1 2 1
B
E
D
2 1 2 1
C
2 0 1 1
E
F
2 0 2 1
F
19Performance Results
Software Hardware
Software-Only
By using programmable hardware and cache to track
paths at runtime, significantly reduces the
instrumentation overhead (from 50 to 5) and
hence, the overall performance overhead (from 65
to 20 )
20Implementation Details
- Implemented using an Optimizing compiler
- LLVM (Developed in Vikram Adves group at
Illinois) - Technique implemented as an LLVM Pass
- Handles recursive calls, dynamic memory
allocations, system calls etc. - Fanout metric used to choose detector points
(using static analysis and profiling) - Tested on sample C programs
- E.g., Fibonacci, Bubble sort, Fast-Fourier
Transform
21Example Matrix Multiplication
- void rInnerproduct(float result, float
arowsize1rowsize1, float browsize1rowsiz
e1, int row, int column) - / computes the inner product of Arow,
and B,column / - int i
- result 0.0f
- for (i 1 iltrowsize i) result
resultarowibicolumn -
- void Mm (int run)
- int i, j
- Initrand()
- rInitmatrix (rma)
- rInitmatrix (rmb)
- for ( i 1 i lt rowsize i )
- for ( j 1 j lt rowsize j )
- rInnerproduct(rmrij,r
ma,rmb,i,j) - printf("f\n", rmrrun 1run 1)
-
22Example LLVM intermediate code
- void rInnerproduct(double result, 41 x
double a, 41 x double b, int row, int
column) - loopentry
- .
- br bool tmp.2, label no_exit, label loopexit
- no_exit
-
- tmp.7 load 41 x double a_addr
- tmp.8 load int row_addr
- tmp.9 getelementptr 41 x double
tmp.7, int tmp.8 - tmp.10 load int i
- tmp.11 getelementptr 41 x double
tmp.9, int 0, int tmp.10 - tmp.12 load double tmp.11
- tmp.13 load 41 x double b_addr
- tmp.14 load int i
- tmp.15 getelementptr 41 x double
tmp.13, int tmp.14 - tmp.16 load int column_addr
- tmp.17 getelementptr 41 x double
tmp.15, int 0, int tmp.16 - tmp.18 load double tmp.17
- tmp.19 mul double tmp.12, tmp.18
23Checking code added to example
- tmp.20.i add double tmp.12.tmp.2, tmp.19.i
- switch uint pathValue-8114, label rest-8
- uint 2, label path2-8
- uint 3, label path3-8
- uint 4, label path4-8
-
- path2-8 preds rest-9
- new.2.tmp.19.i mul double tmp.12.i, tmp.18.i
- new.2.tmp.20.i add double 0.000000e00,
new.2.tmp.19.i - br label Check-8
- path3-8 preds rest-9
- new.3.tmp.19.i mul double tmp.12.i,
tmp.18.i - new.3.tmp.20.i add double
tmp.20.i.copy, new.3.tmp.19.i - br label Check-8
- path4-8 preds rest-9
- new.4.tmp.19.i mul double tmp.12.i,
tmp.18.i
24Approach
Determine where (program location and variable)
to place detectors for best coverage
Placement
Dynamic Analysis
Static Analysis
Instrument application to observe values at
detector points and form assertions based on
these values
Perform backward slicing on application code from
the detector points and form a symbolic
expression to encode runtime paths
Reliability Security
Check assertions using a combination of software
and hardware
Runtime
25Approach
26What is a Detector ?
- A check based on the value of a program variable
or memory location at a program point - Only detectors based on the value of a single
variable/location or Single-Valued Detectors - Involves only current and previous values of
variable - Detector consists of
- A generic rule (belonging to a template class)
- An exception condition for values of the variable
that do not satisfy the rule (logical expression)
27Dynamic Detector Example
- void foo( int N )
- for (int k 0 kltN k)
-
-
-
- "Either the current value of k is zero, or it is
greater than the previous value of k by 1" - (ki ki-1 1) or (ki 0)
- Rule Exception
28Detector Classes
29Deriving Detectors Dynamic Analysis
- Detector Tightness
- Probability that a detector detects an erroneous
value of the variable it checks - Conceptually different from coverage
- Execution Cost
- Amortized additional computation involved in
invoking the detector over multiple values
observed at the detector point - Choose detector with highest (Tightness / Cost)
ratio for each detector point - First, choose rule from template classes for data
stream - Next, form the exception condition to account for
values that do not satisfy the rule. - If no exception can be found, discard the rule
and try again
30Experimental Setup
- Steps in Evaluation
- Analysis Detector placement and code
instrumentation - Training Learning detectors using representative
inputs - Testing Fault-injection experiments by flipping
random bits in application data - Tool used for evaluation Modified version of
Simplescalar simulator (functional simulation) - Emulates real-world behavior under faults
- Application Workload Siemens suite
- C programs with 100-1000 lines of code
31Coverage versus Number of Detectors
32Coverage Versus Detector Type
33False-Positives
- Error detected even when no fault is injected
- Less than 6 for all applications except tot_info
34Approach
Determine where (program location and variable)
to place detectors for best coverage
Placement
Dynamic Analysis
Static Analysis
Instrument application to observe values at
detector points and form assertions based on
these values
Perform backward slicing on application code from
the detector points and form a symbolic
expression to encode runtime paths
Reliability Security
Check assertions using a combination of software
and hardware
Runtime
35Hardware Implementation
- RSE is a reconfigurable processor-level framework
for reliability and security - Detectors implemented as an RSE module consisting
of - Shadow Register File - holds the state of the
checked location - Assertion Table - stores the assertions
parameters - Data-path - check assertions independently from
processor
36Hardware Synthesis Results
Area Overhead of EDMs alone 30 Area Overhead
of EDM RSE interface 45 Performance
Overhead 5.6
37Approach
Determine where (program location and variable)
to place detectors for best coverage
Placement
Dynamic Analysis
Static Analysis
Instrument application to observe values at
detector points and form assertions based on
these values
Perform backward slicing on application code from
the detector points and form a symbolic
expression to encode runtime paths
Reliability Security
Check assertions using a combination of software
and hardware
Runtime
38Information-Flow Signatures
- Use detection of program data-flow violations as
an indicator of malicious tampering with the
system - Prevent an attacker from exploiting disconnect
between source-level semantics and execution
semantics of program - Employ a compile-time static program analysis to
extract instructions allowed (at runtime) to
write to a given memory location - Sign each identified location by the PC(s) of
instruction(s) allowed to write to this location - Typically, only a few static instructions write
to a given program location - Employ special hardware to perform runtime check
39What and How Do We Check ?
- Security Critical Data (incomplete list)
- System call arguments,
- Function Call and Return addresses
- Control-flow data
- Pointers on the stack and the heap
- Special hardware maintains a tag for each memory
word - Write to a location
- Create runtime signature corresponding to the
location - (PC) XOR (tag)
- Reference to a location
- Check the tag against the set of allowed
signatures (derived at compile-time) - If there are no matches the operation is
disallowed
40Summary
- Application-aware Error Detection and Recovery to
ensure low-latency error detection - Technique to place detectors in code to achieve
upto 80 coverage with 10 detectors - Dynamic analysis to derive value-based error
detectors and implement in hardware - Static analysis to derive checking expression
based on backward program slicing - Efficient implementation in hardware with
significant benefits over full-duplication
41Ongoing and Future Work
- Dynamic Analysis Extension to larger programs
and multi-valued detectors - Static Analysis Concise representation of
checking expressions and compiling to H/W - Extension to Security Signatures based on
Information-flow in a program - Formal methods of verification of derived
detectors Model Checking/Thm Proving