Title: A Theory of FaultTolerance
1- A Theory of Fault-Tolerance
2Unifying Fault-Tolerance Approaches
- Several disciplines with focus on different
faults and specific architectures - Crash recovery
- Atomic transactions
- Fault-tolerance of digital systems
- Fault-tolerance in message-passing systems
- Verification of fault-tolerance
- Application-specific
- Verify recovery and safely terminate (mask the
faults) - Less attention given to non-maskable faults
Arora 1992 A foundation of fault-tolerant
computing, PhD thesis, University of Texas-
Austin, 1992.
3A Foundation of Fault-Tolerant Computing
- Provide a uniform definition of fault-tolerance
- Provide verification methods independent of
technology, architecture, or application
Arora 1992 A foundation of fault-tolerant
computing, PhD thesis, University of Texas-
Austin, 1992.
4Program and Fault
- Program model synchronization skeleton of
finite-state programs - Finite number of variables with finite domains
- Finite number of processes
- State a valuation of program variables
- Finite state space Sp
- Program p, Fault f ?
Sp ? Sp - Use Dijkstras Guarded Commands (actions) as a
shorthand to represent program and fault
transitions - Guard ? Statement
Sp
Program
Fault
5Examples of Intermittent Faults
- Intermittent faults
- Sudden acceleration in cruise control systems
- E.g., Cruise control that only works in wet
weather - Malfunction in a component of an electronic
circuit when the voltage goes beyond a threshold - x and y are two points of contacts in a circuit
that have independent voltages. However, when the
voltage level of x goes beyond 3.5 v, y gets the
same voltage as x. We model this class of faults
by the following guarded command - x gt 3.5 ? y x
6Examples of Transient Faults
- Transient faults
- A hardware interrupt routine gets called without
any interrupt being raised by hardware devices - Solar radiation corrupts the communication and
the navigation systems - The variables of the controlling software of
space shuttles may be corrupted by transient
solar radiations - true ? x ?
- The above guarded command means that at any state
of the system, the variable x may be corrupted
due to transient faults
7Transient vs. Intermittent Faults
- Transient faults are difficult (if not
impossible) to reproduce - Can we reproduce solar radiations?
- Intermittent faults may be reproduced under
certain conditions - E.g., pressing the Ctrl key causes the system
to reset
8State Predicate
- State predicate X X ? Sp
- Closure X is closed in p
- Projection pX
- (s0, s1) (s0, s1) ? p ? s0 ? X ? s1 ? X
Sp
9Program Computations
- Program computations
- Infinite sequences of program transitions
10Specification, Invariant, and Fault-Span
- Safety specification something bad never
happens - Formal representation ? Sp ? Sp
(set of bad transitions) - E.g., transitions that change the value of a
counter from non-zero values to zero - Liveness specification something good will
eventually happen - In the absence of faults, fault-tolerant program
p satisfies the liveness specification of the
fault-intolerant program p - Invariant S, fault-span T ? Sp
Sp
11Token Ring Example
- Processes P0, P1, P2, P3
- Variables x0 , x1 , x2 , x3 (domain 0, 1,
?) - Dijkstras Guarded Commands (actions)
- Guard ? Statement
- Fault-intolerant program
- Process P0
- TR0 (x0 1) ? (x3 1) ? x0 0
- TR0 (x0 0) ? (x3 0) ? x0 1
12Token Ring Example Continued
- Processes P1, P2, P3
- TRi (xi 0)?(x(i-1) 1) ? xi 1
- TRi (xi 1)?(x(i-1) 0) ? xi 0
- Fault transitions process-restart
- true ? xj ?
13Token Ring Example Continued
- Invariant
- (state is represented as a tuple ltx0, x1, x2,
x3gt) - lt0, 0, 0, 0gt, lt0, 1, 1, 1gt,
- lt1, 0, 0, 0gt, lt0, 0, 1, 1gt,
- lt1, 1, 0, 0gt, lt0, 0, 0, 1gt
- lt1, 1, 1, 0gt,
- lt1, 1, 1, 1gt,
- Safety Specification
- Corrupted value does not affect a non-corrupted
process - There is only one token in the ring
- Liveness of the fault-intolerant program
- Token should be circulated infinitely often
14Defining Fault-Tolerance Closure
- Let S be a state predicate of a program p,
- S is closed in p iff for every action G -gt
st - executing st in a state of (S ? G) results in
a state in S
Sp
15Defining Fault-Tolerance Convergence
- Let S and T be state predicates of program p
- T converges to S in p iff
- S is closed in p
- T is closed in p
- Starting in T, each computation of p reaches a
state in S
Sp
16Levels of Fault-Tolerance
- Failsafe (program p is failsafe
f-tolerant for spec from S) - Guarantee safety in the presence of faults
- Nonmasking (program p is nonmasking f-tolerant
for spec from S) - Guarantee recovery in the presence of faults
- Masking (program p is masking f-tolerant
for spec from S) - Guarantee safety and recovery in the presence of
faults
Sp
Safety-violating transitions
17Component-Based Design of Fault-Tolerance
- A fault-tolerant program
- A fault-intolerant program
- Fault-tolerance components
- Two types of fault-tolerance components necessary
and sufficient for the design of faults
tolerance - detectors and correctors
Kulkarni 1999 Component-Based Design of
Fault-tolerance, PhD thesis, The Ohio State
University, 1999.
18Synthesis of Fault-Tolerance
- It is difficult to anticipate all classes of
faults at the design time - New classes of faults requires the addition of
corresponding level of fault-tolerance - Can we do it automatically?
Fault-intolerant program p
Synthesis Algorithm
Fault-tolerant program p
f
Ebnenasir 2004 Automatic Synthesis of
Fault-tolerance, PhD thesis, Michigan State
University, 2004.
19Conclusion
- Fault-tolerance is an important factor in the
survivability of software systems - A well-defined need for
- the design of correct fault-tolerant programs
- the design of programs that tolerate multiple
classes of faults (multitolerance) - development methodologies that provide
correctness guarantees - Automatic addition of fault-tolerance generates a
program that is correct by construction - Future work
- Developing tools for automation