Title: Fault Tolerant Computing
1Fault Tolerant Computing
2Acknowledgements
- The following lectures are based on materials
from the following sources - S. Kulkarni
- J. Rushby
- J. Knight
3Objectives
- Exposure to area of Critical Systems
- What it means to have a fault-tolerant system
- Specification techniques for representing
critical properties - How to Design Fault tolerance into a system
4Reliability and Recovery
- Reliability
- Probability that a system will not fail at time t
if it was operating properly at time 0. - Recovery
- Process of restoring consistency after a failure
-
5Dependability
- Dependability
- How much one may rely on the quality of services
delivered - Quality of service depends on
- Correctness
- Continuity of service
6Terms
- Failure malfunction
- Fault condition that might lead to failure
- Error an incorrect response indicates a fault
is present - Faults may be
- permanent
- intermittent
- transient
7Terms (contd)
- Graceful Degradation
- system is operational, but degraded, after faults
- Fail-safe
- system execution is safe after the fault
- Stabilizing
- system recovers to a consistent state after the
fault - Masking
- the user of the system does not see any
unintended behavior due to faults
8Terms (contd)
- Mean Time to Failure (MTTF)
- expected value of system failure time
- Mean Time to Repair (MTTR)
- expected value of system repair time
- Mean Time Between Failure
- expected time between successive failures MTBF
MTTF MTTR - Fault Tolerance
- ability to continue operation after occurrence of
faults - A system is faulty, once its behavior is no
longer consistent with its specification. -
9Design Decisions
- Fault detection
- Fault confinement
- Fault diagnosis
- Repair and/or reconfigure
- Redundancy
- Hardware extra hardware
- Information redundancy bits
- Software diagnosis software, extra software
- Temporal re-execute software to recover from
intermittent faults
10Safety vs Reliability
- Reliability
- concerns occurrence of failures
- System failures defined in terms of system
services - Safety concerns occurrence of accidents
- Unplanned events that result in death, inury,
illness, damage, loss of property or evironmental
harm - Defined in terms of external consequences
11Types of Faults
- Omission failure
- server omits to respond to an input (fail-silent
failure) - Timing failure
- response is functionally correct, but untimely
- can be early timing failure or late timing
failure - (performance failure)
- Response failure
- incorrect response
- if output value incorrect (value failure)
- state transition incorrect (state transition
failure)
12Types of Faults (contd)
- Crash failure
- if after a first omission, a server omits to
produce output until it restarts - Amnesia crash
- server restarts in a predefined initial state
that does not depend on the inputs seen before
crash - Partial amnesia crash
- some part of the state is the same before the
crash rest is in predefined initial state - Pause crash
- server restarts in the state it had before the
crash - Halting crash
- crashed server never restarts
13Types of Faults (contd)
- Byzantine failure
- Component exhibits arbitrary and malicious
behavior, - Perhaps in cooperation with other faulty
components. - Fail-stop failure
- In response to a failure,
- Component changes to a state that permits other
components to detect that a failure has occurred
and then stops.
14Examples
- OS crashed followed by reboots in initial state
(amnesia failure) - Database server crash followed by recovery of a
database state that reflects all transactions
before the crash (pause failure) - Communication server occasionally loses messages
but does not delay messages (omission failure) - Excessive message transmission or message
processing delay (communication performance
failure) - Alteration of a message due to random noise
during transmission (response failure)
15Hierarchical Failure Masking
- A failure of a certain type at a lower level can
propagate as a different kind of failure at a
higher level abstraction. - Value Error at the physical layer (e.g., 2 bits
corrupted) propagates as omission error at data
link layer
16Group Failure Masking
- To ensure a service remains available to clients
despite server failure, - one can implement a group of redundant,
physically independent servers. - The group masks the failure of a member.
- Hierarchical masking requires
- users to implement resource failure-masking
attempts as exception handling code. - In group masking,
- individual members failures are entirely hidden
from users by group management mechanisms.
17Group Failure Masking (contd)
- Group output is a function of outputs of
individual group members. - fastest member
- distinguished member
- result of majority vote
- A server able to mask any k-1concurrent member
failures will be termed k-fault tolerant - e.g., a primary/standby group of k servers with
members ranked as primary, 1st backup, 2nd
backup, ..., can mask k-1 failures.
18Some Formalism
- Programs
- A Program consists of
- a finite set of variables
- a finite set of actions
- where
- guard is a boolean expression over program
variables, and - statement updates program variables
- Modifications
- guards may contain receive from channels
- statements may contain sends/receive
19Computation
- A program computation is a fair'' sequence of
steps, where in each step an action whose guard
is true has its statement executed - In one step, multiple guards may be true.
- If guard of some action is true continuously,
then that action would eventually be chosen for
execution. - Notes
- A program computation is a sequence of states
20Specification
- A specification is a set of sequences of states.
- What does it mean for a program, p to satisfy a
specification sp from a set of states S? - every computation of p that starts from a state
in S is in sp .
21Examples of specifications
- Let S be a predicate.
- invariant
- Invariant(S) seq S is true in each state of
seq - A sequence seq is in invariant(S) iff S is true
in each state in seq. - Closure
- Closed(S)
- seq "i i gt 0
- S is true in the ith state of seq
-
- S is true in the (i1)th state of seq
- If S ever becomes true, it continues to be true.
22Examples of specifications (contd)
- Let R and S be predicates.
- leads-to
- R leads-to S
- seq ("i i gt 0
- R is true in ith state of seq
- gt
- ( k k gti
- S is true in kth state of seq)
- )
23Examples of specifications (contd)
- Mutual Exclusion
- invariant( (j ltgt k) (cs.j /\ cs.k) )
- ("j (req.j leads-to cs.j)) // request for
cs - Leader Election
- invariant ( ( jltgtk) (leader.j s /\ leader.k)
) - true leads-to ( j leader.j)
- Load Balancing
- true leads-to
- ("j,k load.j - load.k bound)
24Safety Specification
- Safety specification
- A sequence does nothing bad''
- No sequence has a bad prefix
- Let sp be a specification.
- sp is a safety specification
- iff
- ("s s Ï sp
-
- ( a a is a prefix of s ("b ab Ï sp)))
25Liveness Specification
- Liveness specification
- A sequence does something good
- Every finite prefix has a good extension
- Let sp be a specification
- sp is a liveness specification
- iff
- (" a ( b ab Î sp)) // a could be
bad prefix
26Faults
- A fault is an action that can change the program
state - All faults
- (be they crash, fail-stop, omission,
corruption, timing, Byzantine, intruders, or
...) - can be thus viewed as perturbations on the
system
27Faults (contd)
- A program computation in the presence of faults
is a sequence of steps where - in each step either program action executes or
fault action executes - the program actions are fairly executed
- the fault occurrences are finite
28Representation of Faults
- Communication faults
- Let c denote the sequence of messages on a
channel. - Let m1 and m2 be messages, and let seqm be a
sequence of messages. - Message Loss c lt seqm, m1gt c lt seqmgt
- Message Duplication c lt seqm ,m1gt c lt
seqm,m1,m1gt - Message Reorder c lt seqm,m1,m2gt c lt
seqm,m2,m1gt
29Representation of Faults (contd)
- Amnesia/Transient faults.
- Let c denote all the variables of a process.
- True c?? // ?? arbitrary value
30Representation of Permanent Faults
- Fail-stop fault
- Upon fail-stop, a process does nothing
- it does not execute any action and
- it does not send any messages.
- Introduce an auxiliary variable up.j at process j
- Add up.j to the guard of each action of j
- If processes can detect failure of other
processes, then they can do so using variable up.
31Representation of Permanent Faults
- Byzantine Faults
- Introduce an auxiliary variable b.j at process j
- Add these actions as faults b.j b.j true
- b.j state.j??
32Goal of Fault-tolerance Design
- Starting from some initial states, S,
- If the program executes alone then the original
specification, sp, is satisfied - If the program executes in the presence of faults
then the fault-tolerant specification, sp', is
satisfied. - The fault-tolerance specification depends upon
the type of the desired fault-tolerance, e.g., - for masking sp' sp
- for fail-safe sp' safety specification of sp'
33Representation of Permanent Faults
- Fault-tolerant systems are rarely designed from
scratch!!! - One needs to modify a fault-intolerant system to
add fault-tolerance - Need for reuse of the fault-intolerant program.
- Fault-tolerant systems need to be modified to
deal with new faults. - Need for incremental design
- Need to perform several activities while
developing fault-tolerant systems. - manual or automated design, testing,
verification, synthesis, ... - desirable to have a unified framework that allows
to perform these activities.
34Overall Design
35Overall Design (contd)
- Should separate concerns of functionality and
fault-tolerance. - Should use components that are responsible for
fault-tolerance alone. - Should provide structural continuity while
performing these tasks. - Should be able to use the same components while
performing the above tasks.
36A Specific Approach
- We explore the following thesis (Kulkarni)
- fault-tolerant system
- fault-intolerant system
- in composition with
- fault-tolerance components
37Validation
- Two components, detectors and correctors form a
basis of fault-tolerance design - Detectors and correctors are necessary and
sufficient for designing fault-tolerant systems
that satisfy the reuse criterion - Reuse criterion
- In the absence of faults, the fault-tolerant
system behaves like the fault-intolerant system - In the presence of faults, the fault-tolerant
system recovers to the computations of the
fault-intolerant system
38Validation (contd)
- Existing methods satisfy the reuse criterion
- Replication
- Schneider's state machine approach
- Checkpointing and recovery
- Programs designed with these methods can be
(alternatively) designed by using detectors and
correctors - The use of detectors and correctors offers the
potential for improved design
39Outline of Approach
- Identifying the components
- Their applications in design
- Their applications in verification
40Components for Fail-safe Tolerance
- How to preserve the safety specification ?
- Existence of safe predicate
- follows from the definition of safety
- Hence, we need to detect whether execution of an
action in the given state is safe - The added component is called a detector
Assume that safety is not violated here
Check whether safety would be violated
41Detectors
- Specification of a detector ( detection
predicate, X, witness predicate, Z) - Z Þ X
- X leads to (ØZ Ú X)
- Z next (Z Ú ØX)
- Examples error detection codes, acceptance
tests, comparators snapshot procedures,
exception conditions
42Designing Fail-safe Fault-Tolerance
- For each program action
- Add a detector d such that
- detection predicate equals a safe predicate of
- g st
- witness predicate equals Z
- New action is
- Z Ù g st
43 Hierarchical Construction of Detectors
44Components for Nonmasking Fault-Tolerance
- How to eventually satisfy the specification ?
-
- Restore the program to a state from where its
safety and liveness specification are satisfied - The added component is called a corrector
45Correctors
- Specification of a corrector ( correction
predicate, X, witness predicate, Z) - Z Þ X
- true leads to (X Ù Z)
- X next X
- Z next Z
- Large' correctors in distributed programs are
built out of parallel' or sequential'
composition of smaller' ones - Examples error correction codes, reset
procedures, voters, rollback recovery,
constraint satisfaction
46Components for Masking Fault-Tolerance
- Ensure that in the presence of faults the safety
specification is always satisfied - use detectors
- Ensure that eventually the program reaches a
state from where the specification is satisfied - use correctors
47An example Input-Output Problem
- in constant // either 0 or 1
- out 0,1, // either 0 or 1 or //
some specific value (currently // unknown) - Safety specification
- always ( )
- (out ) Ú (out in)
- Ù (out ¹ ) next (out ¹ )
- Liveness specification eventually (u) (out
in)
48Example contd
- in constant // either 0 or 1
- x 0, 1 // initialized to in
- out 0,1,
- out out x
- Faults
- true x ?
49Example contd
- y,z 0, 1 // initialized to in
- (x y Ú x z) Ù //detector
- More Faults
- true y ?
- true z ?
50Triple Modular Redundancy
- (y x Ú y z) Ù out out y
- (z x Ú z y) Ù out out z
51Distributed Reset An Example in Design
- The problem Reset the state of a distributed
system to a given global state - Applicable in the design of various
fault-tolerant systems - Need for a fault-tolerant, bounded memory
protocol (Lamport and Lynch, in Handbook of TCS
1990) - Previous solutions are merely stabilizing
tolerant - Allows resets to be incorrect during recovery
- Our solution is the first to provide masking
tolerance in addition to stabilizing tolerance
52Specification of Distributed Reset
Masking Tolerant Program
Fail-safe tolerant program
Nonmasking tolerant program
detectors and correctors
detectors
correctors
Intolerant program
53Specification of Distributed Reset
- A process initiates a reset operation to reset
the system to a given global state. - For each reset operation initiated, the following
two conditions should be satisfied - non-prematurity
- when the initiating process completes the reset
operation, the program state is reachable from
the given global state - eventual completion
- the initiating process eventually completes the
reset operation
54Faults and Fault-tolerance Requirements
- CJTSS'98
- Fault-classes considered in our solution
- Network faults
- Failure and repair of processes and
communication channels - Memory faults
- Transient faults, undetectable message
corruption - Fault-tolerance requirements
- Masking tolerance to network faults
- Stabilizing tolerance to network faults and
memory faults - Other requirements
- Bounded memory at each process
55Use a diffusing computation
56Fault-intolerant Distributed Reset
- Embed a tree
- Use a diffusing computation
- Root of the tree initiates a diffusing
computation - Each process propagates the diffusing computation
to its children - A process completes the diffusing computation
only after its descendents have completed the
diffusing computation - Each process resets its state when it propagates
the diffusing computation - Two processes communicate only if either both
have reset their states or none have reset their
states in the current reset computation - When the root of the tree completes the diffusing
computation, the state of the system is reachable
from the given global state
57 Designing components for masking tolerance
- Add a detector that
- lets the root detect if all processes
participated in the current diffusing computation
- Add a corrector that
- reconstructs the tree
- corrects the variables used in a diffusing
computation - ensures that the diffusing computation never
blocks - when if the diffusing computation completes, if
the check performed by the detector fails then
performs another diffusing computation - These components must be multitolerant !!
58Designing multitolerant detector
- Problem Detect whether all processes
participated in the diffusing computation - Subproblem Let each process detect if all its
neighbors participated in that diffusing
computation - Easy if each diffusing computation is associated
with a distinct sequence number - requires that the sequence numbers are unbounded
-
- Difficult if the sequence numbers are bounded
- sequence numbers from old diffusing computations
may confuse the detection
59Problem with Bounded Sequence Numbers
60Problem with Bounded Sequence Numbers (contd)
61Problem with Bounded Sequence Numbers (contd)
- Theorem. Let j and l be neighboring processes and
let ROOT be an ancestor of j. - If j and l have completed at least two diffusing
computations since they changed tree or they
observed a network fault, and the sequence
numbers of j and l are identical, - Then l has propagated the same diffusing
computation as j
62Multitolerant Detector (continued)
- Our detector guarantees that
- In the presence of network faults only, the root
can always detect whether all processes
participated in the current diffusing
computation - In the presence of network faults and memory
faults, the root can eventually detect whether
all processes participated in the current
diffusing computation
63Multitolerant Distributed Reset
- Properties of our program
- Masking tolerance to network faults
- Stabilizing tolerance to memory faults and
network faults bounded memory - Contains a multitolerant detector for
non-prematurity - Useful in various other applications
- termination detection
- network management