Fault Tolerant Computing - PowerPoint PPT Presentation

About This Presentation
Title:

Fault Tolerant Computing

Description:

server omits to respond to an input (fail-silent failure) ... if after a first omission, a server omits to produce output until it restarts. Amnesia crash ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 64
Provided by: DrBetty3
Learn more at: http://www.cse.msu.edu
Category:

less

Transcript and Presenter's Notes

Title: Fault Tolerant Computing


1
Fault Tolerant Computing
2
Acknowledgements
  • The following lectures are based on materials
    from the following sources
  • S. Kulkarni
  • J. Rushby
  • J. Knight

3
Objectives
  • Exposure to area of Critical Systems
  • What it means to have a fault-tolerant system
  • Specification techniques for representing
    critical properties
  • How to Design Fault tolerance into a system

4
Reliability and Recovery
  • Reliability
  • Probability that a system will not fail at time t
    if it was operating properly at time 0.
  • Recovery
  • Process of restoring consistency after a failure

5
Dependability
  • Dependability
  • How much one may rely on the quality of services
    delivered
  • Quality of service depends on
  • Correctness
  • Continuity of service

6
Terms
  • Failure malfunction
  • Fault condition that might lead to failure
  • Error an incorrect response indicates a fault
    is present
  • Faults may be
  • permanent
  • intermittent
  • transient

7
Terms (contd)
  • Graceful Degradation
  • system is operational, but degraded, after faults
  • Fail-safe
  • system execution is safe after the fault
  • Stabilizing
  • system recovers to a consistent state after the
    fault
  • Masking
  • the user of the system does not see any
    unintended behavior due to faults

8
Terms (contd)
  • Mean Time to Failure (MTTF)
  • expected value of system failure time
  • Mean Time to Repair (MTTR)
  • expected value of system repair time
  • Mean Time Between Failure
  • expected time between successive failures MTBF
    MTTF MTTR
  • Fault Tolerance
  • ability to continue operation after occurrence of
    faults
  • A system is faulty, once its behavior is no
    longer consistent with its specification.

9
Design Decisions
  • Fault detection
  • Fault confinement
  • Fault diagnosis
  • Repair and/or reconfigure
  • Redundancy
  • Hardware extra hardware
  • Information redundancy bits
  • Software diagnosis software, extra software
  • Temporal re-execute software to recover from
    intermittent faults

10
Safety vs Reliability
  • Reliability
  • concerns occurrence of failures
  • System failures defined in terms of system
    services
  • Safety concerns occurrence of accidents
  • Unplanned events that result in death, inury,
    illness, damage, loss of property or evironmental
    harm
  • Defined in terms of external consequences

11
Types of Faults
  • Omission failure
  • server omits to respond to an input (fail-silent
    failure)
  • Timing failure
  • response is functionally correct, but untimely
  • can be early timing failure or late timing
    failure
  • (performance failure)
  • Response failure
  • incorrect response
  • if output value incorrect (value failure)
  • state transition incorrect (state transition
    failure)

12
Types of Faults (contd)
  • Crash failure
  • if after a first omission, a server omits to
    produce output until it restarts
  • Amnesia crash
  • server restarts in a predefined initial state
    that does not depend on the inputs seen before
    crash
  • Partial amnesia crash
  • some part of the state is the same before the
    crash rest is in predefined initial state
  • Pause crash
  • server restarts in the state it had before the
    crash
  • Halting crash
  • crashed server never restarts

13
Types of Faults (contd)
  • Byzantine failure
  • Component exhibits arbitrary and malicious
    behavior,
  • Perhaps in cooperation with other faulty
    components.
  • Fail-stop failure
  • In response to a failure,
  • Component changes to a state that permits other
    components to detect that a failure has occurred
    and then stops.

14
Examples
  • OS crashed followed by reboots in initial state
    (amnesia failure)
  • Database server crash followed by recovery of a
    database state that reflects all transactions
    before the crash (pause failure)
  • Communication server occasionally loses messages
    but does not delay messages (omission failure)
  • Excessive message transmission or message
    processing delay (communication performance
    failure)
  • Alteration of a message due to random noise
    during transmission (response failure)

15
Hierarchical Failure Masking
  • A failure of a certain type at a lower level can
    propagate as a different kind of failure at a
    higher level abstraction.
  • Value Error at the physical layer (e.g., 2 bits
    corrupted) propagates as omission error at data
    link layer

16
Group Failure Masking
  • To ensure a service remains available to clients
    despite server failure,
  • one can implement a group of redundant,
    physically independent servers.
  • The group masks the failure of a member.
  • Hierarchical masking requires
  • users to implement resource failure-masking
    attempts as exception handling code.
  • In group masking,
  • individual members failures are entirely hidden
    from users by group management mechanisms.

17
Group Failure Masking (contd)
  • Group output is a function of outputs of
    individual group members.
  • fastest member
  • distinguished member
  • result of majority vote
  • A server able to mask any k-1concurrent member
    failures will be termed k-fault tolerant
  • e.g., a primary/standby group of k servers with
    members ranked as primary, 1st backup, 2nd
    backup, ..., can mask k-1 failures.

18
Some Formalism
  • Programs
  • A Program consists of
  • a finite set of variables
  • a finite set of actions
  • where
  • guard is a boolean expression over program
    variables, and
  • statement updates program variables
  • Modifications
  • guards may contain receive from channels
  • statements may contain sends/receive

19
Computation
  • A program computation is a fair'' sequence of
    steps, where in each step an action whose guard
    is true has its statement executed
  • In one step, multiple guards may be true.
  • If guard of some action is true continuously,
    then that action would eventually be chosen for
    execution.
  • Notes
  • A program computation is a sequence of states

20
Specification
  • A specification is a set of sequences of states.
  • What does it mean for a program, p to satisfy a
    specification sp from a set of states S?
  • every computation of p that starts from a state
    in S is in sp .

21
Examples of specifications
  • Let S be a predicate.
  • invariant
  • Invariant(S) seq S is true in each state of
    seq
  • A sequence seq is in invariant(S) iff S is true
    in each state in seq.
  • Closure
  • Closed(S)
  • seq "i i gt 0
  • S is true in the ith state of seq
  • S is true in the (i1)th state of seq
  • If S ever becomes true, it continues to be true.

22
Examples of specifications (contd)
  • Let R and S be predicates.
  • leads-to
  • R leads-to S
  • seq ("i i gt 0
  • R is true in ith state of seq
  • gt
  • ( k k gti
  • S is true in kth state of seq)
  • )

23
Examples of specifications (contd)
  • Mutual Exclusion
  • invariant( (j ltgt k) (cs.j /\ cs.k) )
  • ("j (req.j leads-to cs.j)) // request for
    cs
  • Leader Election
  • invariant ( ( jltgtk) (leader.j s /\ leader.k)
    )
  • true leads-to ( j leader.j)
  • Load Balancing
  • true leads-to
  • ("j,k load.j - load.k bound)

24
Safety Specification
  • Safety specification
  • A sequence does nothing bad''
  • No sequence has a bad prefix
  • Let sp be a specification.
  • sp is a safety specification
  • iff
  • ("s s Ï sp
  • ( a a is a prefix of s ("b ab Ï sp)))

25
Liveness Specification
  • Liveness specification
  • A sequence does something good
  • Every finite prefix has a good extension
  • Let sp be a specification
  • sp is a liveness specification
  • iff
  • (" a ( b ab Î sp)) // a could be
    bad prefix

26
Faults
  • A fault is an action that can change the program
    state
  • All faults
  • (be they crash, fail-stop, omission,
    corruption, timing, Byzantine, intruders, or
    ...)
  • can be thus viewed as perturbations on the
    system

27
Faults (contd)
  • A program computation in the presence of faults
    is a sequence of steps where
  • in each step either program action executes or
    fault action executes
  • the program actions are fairly executed
  • the fault occurrences are finite

28
Representation of Faults
  • Communication faults
  • Let c denote the sequence of messages on a
    channel.
  • Let m1 and m2 be messages, and let seqm be a
    sequence of messages.
  • Message Loss c lt seqm, m1gt c lt seqmgt
  • Message Duplication c lt seqm ,m1gt c lt
    seqm,m1,m1gt
  • Message Reorder c lt seqm,m1,m2gt c lt
    seqm,m2,m1gt

29
Representation of Faults (contd)
  • Amnesia/Transient faults.
  • Let c denote all the variables of a process.
  • True c?? // ?? arbitrary value

30
Representation of Permanent Faults
  • Fail-stop fault
  • Upon fail-stop, a process does nothing
  • it does not execute any action and
  • it does not send any messages.
  • Introduce an auxiliary variable up.j at process j
  • Add up.j to the guard of each action of j
  • If processes can detect failure of other
    processes, then they can do so using variable up.

31
Representation of Permanent Faults
  • Byzantine Faults
  • Introduce an auxiliary variable b.j at process j
  • Add these actions as faults b.j b.j true
  • b.j state.j??

32
Goal of Fault-tolerance Design
  • Starting from some initial states, S,
  • If the program executes alone then the original
    specification, sp, is satisfied
  • If the program executes in the presence of faults
    then the fault-tolerant specification, sp', is
    satisfied.
  • The fault-tolerance specification depends upon
    the type of the desired fault-tolerance, e.g.,
  • for masking sp' sp
  • for fail-safe sp' safety specification of sp'

33
Representation of Permanent Faults
  • Fault-tolerant systems are rarely designed from
    scratch!!!
  • One needs to modify a fault-intolerant system to
    add fault-tolerance
  • Need for reuse of the fault-intolerant program.
  • Fault-tolerant systems need to be modified to
    deal with new faults.
  • Need for incremental design
  • Need to perform several activities while
    developing fault-tolerant systems.
  • manual or automated design, testing,
    verification, synthesis, ...
  • desirable to have a unified framework that allows
    to perform these activities.

34
Overall Design
35
Overall Design (contd)
  • Should separate concerns of functionality and
    fault-tolerance.
  • Should use components that are responsible for
    fault-tolerance alone.
  • Should provide structural continuity while
    performing these tasks.
  • Should be able to use the same components while
    performing the above tasks.

36
A Specific Approach
  • We explore the following thesis (Kulkarni)
  • fault-tolerant system
  • fault-intolerant system
  • in composition with
  • fault-tolerance components

37
Validation
  • Two components, detectors and correctors form a
    basis of fault-tolerance design
  • Detectors and correctors are necessary and
    sufficient for designing fault-tolerant systems
    that satisfy the reuse criterion
  • Reuse criterion
  • In the absence of faults, the fault-tolerant
    system behaves like the fault-intolerant system
  • In the presence of faults, the fault-tolerant
    system recovers to the computations of the
    fault-intolerant system

38
Validation (contd)
  • Existing methods satisfy the reuse criterion
  • Replication
  • Schneider's state machine approach
  • Checkpointing and recovery
  • Programs designed with these methods can be
    (alternatively) designed by using detectors and
    correctors
  • The use of detectors and correctors offers the
    potential for improved design

39
Outline of Approach
  • Identifying the components
  • Their applications in design
  • Their applications in verification

40
Components for Fail-safe Tolerance
  • How to preserve the safety specification ?
  • Existence of safe predicate
  • follows from the definition of safety
  • Hence, we need to detect whether execution of an
    action in the given state is safe
  • The added component is called a detector

Assume that safety is not violated here
Check whether safety would be violated
41
Detectors
  • Specification of a detector ( detection
    predicate, X, witness predicate, Z)
  • Z Þ X
  • X leads to (ØZ Ú X)
  • Z next (Z Ú ØX)
  • Examples error detection codes, acceptance
    tests, comparators snapshot procedures,
    exception conditions

42
Designing Fail-safe Fault-Tolerance
  • For each program action
  • Add a detector d such that
  • detection predicate equals a safe predicate of
  • g st
  • witness predicate equals Z
  • New action is
  • Z Ù g st

43
Hierarchical Construction of Detectors
44
Components for Nonmasking Fault-Tolerance
  • How to eventually satisfy the specification ?
  • Restore the program to a state from where its
    safety and liveness specification are satisfied
  • The added component is called a corrector

45
Correctors
  • Specification of a corrector ( correction
    predicate, X, witness predicate, Z)
  • Z Þ X
  • true leads to (X Ù Z)
  • X next X
  • Z next Z
  • Large' correctors in distributed programs are
    built out of parallel' or sequential'
    composition of smaller' ones
  • Examples error correction codes, reset
    procedures, voters, rollback recovery,
    constraint satisfaction

46
Components for Masking Fault-Tolerance
  • Ensure that in the presence of faults the safety
    specification is always satisfied
  • use detectors
  • Ensure that eventually the program reaches a
    state from where the specification is satisfied
  • use correctors

47
An example Input-Output Problem
  • in constant // either 0 or 1
  • out 0,1, // either 0 or 1 or //
    some specific value (currently // unknown)
  • Safety specification
  • always ( )
  • (out ) Ú (out in)
  • Ù (out ¹ ) next (out ¹ )
  • Liveness specification eventually (u) (out
    in)

48
Example contd
  • in constant // either 0 or 1
  • x 0, 1 // initialized to in
  • out 0,1,
  • out out x
  • Faults
  • true x ?

49
Example contd
  • y,z 0, 1 // initialized to in
  • (x y Ú x z) Ù //detector
  • More Faults
  • true y ?
  • true z ?

50
Triple Modular Redundancy
  • (y x Ú y z) Ù out out y
  • (z x Ú z y) Ù out out z

51
Distributed Reset An Example in Design
  • The problem Reset the state of a distributed
    system to a given global state
  • Applicable in the design of various
    fault-tolerant systems
  • Need for a fault-tolerant, bounded memory
    protocol (Lamport and Lynch, in Handbook of TCS
    1990)
  • Previous solutions are merely stabilizing
    tolerant
  •   Allows resets to be incorrect during recovery
  • Our solution is the first to provide masking
    tolerance in addition to stabilizing tolerance

52
Specification of Distributed Reset
Masking Tolerant Program
Fail-safe tolerant program
Nonmasking tolerant program
detectors and correctors
detectors
correctors
Intolerant program
53
Specification of Distributed Reset
  • A process initiates a reset operation to reset
    the system to a given global state.  
  • For each reset operation initiated, the following
    two conditions should be satisfied
  •  non-prematurity
  • when the initiating process completes the reset
    operation, the program state is reachable from
    the given global state
  • eventual completion
  •   the initiating process eventually completes the
    reset operation

54
Faults and Fault-tolerance Requirements
  • CJTSS'98
  • Fault-classes considered in our solution
  • Network faults
  • Failure and repair of processes and
    communication channels
  • Memory faults
  •  Transient faults, undetectable message
    corruption
  • Fault-tolerance requirements
  • Masking tolerance to network faults
  • Stabilizing tolerance to network faults and
    memory faults
  • Other requirements
  • Bounded memory at each process

55
Use a diffusing computation
56
Fault-intolerant Distributed Reset
  • Embed a tree
  • Use a diffusing computation
  • Root of the tree initiates a diffusing
    computation
  • Each process propagates the diffusing computation
    to its children
  • A process completes the diffusing computation
    only after its descendents have completed the
    diffusing computation
  • Each process resets its state when it propagates
    the diffusing computation
  • Two processes communicate only if either both
    have reset their states or none have reset their
    states in the current reset computation
  • When the root of the tree completes the diffusing
    computation, the state of the system is reachable
    from the given global state

57
 Designing components for masking tolerance
  • Add a detector that
  • lets the root detect if all processes
    participated in the current diffusing computation
  • Add a corrector that
  • reconstructs the tree
  • corrects the variables used in a diffusing
    computation
  • ensures that the diffusing computation never
    blocks
  • when if the diffusing computation completes, if
    the check performed by the detector fails then
    performs another diffusing computation
  • These components must be multitolerant !!

58
Designing multitolerant detector
  • Problem Detect whether all processes
    participated in the diffusing computation
  • Subproblem Let each process detect if all its
    neighbors participated in that diffusing
    computation
  • Easy if each diffusing computation is associated
    with a distinct sequence number
  • requires that the sequence numbers are unbounded
     
  • Difficult if the sequence numbers are bounded
  • sequence numbers from old diffusing computations
    may confuse the detection

59
Problem with Bounded Sequence Numbers
60
Problem with Bounded Sequence Numbers (contd)
61
Problem with Bounded Sequence Numbers (contd)
  • Theorem. Let j and l be neighboring processes and
    let ROOT be an ancestor of j.
  • If j and l have completed at least two diffusing
    computations since they changed tree or they
    observed a network fault, and the sequence
    numbers of j and l are identical,
  • Then l has propagated the same diffusing
    computation as j

62
Multitolerant Detector (continued)
  • Our detector guarantees that
  • In the presence of network faults only, the root
    can always detect whether all processes
    participated in the current diffusing
    computation
  • In the presence of network faults and memory
    faults, the root can eventually detect whether
    all processes participated in the current
    diffusing computation

63
Multitolerant Distributed Reset
  • Properties of our program
  • Masking tolerance to network faults
  • Stabilizing tolerance to memory faults and
    network faults bounded memory
  • Contains a multitolerant detector for
    non-prematurity
  • Useful in various other applications
  • termination detection
  • network management
Write a Comment
User Comments (0)
About PowerShow.com