Fault Tolerant Computing - PowerPoint PPT Presentation

About This Presentation
Title:

Fault Tolerant Computing

Description:

Specification techniques for representing critical properties ... Repair and/or reconfigure. Redundancy. Hardware: extra hardware. Information: redundancy bits ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 35
Provided by: DrBetty3
Learn more at: http://www.cse.msu.edu
Category:

less

Transcript and Presenter's Notes

Title: Fault Tolerant Computing


1
Fault Tolerant Computing
2
Acknowledgements
  • The following lectures are based on materials
    from the following sources
  • S. Kulkarni
  • J. Rushby
  • J. Knight

3
Objectives
  • Exposure to area of Critical Systems
  • What it means to have a fault-tolerant system
  • Specification techniques for representing
    critical properties
  • How to Design Fault tolerance into a system

4
Reliability and Recovery
  • Reliability
  • Probability that a system will not fail at time t
    if it was operating properly at time 0.
  • Recovery
  • Process of restoring consistency after a failure

5
Dependability
  • Dependability
  • How much one may rely on the quality of services
    delivered
  • Quality of service depends on
  • Correctness
  • Continuity of service

6
Terms
  • Failure malfunction
  • Fault condition that might lead to failure
  • Error an incorrect response indicates a fault
    is present
  • Faults may be
  • permanent
  • intermittent
  • transient

7
Terms (contd)
  • Graceful Degradation
  • system is operational, but degraded, after faults
  • Fail-safe
  • system execution is safe after the fault
  • Stabilizing
  • system recovers to a consistent state after the
    fault
  • Masking
  • the user of the system does not see any
    unintended behavior due to faults

8
Terms (contd)
  • Mean Time to Failure (MTTF)
  • expected value of system failure time
  • Mean Time to Repair (MTTR)
  • expected value of system repair time
  • Mean Time Between Failure
  • expected time between successive failures MTBF
    MTTF MTTR
  • Fault Tolerance
  • ability to continue operation after occurrence of
    faults

9
Design Decisions
  • Fault detection
  • Fault confinement
  • Fault diagnosis
  • Repair and/or reconfigure
  • Redundancy
  • Hardware extra hardware
  • Information redundancy bits
  • Software diagnosis software, extra software
  • Temporal re-execute software to recover from
    intermittent faults

10
Safety vs Reliability
  • Reliability
  • concerns occurrence of failures
  • System failures defined in terms of system
    services
  • Safety concerns occurrence of accidents
  • Unplanned events that result in death, inury,
    illness, damage, loss of property or evironmental
    harm
  • Defined in terms of external consequences

11
Types of Faults
  • Omission failure
  • server omits to respond to an input (fail-silent
    failure)
  • Timing failure
  • response is functionally correct, but untimely -
    can be early timing failure or late timing
    failure
  • (performance failure)
  • Response failure
  • incorrect response
  • if output value incorrect (value failure)
  • state transition incorrect (state transition
    failure)

12
Types of Faults (contd)
  • Crash failure
  • if after a first omission, a server omits to
    produce output until it restarts
  • Amnesia crash
  • server restarts in a predefined initial state
    that does not depend on the inputs seen before
    crash
  • Partial amnesia crash
  • some part of the state is the same before the
    crash rest is in predefined initial state
  • Pause crash
  • server restarts in the state it had before the
    crash
  • Halting crash
  • crashed server never restarts

13
Examples
  • OS crashed followed by reboots in initial state
  • Database server crash followed by recovery of a
    database state that reflects all transactions
    before the crash
  • Communication server occasionally loses messages
    but does not delay messages (omission failure)
  • Excessive message transmission or message
    processing delay (communication performance
    failure)
  • Alteration of a message due to random noise
    during transmission (response failure)

14
Hierarchical Failure Masking
  • A failure of a certain type at a lower level can
    propagate as a different kind of failure at a
    higher level abstraction.
  • Value Error at the physical layer (e.g., 2 bits
    corrupted) propagates as omission error at data
    link layer

15
Group Failure Masking
  • To ensure a service remains available to clients
    despite server failure,
  • one can implement a group of redundant,
    physically independent servers.
  • The group masks the failure of a member.
  • Hierarchical masking requires
  • users to implement resource failure-masking
    attempts as exception handling code.
  • In group masking,
  • individual members failures are entirely hidden
    from users by group management mechanisms.

16
Group Failure Masking (contd)
  • Group output is a function of outputs of
    individual group members.
  • fastest member
  • distinguished member
  • result of majority vote
  • A server able to mask any k concurrent member
    failures will be termed k-fault tolerant
  • e.g., a primary/standby group of k servers with
    members ranked as primary, 1st backup, 2nd
    backup, ..., can mask k-1 failures.

17
Some Formalism
  • Programs
  • A Program consists of
  • a finite set of variables
  • a finite set of actions
  • where
  • guard is a boolean expression over program
    variables, and
  • statement updates program variables
  • Modifications
  • guards may contain receive from channels
  • statements may contain sends/receive

18
Computation
  • A program computation is a fair'' sequence of
    steps, where in each step an action whose guard
    is true has its statement executed
  • In one step, multiple guards may be true.
  • If guard of some action is true continuously,
    then that action would eventually be chosen for
    execution.
  • Notes
  • A program computation is a sequence of states

19
Specification
  • A specification is a set of sequences of states.
  • What does it mean for a program, p to satisfy a
    specification sp from a set of states S?
  • every computation of p that starts from a state
    in S is in sp .

20
Examples of specifications
  • Let S be a predicate.
  • invariant
  • Invariant(S) seq S is true in each state of
    seq
  • A sequence seq is in invariant(S) iff S is true
    in each state in seq.
  • Closure
  • Closed(S)
  • seq Ai I gt 0
  • S is true in the ith state of seq
  • gt
  • S is true in the (I1)th state of seq
  • If S ever becomes true, it continues to be true.

21
Examples of specifications (contd)
  • Let R and S be predicates.
  • leads-to
  • R leads-to S
  • seq (Ai igt 0
  • R is true in ith state of seq
  • gt
  • (Ek k gti
  • S is true in kth state of seq)
  • )

22
Examples of specifications (contd)
  • Mutual Exclusion
  • invariant( (j ltgt k) gt (cs.j /\ cs.k) )
  • (Aj (req.j leads-to cs.j))
  • Leader Election
  • invariant ( ( jltgtk) gt (leader.j s /\ leader.k)
    )
  • true leads-to (Ej leader.j)
  • Load Balancing
  • true leads-to
  • (Aj,k load.j - load.k lt bound)

23
Safety and Liveness
  • Safety specification
  • A sequence does nothing bad''
  • No sequence has a bad prefix
  • Let sp be a specification.
  • sp is a safety specification
  • iff
  • (A s s element_of sp
  • gt
  • (E a a is a prefix of s (Ab ab element_of
    sp)))

24
Liveness Specification
  • Liveness specification
  • A sequence does something good
  • Every finite prefix has a good extension
  • Let sp be a specification
  • sp is a liveness specification
  • iff
  • (A a (E b ab element_of sp))

25
Faults
  • A fault is an action that can change the program
    state
  • All faults
  • (be they crash, failstop, omission, corruption,
    timing, Byzantine, intruders, or ...)
  • can be thus viewed as perturbations on the
    system

26
Faults (contd)
  • A program computation in the presence of faults
    is a sequence of steps where
  • in each step either program action executes or
    fault action executes
  • the program actions are fairly executed
  • the fault occurrences are finite

27
Representation of Faults
  • Communication faults
  • Let c denote the sequence of messages on a
    channel.
  • Let m1 and m2 be messages, and let seqm be a
    sequence of messages.
  • Message Loss c lt seqm, m1gt gt c lt seqmgt
  • Message Duplication c lt seqm ,m1gt gt c lt
    seqm,m1,m1gt
  • Message Reorder c lt seqm,m1,m2gt gt c lt
    seqm,m2,m1gt

28
Representation of Faults (contd)
  • Amnesia/Transient faults.
  • Let c denote all the variables of a process.
  • True gt c??

29
Representation of Permanent Faults
  • Fail-stop fault
  • Upon fail-stop, a process does nothing
  • it does not execute any action and
  • it does not send any messages.
  • Introduce an auxiliary variable up.j at process j
  • Add up.j to the guard of each action of j
  • If processes can detect failure of other
    processes, then they can do so using variable up.

30
Representation of Permanent Faults
  • Byzantine Faults
  • Introduce an auxiliary variable b.j at process j
  • Add these actions as faults b.j gt b.j true
  • b.j gt state.j??

31
Goal of Fault-tolerance Design
  • Starting from some initial states, S,
  • If the program executes alone then the original
    specification, sp, is satisfied
  • If the program executes in the presence of faults
    then the fault-tolerant specification, sp', is
    satisfied.
  • The fault-tolerance specification depends upon
    the type of the desired fault-tolerance, e.g.,
  • for masking sp' sp
  • for fail-safe sp' safety specification of sp'

32
Representation of Permanent Faults
  • Fault-tolerant systems are rarely designed from
    scratch!!!
  • One needs to modify a fault-intolerant system to
    add fault-tolerance
  • Need for reuse the fault-intolerant program.
  • Fault-tolerant systems need to be modified to
    deal with new faults.
  • Need for incremental design
  • Need to perform several activities while
    developing fault-tolerant systems.
  • manual or automated design, testing,
    verification, synthesis, ...
  • desirable to have a unified framework that allows
    to perform these activities.

33
Overall Design
34
Overall Design (contd)
  • Should separate concerns of functionality and
    fault-tolerance.
  • Should use components that are responsible for
    fault-tolerance alone.
  • Should provide structural continuity while
    performing these tasks.
  • Should be able to use the same components while
    performing the above tasks.
Write a Comment
User Comments (0)
About PowerShow.com