Fault Tolerant Computing - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Fault Tolerant Computing

Description:

Specification techniques for representing critical properties ... Repair and/or reconfigure. Redundancy. Hardware: extra hardware. Information: redundancy bits ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 35

Provided by: DrBetty3

Learn more at: http://www.cse.msu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Fault Tolerant Computing

1
Fault Tolerant Computing
2
Acknowledgements

The following lectures are based on materials
from the following sources
S. Kulkarni
J. Rushby
J. Knight

3
Objectives

Exposure to area of Critical Systems
What it means to have a fault-tolerant system
Specification techniques for representing
critical properties
How to Design Fault tolerance into a system

4
Reliability and Recovery

Reliability
Probability that a system will not fail at time t
if it was operating properly at time 0.
Recovery
Process of restoring consistency after a failure

5
Dependability

Dependability
How much one may rely on the quality of services
delivered
Quality of service depends on
Correctness
Continuity of service

6
Terms

Failure malfunction
Fault condition that might lead to failure
Error an incorrect response indicates a fault
is present
Faults may be
permanent
intermittent
transient

7
Terms (contd)

Graceful Degradation
system is operational, but degraded, after faults
Fail-safe
system execution is safe after the fault
Stabilizing
system recovers to a consistent state after the
fault
Masking
the user of the system does not see any
unintended behavior due to faults

8
Terms (contd)

Mean Time to Failure (MTTF)
expected value of system failure time
Mean Time to Repair (MTTR)
expected value of system repair time
Mean Time Between Failure
expected time between successive failures MTBF
MTTF MTTR
Fault Tolerance
ability to continue operation after occurrence of
faults

9
Design Decisions

Fault detection
Fault confinement
Fault diagnosis
Repair and/or reconfigure
Redundancy
Hardware extra hardware
Information redundancy bits
Software diagnosis software, extra software
Temporal re-execute software to recover from
intermittent faults

10
Safety vs Reliability

Reliability
concerns occurrence of failures
System failures defined in terms of system
services
Safety concerns occurrence of accidents
Unplanned events that result in death, inury,
illness, damage, loss of property or evironmental
harm
Defined in terms of external consequences

11
Types of Faults

Omission failure
server omits to respond to an input (fail-silent
failure)
Timing failure
response is functionally correct, but untimely -
can be early timing failure or late timing
failure
(performance failure)
Response failure
incorrect response
if output value incorrect (value failure)
state transition incorrect (state transition
failure)

12
Types of Faults (contd)

Crash failure
if after a first omission, a server omits to
produce output until it restarts
Amnesia crash
server restarts in a predefined initial state
that does not depend on the inputs seen before
crash
Partial amnesia crash
some part of the state is the same before the
crash rest is in predefined initial state
Pause crash
server restarts in the state it had before the
crash
Halting crash
crashed server never restarts

13
Examples

OS crashed followed by reboots in initial state
Database server crash followed by recovery of a
database state that reflects all transactions
before the crash
Communication server occasionally loses messages
but does not delay messages (omission failure)
Excessive message transmission or message
processing delay (communication performance
failure)
Alteration of a message due to random noise
during transmission (response failure)

14
Hierarchical Failure Masking

A failure of a certain type at a lower level can
propagate as a different kind of failure at a
higher level abstraction.
Value Error at the physical layer (e.g., 2 bits
corrupted) propagates as omission error at data
link layer

15
Group Failure Masking

To ensure a service remains available to clients
despite server failure,
one can implement a group of redundant,
physically independent servers.
The group masks the failure of a member.
Hierarchical masking requires
users to implement resource failure-masking
attempts as exception handling code.
In group masking,
individual members failures are entirely hidden
from users by group management mechanisms.

16
Group Failure Masking (contd)

Group output is a function of outputs of
individual group members.
fastest member
distinguished member
result of majority vote
A server able to mask any k concurrent member
failures will be termed k-fault tolerant
e.g., a primary/standby group of k servers with
members ranked as primary, 1st backup, 2nd
backup, ..., can mask k-1 failures.

17
Some Formalism

Programs
A Program consists of
a finite set of variables
a finite set of actions
where
guard is a boolean expression over program
variables, and
statement updates program variables
Modifications
guards may contain receive from channels
statements may contain sends/receive

18
Computation

A program computation is a fair'' sequence of
steps, where in each step an action whose guard
is true has its statement executed
In one step, multiple guards may be true.
If guard of some action is true continuously,
then that action would eventually be chosen for
execution.
Notes
A program computation is a sequence of states

19
Specification

A specification is a set of sequences of states.
What does it mean for a program, p to satisfy a
specification sp from a set of states S?
every computation of p that starts from a state
in S is in sp .

20
Examples of specifications

Let S be a predicate.
invariant
Invariant(S) seq S is true in each state of
seq
A sequence seq is in invariant(S) iff S is true
in each state in seq.
Closure
Closed(S)
seq Ai I gt 0
S is true in the ith state of seq
gt
S is true in the (I1)th state of seq
If S ever becomes true, it continues to be true.

21
Examples of specifications (contd)

Let R and S be predicates.
leads-to
R leads-to S
seq (Ai igt 0
R is true in ith state of seq
gt
(Ek k gti
S is true in kth state of seq)
)

22
Examples of specifications (contd)

Mutual Exclusion
invariant( (j ltgt k) gt (cs.j /\ cs.k) )
(Aj (req.j leads-to cs.j))
Leader Election
invariant ( ( jltgtk) gt (leader.j s /\ leader.k)
)
true leads-to (Ej leader.j)
Load Balancing
true leads-to
(Aj,k load.j - load.k lt bound)

23
Safety and Liveness

Safety specification
A sequence does nothing bad''
No sequence has a bad prefix
Let sp be a specification.
sp is a safety specification
iff
(A s s element_of sp
gt
(E a a is a prefix of s (Ab ab element_of
sp)))

24
Liveness Specification

Liveness specification
A sequence does something good
Every finite prefix has a good extension
Let sp be a specification
sp is a liveness specification
iff
(A a (E b ab element_of sp))

25
Faults

A fault is an action that can change the program
state
All faults
(be they crash, failstop, omission, corruption,
timing, Byzantine, intruders, or ...)
can be thus viewed as perturbations on the
system

26
Faults (contd)

A program computation in the presence of faults
is a sequence of steps where
in each step either program action executes or
fault action executes
the program actions are fairly executed
the fault occurrences are finite

27
Representation of Faults

Communication faults
Let c denote the sequence of messages on a
channel.
Let m1 and m2 be messages, and let seqm be a
sequence of messages.
Message Loss c lt seqm, m1gt gt c lt seqmgt
Message Duplication c lt seqm ,m1gt gt c lt
seqm,m1,m1gt
Message Reorder c lt seqm,m1,m2gt gt c lt
seqm,m2,m1gt

28
Representation of Faults (contd)

Amnesia/Transient faults.
Let c denote all the variables of a process.
True gt c??

29
Representation of Permanent Faults

Fail-stop fault
Upon fail-stop, a process does nothing
it does not execute any action and
it does not send any messages.
Introduce an auxiliary variable up.j at process j
Add up.j to the guard of each action of j
If processes can detect failure of other
processes, then they can do so using variable up.

30
Representation of Permanent Faults

Byzantine Faults
Introduce an auxiliary variable b.j at process j
Add these actions as faults b.j gt b.j true
b.j gt state.j??

31
Goal of Fault-tolerance Design

Starting from some initial states, S,
If the program executes alone then the original
specification, sp, is satisfied
If the program executes in the presence of faults
then the fault-tolerant specification, sp', is
satisfied.
The fault-tolerance specification depends upon
the type of the desired fault-tolerance, e.g.,
for masking sp' sp
for fail-safe sp' safety specification of sp'

32
Representation of Permanent Faults