Title: Fault Tolerance
1Fault Tolerance
2Fault tolerance terminology
- dependability - extent to which reliance can
justifiably be placed on service. - General concept
- reliability - continuity of service
- metric mean time between failures (MBTF)
- availability - readiness for usage
- safety - avoidance of catastrophic effects on
environment - security - resistance to unauthorized access.
3Faults, errors, failures
- fault - component malfunction
- error - system state is wrong
- failure - system departs from specification
error
fault
failure
4System
System
components
fault
failure
Environment
5Coping with faults
- Reduce/eliminate faults in components.
- Fault tolerance
- Prevent faults from becoming failures
- usually through redundancy.
6Types of faults (fault models)
- Fault tolerance algorithms dependent on fault
models. - Crash fault or stop fault - faulty component
stops responding. No incorrect state changes in
component. - Timing fault - response is too early or late.
- Byzantine fault - arbitrary behavior. Can be
considered adversarial (imagine worst case).
7The agreement problem
- Processors may fail
- so, use multiple processors
- but then, processors may disagree, causing
failures. - Need a principled approach to distributed
agreement
8Example AFTI 16 (from J. Rushby)
- Advanced Fighter Technology Integration F16
- Triple-redundant digital flight-control system
(DFCS) with analog backup - DFCS design was asynchronous
- processors ran independently
- sample sensor, evaluate control law, send command
to actuator - actuator averages or selects from commands
- General Dynamics felt synchronization would
introduce a single point of failure.
9AFTI 16 problems
- Processors can get widely varying sensor readings
because of timing differences - Reconfiguration can cause sudden changes in
control (thumps). - Need to allow wide range of plausible values
before declaring a processor bad - Bad sensor reading drags average down
- Sensor finally crosses threshhold and is called
bad - average suddenly snaps back when sensor is
excluded.
10AFTI 16 problems (cont)
- Processor states can diverge rapidly
- especially when different processors go into
different control modes. - Design complexity
- 70 of application code was for redundancy
management - Control laws had to be modified to ramp changes
in and out smoothly
11AFTI 16 flight test, Flight 36
- Departure from control laws for 3 seconds
- acceleration exceeded -4g, then 7g
- Angle of attack went to -10 degrees, then 20
degrees - Aircraft rolled 360 degreees
- Cause side air probe cut out at high angle of
attack - Analysis showed this would cause complete failure
of DFCS for several areas of flight envelope
12AFTI 16 flight 44
- Each channel declared the others failed
- asynchronous operation, timing skew, sensor noise
- analog backup not selected
- simultaneous failure of two channels not
anticipated - Aircraft flown home on a single digital channel
(not designed for this) - There were no hardware failures.
13AFTI 16 Analysis (NASA)
- Nearly all failure indications were design
oversights related to asynchronous operation - Failures due to lack of understanding of
interactions among - Air data system
- redundancy management software
- flight control laws (decision points, thumps,
ramp-in/out) - Moral of the story Reliability through
redundancy is a lot harder than it looks.
14Distributed consensus
- Goal multiple processors agree on something in
the presence of various kinds of faults and
errors - Intellectually difficult
- Algorithms are tricky
- Proofs are subtle
- Sensitive to assumptions
- Synchronous vs. asynchronous
- Communication mechanism
- Fault models
- Many papers written
15Synchronous vs. asynchronous
- Synchronous Processors run in lock-step
- Hard to implement - model may be unrealistic
- Requires clock synchronization.
- Consensus is easier
- Asynchronous Processors run at arbitrary speed
- Easier to implement - model is conservative
- In most models, consensus problem is provably
unsolvable.
16Synchronous vs. asynchronous
- Semi-synchronous
- Bounds on how far out-of-sync processors can get
- Model is fairly realistic
- Consensus is almost as easy as synchronous
17Fault models
- Goal Make claims such as the system will
continue to function if any single processor
stops. - More conservative fault models
- Fault tolerance is harder
- But, if successful, stronger claims can be made
- Fewer assumptions simpler FMEA, easier
certification - A lot of models have been proposed.
18Process fault models
- Stopping fault - process stops sending messages
- does not restart
- does not send wrong messages
- liberal (easy) model
- Byzantine fault - process behaves arbitrarily
- Name comes from cute Byzantine generals
metaphor - May send arbitrary messages, enter arbitrary
states - Equivalent to evil behavior, for our purposes
19Synchronous agreement with stopping faults
- Multiple processes want to agree on a value
- Applications
- sensor readings among redundant processors
- decide what time it is
- decide which of a group of processors are broken
and should be removed from system.
20Synchronous agreement - properties
- Each process starts with an initial value,
processes end with a decision value. - Agreement all good processes decide on same
values. - Validity if all processors start with same
value, that value is the final decision value. - Termination All good processes eventually decide.
21Flood set algorithm
- Assumption There is a dedicated link between
each pair of processes - No more than f processes can stop
- Each process has an initial value v
- Each process accumulates a set W of all the
values it has ever seen. - On each round, every process sends its W set to
every other process - Every process sets W to the union of the old
value and all the new values coming in from
others.
22Flood set
- After f rounds, every process looks at W.
- If W has only one value, choose that value.
- Else, choose 0 (a predetermined default).
23Flood set correctness
- In f1 rounds, there must be at least one round
in which no processes stop - At most f processes can stop, and processes
cannot stop more than once. - If no process stops in round r, W will be the
same in all good processes in subsequent rounds. - All good processes successfully send all values
in W to all other good processes, so all
processes will have same W after the round. - After this, nothing can get added to any W sets,
so it doesnt matter whether more processes stop.
24Flood set correctness
- So, after f1 rounds, all non-stopped processes
have same W sets - If W has only one value, all processes pick this
value. - Else all processes pick 1.
25Flood set example
- 3 processes, 1 fault, default value 0
W in round 0
W in round 1
W in round 2
final
26Flood set efficiency
- O((f 1) n2) messages
- f1 rounds
- n processes send n messages per round
- O((f1)n3) values are sent (each message
- may have a set of up to n values)
27Optimized flood set
- Note If W has more than one element, process
doesnt need to know what is in it. - Idea Every process sends only first two distinct
values. - Every process sends its initial value on first
round - If process receives a different value, it sends
it out on next round - Correctness proof run Flood and OptFlood in
parallel - same initial values, stopping pattern
- W sets have more than one value iff OptFlood
process gets two values.
28OptFlood efficiency
- 2 n2 messages
- n processes send at most two messages to n other
processes. - O(n2) values are sent
29Byzantine agreement
- Goal non-faulty processes should agree on a
value. - E.g., message received
- e.g., sensor value
- Faults may cause arbitrary behavior
- arbitrary values communicated
- different values communicated to different
receivers - Advantage reduces fault analysis
- Disadvantage hard or impossible to do.
30Byzantine agreement properties
- Agreement All good processes agree on a value
- Validity If source of value was non-faulty,
agreed upon value is the same.
31Asynchronous agreement
- Asynchronous model
- Message transmission takes arbitrary time.
- Processes run at arbitrary speeds.
- Theorem There is no algorithm that reaches
agreement in an asynchronous model with even one
Byzantine failure - Fine print Details of conditions, communication
- This is one of the most important results about
distributed systems.
32Synchronous agreement
- Synchronous model Processes can communicate in a
sequence of rounds. All processes complete a
round before next round begins. - The agreement problem is solvable in this model.
- Theorem Tolerating k Byzantine faults requires
gt 3k processes. - So Triple modular redundancy cant handle
Byzantine faults. - Practical case 1 Byzantine fault, 4 processes.
- Assumes full connectivity (connections between
each pair of processors).
33Synchronous agreement with one fault
- Single transmitter communicates value to all
processes. - Round 0 Transmitter sends value to n-1
receivers. - Values are sent correctly if transmitter is not
faulty. - Round 1 Each receiver sends value to n-2 other
receivers. - Receivers record all values separately.
- Intuition receivers compare notes on what
transmitter told them. - Each receiver choose majority value of all values
it received. - If no majority, use pre-arranged default value.
34Example 1- faulty transmitter
Round 1 rcvrs exchange values (reliably)
35Example 2- faulty transmitter
Round 1 rcvrs exchange values (reliably)
36Example 3- faulty receiver
Process 1 is broken, so result is not required
to be correct
Process 1 sends bogus values
37General case
- Previous algorithm can be generalized to handle
more Byzantine faults. - General results k faults require k1 (k?)
rounds, 3k1 processors - Number of messages grows exponentially with
number of rounds - Intuition Pn said that Pn-1 said that ... p1
said that p0 said that the value was x - There are exponentially many chains pn ... p0.
38Hybrid Byzantine agreement
- Idea Free bonus reliability with the purchase of
Byzantine agreement. - Handles Byzantine faults, plus some more simpler
faults - Symmetric fault process sends same wrong value
to everyone. - Nonmalicious fault process sends a recognizable
error value. - Advantages
- If processors have these faults, we can tolerate
more faulty processors - These faults are more probable than true
Byzantine faults - so this increases reliability
39Hybrid Byzantine agreement
- Modify previous algorithm by adding special error
value E. - Nonmalicious faults send E value (other faults
may send E, also). - Majority algorithm first removes E values.
- Theorem Algorithm reaches agreement if
- n gt 2a 2s b r
- a Byzantine, s symmetric, b nonmalicious, r
number of rounds (excluding first
transmission). - Previous case a1, s0, b0, r1, so n gt 3
- With 6 processors, can deal with 1 Byzantine 2
nonmalicious faults. - or 1 Byzantine and 1 symmetric
- ... but just 1 Byzantine in previous algorithm
40Variations
- Synchronous communication is difficult
- Compromise between synchronous and asynchronous
real-time constraints. - Authentication - agreement can be made less
costly by using digital signatures - transmitter digitally signs messages
- processes cant lie about who said what.
- can handle any number of faults (in synchronous
model). - May assume different network connectivity
- Some links in network missing
41Summary
- Fault tolerance is tricky. Redundancy does not
necessarily buy reliability. - Byzantine models can account for unforeseen fault
types. - Byzantine agreement is impossible in some models.
- There exist practical algorithms for Byzantine
agreement if synchronous communication is
available. - There are deep theoretical results in this area.