Title: Failure Detectors
1Failure Detectors
- CS 717
- Ashish Motivala
- Dec 6th 2001
2Some Papers Relevant Papers
- Unreliable Failure Detectors for Reliable
Distributed Systems. Tushar Deepak Chandra and
Sam Toueg. Journal of the ACM. - A gossip-style failure detection service. R. van
Renesse, Y. Minsky, and M. Hayden. Middleware
'98. - Scalable Weakly-consistent Infection-style
Process Group Membership protocol. Ashish
Motivala, Abhinandan Das, Indranil Gupta. To be
submitted to DSN 2002 tomorrow.
http//www.cs.cornell.edu/gupta/swim - On the Quality of Service of Failure Detectors.
Wei Chen, Cornell University (with Sam Toueg,
Advisor, and Marcos Aguilera, Contributing
Author). DSN 2000. - Fail-aware failure detectors. C. Fetzer and F.
Cristian. In Proceedings of the 15th Symposium on
Reliable Distributed Systems.
3Asynchronous vs Synchronous Model
- No value to assumptions about process speed
- Network can arbitrarily delay a message
- But we assume that messages are sequenced and
retransmitted (arbitrary numbers of times), so
they eventually get through. - Failures in asynchronous model?
- Usually, limited to process crash faults
- If detectable, we call this fail-stop but how
to detect?
4Asynchronous vs Synchronous Model
- No value to assumptions about process speed
- Network can arbitrarily delay a message
- But we assume that messages are sequenced and
retransmitted (arbitrary numbers of times), so
they eventually get through.
- Assume that every process will run within bounded
delay - Assume that every link has bounded delay
- Usually described as synchronous rounds
5Failures in Asynchronous and Synchronous Systems
- Usually, limited to process crash faults
- If detectable, we call this fail-stop but how
to detect?
- Can talk about message omission failures
failure to send is the usual approach - But network assumed reliable (loss charged to
sender) - Process crash failures, as in asynchronous
setting - Byzantine failures arbitrary misbehavior by
processes
6Realistic???
- Asynchronous model is too weak since they have no
clocks(real systems have clocks, most timing
meets expectations but heavy tails) - Synchronous model is too strong (real systems
lack a way to implement synchronize rounds) - Partially Synchronous Model async n/w with a
reliable channel - Timed Asynchronous Model time bounds on clock
drift rates and message delays Fetzer
7Impossibility Results
- Consensus All processes need to agree on a value
- FLP Impossibility of Consensus
- A single faulty process can prevent consensus
- Realistic because a slow process is
indistinguishable from a crashed one. - Chandra/Toueg Showed that FLP Impossibility
applies to many problems, not just consensus - In particular, they show that FLP applies to
group membership, reliable multicast - So these practical problems are impossible in
asynchronous systems - They also look at the weakest condition under
which consensus can be solved
8Byzantine Consensus
- Example 3 processes, 1 is faulty (A, B, C)
- Non-faulty processes A and B start with input 0
and 1, respectively - They exchange messages each now has a set of
inputs 0, 1, x, where x comes from C - C sends 0 to A and 1 to B
- A has 0, 1, 0 and wants to pick 0. B has 0,
1, 1 and wants to pick 1. - By definition, impossibility in this model means
xxx cant always be done
9Chandra/Toueg Idea
- Theoretical Idea
- Separate problem into
- The consensus algorithm itself
- A failure detector a form of oracle that
announces suspected failure - But the process can change its decision
- Question what is the weakest oracle for which
consensus is always solvable?
10Sample properties
- Completeness detection of every crash
- Strong completeness Eventually, every process
that crashes is permanently suspected by every
correct process - Weak completeness Eventually, every process that
crashes is permanently suspected by some correct
process
11Sample properties
- Accuracy does it make mistakes?
- Strong accuracy No process is suspected before
it crashes. - Weak accuracy Some correct process is never
suspected - Eventual strong/ weak accuracy there is a time
after which strong/weak accuracy is satisfied.
12A sampling of failure detectors
13Perfect Detector?
- Named Perfect, written P
- Strong completeness and strong accuracy
- Immediately detects all failures
- Never makes mistakes
14Example of a failure detector
- The detector they call W eventually weak
- More commonly ?W diamond-W
- Defined by two properties
- There is a time after which every process that
crashes is suspected by some correct process
weak completeness - There is a time after which some correct process
is never suspected by any correct process weak
accuracy - Eg. we can eventually agree upon a leader. If it
crashes, we eventually, accurately detect the
crash
15?W Weakest failure detector
- They show that ?W is the weakest failure detector
for which consensus is guaranteed to be achieved - Algorithm is pretty simple
- Rotate a token around a ring of processes
- Decision can occur once token makes it around
once without a change in failure-suspicion status
for any process - Subsequently, as token is passed, each recipient
learns the decision outcome
16Building systems with ?W
- Unfortunately, this failure detector is not
implementable - This is the weakest failure detector that solves
consensus - Using timeouts we can make mistakes at arbitrary
times
17Group Membership Service
Asynchronous Lossy Network
X
Process Group
pi
Join Leave Failure
pj
18Data Dissemination using Epidemic Protocols
- Want efficiency, robustness, speed and scale
- Tree distribution is efficient, but fragile and
hard configure - Gossip is efficient and robust but has high
latency. Almost linear in network load and scales
O(nlogn) in detection time with number of
processes.
19State Monotonic Property
- A gossip message contains the state of the sender
of the gossip. - The receiver used a merge function to merge the
received state and the sent state. - Need some kind of monotonicity in state and in
gossip
20Simple Epidemic
- Assume a fixed population of size n
- For simplicity, assume homogeneous spreading
- Simple epidemic any one can infect any one with
equal probability - Assume that k members are already in infected
- And that the infection occurs in rounds
21Probability of Infection
- Probability Pinfect(k,n) that a particular
uninfected member is infected in a round if k are
already in a round if k are already infected? - Pinfect(k,n) 1 P(nobody infects member)
- 1 (1 1/n)k
- E(newly infected members) (n-k)x Pinfect(k,n)
- Basically its a Binomial Distribution
222 Phases
- Intuition 2 Phases
- First Half 1 -gt n/2 Phase 1
- Second Half n/2 -gt n Phase 2
- For large n, Pinfect(n/2,n) 1 (1/e)0.5 0.4
23Infection and Uninfection
- Infection
- Initial Growth Factor is very high about 2
- At the half way mark its about 1.4
- Exponential growth
- Uninfection
- Slow death of uninfection to start
- At half way mark its about 0.4
- Exponential decline
24Rounds
- Number of rounds necessary to infect the entire
population is O(log n) - Robbert uses and base of 1.585 for experiments
25How the Protocol Works
- Each member maintains a list of (address
heartbeat) pairs. - Periodically each member gossips
- Increments his heartbeat
- Sends (part of) list to a randomly chosen member
- On receipt of gossip, merge the lists
- Each member maintains the last heartbeat of each
list member
26(No Transcript)
27 28(No Transcript)
29(No Transcript)
30SWIMGroup Membership Service
Asynchronous Lossy Network
X
Process Group
pi
Join Leave Failure
pj
31System Design
- Join, Leave, Failure broadcast to all processes
- Need to detect a process failure at some process
quickly (to be able to broadcast it) - Failure Detector Protocol Specifications
- Detection Time
- Accuracy
- Load
Specified by application designer to SWIM
Optimized by SWIM
32SWIM Failure Detector Protocol
Protocol period T time units
33Properties
- Expected Detection time
- e/(e-1) protocol periods
- Load O(K) per process
- Inaccuracy probability exponential in K
- Process failures detected
- in O(log N) protocol periods w.h.p.
- in O(N) protocol periods deterministically
34Why not Heartbeating ?
- Centralized single failure point
- All-to-all O(N) load per process
- Logical ring unpredictability on multiple
failures
35LAN Scalability
Win2000, 100 Base-T Ethernet LAN Protocol Period
3RTT, RTT10 ms, K1
36Deployment
- Broadcast suspicion before declaring process
failure - Piggyback broadcasts through ping messages
- Epidemic-style broadcast
- WAN
- Load on core routers
- No representatives per subnet/domain