Title: Membership
1Membership
CS525 Presentation
2A Gossip-Style Failure Detection Service
- R. v. Renesse
- Y. Minsky
- M. Hayden
3Outline
- Motivation
- System Model
- Mechanism of the Algorithm
- Parameters to tune the Algorithm
- Analysis
- Multi-level Gossiping
- Discussion
- Conclusion
4Motivation
- Why do we need Failure Detection?
- System Management
- Replication
- Load Balancing
- When we detect a Failure in the system,
- Not using it
- Moving responsibility
- Making another copy
5Motivation
2
6
22
8
10
17
16
File reviews.doc
6System Model
- Fail-stop model (not Byzantine Fault)
- Minimal assumptions about the network
- Some message lost
- Most messages are delivered within predetermined,
reasonable time. - Goal detecting failures at each host
- Completeness
- Accuracy
- Speed
7Characterization Approach
- Strong/Weak COMPLETENESS
- Failure of any member is detected by all/some
non-faulty members - STRONG ACCURACY
- no mistake
- Cannot achieve both
- Guarantees completeness always and accuracy with
high probability.
8Requirements
- COMPLETENESS
- SPEED
- Every failure is detected by some non-faulty
member within T time units - ACCURACY
- The probability that non-faulty member(not yet
detected as failed) is detected as faulty by any
other non-faulty member should be less than
Pmistake
9Outline
- Motivation
- System Model
- Mechanism of the Algorithm
- Parameters to tune the Algorithm
- Analysis
- Multi-level Gossiping
- Discussion
- Conclusion
10Mechanism of the Algorithm
1 10120 66
2 10103 62
3 10098 63
4 10111 65
2
1
Address
Heartbeat Counter
Time
4
Gossiping this list to others And when a node
receives it, merge this list with its list
3
11Mechanism of the Algorithm
1 10118 64
2 10110 64
3 10090 58
4 10111 65
1 10120 66
2 10103 62
3 10098 63
4 10111 65
2
1
1 10120 70
2 10110 64
3 10098 70
4 10111 65
4
Gossiping this list to others And when a node
receives it, merge this list with its list
3
Current time 70 at node 2
12Mechanism of the Algorithm
- If the heartbeat has not increased for more than
Tfail seconds, the member is considered failed - And after Tcleanup seconds, it will delete the
member from the list
13Mechanism of the Algorithm
- The reason they dont delete a host right after
Tfail seconds?
1 10120 66
2 10110 64
3 10098 50
4 10111 65
1 10120 66
2 10110 64
4 10111 65
1 10120 66
2 10110 64
3 10098 75
4 10111 65
1 10120 66
2 10103 62
3 10098 55
4 10111 65
2
1
Current time 75 at node 2
4
3
14Outline
- Motivation
- System Model
- Mechanism of the Algorithm
- Parameters to tune the Algorithm
- Analysis
- Multi-level Gossiping
- Discussion
- Conclusion
15Parameters to tune the Algorithm
- Given Pmistake,
- Tfail
- Tcleanup
- Rate of gossiping, Tgossip
- Choose Tfail so that erroneous failure detection
-gt Pmistake - Choose Tcleanup to be double of Tfail
- Choose Rate of gossiping depending on network
bandwidth
16Parameters to tune the Algorithm
tTcleanupt2Tfail
tTfail
t
failure
time
17Analysis
- Simplified Analysis
- In a round, only one member can gossip to another
member with the success probability Parrival - By using dynamic method, we get the required
number of rounds to achieve a certain quality of
detection. - As round increases, Pmistake(r) decrease
- Detection Time Multiply the round when
Pmistake(r) Pmistake , by Tgossip
18Analysis
- As members increases, the detection time
increase
19Analysis
- As requirement is loosened, the detection time
decrease.
20Analysis
- As failed members increases, the detection time
increase significantly
21Analysis
- The algorithm is resilient to message loss
22Outline
- Motivation
- System Model
- Mechanism of the Algorithm
- Parameters to tune the Algorithm
- Analysis
- Multi-level Gossiping
- Discussion
- Conclusion
23Multi-level Gossiping
In subnet i, with probability 1/ni
24Discussion
- We might be able to use other gossip method to
improve the performance - A hybrid of pull and push?
- Sending only new contents?
- It is consuming a lot of network resource.
25Conclusion
- A failure detection based on a gossip protocol.
- Accurate with known probability
- Resilient against message loss
- Simple analysis of this algorithm
- Multi-level Gossiping
26On Scalable and Efficient Distributed Failure
Detectors
- I. Gupta
- T. D. Chandra
- G. S. Goldszmidt
27Outline
- Motivation Previous Work
- Problem Statement
- Worst-case Network Load
- Randomized Distributed Failure Detector Protocol
- Analysis and Experimental Result
- Discussion
- Conclusion
28Motivation Previous Work
- Most distributed applications rely on failure
detector algorithms to avoid impossibility result - The Heartbeating algorithms are not as efficient
and scalable as claimed - Previous analysis model didnt consider
scalability.
29Characterization Approach
- Strong/Weak COMPLETENESS
- Failure of any member is detected by all/some
non-faulty members - STRONG ACCURACY
- no mistake
- Cannot achieve both
- Guarantees completeness always and accuracy with
high probability.
30Requirements
- COMPLETENESS
- SPEED
- Every failure is detected by some non-faulty
member within T time units - ACCURACY
- The probability that non-faulty member(not yet
detected as failed) is detected as faulty by any
other non-faulty member should be less than PM(T)
31System Model
- A large group of n members
- Each member knows each other
- Crash (Non-Byzantine) Failure
- Message loss rate pml
- Member Failure Prebability Pf
- qml (1 pml)
- qf (1 pf)
32Outline
- Motivation Previous Work
- Problem Statement
- Worst-case Network Load
- Randomized Distributed Failure Detector Protocol
- Analysis and Experimental Result
- Discussion
- Conclusion
33Worst-case Network Load
Definition The worst-case network load L of a
failure detector protocol is the maximum number
of messages transmitted by any run of the
protocol within any time interval of length T,
divided by T
- SCALE
- The worst-case network load L is close to the
optimal worst-case network load as possible - Equal expected load per member
34Optimal Worst-case Network Load
- Any distributed failure detector algorithm
imposes a minimal worst-case network load of
, for - Group of size n
- Satisfying COMPLETENESS, SPEED, ACCURACY
- Given values of T and PM(T)
35Optimal Worst-case Network Load
- A group member at a random point in time t is not
detected as failed yet and stays non-faulty until
at least time tT - It sends m messages for time T
- If all messages are dropped, it need to be
detected as failed (SPEED) - Its probability should be less than PM(T)
36Worst-case Network Load
- Distributed HeartBeating
- Gossip-style Failure Detection
-gt They are not scalable
37Outline
- Motivation Previous Work
- Problem Statement
- Worst-case Network Load
- Randomized Distributed Failure Detector Protocol
- Analysis and Experimental Result
- Discussion
- Conclusion
38Randomized Distributed Failure Detector Protocol
- Relax the SPEED condition to detect a failure
within an expected time bound of T time units - COMPLETENESS with probability 1
- ACCURACY with probability of (1-PM(T))
- The ratio of the worst-case network load to
optimal is independent of group size
39Randomized Distributed Failure Detector Protocol
For every period,Select a random memberand send
a ping(Mi,Mj,pr)
reply a ack(Mi,Mj,pr)
time
Select k members randomlySend each of them a
ping-req(Mi,Mj,pr)And wait for ack(Mi,Mj,pr)
Send a ping(Mi,Mj,Mk,pr)
Mi
Mj
K3 other members
40Randomized Distributed Failure Detector Protocol
time
T
Decalre Mj as failed
Mi
Mj
K3 other members
41Outline
- Motivation Previous Work
- Problem Statement
- Worst-case Network Load
- Randomized Distributed Failure Detector Protocol
- Analysis and Experimental Result
- Discussion
- Conclusion
42Analysis and Parameters
- We need to set the period T and k
43Experimental Result
Independent from the number of member n
Resilient on number of failures and message loss
44Discussion
- Is the worst-case network load representing all
performance of the algorithm? - Small packet size imposes large overhead.
- Aggregation of packets? Or another way?
45Conclusion
- We characterized Fault Detection Algorithms
- The worst-case Network Load
- And its optimal one
- A randomized distributed failure detection
- It is better in terms of the worst-case network
load
46 SWIM Scalable Weakly-consistent
Infection-style Process Group Memberhip
Protocol
- Abhinandan Das
- Indranil Gupta
- Ashish Motivala
47Questions
- Why would we need another membership protocol?
- How are process failures detected?
- How are membership updates disseminated?
- How can false failure detection frequency be
reduced?
48Group Membership Service
Application queries, updates, etc.
Membership Protocol
Failure Detector
Joins, leaves, fails of members
Dissemination
Group Membership List
Unreliable Network
49Large Scale Process Group requires Scalability
a process
1000s of processes
50SWIM Actions
Mj fails
Step 1. Failure Detector detects failures at Mj
Mi
Step 2. Dissemination disseminates the failure
info
51Scalable Failure Detectors
- Strong Completeness
- always guarantees
- Speed of failure detection
- Time to detect a failure
- Accuracy
- Rate of false failure detection
- Network message load
- in Bps generated by the protocol
52Minimal Network Load L
- To satisfy application-specified detection time
(T), and false detection rate (PM(T)), the
minimal total network load L is -
- L n.log(PM(T))/log(pml).T
- n group size, pml indepent message loss
probability
53Heartbeating
- Centralized single failure point
- Logical ring unpredictable under multiple
failures - All-to-all O(N/T) load per process
- Problems
- Try simultaneous detections at all processes
- Does not separate failure detection and
dissemination - Solutions
- Separate failure detection and dissemination
- Not use heartbeat-based failure detection
54SWIM Failure Detector
55SWIM vs Heartbeating
Heartbeating
O(n)
First Detection Time
SWIM
Heartbeating
constant
Process Load
constant
O(n)
- For fixed
- False positive rate
- Message loss rate
56SWIM Failure Detector
Parameter SWIM
First Detection Time Expected e/(e-1) protocol periods Constant (independent of group size)
Process Load Constant per period lt 8L for 15 message loss
False Positive Rate Tunable
Completeness Deterministic time-bounded Within O(log(N)) periods w.h.p.
57Dissemination
Mj fails
Step 1. Failure Detector detects failures at Mj
Mi
Step 2. Dissemination disseminates the failure
info
58Dissemination Options
- Multicast (hardware/IP)
- costly (multiple simultaneous multicast)
- unreliable
- Point-to-point TCP/UDP
- Expensive
- Piggypacking updates on failure detectors
messages zero extra messages - Infection-style dissemination
59Infection-style Dissemination
60Infection-style Dissemination
- Epidemic process
- After ?log(n) protocol periods, n-n-(2?-2)
members heard about an update - Maintains a buffer of recently joined/evicted
processes - Piggypacked in pings, ping-reqs, and acks
- Prefer recent updates
- Buffer elements garbage collected after a while
- After ?log(n) protocol periods this defines weak
consistency
61Suspicion Mechanism
- False detection due to
- Perturbed processes
- Packet losses
- Indirect pinging may not solve the problem
- e.g., correlated message losses near pinged host
- Solution suspect a process before declaring it
as failed in the group
62Suspicion Mechanism
State Machine of process Mjs view at process
Mi FD Failure Detector D Dissemination
D- (Suspect Mj)
Suspected
FD- Mi ping failed D- (Suspect Mj)
Time out FD- Mi confirms Mj failed
FD- Mi ping success D- (Alive Mj)
Alive
Failed
D- (Alive Mj)
D- (Confirm Mj failed)
63Suspicion Mechanism
- Timeout can be adjusted to trade-off false
detection rate with failure declaration time - Distinguish multiple suspicion of a process
- Per-process incarnation number
- Incarnation number incremented only by its
associated process - An alive message with higher incarnation number
overides alive/suspect message with lower
incarnation number and vice versa - Confirm messages overide alive and suspect
messages
64Time-bounded Completeness
- Round-robin pinging
- each entry in Mis membership list is selected as
a ping target once during each traversal of the
list - After each traversal, membership list is randomly
reordered - Worst-case delay of successive selections of the
same target - 2ni-1 protocol periods apart
- Preserve average failure detection time of the
original FD
65Experiments Set-Up
- SWIM prototype
- Win2000, Winsock 2
- Uses only UDP messages
- Experimental platform
- Heterogeneous cluster of commodity PCs (gt32)
- 100Mbps Ethernet
- Protocol setting
- Number of process for ping-reqs(K)1, protocol
period2s, times to piggypacked and suspicion
timeout3ceillog(N1) - No perpeptual partial membership lists observed
in the experiments
66Per-process message load is independent of group
size
67Fig 3a. The avg detection time varies with group
size
68Fig 3b. Median dissemination latency is
uncorrelated with group size
69Fig 3c. Shows suspicion timeout used in Suspicion
Mechanism
70Benefit of Suspicion Mechanism
71Answers
- Why would we need another membership protocol?
- Heartbeat does not scale well
- Need to separate detection and dissemination
- How are process failures detected?
- Randomized pings
- How are membership updates disseminated?
- Infection-style
- How can false failure detection frequency be
reduced? - Suspicion mechanism
72Critiques/Questions
- In figure 3a, why the intervals between vertical
lines are different? - In figure 3b, why are the latencies high at 18
and 26? - Figure 3c does not have experimental information.
- How is SWIM compared to other membership
protocol? - How does SWIM adapt to application requirement
and network dynamism? E.g. by changing k,
suspicion timeout