Title: A Algorithm for Mutual Exclusion in Decentralized Systems
1A Algorithm for Mutual Exclusion in
Decentralized Systems
2Setting Previous Work
- Distributed mutual exclusion distributed lock
among nodes without shared memory. We assume
reliable FIFO channels. - Algorithm must ensure safety, liveness, and
deadlock freedom. - Centralized algorithm linear message overhead,
s.p.o.f. - Ricart Agrawala, Thomas total or majority
consensus, hence linear overhead.
3Maekawa Algorithm
- Idea each node i will have a different but
static member set Si responsible for granting
access to the lock. - Member sets are also called quorums in the
literature - Example Si 1 for a centralized algorithm
- Si N for classical algorithms
- Can we minimize the size of Si ?
4Requirements
- a. Mutual exclusion
- b. Reduce message overhead
- c. Symmetry (1)
- d. Symmetry (2)
5Derivations
- Maximum number of subsets
- Approximately,
- Therefore
- Exact solutions when (K-1) is a power of a prime,
but overall
6Example
- 1 1, 2, 3, 4
- 2 2, 5, 8, 11
- 3 3, 6, 8, 13
- 4 4, 6, 10, 11
- 5 1, 5, 6, 7
- 6 2, 6, 9, 12
- 7 2, 7, 10, 13
- 8 1, 8, 9, 10
- 9 3, 7, 9, 11
- 10 3, 5, 10, 12
- 11 1, 11, 12, 13
- 12 4, 7, 8, 12
- 13 4, 5, 9, 13
7Protocol overview
- To acquire the lock, a node A tries to lock all
its member nodes (Si) with REQUEST messages.
Messages are totally ordered by (timestamp,
node). - Nodes reply with either
- LOCKED, if they arent
- FAILED, if theyre locked by a prior REQUEST,
- If a member node is locked by a later REQUEST,
they INQUIRE the locker. If he wont get the lock
anyway (because hes received a FAIL) he
RELINQUISHES the lock, and the member node LOCKS
A.
811
7
7
11
11
13
11
2
12
7
11
7
7
11
FAILED
CRITICALSECTION
CRITICALSECTION
1
10
7
8
8
11
8
11
9
8
CRITICALSECTION
8
8
FAILED
9The Protocol - REQUEST
9
9
3
6
4
5
REQUEST
LOCKED
REQUEST
FAILED
5
5
Important REQUESTs are totally ordered based on
Lamport time restricted to requests. The REQUEST
message is given a sequence number greater than
any REQUEST message sent, received or observed at
this node. REQUESTs are totally ordered by using
IDs as tie-breakers.
10The Protocol INQUIRE/RELINQUISH
2
6
3
9
2
9
3
7
FAILED
1
3
11The Protocol RELEASE
RELEASE
RELEASE
INQUIRE
6
12Mutual Exclusion
- At most one node is in the critical section.
- If two nodes were in the critical section, they
must have received LOCKED messages from all the
nodes in their member sets. - Since the intersection of both sets is non-empty,
some node must be locked for two nodes
impossible by design.
13Liveness
- No closed waiting cycle can exist.
- The node who sent the least preceding message
(total ordering) is forced to RELINQUISH. - A node requesting the lock eventually gets it
- There are at most K-1 preceding requests at each
member node.
14Complexity
- State for every node
- Messages for mutual exclusion
- (K-1) REQUESTs
- At most (K-1) INQUIREs
- At most (K-1) RELINQUISHes
- (K-1) LOCKED
- (K-1) RELEASEs
15Fault Tolerance
- Model nodes may fail-stop.
- If a node fails, only state is
lost and the author claims that another node
could take over. It is not clear how this would
impact the mutual exclusion guarantees.
16Building Member Sets (1/2)
- Build an approximate projective plane of order
K(K-1) 1 where K is the smallest power of a
prime such that - Replace entries larger than N with entries
smaller than N. - Computational issues?
17Building Member Sets (2/2)
- Placing the nodes in a grid as square as
possible. - Member sets are nodes in the same column or the
same row.
1 2 3 4 5 i j
- Guaranteed intersections
- Member sets of size
- If N is not a square, we complete the last row
with existing values
18Discussion
- Optimality? There exist logarithmic algorithms
for mutual exclusion (e.g. Naimi and Trehel). - Resilience to unexpected node loss? to churn?
- Is sqrt(N) low enough?
19Replication Strategies in Unstructured
Peer-to-Peer Networks
20Introduction
- Goal optimal replication data in a decentralized
unstructured P2P system. (Not e.g. Napster or
Chord) - Question how many times do we replicate each
file? - Two basic strategies
- Uniform files replicated a fixed number of times
- Proportional files replicated proportional to
demand - A continuum of strategies with an optimum
square-root replication, proportional to the
square root of queries.
21Simple Setting
- nodes with capacity , files with
normalized query rates and copies. There
are file copies in total and is the
fraction of copies of a given file. Files have
the same size. - Assuming a query distribution q we want to
minimize the expected search size (ESS) - Under constraints
22Reductions to the simple setting
- In practice, queries stop after L searches.
Assuming L is large enough w.r.t. the number of
files ( ), the error in is
less than one. - If the nodes capacities are not uniform, we take
the average capacity weighed by the visitation
rates and obtain the same results. Larger files
can be seen as multiple copies of a single-sized
one.
23Three basic allocations
- Uniform
- Proportional
- Square-root
- Allocation in-between uniform and proportional
- Under the constraints
24Uniform v. proportional
- Uniform allocation
- Minimizes resources spent on unsolvable queries
- Proportional allocation
- Minimizes the maximum utilization rate the u.r.
is the average rate of requests a copy serves.
Minimizing it prevents hotspots. - Surprisingly, both have the same expected search
size m/r!
25The space of allocations
- An allocation strictly in-between the uniform and
proportional allocation has a strictly lower
expected search size. - An allocation that lies strictly outside this
range perform strictly worse. - Assuming the constraints on the number of copies
are satisfied, the square-root allocation
minimizes the ESS.
26How much can we gain? Theory
27How much can we gain? Practice
28Problem feasibility
- Proportional or square-root allocations may
violate the constraints - We define square-root as the (unique) allocation
proportional to the square root of the
frequencies when possible, and l1/R when not. - This is optimal among the feasible allocations.
- Proportional is defined in a similar way (but is
not very interesting)
29How do we implement this in a distributed system?
- Assumptions copies can be made after a query,
and copies are deleted independently of each
other and of their age (excluding e.g. LRU). - Uniform OK. Nodes dont need to know file query
rates. - Square-root more complicated. How to determine
the number of copies Ci?
30Square-root replication in practice
- Path replication
- C number of nodes searched. Works but may
undershoot/overshoot the target frequency if
queries arrive in bursts. - Replication with sibling memory
- Each file tracks the number of times its been
directly copied. If we know the average frequency
at which files disappear, we can compute an
unbiased estimator for file query rates. - Replication with probe memory
- Nodes keep track of the probes they receive and
use those frequencies as file query rates,
possibly aggregating along the search path
31Discussion
- Reactivity to new data items?
- What about structured distributed systems? Can we
port some of these ideas? - Strong assumptions about the life cycle of files.
Too strong perhaps? - Which replication strategy is better for
distributed caches, p2p television, Kazaa-like
applications?
32Reliable Communication in the Presence of
Failures
- Kenneth Birman and Thomas Joseph, Cornell
University - ACM Transactions on Computer Science,1987
33Agenda
- Overview
- Motivation and Background
- Assumptions and Definition of Terms
- System Components
- Communication primitives
- Summary
34Overview
- Communication facility for distributed system
- Failures can occur
- present a set of communication primitives
- Applicable to local and wide-area networks
- Fault-tolerant process groups
- Consistent orderings of events
- Events delivered despite failures
35Motivation Background
- Design of a communication facility for
distributed systems. - Consistent event orderings.
- Optimize concurrency.
- This system is in use at the NYSE, Swiss Exchange
and the French air traffic control system.
36Assumptions and Definition of Terms
- Failure process stops without incorrect actions
- Event orderings controlled by communication layer
- Fault tolerance continued operation
- Failures detected, others notified
- Detect remote failure using timeout
- Logical approach, rather than physical
- Pretend events took place either before or after
- No communication among inconsistent processes
37Definitions
- Fault Tolerant Process Groups
- Collection of processes that are cooperating to
perform a distributed computation, and use the
communication primitives described in this paper. - Broadcast
- Refers to transmission of a message from a
process to the members of a process group, and
not to all processes in the system.
38Fault Tolerant Process Groups
- Processes cooperating to perform distributed
transaction - No shared memory or synchronized clocks
- Timer for timeout
- Changes in membership ordered with respect to
events - Members monitor each other
39Managing Group Membership
- View (process group view)
- A snapshot or the membership and global
properties of a process group - Local copy of view list at all sites
- View Manager
- Oldest site
- Calculates view extensions
- View Extension
- Current view 1 extension
- Other changes can get added on
- Site Manager
S2
S1
S2-3
S2-1
S2-2
S3
S4
40Communication Primitives
- Send messages only to members of group
- Members can be at the same site or remote
- GBCAST group broadcast
- ABCAST atomic broadcast
- CBCAST causal broadcast
- All are atomic
41GBCAST (action, G)
- Broadcasts membership changes
- Issued by coordinator 2 Phase Commit
- Coordinator calculates change
- Change received Acknowledge Commit
- Change doesnt match NACK with missing events
- Delivered after messages from failed member
- Failed process will never be heard from again
- If declared dead must go through recovery
- Consistent order with ABCAST and CBCAST
- Using priority queue
42Normal GBCAST
GBCAST (P1 down, (C,P2,P3,P4))
P4
Coordinator
GBCAST (P1 down, )
Commit
P3
GBCAST (P1 down, )
ACK
P1
P2
Compare current view (C, P1, P2, P3, P4) with new
view. Save to stable storage.
New view (C, P2, P3, P4) P1 down
43GBCAST Coordinator Fails (1)
GBCAST (P1 down, (C,P2,P3,P4))
P4
Coordinator
Commit
GBCAST (P1 down, )
Commit
P3
GBCAST (P1 down, )
New view (C, P2, P3, P4) P1 down
ACK
P1
P2
Compare current view (C, P1, P2, P3, P4) with new
view. Save to stable storage.
New view (C, P2, P3, P4) P1 down
44GBCAST Coordinator Fails (2)
Compare current view (C, P1, P2, P3, P4) with new
view. Note there are 2 changes. Save to stable
storage.
New view (P2, P3, P4) P1, C down
P4
Coordinator
Commit
ACK
P3
Commit
(C, P2, P3, P4) P1 down
GBCAST (C down, ((P2,P3,P4) P1 down))
New view (P2, P3, P4) P1, C down
P2
P1
Coordinator
New view (P2, P3, P4) P1, C down
45ABCAST (msg, label, dests)
- Assures messages received in same order
- Issued by sender of message
- Recipient queues message, assigns max priority,
tags undeliverable, replies - Sender collects responses, computes max, sends
value - Recipient changes priority, tag deliverable,
resort queue, transfer to delivery queue in order
46ABCAST (msg, label, dests)
ABCAST queue
P
ABCAST queue
Delivery queue
ABCAST arrives
ABCAST queue
Sender transmits mesg to its destinations. Recipie
nt adds mesg to priority queue associated with
label, tags it as undeliverable, assigns priority
informs sender. Sender computes maximum
priority, sends it back to recipients. Recipients
change priority, tag it as deliverable,
transfer messages to delivery queue in increasing
order of priority.
47CBCAST (msg, clabel, dests)
- Ensures relative ordering when necessary
- clables are comparable or incomparable
- No common destinations, no comparison
- Previous messages included in transmit
- Optimization possible
- Intersite packets
- Common message pool and pointers to it
- Flags track where sent to
48CBCAST (msg, clabel, dests)
ABCAST queue
P
ABCAST queue
Delivery queue
ABCAST arrives
ABCAST queue
BUFP
CBCAST arrives
Transmission of B from BUFP to BUFQ -A transfer
packet (B1, B2.) is sent to q and includes all
B in BUFP such that B?B and it has not been
delivered to all destinations. For i lt j, Bi ?
Bj -At q, Bi are places in BUFQ. If Bi was
destined for q, it is also placed in the delivery
queue.
49View Management using GBCAST
P2
View Manager P1
Record Site View To stable storage Cease to
accept Messages from P1
P5 down, View P1,P2,P3,P4
P3
P5 Fails
ACK
P4
Positive ACK /Negative ACK
P5
After receiving the positive ACKS a commit
message is sent to all sites by the view manager
50Summary
- Communication protocols for distributed system
- Defined members, protocols
- Failures can be tolerated
- Members have consistent view
- Used at ISIS project, Cornell Univ.
51 Discussion
- How well does the protocol perform with frequent
failures/recoveries within a cluster? - Can we use Gossip techniques/spanning trees to
avoid all the transmission burden on the
broadcasting node? - Compare the replication strategies from previous
DHTs (Chord, Pastry, etc.)with the replication
strategies this paper is presenting - Could we implement consensus? Why?
52Thank you!