A Algorithm for Mutual Exclusion in Decentralized Systems

1 / 52
About This Presentation
Title:

A Algorithm for Mutual Exclusion in Decentralized Systems

Description:

Assuming a query distribution q we want to minimize the expected search size (ESS) ... Minimizing it prevents hotspots. Surprisingly, both have the same ... –

Number of Views:336
Avg rating:3.0/5.0
Slides: 53
Provided by: DAN31
Category:

less

Transcript and Presenter's Notes

Title: A Algorithm for Mutual Exclusion in Decentralized Systems


1
A Algorithm for Mutual Exclusion in
Decentralized Systems
  • M. Maekawa

2
Setting Previous Work
  • Distributed mutual exclusion distributed lock
    among nodes without shared memory. We assume
    reliable FIFO channels.
  • Algorithm must ensure safety, liveness, and
    deadlock freedom.
  • Centralized algorithm linear message overhead,
    s.p.o.f.
  • Ricart Agrawala, Thomas total or majority
    consensus, hence linear overhead.

3
Maekawa Algorithm
  • Idea each node i will have a different but
    static member set Si responsible for granting
    access to the lock.
  • Member sets are also called quorums in the
    literature
  • Example Si 1 for a centralized algorithm
  • Si N for classical algorithms
  • Can we minimize the size of Si ?

4
Requirements
  • a. Mutual exclusion
  • b. Reduce message overhead
  • c. Symmetry (1)
  • d. Symmetry (2)

5
Derivations
  • Maximum number of subsets
  • Approximately,
  • Therefore
  • Exact solutions when (K-1) is a power of a prime,
    but overall

6
Example
  • 1 1, 2, 3, 4
  • 2 2, 5, 8, 11
  • 3 3, 6, 8, 13
  • 4 4, 6, 10, 11
  • 5 1, 5, 6, 7
  • 6 2, 6, 9, 12
  • 7 2, 7, 10, 13
  • 8 1, 8, 9, 10
  • 9 3, 7, 9, 11
  • 10 3, 5, 10, 12
  • 11 1, 11, 12, 13
  • 12 4, 7, 8, 12
  • 13 4, 5, 9, 13

7
Protocol overview
  • To acquire the lock, a node A tries to lock all
    its member nodes (Si) with REQUEST messages.
    Messages are totally ordered by (timestamp,
    node).
  • Nodes reply with either
  • LOCKED, if they arent
  • FAILED, if theyre locked by a prior REQUEST,
  • If a member node is locked by a later REQUEST,
    they INQUIRE the locker. If he wont get the lock
    anyway (because hes received a FAIL) he
    RELINQUISHES the lock, and the member node LOCKS
    A.

8
11
7
7
11
11
13
11
2
12
7
11
7
7
11
FAILED
CRITICALSECTION
CRITICALSECTION
1
10
7
8
8
11
8
11
9
8
CRITICALSECTION
8
8
FAILED
9
The Protocol - REQUEST
9
9
3
6
4
5
REQUEST
LOCKED
REQUEST
FAILED
5
5
Important REQUESTs are totally ordered based on
Lamport time restricted to requests. The REQUEST
message is given a sequence number greater than
any REQUEST message sent, received or observed at
this node. REQUESTs are totally ordered by using
IDs as tie-breakers.
10
The Protocol INQUIRE/RELINQUISH
2
6
3
9
2
9
3
7
FAILED
1
3
11
The Protocol RELEASE
RELEASE
RELEASE
INQUIRE
6
12
Mutual Exclusion
  • At most one node is in the critical section.
  • If two nodes were in the critical section, they
    must have received LOCKED messages from all the
    nodes in their member sets.
  • Since the intersection of both sets is non-empty,
    some node must be locked for two nodes
    impossible by design.

13
Liveness
  • No closed waiting cycle can exist.
  • The node who sent the least preceding message
    (total ordering) is forced to RELINQUISH.
  • A node requesting the lock eventually gets it
  • There are at most K-1 preceding requests at each
    member node.

14
Complexity
  • State for every node
  • Messages for mutual exclusion
  • (K-1) REQUESTs
  • At most (K-1) INQUIREs
  • At most (K-1) RELINQUISHes
  • (K-1) LOCKED
  • (K-1) RELEASEs

15
Fault Tolerance
  • Model nodes may fail-stop.
  • If a node fails, only state is
    lost and the author claims that another node
    could take over. It is not clear how this would
    impact the mutual exclusion guarantees.

16
Building Member Sets (1/2)
  • Build an approximate projective plane of order
    K(K-1) 1 where K is the smallest power of a
    prime such that
  • Replace entries larger than N with entries
    smaller than N.
  • Computational issues?

17
Building Member Sets (2/2)
  • Placing the nodes in a grid as square as
    possible.
  • Member sets are nodes in the same column or the
    same row.

1 2 3 4 5 i j
  • Guaranteed intersections
  • Member sets of size
  • If N is not a square, we complete the last row
    with existing values

18
Discussion
  • Optimality? There exist logarithmic algorithms
    for mutual exclusion (e.g. Naimi and Trehel).
  • Resilience to unexpected node loss? to churn?
  • Is sqrt(N) low enough?

19
Replication Strategies in Unstructured
Peer-to-Peer Networks
  • E. Cohen, S. Shenker

20
Introduction
  • Goal optimal replication data in a decentralized
    unstructured P2P system. (Not e.g. Napster or
    Chord)
  • Question how many times do we replicate each
    file?
  • Two basic strategies
  • Uniform files replicated a fixed number of times
  • Proportional files replicated proportional to
    demand
  • A continuum of strategies with an optimum
    square-root replication, proportional to the
    square root of queries.

21
Simple Setting
  • nodes with capacity , files with
    normalized query rates and copies. There
    are file copies in total and is the
    fraction of copies of a given file. Files have
    the same size.
  • Assuming a query distribution q we want to
    minimize the expected search size (ESS)
  • Under constraints

22
Reductions to the simple setting
  • In practice, queries stop after L searches.
    Assuming L is large enough w.r.t. the number of
    files ( ), the error in is
    less than one.
  • If the nodes capacities are not uniform, we take
    the average capacity weighed by the visitation
    rates and obtain the same results. Larger files
    can be seen as multiple copies of a single-sized
    one.

23
Three basic allocations
  • Uniform
  • Proportional
  • Square-root
  • Allocation in-between uniform and proportional
  • Under the constraints

24
Uniform v. proportional
  • Uniform allocation
  • Minimizes resources spent on unsolvable queries
  • Proportional allocation
  • Minimizes the maximum utilization rate the u.r.
    is the average rate of requests a copy serves.
    Minimizing it prevents hotspots.
  • Surprisingly, both have the same expected search
    size m/r!

25
The space of allocations
  • An allocation strictly in-between the uniform and
    proportional allocation has a strictly lower
    expected search size.
  • An allocation that lies strictly outside this
    range perform strictly worse.
  • Assuming the constraints on the number of copies
    are satisfied, the square-root allocation
    minimizes the ESS.

26
How much can we gain? Theory
27
How much can we gain? Practice
28
Problem feasibility
  • Proportional or square-root allocations may
    violate the constraints
  • We define square-root as the (unique) allocation
    proportional to the square root of the
    frequencies when possible, and l1/R when not.
  • This is optimal among the feasible allocations.
  • Proportional is defined in a similar way (but is
    not very interesting)

29
How do we implement this in a distributed system?
  • Assumptions copies can be made after a query,
    and copies are deleted independently of each
    other and of their age (excluding e.g. LRU).
  • Uniform OK. Nodes dont need to know file query
    rates.
  • Square-root more complicated. How to determine
    the number of copies Ci?

30
Square-root replication in practice
  • Path replication
  • C number of nodes searched. Works but may
    undershoot/overshoot the target frequency if
    queries arrive in bursts.
  • Replication with sibling memory
  • Each file tracks the number of times its been
    directly copied. If we know the average frequency
    at which files disappear, we can compute an
    unbiased estimator for file query rates.
  • Replication with probe memory
  • Nodes keep track of the probes they receive and
    use those frequencies as file query rates,
    possibly aggregating along the search path

31
Discussion
  • Reactivity to new data items?
  • What about structured distributed systems? Can we
    port some of these ideas?
  • Strong assumptions about the life cycle of files.
    Too strong perhaps?
  • Which replication strategy is better for
    distributed caches, p2p television, Kazaa-like
    applications?

32
Reliable Communication in the Presence of
Failures
  • Kenneth Birman and Thomas Joseph, Cornell
    University
  • ACM Transactions on Computer Science,1987

33
Agenda
  • Overview
  • Motivation and Background
  • Assumptions and Definition of Terms
  • System Components
  • Communication primitives
  • Summary

34
Overview
  • Communication facility for distributed system
  • Failures can occur
  • present a set of communication primitives
  • Applicable to local and wide-area networks
  • Fault-tolerant process groups
  • Consistent orderings of events
  • Events delivered despite failures

35
Motivation Background
  • Design of a communication facility for
    distributed systems.
  • Consistent event orderings.
  • Optimize concurrency.
  • This system is in use at the NYSE, Swiss Exchange
    and the French air traffic control system.

36
Assumptions and Definition of Terms
  • Failure process stops without incorrect actions
  • Event orderings controlled by communication layer
  • Fault tolerance continued operation
  • Failures detected, others notified
  • Detect remote failure using timeout
  • Logical approach, rather than physical
  • Pretend events took place either before or after
  • No communication among inconsistent processes

37
Definitions
  • Fault Tolerant Process Groups
  • Collection of processes that are cooperating to
    perform a distributed computation, and use the
    communication primitives described in this paper.
  • Broadcast
  • Refers to transmission of a message from a
    process to the members of a process group, and
    not to all processes in the system.

38
Fault Tolerant Process Groups
  • Processes cooperating to perform distributed
    transaction
  • No shared memory or synchronized clocks
  • Timer for timeout
  • Changes in membership ordered with respect to
    events
  • Members monitor each other

39
Managing Group Membership
  • View (process group view)
  • A snapshot or the membership and global
    properties of a process group
  • Local copy of view list at all sites
  • View Manager
  • Oldest site
  • Calculates view extensions
  • View Extension
  • Current view 1 extension
  • Other changes can get added on
  • Site Manager

S2
S1
S2-3
S2-1
S2-2
S3
S4
40
Communication Primitives
  • Send messages only to members of group
  • Members can be at the same site or remote
  • GBCAST group broadcast
  • ABCAST atomic broadcast
  • CBCAST causal broadcast
  • All are atomic

41
GBCAST (action, G)
  • Broadcasts membership changes
  • Issued by coordinator 2 Phase Commit
  • Coordinator calculates change
  • Change received Acknowledge Commit
  • Change doesnt match NACK with missing events
  • Delivered after messages from failed member
  • Failed process will never be heard from again
  • If declared dead must go through recovery
  • Consistent order with ABCAST and CBCAST
  • Using priority queue

42
Normal GBCAST
GBCAST (P1 down, (C,P2,P3,P4))
P4
Coordinator
GBCAST (P1 down, )
Commit
P3
GBCAST (P1 down, )
ACK
P1
P2
Compare current view (C, P1, P2, P3, P4) with new
view. Save to stable storage.
New view (C, P2, P3, P4) P1 down
43
GBCAST Coordinator Fails (1)
GBCAST (P1 down, (C,P2,P3,P4))
P4
Coordinator
Commit
GBCAST (P1 down, )
Commit
P3
GBCAST (P1 down, )
New view (C, P2, P3, P4) P1 down
ACK
P1
P2
Compare current view (C, P1, P2, P3, P4) with new
view. Save to stable storage.
New view (C, P2, P3, P4) P1 down
44
GBCAST Coordinator Fails (2)
Compare current view (C, P1, P2, P3, P4) with new
view. Note there are 2 changes. Save to stable
storage.
New view (P2, P3, P4) P1, C down
P4
Coordinator
Commit
ACK
P3
Commit
(C, P2, P3, P4) P1 down
GBCAST (C down, ((P2,P3,P4) P1 down))
New view (P2, P3, P4) P1, C down
P2
P1
Coordinator
New view (P2, P3, P4) P1, C down
45
ABCAST (msg, label, dests)
  • Assures messages received in same order
  • Issued by sender of message
  • Recipient queues message, assigns max priority,
    tags undeliverable, replies
  • Sender collects responses, computes max, sends
    value
  • Recipient changes priority, tag deliverable,
    resort queue, transfer to delivery queue in order

46
ABCAST (msg, label, dests)
ABCAST queue
P
ABCAST queue
Delivery queue
ABCAST arrives
ABCAST queue
Sender transmits mesg to its destinations. Recipie
nt adds mesg to priority queue associated with
label, tags it as undeliverable, assigns priority
informs sender. Sender computes maximum
priority, sends it back to recipients. Recipients
change priority, tag it as deliverable,
transfer messages to delivery queue in increasing
order of priority.
47
CBCAST (msg, clabel, dests)
  • Ensures relative ordering when necessary
  • clables are comparable or incomparable
  • No common destinations, no comparison
  • Previous messages included in transmit
  • Optimization possible
  • Intersite packets
  • Common message pool and pointers to it
  • Flags track where sent to

48
CBCAST (msg, clabel, dests)
ABCAST queue
P
ABCAST queue
Delivery queue
ABCAST arrives
ABCAST queue
BUFP
CBCAST arrives
Transmission of B from BUFP to BUFQ -A transfer
packet (B1, B2.) is sent to q and includes all
B in BUFP such that B?B and it has not been
delivered to all destinations. For i lt j, Bi ?
Bj -At q, Bi are places in BUFQ. If Bi was
destined for q, it is also placed in the delivery
queue.
49
View Management using GBCAST
P2
View Manager P1
Record Site View To stable storage Cease to
accept Messages from P1
P5 down, View P1,P2,P3,P4
P3
P5 Fails
ACK
P4
Positive ACK /Negative ACK
P5
After receiving the positive ACKS a commit
message is sent to all sites by the view manager
50
Summary
  • Communication protocols for distributed system
  • Defined members, protocols
  • Failures can be tolerated
  • Members have consistent view
  • Used at ISIS project, Cornell Univ.

51
Discussion
  • How well does the protocol perform with frequent
    failures/recoveries within a cluster?
  • Can we use Gossip techniques/spanning trees to
    avoid all the transmission burden on the
    broadcasting node?
  • Compare the replication strategies from previous
    DHTs (Chord, Pastry, etc.)with the replication
    strategies this paper is presenting
  • Could we implement consensus? Why?

52
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com