A Algorithm for Mutual Exclusion in Decentralized Systems presentation

About This Presentation

Title:

A Algorithm for Mutual Exclusion in Decentralized Systems

Description:

Assuming a query distribution q we want to minimize the expected search size (ESS) ... Minimizing it prevents hotspots. Surprisingly, both have the same ... –

Number of Views:336

Avg rating:3.0/5.0

Slides: 53

Provided by: DAN31

Category:

more less

Transcript and Presenter's Notes

Title: A Algorithm for Mutual Exclusion in Decentralized Systems

1
A Algorithm for Mutual Exclusion in
Decentralized Systems

M. Maekawa

2
Setting Previous Work

Distributed mutual exclusion distributed lock
among nodes without shared memory. We assume
reliable FIFO channels.
Algorithm must ensure safety, liveness, and
deadlock freedom.
Centralized algorithm linear message overhead,
s.p.o.f.
Ricart Agrawala, Thomas total or majority
consensus, hence linear overhead.

3
Maekawa Algorithm

Idea each node i will have a different but
static member set Si responsible for granting
access to the lock.
Member sets are also called quorums in the
literature
Example Si 1 for a centralized algorithm
Si N for classical algorithms
Can we minimize the size of Si ?

4
Requirements

a. Mutual exclusion
b. Reduce message overhead
c. Symmetry (1)
d. Symmetry (2)

5
Derivations

Maximum number of subsets
Approximately,
Therefore
Exact solutions when (K-1) is a power of a prime,
but overall

6
Example

1 1, 2, 3, 4
2 2, 5, 8, 11
3 3, 6, 8, 13
4 4, 6, 10, 11
5 1, 5, 6, 7
6 2, 6, 9, 12
7 2, 7, 10, 13

8 1, 8, 9, 10
9 3, 7, 9, 11
10 3, 5, 10, 12
11 1, 11, 12, 13
12 4, 7, 8, 12
13 4, 5, 9, 13

7
Protocol overview

To acquire the lock, a node A tries to lock all
its member nodes (Si) with REQUEST messages.
Messages are totally ordered by (timestamp,
node).
Nodes reply with either
LOCKED, if they arent
FAILED, if theyre locked by a prior REQUEST,
If a member node is locked by a later REQUEST,
they INQUIRE the locker. If he wont get the lock
anyway (because hes received a FAIL) he
RELINQUISHES the lock, and the member node LOCKS
A.

8
11
7
7
11
11
13
11
2
12
7
11
7
7
11
FAILED
CRITICALSECTION
CRITICALSECTION
1
10
7
8
8
11
8
11
9
8
CRITICALSECTION
8
8
FAILED
9
The Protocol - REQUEST
9
9
3
6
4
5
REQUEST
LOCKED
REQUEST
FAILED
5
5
Important REQUESTs are totally ordered based on
Lamport time restricted to requests. The REQUEST
message is given a sequence number greater than
any REQUEST message sent, received or observed at
this node. REQUESTs are totally ordered by using
IDs as tie-breakers.
10
The Protocol INQUIRE/RELINQUISH
2
6
3
9
2
9
3
7
FAILED
1
3
11
The Protocol RELEASE
RELEASE
RELEASE
INQUIRE
6
12
Mutual Exclusion

At most one node is in the critical section.
If two nodes were in the critical section, they
must have received LOCKED messages from all the
nodes in their member sets.
Since the intersection of both sets is non-empty,
some node must be locked for two nodes
impossible by design.

13
Liveness

No closed waiting cycle can exist.
The node who sent the least preceding message
(total ordering) is forced to RELINQUISH.
A node requesting the lock eventually gets it
There are at most K-1 preceding requests at each
member node.

14
Complexity

State for every node
Messages for mutual exclusion
(K-1) REQUESTs
At most (K-1) INQUIREs
At most (K-1) RELINQUISHes
(K-1) LOCKED
(K-1) RELEASEs

15
Fault Tolerance

Model nodes may fail-stop.
If a node fails, only state is
lost and the author claims that another node
could take over. It is not clear how this would
impact the mutual exclusion guarantees.

16
Building Member Sets (1/2)

Build an approximate projective plane of order
K(K-1) 1 where K is the smallest power of a
prime such that
Replace entries larger than N with entries
smaller than N.
Computational issues?

17
Building Member Sets (2/2)

Placing the nodes in a grid as square as
possible.
Member sets are nodes in the same column or the
same row.

1 2 3 4 5 i j

Guaranteed intersections
Member sets of size
If N is not a square, we complete the last row
with existing values

18
Discussion

Optimality? There exist logarithmic algorithms
for mutual exclusion (e.g. Naimi and Trehel).
Resilience to unexpected node loss? to churn?
Is sqrt(N) low enough?

19
Replication Strategies in Unstructured
Peer-to-Peer Networks

E. Cohen, S. Shenker

20
Introduction

Goal optimal replication data in a decentralized
unstructured P2P system. (Not e.g. Napster or
Chord)
Question how many times do we replicate each
file?
Two basic strategies
Uniform files replicated a fixed number of times
Proportional files replicated proportional to
demand
A continuum of strategies with an optimum
square-root replication, proportional to the
square root of queries.

21
Simple Setting

nodes with capacity , files with
normalized query rates and copies. There
are file copies in total and is the
fraction of copies of a given file. Files have
the same size.
Assuming a query distribution q we want to
minimize the expected search size (ESS)
Under constraints

22
Reductions to the simple setting

In practice, queries stop after L searches.
Assuming L is large enough w.r.t. the number of
files ( ), the error in is
less than one.
If the nodes capacities are not uniform, we take
the average capacity weighed by the visitation
rates and obtain the same results. Larger files
can be seen as multiple copies of a single-sized
one.

23
Three basic allocations

Uniform
Proportional
Square-root
Allocation in-between uniform and proportional
Under the constraints

24
Uniform v. proportional

Uniform allocation
Minimizes resources spent on unsolvable queries
Proportional allocation
Minimizes the maximum utilization rate the u.r.
is the average rate of requests a copy serves.
Minimizing it prevents hotspots.
Surprisingly, both have the same expected search
size m/r!

25
The space of allocations

An allocation strictly in-between the uniform and
proportional allocation has a strictly lower
expected search size.
An allocation that lies strictly outside this
range perform strictly worse.
Assuming the constraints on the number of copies
are satisfied, the square-root allocation
minimizes the ESS.

26
How much can we gain? Theory
27
How much can we gain? Practice
28
Problem feasibility

Proportional or square-root allocations may
violate the constraints
We define square-root as the (unique) allocation
proportional to the square root of the
frequencies when possible, and l1/R when not.
This is optimal among the feasible allocations.
Proportional is defined in a similar way (but is
not very interesting)

29
How do we implement this in a distributed system?

Assumptions copies can be made after a query,
and copies are deleted independently of each
other and of their age (excluding e.g. LRU).
Uniform OK. Nodes dont need to know file query
rates.
Square-root more complicated. How to determine
the number of copies Ci?

30
Square-root replication in practice

Path replication
C number of nodes searched. Works but may
undershoot/overshoot the target frequency if
queries arrive in bursts.
Replication with sibling memory
Each file tracks the number of times its been
directly copied. If we know the average frequency
at which files disappear, we can compute an
unbiased estimator for file query rates.
Replication with probe memory
Nodes keep track of the probes they receive and
use those frequencies as file query rates,
possibly aggregating along the search path

31
Discussion

Reactivity to new data items?
What about structured distributed systems? Can we
port some of these ideas?
Strong assumptions about the life cycle of files.
Too strong perhaps?
Which replication strategy is better for
distributed caches, p2p television, Kazaa-like
applications?

32
Reliable Communication in the Presence of
Failures

Kenneth Birman and Thomas Joseph, Cornell
University
ACM Transactions on Computer Science,1987

33
Agenda

Overview
Motivation and Background
Assumptions and Definition of Terms
System Components
Communication primitives
Summary

34
Overview

Communication facility for distributed system
Failures can occur
present a set of communication primitives
Applicable to local and wide-area networks
Fault-tolerant process groups
Consistent orderings of events
Events delivered despite failures

35
Motivation Background

Design of a communication facility for
distributed systems.
Consistent event orderings.
Optimize concurrency.
This system is in use at the NYSE, Swiss Exchange
and the French air traffic control system.

36
Assumptions and Definition of Terms

Failure process stops without incorrect actions
Event orderings controlled by communication layer
Fault tolerance continued operation
Failures detected, others notified
Detect remote failure using timeout
Logical approach, rather than physical
Pretend events took place either before or after
No communication among inconsistent processes

37
Definitions

Fault Tolerant Process Groups
Collection of processes that are cooperating to
perform a distributed computation, and use the
communication primitives described in this paper.
Broadcast
Refers to transmission of a message from a
process to the members of a process group, and
not to all processes in the system.

38
Fault Tolerant Process Groups

Processes cooperating to perform distributed
transaction
No shared memory or synchronized clocks
Timer for timeout
Changes in membership ordered with respect to
events
Members monitor each other

39
Managing Group Membership

View (process group view)
A snapshot or the membership and global
properties of a process group
Local copy of view list at all sites
View Manager
Oldest site
Calculates view extensions
View Extension
Current view 1 extension
Other changes can get added on
Site Manager

S2
S1
S2-3
S2-1
S2-2
S3
S4
40
Communication Primitives

Send messages only to members of group
Members can be at the same site or remote
GBCAST group broadcast
ABCAST atomic broadcast
CBCAST causal broadcast
All are atomic

41
GBCAST (action, G)

Broadcasts membership changes
Issued by coordinator 2 Phase Commit
Coordinator calculates change
Change received Acknowledge Commit
Change doesnt match NACK with missing events
Delivered after messages from failed member
Failed process will never be heard from again
If declared dead must go through recovery
Consistent order with ABCAST and CBCAST
Using priority queue

42
Normal GBCAST
GBCAST (P1 down, (C,P2,P3,P4))
P4
Coordinator
GBCAST (P1 down, )
Commit
P3
GBCAST (P1 down, )
ACK
P1
P2
Compare current view (C, P1, P2, P3, P4) with new
view. Save to stable storage.
New view (C, P2, P3, P4) P1 down
43
GBCAST Coordinator Fails (1)
GBCAST (P1 down, (C,P2,P3,P4))
P4
Coordinator
Commit
GBCAST (P1 down, )
Commit
P3
GBCAST (P1 down, )
New view (C, P2, P3, P4) P1 down
ACK
P1
P2
Compare current view (C, P1, P2, P3, P4) with new
view. Save to stable storage.
New view (C, P2, P3, P4) P1 down
44
GBCAST Coordinator Fails (2)
Compare current view (C, P1, P2, P3, P4) with new
view. Note there are 2 changes. Save to stable
storage.
New view (P2, P3, P4) P1, C down
P4
Coordinator
Commit
ACK
P3
Commit
(C, P2, P3, P4) P1 down
GBCAST (C down, ((P2,P3,P4) P1 down))
New view (P2, P3, P4) P1, C down
P2
P1
Coordinator
New view (P2, P3, P4) P1, C down
45
ABCAST (msg, label, dests)

Assures messages received in same order
Issued by sender of message
Recipient queues message, assigns max priority,
tags undeliverable, replies
Sender collects responses, computes max, sends
value
Recipient changes priority, tag deliverable,
resort queue, transfer to delivery queue in order

46
ABCAST (msg, label, dests)
ABCAST queue
P
ABCAST queue
Delivery queue
ABCAST arrives
ABCAST queue
Sender transmits mesg to its destinations. Recipie
nt adds mesg to priority queue associated with
label, tags it as undeliverable, assigns priority
informs sender. Sender computes maximum
priority, sends it back to recipients. Recipients
change priority, tag it as deliverable,
transfer messages to delivery queue in increasing
order of priority.
47
CBCAST (msg, clabel, dests)

Ensures relative ordering when necessary
clables are comparable or incomparable
No common destinations, no comparison
Previous messages included in transmit
Optimization possible
Intersite packets
Common message pool and pointers to it
Flags track where sent to

48
CBCAST (msg, clabel, dests)
ABCAST queue
P
ABCAST queue
Delivery queue
ABCAST arrives
ABCAST queue
BUFP
CBCAST arrives
Transmission of B from BUFP to BUFQ -A transfer
packet (B1, B2.) is sent to q and includes all
B in BUFP such that B?B and it has not been
delivered to all destinations. For i lt j, Bi ?
Bj -At q, Bi are places in BUFQ. If Bi was
destined for q, it is also placed in the delivery
queue.
49
View Management using GBCAST
P2
View Manager P1
Record Site View To stable storage Cease to
accept Messages from P1
P5 down, View P1,P2,P3,P4
P3
P5 Fails
ACK
P4
Positive ACK /Negative ACK
P5
After receiving the positive ACKS a commit
message is sent to all sites by the view manager
50
Summary

Communication protocols for distributed system
Defined members, protocols
Failures can be tolerated
Members have consistent view
Used at ISIS project, Cornell Univ.

51
Discussion

How well does the protocol perform with frequent
failures/recoveries within a cluster?
Can we use Gossip techniques/spanning trees to
avoid all the transmission burden on the
broadcasting node?
Compare the replication strategies from previous
DHTs (Chord, Pastry, etc.)with the replication
strategies this paper is presenting
Could we implement consensus? Why?

52
Thank you!

Write a Comment

User Comments (0)

About PowerShow.com