Title: UBI529 Distributed Algorithms
1UBI529 Distributed Algorithms
Global State of Distributed Systems
2Motivation
- Goal Take a snapshot of the global computation
- A snapshot of local states on n processes taken
at exactly the same time - Two terms global state and global snapshot
- Useful for debugging
- Useful for backup/check-pointing
- Useful for calculating global predicate
- E.g., Exactly how much currency do we have in the
country (notice that money flows among people
constantly)? - Deadlock Detection
- Rollback Recovery
- Termination Detection
3Global state
- Global state
- A set of local states that are concurrent with
each other - Concurrent states no two states have a happened
before relation with each other
4The mystery of the missing dollars
Send 100
B
A
300
400
- Picture taken at A - 400
- A sends 100 to B
- Picture taken at B - 400
- Total is 800
5Global Snapshot Problem
- Determine the global system state (e.g. the total
money ) - Each process records its own state
- No shared clock/memory
- Group of photographers taking snaps of different
portions and trying to combine to get the overall
picture.
6Consistent cut
- Given computation (E,!) and F µ E is a cut iff
- F is a consistent cut (global snapshot) iff
7Consistent and inconsistent cuts
8Consistent cut
A cut is a set of events.
- (a ? consistent cut C) ? (b happened before
a) ? b ? C
b
g
c
a
d
P1
e
m
f
P2
P3
k
h
i
j
Cut 1
Cut 2
(Not consistent)
(Consistent)
9Consistent snapshot
- The set of states immediately following a
consistent cut forms a consistent snapshot of a
distributed system. - A snapshot that is of practical interest is the
most recent one. Let C1 and C2 be two consistent
cuts and C1 ? C2. Then C2 is more recent than C1. - Analyze why certain cuts in the one-dollar bank
are inconsistent.
10Consistent snapshot
- How to record a consistent snapshot? Note that
- 1. The recording must be non-invasive
- 2. Recording must be done on-the-fly.
- You cannot stop the system.
-
11Chandy Lamport Algorithm
- Assumes
-
- FIFO and
- Unidirectional channels
- A bidirectional channel is modelled as two
unidirectional channels - Each process has an associated color. All
processes are initially white. - A process records it local state just before
turning red - On turning red the process sends out a marker on
all outgoing channels - On receiving a marker a white process turns red
12Chandy-Lamport Algorithm
- Works on a
- (1) strongly connected graph
- (2) each channel is FIFO.
- An initiator initiates the algorithm by sending
out a marker ( )
13White and red processes
- Initially every process is white. When a process
receives a marker, it turns red if it has not
already done so. - Every action by a process, and every message sent
by a process gets the color of that process.
14Two steps
- Step 1. In one atomic action, the initiator (a)
Turns red (b) Records its own state (c) sends a
marker along all outgoing channels - Step 2. Every other process, upon receiving a
marker for the first time (and before doing
anything else) (a) Turns red (b) Records its own
state (c) sends markers along all outgoing
channels - The algorithm terminates when (1) every process
turns red, and (2) Every process has received a
marker through each incoming channel.
15Why does it work?
- Lemma 1. No red message is received in a white
action.
16Why does it work?
All white
All red
SSS
Easy conceptualization of the snapshot state
- Theorem. The global state recorded by
Chandy-Lamport algorithm is equivalent to the
ideal snapshot state SSS. - Hint. A pair of actions (a, b) can be scheduled
in any order, if there is no causal order between
them, so (a b) is equivalent to (b a)
17Why does it work?
Let an observer observe the following
actions wi wk rk wj ri wl rj rl
? wi wk wj rk ri wl rj rl
Lemma 1 ? wi wk wj rk wl ri rj
rl Lemma 1 ? wi wk wj wl rk ri
rj rl done!
Recorded state
18Example 1. Count the tokens
- Let us verify that Chandy-Lamport snapshot
algorithm correctly counts - the tokens circulating in the system
D
C
A
B
How to account for the channel states? Use sent
and received variables for each process.
19Chandy Lamport Algorithm
20Algorithm
public class RecvCamera extends Process
implements Camera . . . public
RecvCamera(Linker initComm, CamUser app)
. . . for (int i 0 i lt N i)
if (isNeighbor(i))
closedi false chani new
LinkedList() else closedi
true public synchronized void
globalState() myColor red
app.localState() // record local State
sendToNeighbors("marker", myId) // send
Markers public synchronized void
handleMsg(Msg m, int src, String tag)
if (tag.equals("marker")) if
(myColor white) globalState()
closedsrc true if (isDone())
----- Display channel state
(transit messages) chan ----
else // application message
if ((myColor red)
(!closedsrc))
chansrc.add(m) app.handleMsg(m,
src, tag) // give it to app
boolean isDone() if (myColor white)
return false for (int i 0 i lt N
i) if (!closedi) return false
return true
21Lai Yang Algorithm
- LY1. The initiator records its own state. When
it needs to send a message m to another process,
it sends a message (m, red). - LY2. When a process receives a message (m, red),
it records its state if it has not already done
so, and then accepts the message m.
22Another example of distributed snapshot
Communicating State Machines
23Something unusual
- Let machine i start Chandy-lamport snapshot
before it has sent M along ch1. Also, let machine
j receive the marker after it sends out M along
ch2. Observe that the snapshot state is - down ? up M
- Doesnt this appear strange? This state was
never reached during the computation!
24Understanding snapshot
25Understanding snapshot
The observed state is a feasible state that is
reachable from the initial configuration. It may
not actually be visited during a specific
execution. The final state of the original
computation is always reachable from the
observed state.
26Discussions
- What good is a snapshot if that state has never
been visited by the system? - - It is relevant for the detection of stable
predicates. - - Useful for checkpointing.
27Discussions
- What if the channels are not FIFO?
- Study how Lai-Yang algorithm works. It does not
use any marker - LY1. The initiator records its own state. When
it needs to send a message m to another process,
it sends a message (m, red). - LY2. When a process receives a message (m, red),
it records its state if it has not already done
so, and then accepts the message m. - Question 1. Why will it work?
- Question 1 Are there any limitations of this
approach?
28Global state collection
- Some applications
- - computing network topology
- - termination detection
- - deadlock detection
- Chandy Lamport algorithm does a partial job.
Each process collects a fragment of the global
state, but these pieces have to be stitched
together to form a global state. -
29A simple exercise
- Once the pieces of a consistent global state
become available, consider collecting the global
state via all-to-all broadcast -
- At the end, each process
- will compute a set V, where
- V s(i) 0 i N-1
s(i)
s(j)
i
j
s(k)
s(l)
k
l
30All-to-all broadcast
Assume that the topology is strongly connected
graph
- Program broadcast (for process i
- define V.i, W.i set of values
- initially V.is(i), W.i ??
- ?and?every channel is empty?
- do V.i ? W.i? send (V.i \ W.i) to every outgoing
channel W.i V.i - ? empty (k, i)? receive X from channel(k, i)
V.i V.i ? X - od
V.i W.i
V.k W.k
(i,k)
Acts like a pump
31Proof
- Lemma. empty (i. k) ? W.i ??V.k.
- (Upon termination) ?i V.i W.i,
- and all channels are empty.
- So, V.i ?? V.k.
- On a cyclic path, V.i V.k must be
- true. Since s(i) ??V.i, s(i) ??V.k
V.i W.i
V.k W.k
(i,k)
32Acknowledgements
- This part is heavily dependent on Dr. Sukumar
Ghosh Iowa University Distributed Systems course
22C166
33(No Transcript)
34Termination Detection and Deadlocks
35Termination detection
- During the progress of a distributed computation,
- processes may periodically turn active or
passive. - A distributed computation termination when
- (a) every process is passive,
- (b) all channels are empty, and
- (c) the global state satisfies the desired
postcondition
36Visualizing diffusing computation
initiator
active
passive
Notice how one process engages another process.
Eventually all processes turn white, and no
message is in transit -this signals termination.
How to develop a signaling mechanism to detect
termination?
37Dijkstra-Scholten algorithm
The basic scheme
- An initiator initiates termination detection
- by sending signals (messages) down the
- edges via which it engages other nodes.
- At a suitable time, the recipient sends an
- ack back.
- When the initiator receives ack from every
- node that it engaged, it detects termination.
j
k
signal
j
k
j
k
ack
38Dijkstra-Scholten algorithm
- Deficit (e) of signals on edge e - of ack
on edge e - For any node, C total deficit along incoming
edges - and D total deficit along outgoing
edges - For the initiator, by definition, C 0
- Dijkstra-Scholten algorithm used the following
two - Invariants to develop their algorithm
- Invariant 1. (C 0) ? (D 0)
- Invariant 2. (C gt 0) ? (D 0)
0
1
2
3
4
5
39Dijkstra-Scholten algorithm
- The invariants must hold when an interim node
sends an ack. - So, acks will be sent when
- (C-1 0) ? (C-1 gt 0 ??D0)
- follows from INV1 and INV2
- (C gt 1) ?? (C 1 ? D0)
- (C gt 1) ??(C 1 ? D0)
0
1
2
3
4
5
40Dijkstra-Scholten algorithm
- program detect for an internal node i
- initially C0, D0, parent i
- do
- - m signal ? (C0) ?
- C1 state active parent sender
- this node can send out messages to engage other
nodes, or turn passive - - m ack ? D D-1
- - (C1? D0) ? state passive ? send ack
to parent C 0 parent i - - m signal ? (C1) ?
- send ack to the sender
- od
0
1
2
3
4
5
Note that the engaged nodes induce a spanning tree
41Distributed deadlock
- Assume each process owns a few resources, and
review how resources are allocated. -
- Why deadlocks occur?
- - Exclusive (i.e not shared) resources
- - Non-preemptive scheduling
- - Circular waiting by all or a subset of
processes
42Distributed deadlock
-
- Three aspects of deadlock
- deadlock detection
- deadlock prevention
- deadlock recovery
43Distributed deadlock
- May occur due to bad designs/bad strategy
-
- Sometimes prevention is more expensive than
detection and recovery. So designs may not care
about deadlocks, particularly if it is rare. - Caused by failures or perturbations in the system
44Wait-for Graph (WFG)
- Represents who waits for whom.
- No single process can see the WFG.
- Review how the WFG is formed.
45Another classification
- Resource deadlock
- R1 AND R2 AND R3
- also known as AND deadlock
- Communication deadlock
- R1 OR R2 OR R3
- also known as OR deadlock
46Detection of resource deadlock
- Notations
- w(j) true ? (j is waiting)
- depend j,i true ??
- j ? succn(i) (ngt0)
- P(i,s,k) is a probe
- (iinitiator, s sender, rreceiver)
2
1
3
4
P(4,4,3)
initiator
47Detection of resource deadlock
- Program for process k
- do
- P(i,s,k) received ? wk ? (k ? i) ??
dependk, i ? - send P(i,k,j) to each successor j dependk,
i true - P(i,s, k) received ??wk ? (k i) ? process k
is deadlocked - od
48Observations
- To detect deadlock, the initiator must be in a
cycle - Message complexity O(E)
- (edge-chasing algorithm)
Eset of edges
Should the links be FIFO?
49Communication deadlock
This has a resource deadlock but no
communication deadlock
50Detection of communication deadlock
- A process ignores a probe, if it is not waiting
for any process. Otherwise, - first probe ?
- mark the sender as parent
- forwards the probe to successors
- Not the first probe ?
- Send ack to that sender
- ack received from every successor ?
- send ack to the parent
- Communication deadlock is detected
- if the initiator receives ack.
Has many similarities with Dijkstra-Scholtens
termination detection algorithm
51Distributed deadlock
- May occur due to faulty design or resource
sharing problems -
- Sometimes prevention is more expensive than
detection and recovery. So certain designs
deliberately do not care about deadlocks,
particularly if it is rare. - Sometimes failures failures or perturbations can
modigy the system state and cause deadlock.
Major issues
detection
prevention
recovery
52Wait-for Graph (WFG)
- Represents who waits for whom.
- No single process can see the WFG.
- Review how the WFG is formed.
53Another classification
- Resource deadlock
- R1 AND R2 AND R3
- also known as AND deadlock
- Communication deadlock
- R1 OR R2 OR R3
- also known as OR deadlock
54Detection of resource deadlock
- Notations
- w(j) true ? (j is waiting)
- depend j,i true ??j ? succn(i) (ngt0)
- P(i,s,k) is a probe
- (iinitiator, s sender, rreceiver)
2
1
3
4
P(4,4,3)
initiator
55Detection of resource deadlock
Chandy-Misra-Haas algorithm
- Program for process k
- do P(i,s,k) received ?
- wk ? (k ? i) ?? dependk, i ?
- send P(i,k,j) to each successor j dependk,
i true - P(i,s,k) received ??wk ? (k i) ? process k
is deadlocked - od
56Observations
- To detect deadlock, the initiator must be in a
cycle - Message complexity O(E)
- (edge-chasing algorithm)
Eset of edges
57Communication deadlock
5
The subgraph of the WFG consisting of black nodes
and black edges has a resource deadlock as well
as a communication deadlock. However, if we add
node 5 and the red edge (4,5) then the
communication deadlock will disappear.
58Detection of communication deadlock
- A process ignores a probe, if it is not waiting
for any process. Otherwise, - first probe ?
- mark the sender as parent
- forwards the probe to successors
- Not the first probe ?
- Send ack to that sender
- ack received from every successor ?
- send ack to the parent
- Communication deadlock is detected
- if the initiator receives ack.
Has many similarities with Dijkstra-Scholtens
termination detection algorithm
59Acknowledgements
- This part of the slides is almost entirely
dependent on Dr. Sukumar Ghosh course Iowa
University Distributed Systems course 22C166