Title: Distributed Systems: Atomicity, decision making, snapshots
1Distributed Systems Atomicity, decision making,
snapshots
2Announcements
- Please complete course evaluations
- http//www.engineering.cornell.edu/CourseEval/
- Prelim II coming up this week
- Thursday, April 26th, 730900pm, 1½ hour exam
- 101 Phillips
- Closed book, no calculators/PDAs/
- Bring ID
- Topics
- Since last Prelim, up to (and including) Monday,
April 23rd - Lectures 19-34, chapters 10-18 (7th ed)
- Review Session Tuesday, April 24th
- during second half of 415 Section
- Homework 6 (and solutions) available via CMS
- Do it without looking at solutions. However, it
will not be graded
3Review What time is it?
- In distributed system we need practical ways to
deal with time - E.g. we may need to agree that update A occurred
before update B - Or offer a lease on a resource that expires at
time 1010.0150 - Or guarantee that a time critical event will
reach all interested parties within 100ms
4Review Event Ordering
- Problem distributed systems do not share a clock
- Many coordination problems would be simplified if
they did (first one wins) - Distributed systems do have some sense of time
- Events in a single process happen in order
- Messages between processes must be sent before
they can be received - How helpful is this?
5Review Happens-before
- Define a Happens-before relation (denoted by ?).
- 1) If A and B are events in the same process, and
A was executed before B, then A ? B. - 2) If A is the event of sending a message by one
process and B is the event of receiving that
message by another process, then A ? B. - 3) If A ? B and B ? C then A ? C.
6Review Total ordering?
- Happens-before gives a partial ordering of events
- We still do not have a total ordering of events
- We are not able to order events that happen
concurrently - Concurrent if (not A?B) and (not B?A)
7Review Partial Ordering
Pi -gtPi1 Qi -gt Qi1 Ri -gt Ri1
R0-gtQ4 Q3-gtR4 Q1-gtP4 P1-gtQ2
8Review Total Ordering?
P0, P1, Q0, Q1, Q2, P2, P3, P4, Q3, R0, Q4, R1,
R2, R3, R4
P0, Q0, Q1, P1, Q2, P2, P3, P4, Q3, R0, Q4, R1,
R2, R3, R4
P0, Q0, P1, Q1, Q2, P2, P3, P4, Q3, R0, Q4, R1,
R2, R3, R4
9 Review Timestamps
- Assume each process has a local logical clock
that ticks once per event and that the processes
are numbered - Clocks tick once per event (including message
send) - When send a message, send your clock value
- When receive a message, set your clock to MAX(
your clock, timestamp of message 1) - Thus sending comes before receiving
- Only visibility into actions at other nodes
happens during communication, communicate
synchronizes the clocks - If the timestamps of two events A and B are the
same, then use the process identity numbers to
break ties. - This gives a total ordering!
10Review Distributed Mutual Exclusion
- Want mutual exclusion in distributed setting
- The system consists of n processes each process
Pi resides at a different processor - Each process has a critical section that requires
mutual exclusion - Problem Cannot use atomic testAndSet primitive
since memory not shared and processes may be on
physically separated nodes - Requirement
- If Pi is executing in its critical section, then
no other process Pj is executing in its critical
section - Compare three solutions
- Centralized Distributed Mutual Exclusion (CDME)
- Fully Distributed Mutual Exclusion (DDME)
- Token passing
11Today
- Atomicity and Distributed Decision Making
- What time is it now?
- Synchronized clocks
- What does the entire system look like at this
moment?
12Atomicity
- Recall
- Atomicity either all the operations associated
with a program unit are executed to completion,
or none are performed. - In a distributed system may have multiple copies
of the data - (e.g. replicas are good for reliability/availabili
ty) - PROBLEM How do we atomically update all of the
copies? - That is, either all replicas reflect a change or
none
13Generals Paradox
- Generals paradox
- Constraints of problem
- Two generals, on separate mountains
- Can only communicate via messengers
- Messengers can be captured
- Problem need to coordinate attack
- If they attack at different times, they all die
- If they attack at same time, they win
- Named after Custer, who died at Little Bighorn
because he arrived a couple of days too early! - Can messages over an unreliable network be used
to guarantee two entities do something
simultaneously? - Remarkably, no, even if all messages get
through - No way to be sure last message gets through!
14Replica Consistency Problem -Concurrent and
conflicting updates
- Imagine we have multiple bank servers and a
client desiring to update their back account - How can we do this?
- Allow a client to update any server then have
server propagate update to other servers? - Simple and wrong!
- Simultaneous and conflicting updates can occur at
different servers? - Have client send update to all servers?
- Same problem - race condition which of the
conflicting update will reach each server first
15Two-phase commit
- Since we cant solve the Generals Paradox (i.e.
simultaneous action), concurrent and conflicting
updates may be sent by clients, lets solve a
related problem - Distributed transaction Two machines agree to do
something, or not do it, atomically - Algorithm for providing atomic updates in a
distributed system - Give the servers (or replicas) a chance to say no
and if any server says no, client aborts the
operation
16Framework
- Goal Update all replicas atomically
- Either everyone commits or everyone aborts
- No inconsistencies even in face of failures
- Caveat Assume only crash or fail-stop failures
- Crash servers stop when they fail do not
continue and generate bad data - Fail-stop in addition to crash, fail-stop
failure is detectable. - Definitions
- Coordinator Software entity that shepherds the
process (in our example could be one of the
servers) - Ready to commit side effects of update safely
stored on non-volatile storage - Even if crash, once I say I am ready to commit
then a recover procedure will find evidence and
continue with commit protocol
17Two Phase Commit Phase 1
- Coordinator send a PREPARE message to each
replica - Coordinator waits for all replicas to reply with
a vote - Each participant replies with a vote
- Votes PREPARED if ready to commit and locks data
items being updated - Votes NO if unable to get a lock or unable to
ensure ready to commit
18Two Phase Commit Phase 2
- If coordinator receives PREPARED vote from all
replicas then it may decide to commit or abort - Coordinator send its decision to all participants
- If participant receives COMMIT decision then
commit changes resulting from update - If participant received ABORT decision then
discard changes resulting from update - Participant replies DONE
- When Coordinator received DONE from all
participants then can delete record of outcome
19Performance
- In absence of failure, 2PC (two-phase-commit)
makes a total of 2 (1.5?) round trips of messages
before decision is made - Prepare
- Vote NO or PREPARE
- Commit/abort
- Done (but done just for bookkeeping, does not
affect response time)
20Failure Handling in 2PC Replica Failure
- The log contains a ltcommit Tgt record.
- In this case, the site executes redo(T).
- The log contains an ltabort Tgt record.
- In this case, the site executes undo(T).
- The log contains a ltready Tgt record
- In this case consult coordinator Ci.
- If Ci is down, site sends query-status T message
to the other sites. - The log contains no control records concerning T.
- In this case, the site executes undo(T).
21Failure Handling in 2PC Coordinator Ci Failure
- If an active site contains a ltcommit Tgt record in
its log, then T must be committed. - If an active site contains an ltabort Tgt record in
its log, then T must be aborted. - If some active site does not contain the record
ltready Tgt in its log then the failed coordinator
Ci cannot have decided to commit T. Rather than
wait for Ci to recover, it is preferable to abort
T. - All active sites have a ltready Tgt record in their
logs, but no additional control records. In this
case we must wait for the coordinator to recover.
- Blocking problem T is blocked pending the
recovery of site Si.
22Failure Handling
- Failure detected with timeouts
- If participant times out before getting a PREPARE
can abort - If coordinator times out waiting for a vote can
abort - If a participant times out waiting for a decision
it is blocked! - Wait for Coordinator to recover?
- Punt to some other resolution protocol
- If a coordinator times out waiting for done, keep
record of outcome - other sites may have a replica.
23Failures in distributed systems
- We may want to avoid relying on a single
server/coordinator/boss to make progress - Thus want the decision making to be distributed
among the participants (all nodes created
equal) gt the consensus problem in distributed
systems. - However depending on what we can assume about the
network, it may be impossible to reach a decision
in some cases!
24Impossibility of Consensus
- Network characteristics
- Synchronous - some upper bound on
network/processing delay. - Asynchronous - no upper bound on
network/processing delay. - Fischer Lynch and Paterson showed
- With even just one failure possible, you cannot
guarantee consensus. - Cannot guarantee consensus process will terminate
- Assumes asynchronous network
- Essence of proof Just before a decision is
reached, we can delay a node slightly too long to
reach a decision. - But we still want to do it.. Right?
25Distributed Decision Making Discussion
- Why is distributed decision making desirable?
- Fault Tolerance!
- A group of machines can come to a decision even
if one or more of them fail during the process - Simple failure mode called failstop (different
modes later) - After decision made, result recorded in multiple
places - Undesirable feature of Two-Phase Commit Blocking
- One machine can be stalled until another site
recovers - Site B writes prepared to commit record to its
log, sends a yes vote to the coordinator (site
A) and crashes - Site A crashes
- Site B wakes up, check its log, and realizes that
it has voted yes on the update. It sends a
message to site A asking what happened. At this
point, B cannot decide to abort, because update
may have committed - B is blocked until A comes back
- A blocked site holds resources (locks on updated
items, pages pinned in memory, etc) until learns
fate of update - Alternative There are alternatives such as
Three Phase Commit which dont have this
blocking problem - What happens if one or more of the nodes is
malicious? - Malicious attempting to compromise the decision
making - Known as Byzantine fault tolerance. More on this
next time
26Introducing wall clock time
- Back to the notion of time
- Distributed systems sometimes needs more precise
notion of time other than happens-before - There are several options
- Instead of network/process identitity to break
ties - Extend a logical clock with the clock time and
use it to break ties - Makes meaningful statements like B and D were
concurrent, although B occurred first - But unless clocks are closely synchronized such
statements could be erroneous! - We use a clock synchronization algorithm to
reconcile differences between clocks on various
computers in the network
27Synchronizing clocks
- Without help, clocks will often differ by many
milliseconds - Problem is that when a machine downloads time
from a network clock it cant be sure what the
delay was - This is because the uplink and downlink
delays are often very different in a network - Outright failures of clocks are rare
28Synchronizing clocks
- Suppose p synchronizes with time.windows.com and
notes that 123 ms elapsed while the protocol was
running what time is it now?
Delay 123ms
p
What time is it?
0923.02921
time.windows.com
29Synchronizing clocks
- Options?
- p could guess that the delay was evenly split,
but this is rarely the case in WAN settings
(downlink speeds are higher) - p could ignore the delay
- p could factor in only certain delay, e.g. if
we know that the link takes at least 5ms in each
direction. Works best with GPS time sources! - In general cant do better than uncertainty in
the link delay from the time source down to p
30Consequences?
- In a network of processes, we must assume that
clocks are - Not perfectly synchronized.
- We say that clocks are inaccurate
- Even GPS has uncertainty, although small
- And clocks can drift during periods between
synchronizations - Relative drift between clocks is their precision
31Temporal distortions
- Things can be complicated because we cant
predict - Message delays (they vary constantly)
- Execution speeds (often a process shares a
machine with many other tasks) - Timing of external events
- Lamport looked at this question too
32Temporal distortions
p
0
a
d
e
b
c
p
1
f
p
2
p
3
33Temporal distortions
What does now mean?
p
0
a
d
e
b
c
p
1
f
p
2
p
3
34Temporal distortions
Timelines can stretch caused by
scheduling effects, message delays, message loss
p
0
a
d
e
b
c
p
1
f
p
2
p
3
35Temporal distortions
Timelines can shrink E.g. something lets a
machine speed up
p
0
a
d
e
b
c
p
1
f
p
2
p
3
36Temporal distortions
Cuts represent instants of time. But not
every cut makes sense Black cuts could occur
but not gray ones.
p
0
a
d
e
b
c
p
1
f
p
2
p
3
37Consistent cuts and snapshots
- Idea is to identify system states that might
have occurred in real-life - Need to avoid capturing states in which a message
is received but nobody is shown as having sent it - This the problem with the gray cuts
38Temporal distortions
Red messages cross gray cuts backwards
p
0
a
d
e
b
c
p
1
f
p
2
p
3
39Temporal distortions
Red messages cross gray cuts backwards In
a nutshell the cut includes a message that was
never sent
p
0
a
e
b
c
p
1
p
2
p
3
40Who cares?
- Suppose, for example, that we want to do
distributed deadlock detection - System lets processes wait for actions by other
processes - A process can only do one thing at a time
- A deadlock occurs if there is a circular wait
41Deadlock detection algorithm
- p worries perhaps we have a deadlock
- p is waiting for q, so sends whats your state?
- q, on receipt, is waiting for r, so sends the
same question and r for s. And s is waiting on
p.
42Suppose we detect this state
- We see a cycle
- but is it a deadlock?
p
q
Waiting for
Waiting for
Waiting for
r
s
Waiting for
43Phantom deadlocks!
- Suppose system has a very high rate of locking.
- Then perhaps a lock release message passed a
query message - i.e. we see q waiting for r and r waiting for
s but in fact, by the time we checked r, q was
no longer waiting! - In effect we checked for deadlock on a gray cut
an inconsistent cut.
44Consistent cuts and snapshots
- Goal is to draw a line across the system state
such that - Every message received by a process is shown as
having been sent by some other process - Some pending messages might still be in
communication channels - A cut is the frontier of a snapshot
45Chandy/Lamport Algorithm
- Assume that if pi can talk to pj they do so using
a lossless, FIFO connection - Now think about logical clocks
- Suppose someone sets his clock way ahead and
triggers a flood of messages - As these reach each process, it advances its own
time eventually all do so. - The point where time jumps forward is a
consistent cut across the system
46Using logical clocks to make cuts
Message sets the time forward by a lot
p
0
a
d
e
b
c
p
1
f
p
2
p
3
Algorithm requires FIFO channels must delay e
until b has been delivered!
47Using logical clocks to make cuts
Cut occurs at point where time advanced
p
0
a
d
e
b
c
p
1
f
p
2
p
3
48Turn idea into an algorithm
- To start a new snapshot, pi
- Builds a message Pi is initiating snapshot k.
- The tuple (pi, k) uniquely identifies the
snapshot - In general, on first learning about snapshot (pi,
k), px - Writes down its state pxs contribution to the
snapshot - Starts tape recorders for all communication
channels - Forwards the message on all outgoing channels
- Stops tape recorder for a channel when a
snapshot message for (pi, k) is received on it - Snapshot consists of all the local state
contributions and all the tape-recordings for the
channels
49Chandy/Lamport
- This algorithm, but implemented with an outgoing
flood, followed by an incoming wave of snapshot
contributions - Snapshot ends up accumulating at the initiator,
pi - Algorithm doesnt tolerate process failures or
message failures.
50Chandy/Lamport
w
t
q
r
p
s
u
y
v
x
z
A network
51Chandy/Lamport
w
t
I want to start a snapshot
q
r
p
s
u
y
v
x
z
A network
52Chandy/Lamport
w
t
q
p records local state
r
p
s
u
y
v
x
z
A network
53Chandy/Lamport
w
p starts monitoring incoming channels
t
q
r
p
s
u
y
v
x
z
A network
54Chandy/Lamport
w
t
q
contents of channel p-y
r
p
s
u
y
v
x
z
A network
55Chandy/Lamport
w
p floods message on outgoing channels
t
q
r
p
s
u
y
v
x
z
A network
56Chandy/Lamport
w
t
q
r
p
s
u
y
v
x
z
A network
57Chandy/Lamport
w
q is done
t
q
r
p
s
u
y
v
x
z
A network
58Chandy/Lamport
w
t
q
q
r
p
s
u
y
v
x
z
A network
59Chandy/Lamport
w
t
q
q
r
p
s
u
y
v
x
z
A network
60Chandy/Lamport
w
t
q
q
r
p
s
u
y
v
x
z
s
z
A network
61Chandy/Lamport
w
x
t
q
q
r
p
u
s
u
y
v
x
z
s
z
v
A network
62Chandy/Lamport
w
w
x
t
q
q
r
p
z
s
s
v
y
u
r
u
y
v
x
z
A network
63Chandy/Lamport
w
t
q
q
p
Done!
r
p
s
r
s
u
t
u
w
v
y
v
y
x
x
z
z
A snapshot of a network
64Whats in the state?
- In practice we only record things important to
the application running the algorithm, not the
whole state - E.g. locks currently held, lock release
messages - Idea is that the snapshot will be
- Easy to analyze, letting us build a picture of
the system state - And will have everything that matters for our
real purpose, like deadlock detection