Title: Distributed Systems: Atomicity, Decision Making, Faults, Snapshots
1Distributed Systems Atomicity, Decision Making,
Faults, Snapshots
2Announcements
- Prelim II coming up next week
- In class, Thursday, November 20th, 10101125pm
- 203 Thurston
- Closed book, no calculators/PDAs/
- Bring ID
- Topics
- Everything after first prelim
- Lectures 14-22, chapters 10-15 (8th ed)
- Review Session Tonight, November 18th,
630pm730pm - Location 315 Upson Hall
3Review What time is it?
- In distributed system we need practical ways to
deal with time - E.g. we may need to agree that update A occurred
before update B - Or offer a lease on a resource that expires at
time 1010.0150 - Or guarantee that a time critical event will
reach all interested parties within 100ms
4Review Event Ordering
- Problem distributed systems do not share a clock
- Many coordination problems would be simplified if
they did (first one wins) - Distributed systems do have some sense of time
- Events in a single process happen in order
- Messages between processes must be sent before
they can be received - How helpful is this?
5Review Happens-before
- Define a Happens-before relation (denoted by ?).
- 1) If A and B are events in the same process, and
A was executed before B, then A ? B. - 2) If A is the event of sending a message by one
process and B is the event of receiving that
message by another process, then A ? B. - 3) If A ? B and B ? C then A ? C.
6Review Total ordering?
- Happens-before gives a partial ordering of events
- We still do not have a total ordering of events
- We are not able to order events that happen
concurrently - Concurrent if (not A?B) and (not B?A)
7Review Partial Ordering
Pi -gtPi1 Qi -gt Qi1 Ri -gt Ri1
R0-gtQ4 Q3-gtR4 Q1-gtP4 P1-gtQ2
8Review Total Ordering?
P0, P1, Q0, Q1, Q2, P2, P3, P4, Q3, R0, Q4, R1,
R2, R3, R4
P0, Q0, Q1, P1, Q2, P2, P3, P4, Q3, R0, Q4, R1,
R2, R3, R4
P0, Q0, P1, Q1, Q2, P2, P3, P4, Q3, R0, Q4, R1,
R2, R3, R4
9 Review Timestamps
- Assume each process has a local logical clock
that ticks once per event and that the processes
are numbered - Clocks tick once per event (including message
send) - When send a message, send your clock value
- When receive a message, set your clock to MAX(
your clock, timestamp of message 1) - Thus sending comes before receiving
- Only visibility into actions at other nodes
happens during communication, communicate
synchronizes the clocks - If the timestamps of two events A and B are the
same, then use the process identity numbers to
break ties. - This gives a total ordering!
10Review Distributed Mutual Exclusion
- Want mutual exclusion in distributed setting
- The system consists of n processes each process
Pi resides at a different processor - Each process has a critical section that requires
mutual exclusion - Problem Cannot use atomic testAndSet primitive
since memory not shared and processes may be on
physically separated nodes - Requirement
- If Pi is executing in its critical section, then
no other process Pj is executing in its critical
section - Compare three solutions
- Centralized Distributed Mutual Exclusion (CDME)
- Fully Distributed Mutual Exclusion (DDME)
- Token passing
11Today
- Atomicity and Distributed Decision Making
- Faults in distributed systems
- What time is it now?
- Synchronized clocks
- What does the entire system look like at this
moment?
12Atomicity
- Recall
- Atomicity either all the operations associated
with a program unit are executed to completion,
or none are performed. - In a distributed system may have multiple copies
of the data - (e.g. replicas are good for reliability/availabili
ty) - PROBLEM How do we atomically update all of the
copies? - That is, either all replicas reflect a change or
none
13E.g. Two-Phase Commit
- Goal Update all replicas atomically
- Either everyone commits or everyone aborts
- No inconsistencies even in face of failures
- Caveat Assume only crash or fail-stop failures
- Crash servers stop when they fail do not
continue and generate bad data - Fail-stop in addition to crash, fail-stop
failure is detectable. - Definitions
- Coordinator Software entity that shepherds the
process (in our example could be one of the
servers) - Ready to commit side effects of update safely
stored on non-volatile storage - Even if crash, once I say I am ready to commit
then a recover procedure will find evidence and
continue with commit protocol
14Two Phase Commit Phase 1
- Coordinator send a PREPARE message to each
replica - Coordinator waits for all replicas to reply with
a vote - Each participant replies with a vote
- Votes PREPARED if ready to commit and locks data
items being updated - Votes NO if unable to get a lock or unable to
ensure ready to commit
15Two Phase Commit Phase 2
- If coordinator receives PREPARED vote from all
replicas then it may decide to commit or abort - Coordinator send its decision to all participants
- If participant receives COMMIT decision then
commit changes resulting from update - If participant received ABORT decision then
discard changes resulting from update - Participant replies DONE
- When Coordinator received DONE from all
participants then can delete record of outcome
16Performance
- In absence of failure, 2PC (two-phase-commit)
makes a total of 2 (1.5?) round trips of messages
before decision is made - Prepare
- Vote NO or PREPARE
- Commit/abort
- Done (but done just for bookkeeping, does not
affect response time)
17Failure Handling in 2PC Replica Failure
- The log contains a ltcommit Tgt record.
- In this case, the site executes redo(T).
- The log contains an ltabort Tgt record.
- In this case, the site executes undo(T).
- The log contains a ltready Tgt record
- In this case consult coordinator Ci.
- If Ci is down, site sends query-status T message
to the other sites. - The log contains no control records concerning T.
- In this case, the site executes undo(T).
18Failure Handling in 2PC Coordinator Ci Failure
- If an active site contains a ltcommit Tgt record in
its log, then T must be committed. - If an active site contains an ltabort Tgt record in
its log, then T must be aborted. - If some active site does not contain the record
ltready Tgt in its log then the failed coordinator
Ci cannot have decided to commit T. Rather than
wait for Ci to recover, it is preferable to abort
T. - All active sites have a ltready Tgt record in their
logs, but no additional control records. In this
case we must wait for the coordinator to recover.
- Blocking problem T is blocked pending the
recovery of site Si.
19Failure Handling
- Failure detected with timeouts
- If participant times out before getting a PREPARE
can abort - If coordinator times out waiting for a vote can
abort - If a participant times out waiting for a decision
it is blocked! - Wait for Coordinator to recover?
- Punt to some other resolution protocol
- If a coordinator times out waiting for done, keep
record of outcome - other sites may have a replica.
20Failures in distributed systems
- We may want to avoid relying on a single
server/coordinator/boss to make progress - Thus want the decision making to be distributed
among the participants (all nodes created
equal) gt the consensus problem in distributed
systems. - However depending on what we can assume about the
network, it may be impossible to reach a decision
in some cases!
21Impossibility of Consensus
- Network characteristics
- Synchronous - some upper bound on
network/processing delay. - Asynchronous - no upper bound on
network/processing delay. - Fischer Lynch and Paterson showed
- With even just one failure possible, you cannot
guarantee consensus. - Cannot guarantee consensus process will terminate
- Assumes asynchronous network
- Essence of proof Just before a decision is
reached, we can delay a node slightly too long to
reach a decision. - But we still want to do it.. Right?
22Distributed Decision Making Discussion
- Why is distributed decision making desirable?
- Fault Tolerance! Also, atomicity in distributed
system. - A group of machines can come to a decision even
if one or more of them fail during the process - After decision made, result recorded in multiple
places - Undesirable if algorithm is blocking (e.g.
two-phase commit) - One machine can be stalled until another site
recovers - A blocked site holds resources (locks on updated
items, pages pinned in memory, etc) until learns
fate of update - To reduce blocking
- add more rounds (e.g. three-phase commit)
- Add more replicas than needed (e.g. quorums)
- What happens if one or more of the nodes is
malicious? - Malicious attempting to compromise the decision
making - Known as Byzantine fault tolerance. More on this
next time
23Faults
24Categories of failures
- Crash faults, message loss
- These are common in real systems
- Crash failures process simply stops, and does
nothing wrong that would be externally visible
before it stops - These faults cant be directly detected
25Categories of failures
- Fail-stop failures
- These require system support
- Idea is that the process fails by crashing, and
the system notifies anyone who was talking to it - With fail-stop failures we can overcome message
loss by just resending packets, which must be
uniquely numbered - Easy to work with but rarely supported
26Categories of failures
- Non-malicious Byzantine failures
- This is the best way to understand many kinds of
corruption and buggy behaviors - Program can do pretty much anything, including
sending corrupted messages - But it doesnt do so with the intention of
screwing up our protocols - Unfortunately, a pretty common mode of failure
27Categories of failure
- Malicious, true Byzantine, failures
- Model is of an attacker who has studied the
system and wants to break it - She can corrupt or replay messages, intercept
them at will, compromise programs and substitute
hacked versions - This is a worst-case scenario mindset
- In practice, doesnt actually happen
- Very costly to defend against typically used in
very limited ways (e.g. key mgt. server)
28Models of failure
- Question here concerns how failures appear in
formal models used when proving things about
protocols - Think back to Lamports happens-before
relationship, ? - Model already has processes, messages, temporal
ordering - Assumes messages are reliably delivered
29Two kinds of models
- We tend to work within two models
- Asynchronous model makes no assumptions about
time - Lamports model is a good fit
- Processes have no clocks, will wait indefinitely
for messages, could run arbitrarily fast/slow - Distributed computing at an eons timescale
- Synchronous model assumes a lock-step execution
in which processes share a clock
30Adding failures in Lamports model
- Also called the asynchronous model
- Normally we just assume that a failed process
crashes it stops doing anything - Notice that in this model, a failed process is
indistinguishable from a delayed process - In fact, the decision that something has failed
takes on an arbitrary flavor - Suppose that at point e in its execution, process
p decides to treat q as faulty.
31What about the synchronous model?
- Here, we also have processes and messages
- But communication is usually assumed to be
reliable any message sent at time t is delivered
by time t? - Algorithms are often structured into rounds, each
lasting some fixed amount of time ?, giving time
for each process to communicate with every other
process - In this model, a crash failure is easily detected
- When people have considered malicious failures,
they often used this model
32Neither model is realistic
- Value of the asynchronous model is that it is so
stripped down and simple - If we can do something well in this model we
can do at least as well in the real world - So well want best solutions
- Value of the synchronous model is that it adds a
lot of unrealistic mechanism - If we cant solve a problem with all this help,
we probably cant solve it in a more realistic
setting! - So seek impossibility results
33Fischer, Lynch and Patterson
- Impossibility of Consensus
- A surprising result
- Impossibility of Asynchronous Distributed
Consensus with a Single Faulty Process - They prove that no asynchronous algorithm for
agreeing on a one-bit value can guarantee that it
will terminate in the presence of crash faults - And this is true even if no crash actually
occurs! - Proof constructs infinite non-terminating runs
- Essence of proof Just before a decision is
reached, we can delay a node slightly too long to
reach a decision.
34Tougher failure models
- Weve focused on crash failures
- In the synchronous model these look like a
farewell cruel world message - Some call it the failstop model. A faulty
process is viewed as first saying goodbye, then
crashing - What about tougher kinds of failures?
- Corrupted messages
- Processes that dont follow the algorithm
- Malicious processes out to cause havoc?
35Here the situation is much harder
- Generally we need at least 3f1 processes in a
system to tolerate f Byzantine failures - For example, to tolerate 1 failure we need 4 or
more processes - We also need f1 rounds
- Lets see why this happens
36Byzantine Generals scenario
- Generals (N of them) surround a city
- They communicate by courier
- Each has an opinion attack or wait
- In fact, an attack would succeed the city will
fall. - Waiting will succeed too the city will
surrender. - But if some attack and some wait, disaster ensues
- Some Generals (f of them) are traitors it
doesnt matter if they attack or wait, but we
must prevent them from disrupting the battle - Traitor cant forge messages from other Generals
37Byzantine Generals scenario
Attack! No, wait! Surrender!
Wait
Attack!
Attack!
Wait
38A timeline perspective
p
q
r
- Suppose that p and q favor attack, r is a traitor
and s and t favor waiting assume that in a tie
vote, we attack
s
t
39A timeline perspective
- After first round collected votes are
- attack, attack, wait, wait, traitors-vote
p
q
r
s
t
40What can the traitor do?
- Add a legitimate vote of attack
- Anyone with 3 votes to attack knows the outcome
- Add a legitimate vote of wait
- Vote now favors wait
- Or send different votes to different folks
- Or dont send a vote, at all, to some
41Outcomes?
- Traitor simply votes
- Either all see a,a,a,w,w
- Or all see a,a,w,w,w
- Traitor double-votes
- Some see a,a,a,w,w and some a,a,w,w,w
- Traitor withholds some vote(s)
- Some see a,a,w,w, perhaps others see
a,a,a,w,w, and still others see a,a,w,w,w - Notice that traitor cant manipulate votes of
loyal Generals!
42What can we do?
- Clearly we cant decide yet some loyal Generals
might have contradictory data - Anyone with 4 votes can decide
- But with 3 votes to wait or attack, a General
isnt sure (one could be a traitor) - So in round 2, each sends out witness
messages heres what I saw in round 1 - General Smith send me attack(signed) Smith
43Digital signatures
- These require a cryptographic system
- For example, RSA
- Each player has a secret (private) key K-1 and a
public key K. - She can publish her public key
- RSA gives us a single encrypt function
- Encrypt(Encrypt(M,K),K-1) Encrypt(Encrypt(M,K-1)
,K) M - Encrypt a hash of the message to sign it
44With such a system
- A can send a message to B that only A could have
sent - A just encrypts the body with her private key
- or one that only B can read
- A encrypts it with Bs public key
- Or can sign it as proof she sent it
- B can recompute the signature and decrypt As
hashed signature to see if they match - These capabilities limit what our traitor can do
he cant forge or modify a message
45A timeline perspective
- In second round if the traitor didnt behave
identically for all Generals, we can weed out his
faulty votes
p
q
r
s
t
46A timeline perspective
Attack!!
p
Attack!!
q
Damn! Theyre on to me
r
Attack!!
s
Attack!!
t
47Traitor is stymied
- Our loyal generals can deduce that the decision
was to attack - Traitor cant disrupt this
- Either forced to vote legitimately, or is caught
- But costs were steep!
- (f1)n2 ,messages!
- Rounds can also be slow.
- Early stopping protocols min(t2, f1) rounds
t is true number of faults
48Distributed Snapshots
49Introducing wall clock time
- Back to the notion of time
- Distributed systems sometimes needs more precise
notion of time other than happens-before - There are several options
- Instead of network/process identitity to break
ties - Extend a logical clock with the clock time and
use it to break ties - Makes meaningful statements like B and D were
concurrent, although B occurred first - But unless clocks are closely synchronized such
statements could be erroneous! - We use a clock synchronization algorithm to
reconcile differences between clocks on various
computers in the network
50Synchronizing clocks
- Without help, clocks will often differ by many
milliseconds - Problem is that when a machine downloads time
from a network clock it cant be sure what the
delay was - This is because the uplink and downlink
delays are often very different in a network - Outright failures of clocks are rare
51Synchronizing clocks
- Suppose p synchronizes with time.windows.com and
notes that 123 ms elapsed while the protocol was
running what time is it now?
Delay 123ms
p
What time is it?
0923.02921
time.windows.com
52Synchronizing clocks
- Options?
- p could guess that the delay was evenly split,
but this is rarely the case in WAN settings
(downlink speeds are higher) - p could ignore the delay
- p could factor in only certain delay, e.g. if
we know that the link takes at least 5ms in each
direction. Works best with GPS time sources! - In general cant do better than uncertainty in
the link delay from the time source down to p
53Consequences?
- In a network of processes, we must assume that
clocks are - Not perfectly synchronized.
- We say that clocks are inaccurate
- Even GPS has uncertainty, although small
- And clocks can drift during periods between
synchronizations - Relative drift between clocks is their precision
54Temporal distortions
- Things can be complicated because we cant
predict - Message delays (they vary constantly)
- Execution speeds (often a process shares a
machine with many other tasks) - Timing of external events
- Lamport looked at this question too
55Temporal distortions
p
0
a
d
e
b
c
p
1
f
p
2
p
3
56Temporal distortions
What does now mean?
p
0
a
d
e
b
c
p
1
f
p
2
p
3
57Temporal distortions
Timelines can stretch caused by
scheduling effects, message delays, message loss
p
0
a
d
e
b
c
p
1
f
p
2
p
3
58Temporal distortions
Timelines can shrink E.g. something lets a
machine speed up
p
0
a
d
e
b
c
p
1
f
p
2
p
3
59Temporal distortions
Cuts represent instants of time. But not
every cut makes sense Black cuts could occur
but not gray ones.
p
0
a
d
e
b
c
p
1
f
p
2
p
3
60Consistent cuts and snapshots
- Idea is to identify system states that might
have occurred in real-life - Need to avoid capturing states in which a message
is received but nobody is shown as having sent it - This the problem with the gray cuts
61Temporal distortions
Red messages cross gray cuts backwards
p
0
a
d
e
b
c
p
1
f
p
2
p
3
62Temporal distortions
Red messages cross gray cuts backwards In
a nutshell the cut includes a message that was
never sent
p
0
a
e
b
c
p
1
p
2
p
3
63Who cares?
- Suppose, for example, that we want to do
distributed deadlock detection - System lets processes wait for actions by other
processes - A process can only do one thing at a time
- A deadlock occurs if there is a circular wait
64Deadlock detection algorithm
- p worries perhaps we have a deadlock
- p is waiting for q, so sends whats your state?
- q, on receipt, is waiting for r, so sends the
same question and r for s. And s is waiting on
p.
65Suppose we detect this state
- We see a cycle
- but is it a deadlock?
p
q
Waiting for
Waiting for
Waiting for
r
s
Waiting for
66Phantom deadlocks!
- Suppose system has a very high rate of locking.
- Then perhaps a lock release message passed a
query message - i.e. we see q waiting for r and r waiting for
s but in fact, by the time we checked r, q was
no longer waiting! - In effect we checked for deadlock on a gray cut
an inconsistent cut.
67Consistent cuts and snapshots
- Goal is to draw a line across the system state
such that - Every message received by a process is shown as
having been sent by some other process - Some pending messages might still be in
communication channels - A cut is the frontier of a snapshot
68Chandy/Lamport Algorithm
- Assume that if pi can talk to pj they do so using
a lossless, FIFO connection - Now think about logical clocks
- Suppose someone sets his clock way ahead and
triggers a flood of messages - As these reach each process, it advances its own
time eventually all do so. - The point where time jumps forward is a
consistent cut across the system
69Using logical clocks to make cuts
Message sets the time forward by a lot
p
0
a
d
e
b
c
p
1
f
p
2
p
3
Algorithm requires FIFO channels must delay e
until b has been delivered!
70Using logical clocks to make cuts
Cut occurs at point where time advanced
p
0
a
d
e
b
c
p
1
f
p
2
p
3
71Turn idea into an algorithm
- To start a new snapshot, pi
- Builds a message Pi is initiating snapshot k.
- The tuple (pi, k) uniquely identifies the
snapshot - In general, on first learning about snapshot (pi,
k), px - Writes down its state pxs contribution to the
snapshot - Starts tape recorders for all communication
channels - Forwards the message on all outgoing channels
- Stops tape recorder for a channel when a
snapshot message for (pi, k) is received on it - Snapshot consists of all the local state
contributions and all the tape-recordings for the
channels
72Chandy/Lamport
- This algorithm, but implemented with an outgoing
flood, followed by an incoming wave of snapshot
contributions - Snapshot ends up accumulating at the initiator,
pi - Algorithm doesnt tolerate process failures or
message failures.
73Chandy/Lamport
w
t
q
r
p
s
u
y
v
x
z
A network
74Chandy/Lamport
w
t
I want to start a snapshot
q
r
p
s
u
y
v
x
z
A network
75Chandy/Lamport
w
t
q
p records local state
r
p
s
u
y
v
x
z
A network
76Chandy/Lamport
w
p starts monitoring incoming channels
t
q
r
p
s
u
y
v
x
z
A network
77Chandy/Lamport
w
t
q
contents of channel p-y
r
p
s
u
y
v
x
z
A network
78Chandy/Lamport
w
p floods message on outgoing channels
t
q
r
p
s
u
y
v
x
z
A network
79Chandy/Lamport
w
t
q
r
p
s
u
y
v
x
z
A network
80Chandy/Lamport
w
q is done
t
q
r
p
s
u
y
v
x
z
A network
81Chandy/Lamport
w
t
q
q
r
p
s
u
y
v
x
z
A network
82Chandy/Lamport
w
t
q
q
r
p
s
u
y
v
x
z
A network
83Chandy/Lamport
w
t
q
q
r
p
s
u
y
v
x
z
s
z
A network
84Chandy/Lamport
w
x
t
q
q
r
p
u
s
u
y
v
x
z
s
z
v
A network
85Chandy/Lamport
w
w
x
t
q
q
r
p
z
s
s
v
y
u
r
u
y
v
x
z
A network
86Chandy/Lamport
w
t
q
q
p
Done!
r
p
s
r
s
u
t
u
w
v
y
v
y
x
x
z
z
A snapshot of a network
87Whats in the state?
- In practice we only record things important to
the application running the algorithm, not the
whole state - E.g. locks currently held, lock release
messages - Idea is that the snapshot will be
- Easy to analyze, letting us build a picture of
the system state - And will have everything that matters for our
real purpose, like deadlock detection
88Summary
- Types of faults
- Crash, fail-stop, non-malicious Byzantine,
Byzantine - Two-phase commit distributed decision making
- First, make sure everyone guarantees that they
will commit if asked (prepare) - Next, ask everyone to commit
- Assumes crash or fail-stop faults
- Byzantine Generals Problem distributed decision
making with malicious failures - n general some number of them may be malicious
(upto f of them) - All non-malicious generals must come to same
decision - Only solvable if n ? 3f1, but costs (f1)n2
,messages