Distributed Systems: Atomicity, Decision Making, Faults, Snapshots

About This Presentation

Title:

Distributed Systems: Atomicity, Decision Making, Faults, Snapshots

Description:

Slides adapted from Ken's CS514 lectures. Distributed Systems: Atomicity, ... 3) If A B and B C then A C. 6. Review: Total ordering? ... – PowerPoint PPT presentation

Number of Views:101

Avg rating:3.0/5.0

Slides: 89

Provided by: ranveer7

Learn more at: https://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: Distributed Systems: Atomicity, Decision Making, Faults, Snapshots

1
Distributed Systems Atomicity, Decision Making,
Faults, Snapshots
2
Announcements

Prelim II coming up next week
In class, Thursday, November 20th, 10101125pm
203 Thurston
Closed book, no calculators/PDAs/
Bring ID
Topics
Everything after first prelim
Lectures 14-22, chapters 10-15 (8th ed)
Review Session Tonight, November 18th,
630pm730pm
Location 315 Upson Hall

3
Review What time is it?

In distributed system we need practical ways to
deal with time
E.g. we may need to agree that update A occurred
before update B
Or offer a lease on a resource that expires at
time 1010.0150
Or guarantee that a time critical event will
reach all interested parties within 100ms

4
Review Event Ordering

Problem distributed systems do not share a clock
Many coordination problems would be simplified if
they did (first one wins)
Distributed systems do have some sense of time
Events in a single process happen in order
Messages between processes must be sent before
they can be received
How helpful is this?

5
Review Happens-before

Define a Happens-before relation (denoted by ?).
1) If A and B are events in the same process, and
A was executed before B, then A ? B.
2) If A is the event of sending a message by one
process and B is the event of receiving that
message by another process, then A ? B.
3) If A ? B and B ? C then A ? C.

6
Review Total ordering?

Happens-before gives a partial ordering of events
We still do not have a total ordering of events
We are not able to order events that happen
concurrently
Concurrent if (not A?B) and (not B?A)

7
Review Partial Ordering
Pi -gtPi1 Qi -gt Qi1 Ri -gt Ri1
R0-gtQ4 Q3-gtR4 Q1-gtP4 P1-gtQ2
8
Review Total Ordering?
P0, P1, Q0, Q1, Q2, P2, P3, P4, Q3, R0, Q4, R1,
R2, R3, R4
P0, Q0, Q1, P1, Q2, P2, P3, P4, Q3, R0, Q4, R1,
R2, R3, R4
P0, Q0, P1, Q1, Q2, P2, P3, P4, Q3, R0, Q4, R1,
R2, R3, R4
9
Review Timestamps

Assume each process has a local logical clock
that ticks once per event and that the processes
are numbered
Clocks tick once per event (including message
send)
When send a message, send your clock value
When receive a message, set your clock to MAX(
your clock, timestamp of message 1)
Thus sending comes before receiving
Only visibility into actions at other nodes
happens during communication, communicate
synchronizes the clocks
If the timestamps of two events A and B are the
same, then use the process identity numbers to
break ties.
This gives a total ordering!

10
Review Distributed Mutual Exclusion

Want mutual exclusion in distributed setting
The system consists of n processes each process
Pi resides at a different processor
Each process has a critical section that requires
mutual exclusion
Problem Cannot use atomic testAndSet primitive
since memory not shared and processes may be on
physically separated nodes
Requirement
If Pi is executing in its critical section, then
no other process Pj is executing in its critical
section
Compare three solutions
Centralized Distributed Mutual Exclusion (CDME)
Fully Distributed Mutual Exclusion (DDME)
Token passing

11
Today

Atomicity and Distributed Decision Making
Faults in distributed systems
What time is it now?
Synchronized clocks
What does the entire system look like at this
moment?

12
Atomicity

Recall
Atomicity either all the operations associated
with a program unit are executed to completion,
or none are performed.
In a distributed system may have multiple copies
of the data
(e.g. replicas are good for reliability/availabili
ty)
PROBLEM How do we atomically update all of the
copies?
That is, either all replicas reflect a change or
none

13
E.g. Two-Phase Commit

Goal Update all replicas atomically
Either everyone commits or everyone aborts
No inconsistencies even in face of failures
Caveat Assume only crash or fail-stop failures
Crash servers stop when they fail do not
continue and generate bad data
Fail-stop in addition to crash, fail-stop
failure is detectable.
Definitions
Coordinator Software entity that shepherds the
process (in our example could be one of the
servers)
Ready to commit side effects of update safely
stored on non-volatile storage
Even if crash, once I say I am ready to commit
then a recover procedure will find evidence and
continue with commit protocol

14
Two Phase Commit Phase 1

Coordinator send a PREPARE message to each
replica
Coordinator waits for all replicas to reply with
a vote
Each participant replies with a vote
Votes PREPARED if ready to commit and locks data
items being updated
Votes NO if unable to get a lock or unable to
ensure ready to commit

15
Two Phase Commit Phase 2

If coordinator receives PREPARED vote from all
replicas then it may decide to commit or abort
Coordinator send its decision to all participants
If participant receives COMMIT decision then
commit changes resulting from update
If participant received ABORT decision then
discard changes resulting from update
Participant replies DONE
When Coordinator received DONE from all
participants then can delete record of outcome

16
Performance

In absence of failure, 2PC (two-phase-commit)
makes a total of 2 (1.5?) round trips of messages
before decision is made
Prepare
Vote NO or PREPARE
Commit/abort
Done (but done just for bookkeeping, does not
affect response time)

17
Failure Handling in 2PC Replica Failure

The log contains a ltcommit Tgt record.
In this case, the site executes redo(T).
The log contains an ltabort Tgt record.
In this case, the site executes undo(T).
The log contains a ltready Tgt record
In this case consult coordinator Ci.
If Ci is down, site sends query-status T message
to the other sites.
The log contains no control records concerning T.
In this case, the site executes undo(T).

18
Failure Handling in 2PC Coordinator Ci Failure

If an active site contains a ltcommit Tgt record in
its log, then T must be committed.
If an active site contains an ltabort Tgt record in
its log, then T must be aborted.
If some active site does not contain the record
ltready Tgt in its log then the failed coordinator
Ci cannot have decided to commit T. Rather than
wait for Ci to recover, it is preferable to abort
T.
All active sites have a ltready Tgt record in their
logs, but no additional control records. In this
case we must wait for the coordinator to recover.
Blocking problem T is blocked pending the
recovery of site Si.

19
Failure Handling

Failure detected with timeouts
If participant times out before getting a PREPARE
can abort
If coordinator times out waiting for a vote can
abort
If a participant times out waiting for a decision
it is blocked!
Wait for Coordinator to recover?
Punt to some other resolution protocol
If a coordinator times out waiting for done, keep
record of outcome
other sites may have a replica.

20
Failures in distributed systems

We may want to avoid relying on a single
server/coordinator/boss to make progress
Thus want the decision making to be distributed
among the participants (all nodes created
equal) gt the consensus problem in distributed
systems.
However depending on what we can assume about the
network, it may be impossible to reach a decision
in some cases!

21
Impossibility of Consensus

Network characteristics
Synchronous - some upper bound on
network/processing delay.
Asynchronous - no upper bound on
network/processing delay.
Fischer Lynch and Paterson showed
With even just one failure possible, you cannot
guarantee consensus.
Cannot guarantee consensus process will terminate
Assumes asynchronous network
Essence of proof Just before a decision is
reached, we can delay a node slightly too long to
reach a decision.
But we still want to do it.. Right?

22
Distributed Decision Making Discussion

Why is distributed decision making desirable?
Fault Tolerance! Also, atomicity in distributed
system.
A group of machines can come to a decision even
if one or more of them fail during the process
After decision made, result recorded in multiple
places
Undesirable if algorithm is blocking (e.g.
two-phase commit)
One machine can be stalled until another site
recovers
A blocked site holds resources (locks on updated
items, pages pinned in memory, etc) until learns
fate of update
To reduce blocking
add more rounds (e.g. three-phase commit)
Add more replicas than needed (e.g. quorums)
What happens if one or more of the nodes is
malicious?
Malicious attempting to compromise the decision
making
Known as Byzantine fault tolerance. More on this
next time

23
Faults
24
Categories of failures

Crash faults, message loss
These are common in real systems
Crash failures process simply stops, and does
nothing wrong that would be externally visible
before it stops
These faults cant be directly detected

25
Categories of failures

Fail-stop failures
These require system support
Idea is that the process fails by crashing, and
the system notifies anyone who was talking to it
With fail-stop failures we can overcome message
loss by just resending packets, which must be
uniquely numbered
Easy to work with but rarely supported

26
Categories of failures

Non-malicious Byzantine failures
This is the best way to understand many kinds of
corruption and buggy behaviors
Program can do pretty much anything, including
sending corrupted messages
But it doesnt do so with the intention of
screwing up our protocols
Unfortunately, a pretty common mode of failure

27
Categories of failure

Malicious, true Byzantine, failures
Model is of an attacker who has studied the
system and wants to break it
She can corrupt or replay messages, intercept
them at will, compromise programs and substitute
hacked versions
This is a worst-case scenario mindset
In practice, doesnt actually happen
Very costly to defend against typically used in
very limited ways (e.g. key mgt. server)

28
Models of failure

Question here concerns how failures appear in
formal models used when proving things about
protocols
Think back to Lamports happens-before
relationship, ?
Model already has processes, messages, temporal
ordering
Assumes messages are reliably delivered

29
Two kinds of models

We tend to work within two models
Asynchronous model makes no assumptions about
time
Lamports model is a good fit
Processes have no clocks, will wait indefinitely
for messages, could run arbitrarily fast/slow
Distributed computing at an eons timescale
Synchronous model assumes a lock-step execution
in which processes share a clock

30
Adding failures in Lamports model

Also called the asynchronous model
Normally we just assume that a failed process
crashes it stops doing anything
Notice that in this model, a failed process is
indistinguishable from a delayed process
In fact, the decision that something has failed
takes on an arbitrary flavor
Suppose that at point e in its execution, process
p decides to treat q as faulty.

31
What about the synchronous model?

Here, we also have processes and messages
But communication is usually assumed to be
reliable any message sent at time t is delivered
by time t?
Algorithms are often structured into rounds, each
lasting some fixed amount of time ?, giving time
for each process to communicate with every other
process
In this model, a crash failure is easily detected
When people have considered malicious failures,
they often used this model

32
Neither model is realistic

Value of the asynchronous model is that it is so
stripped down and simple
If we can do something well in this model we
can do at least as well in the real world
So well want best solutions
Value of the synchronous model is that it adds a
lot of unrealistic mechanism
If we cant solve a problem with all this help,
we probably cant solve it in a more realistic
setting!
So seek impossibility results

33
Fischer, Lynch and Patterson

Impossibility of Consensus
A surprising result
Impossibility of Asynchronous Distributed
Consensus with a Single Faulty Process
They prove that no asynchronous algorithm for
agreeing on a one-bit value can guarantee that it
will terminate in the presence of crash faults
And this is true even if no crash actually
occurs!
Proof constructs infinite non-terminating runs
Essence of proof Just before a decision is
reached, we can delay a node slightly too long to
reach a decision.

34
Tougher failure models

Weve focused on crash failures
In the synchronous model these look like a
farewell cruel world message
Some call it the failstop model. A faulty
process is viewed as first saying goodbye, then
crashing
What about tougher kinds of failures?
Corrupted messages
Processes that dont follow the algorithm
Malicious processes out to cause havoc?

35
Here the situation is much harder

Generally we need at least 3f1 processes in a
system to tolerate f Byzantine failures
For example, to tolerate 1 failure we need 4 or
more processes
We also need f1 rounds
Lets see why this happens

36
Byzantine Generals scenario

Generals (N of them) surround a city
They communicate by courier
Each has an opinion attack or wait
In fact, an attack would succeed the city will
fall.
Waiting will succeed too the city will
surrender.
But if some attack and some wait, disaster ensues
Some Generals (f of them) are traitors it
doesnt matter if they attack or wait, but we
must prevent them from disrupting the battle
Traitor cant forge messages from other Generals

37
Byzantine Generals scenario
Attack! No, wait! Surrender!
Wait
Attack!
Attack!
Wait
38
A timeline perspective
p
q
r

Suppose that p and q favor attack, r is a traitor
and s and t favor waiting assume that in a tie
vote, we attack

s
t
39
A timeline perspective

After first round collected votes are
attack, attack, wait, wait, traitors-vote

p
q
r
s
t
40
What can the traitor do?

Add a legitimate vote of attack
Anyone with 3 votes to attack knows the outcome
Add a legitimate vote of wait
Vote now favors wait
Or send different votes to different folks
Or dont send a vote, at all, to some

41
Outcomes?

Traitor simply votes
Either all see a,a,a,w,w
Or all see a,a,w,w,w
Traitor double-votes
Some see a,a,a,w,w and some a,a,w,w,w
Traitor withholds some vote(s)
Some see a,a,w,w, perhaps others see
a,a,a,w,w, and still others see a,a,w,w,w
Notice that traitor cant manipulate votes of
loyal Generals!

42
What can we do?

Clearly we cant decide yet some loyal Generals
might have contradictory data
Anyone with 4 votes can decide
But with 3 votes to wait or attack, a General
isnt sure (one could be a traitor)
So in round 2, each sends out witness
messages heres what I saw in round 1
General Smith send me attack(signed) Smith

43
Digital signatures

These require a cryptographic system
For example, RSA
Each player has a secret (private) key K-1 and a
public key K.
She can publish her public key
RSA gives us a single encrypt function
Encrypt(Encrypt(M,K),K-1) Encrypt(Encrypt(M,K-1)
,K) M
Encrypt a hash of the message to sign it

44
With such a system

A can send a message to B that only A could have
sent
A just encrypts the body with her private key
or one that only B can read
A encrypts it with Bs public key
Or can sign it as proof she sent it
B can recompute the signature and decrypt As
hashed signature to see if they match
These capabilities limit what our traitor can do
he cant forge or modify a message

45
A timeline perspective

In second round if the traitor didnt behave
identically for all Generals, we can weed out his
faulty votes

p
q
r
s
t
46
A timeline perspective
Attack!!

We attack!

p
Attack!!
q
Damn! Theyre on to me
r
Attack!!
s
Attack!!
t
47
Traitor is stymied

Our loyal generals can deduce that the decision
was to attack
Traitor cant disrupt this
Either forced to vote legitimately, or is caught
But costs were steep!
(f1)n2 ,messages!
Rounds can also be slow.
Early stopping protocols min(t2, f1) rounds
t is true number of faults

48
Distributed Snapshots
49
Introducing wall clock time

Back to the notion of time
Distributed systems sometimes needs more precise
notion of time other than happens-before
There are several options
Instead of network/process identitity to break
ties
Extend a logical clock with the clock time and
use it to break ties
Makes meaningful statements like B and D were
concurrent, although B occurred first
But unless clocks are closely synchronized such
statements could be erroneous!
We use a clock synchronization algorithm to
reconcile differences between clocks on various
computers in the network

50
Synchronizing clocks

Without help, clocks will often differ by many
milliseconds
Problem is that when a machine downloads time
from a network clock it cant be sure what the
delay was
This is because the uplink and downlink
delays are often very different in a network
Outright failures of clocks are rare

51
Synchronizing clocks

Suppose p synchronizes with time.windows.com and
notes that 123 ms elapsed while the protocol was
running what time is it now?

Delay 123ms
p
What time is it?
0923.02921
time.windows.com
52
Synchronizing clocks

Options?
p could guess that the delay was evenly split,
but this is rarely the case in WAN settings
(downlink speeds are higher)
p could ignore the delay
p could factor in only certain delay, e.g. if
we know that the link takes at least 5ms in each
direction. Works best with GPS time sources!
In general cant do better than uncertainty in
the link delay from the time source down to p

53
Consequences?

In a network of processes, we must assume that
clocks are
Not perfectly synchronized.
We say that clocks are inaccurate
Even GPS has uncertainty, although small
And clocks can drift during periods between
synchronizations
Relative drift between clocks is their precision

54
Temporal distortions

Things can be complicated because we cant
predict
Message delays (they vary constantly)
Execution speeds (often a process shares a
machine with many other tasks)
Timing of external events
Lamport looked at this question too

55
Temporal distortions

What does now mean?

p

0
a
d

e
b
c

p

1
f

p

2
p

3
56
Temporal distortions
What does now mean?

p

0
a
d

e
b
c

p

1
f

p

2
p

3
57
Temporal distortions
Timelines can stretch caused by
scheduling effects, message delays, message loss

p

0
a
d

e
b
c

p

1
f

p

2
p

3
58
Temporal distortions
Timelines can shrink E.g. something lets a
machine speed up

p

0
a
d

e
b
c

p

1
f

p

2
p

3
59
Temporal distortions
Cuts represent instants of time. But not
every cut makes sense Black cuts could occur
but not gray ones.

p

0
a
d

e
b
c

p

1
f

p

2
p

3
60
Consistent cuts and snapshots

Idea is to identify system states that might
have occurred in real-life
Need to avoid capturing states in which a message
is received but nobody is shown as having sent it
This the problem with the gray cuts

61
Temporal distortions
Red messages cross gray cuts backwards

p

0
a
d

e
b
c

p

1
f

p

2
p

3
62
Temporal distortions
Red messages cross gray cuts backwards In
a nutshell the cut includes a message that was
never sent

p

0
a

e
b
c

p

1
p

2
p

3
63
Who cares?

Suppose, for example, that we want to do
distributed deadlock detection
System lets processes wait for actions by other
processes
A process can only do one thing at a time
A deadlock occurs if there is a circular wait

64
Deadlock detection algorithm

p worries perhaps we have a deadlock
p is waiting for q, so sends whats your state?
q, on receipt, is waiting for r, so sends the
same question and r for s. And s is waiting on
p.

65
Suppose we detect this state

We see a cycle
but is it a deadlock?

p
q
Waiting for
Waiting for
Waiting for
r
s
Waiting for
66
Phantom deadlocks!

Suppose system has a very high rate of locking.
Then perhaps a lock release message passed a
query message
i.e. we see q waiting for r and r waiting for
s but in fact, by the time we checked r, q was
no longer waiting!
In effect we checked for deadlock on a gray cut
an inconsistent cut.

67
Consistent cuts and snapshots

Goal is to draw a line across the system state
such that
Every message received by a process is shown as
having been sent by some other process
Some pending messages might still be in
communication channels
A cut is the frontier of a snapshot

68
Chandy/Lamport Algorithm

Assume that if pi can talk to pj they do so using
a lossless, FIFO connection
Now think about logical clocks
Suppose someone sets his clock way ahead and
triggers a flood of messages
As these reach each process, it advances its own
time eventually all do so.
The point where time jumps forward is a
consistent cut across the system

69
Using logical clocks to make cuts
Message sets the time forward by a lot

p

0
a
d

e
b
c

p

1
f

p

2
p

3
Algorithm requires FIFO channels must delay e
until b has been delivered!
70
Using logical clocks to make cuts
Cut occurs at point where time advanced

p

0
a
d

e
b
c

p

1
f

p

2
p

3
71
Turn idea into an algorithm

To start a new snapshot, pi
Builds a message Pi is initiating snapshot k.
The tuple (pi, k) uniquely identifies the
snapshot
In general, on first learning about snapshot (pi,
k), px
Writes down its state pxs contribution to the
snapshot
Starts tape recorders for all communication
channels
Forwards the message on all outgoing channels
Stops tape recorder for a channel when a
snapshot message for (pi, k) is received on it
Snapshot consists of all the local state
contributions and all the tape-recordings for the
channels

72
Chandy/Lamport

This algorithm, but implemented with an outgoing
flood, followed by an incoming wave of snapshot
contributions
Snapshot ends up accumulating at the initiator,
pi
Algorithm doesnt tolerate process failures or
message failures.

73
Chandy/Lamport
w
t
q
r
p
s
u
y
v
x
z
A network
74
Chandy/Lamport
w
t
I want to start a snapshot
q
r
p
s
u
y
v
x
z
A network
75
Chandy/Lamport
w
t
q
p records local state
r
p
s
u
y
v
x
z
A network
76
Chandy/Lamport
w
p starts monitoring incoming channels
t
q
r
p
s
u
y
v
x
z
A network
77
Chandy/Lamport
w
t
q
contents of channel p-y
r
p
s
u
y
v
x
z
A network
78
Chandy/Lamport
w
p floods message on outgoing channels
t
q
r
p
s
u
y
v
x
z
A network
79
Chandy/Lamport
w
t
q
r
p
s
u
y
v
x
z
A network
80
Chandy/Lamport
w
q is done
t
q
r
p
s
u
y
v
x
z
A network
81
Chandy/Lamport
w
t
q
q
r
p
s
u
y
v
x
z
A network
82
Chandy/Lamport
w
t
q
q
r
p
s
u
y
v
x
z
A network
83
Chandy/Lamport
w
t
q
q
r
p
s
u
y
v
x
z
s
z
A network
84
Chandy/Lamport
w
x
t
q
q
r
p
u
s
u
y
v
x
z
s
z
v
A network
85
Chandy/Lamport
w
w
x
t
q
q
r
p
z
s
s
v
y
u
r
u
y
v
x
z
A network
86
Chandy/Lamport
w
t
q
q
p
Done!
r
p
s
r
s
u
t
u
w
v
y
v
y
x
x
z
z
A snapshot of a network
87
Whats in the state?

In practice we only record things important to
the application running the algorithm, not the
whole state
E.g. locks currently held, lock release
messages
Idea is that the snapshot will be
Easy to analyze, letting us build a picture of
the system state
And will have everything that matters for our
real purpose, like deadlock detection

88
Summary