Title: CPSC 668 Distributed Algorithms and Systems
1CPSC 668Distributed Algorithms and Systems
- Fall 2009
- Prof. Jennifer Welch
2Distributed Shared Memory
- A model for inter-process communication
- Provides illusion of shared variables on top of
message passing - Shared memory is often considered a more
convenient programming platform than message
passing - Formally, give a simulation of the shared memory
model on top of the message passing model - We'll consider the special case of
- no failures
- only read/write variables to be simulated
3The Simulation
users of read/write shared memory
read/write
return/ack
read/write
return/ack
Shared Memory
alg0
algn-1
send
recv
send
recv
Message Passing System
4Shared Memory Issues
- A process invokes a shared memory operation (read
or write) at some time - The simulation algorithm running on the same node
executes some code, possibly involving exchanges
of messages - Eventually the simulation algorithm informs the
process of the result of the shared memory
operation. - So shared memory operations are not
instantaneous! - Operations (invoked by different processes) can
overlap - What values should be returned by operations that
overlap other operations? - defined by a memory consistency condition
5Sequential Specifications
- Each shared object has a sequential
specification specifies behavior of object in
the absence of concurrency. - Object supports operations
- invocations
- matching responses
- Set of sequences of operations that are legal
6Sequential Spec for R/W Registers
- Each operation has two parts, invocation and
response - Read operation has invocation readi(X) and
response returni(X,v) - Write operation has invocation writei(X,v) and
response acki(X) - A sequence of operations is legal iff each read
returns the value of the latest preceding write. - Ex write0(X,3) ack0(X) read1(X) return1(X,3)
7Memory Consistency Conditions
- Consistency conditions tie together the
sequential specification with what happens in the
presence of concurrency. - We will study two well-known conditions
- linearizability
- sequential consistency
- We will only consider read/write registers, in
the absence of failures.
8Definition of Linearizability
- Suppose ? is a sequence of invocations and
responses. - an invocation is not necessarily immediately
followed by its matching response - ? is linearizable if there exists a permutation ?
of all the operations in ? (now each invocation
is immediately followed by its matching response)
s.t. - ?X is legal (satisfies sequential spec) for all
X, and - if response of operation O1 occurs in ? before
invocation of operation O2, then O1 occurs in ?
before O2 (? respects real-time order of
non-overlapping operations in ?).
9Linearizability Examples
Suppose there are two shared variables, X and Y,
both initially 0
p0
1
3
0
p1
2
4
Is this sequence linearizable?
Yes - green triangles.
What if p1's read returns 0?
No - see arrow.
10Definition of Sequential Consistency
- Suppose ? is a sequence of invocations and
responses. - ? is sequentially consistent if there exists a
permutation ? of all the operations in ? s.t. - ?X is legal (satisfies sequential spec) for all
X, and - if response of operation O1 occurs in ? before
invocation of operation O2 at the same process,
then O1 occurs in ? before O2 (? respects
real-time order of operations by the same process
in ?).
11Sequential Consistency Examples
Suppose there are two shared variables, X and Y,
both initially 0
0
4
3
p0
1
2
p1
Is this sequence sequentially consistent?
Yes - green numbers.
What if p0's read returns 0?
No - see arrows.
12Specification of Linearizable Shared Memory Comm.
System
- Inputs are invocations on the shared objects
- Outputs are responses from the shared objects
- A sequence ? is in the allowable set iff
- Correct Interaction each proc. alternates
invocations and matching responses - Liveness each invocation has a matching
response - Linearizability ? is linearizable
13Specification of Sequentially Consistent Shared
Memory
- Inputs are invocations on the shared objects
- Outputs are responses from the shared objects
- A sequence ? is in the allowable set iff
- Correct Interaction each proc. alternates
invocations and matching responses - Liveness each invocation has a matching
response - Sequential Consistency ? is sequentially
consistent
14Algorithm to Implement Linearizable Shared Memory
- Uses totally ordered broadcast as the underlying
communication system. - Each proc keeps a replica for each shared
variable - When read request arrives
- send bcast msg containing request
- when own bcast msg arrives, return value in local
replica - When write request arrives
- send bcast msg containing request
- upon receipt, each proc updates its replica's
value - when own bcast msg arrives, respond with ack
15The Simulation
users of read/write shared memory
read/write
return/ack
read/write
return/ack
Shared Memory
alg0
algn-1
to-bc-send
to-bc-recv
to-bc-send
to-bc-recv
Totally Ordered Broadcast
16Correctness of Linearizability Algorithm
- Consider any admissible execution ? of the
algorithm - underlying totally ordered broadcast behaves
properly - users interact properly
- Show that ?, the restriction of ? to the events
of the top interface, satisfies Liveness, and
Linearizability.
17Correctness of Linearizability Algorithm
- Liveness (every invocation has a response) By
Liveness property of the underlying totally
ordered broadcast. - Linearizability Define the permutation ? of the
operations to be the order in which the
corresponding broadcasts are received. - ? is legal because all the operations are
consistently ordered by the TO bcast. - ? respects real-time order of operations if O1
finishes before O2 begins, O1's bcast is ordered
before O2's bcast.
18Why is Read Bcast Needed?
- The bcast done for a read causes no changes to
any replicas, just delays the response to the
read. - Why is it needed?
- Let's see what happens if we remove it.
19Why Read Bcast is Needed
read return(1)
p0
write(1)
p1
to-bc-send
p2
read return(0)
Not linearizable!
20Algorithm for Sequential Consistency
- The linearizability algorithm, without doing a
bcast for reads - Uses totally ordered broadcast as the underlying
communication system. - Each proc keeps a replica for each shared
variable - When read request arrives
- immediately return the value stored in the local
replica - When write request arrives
- send bcast msg containing request
- upon receipt, each proc updates its replica's
value - when own bcast msg arrives, respond with ack
21Correctness of SC Algorithm
- Lemma (9.3) The local copies at each proc. take
on all the values appearing in write operations,
in the same order, which preserves the order of
non-overlapping writes - - implies per-process order of writes
- Lemma (9.4) If pi writes Y and later reads X,
then pi's update of its local copy of Y (on
behalf of that write) precedes its read of its
local copy of X (on behalf of that read).
22Correctness of the SC Algorithm
- (Theorem 9.5) Why does SC hold?
- Given any admissible execution ?, must come up
with a permutation ? of the shared memory
operations that is - legal and
- respects per-proc. ordering of operations
23The Permutation ?
- Insert all writes into ? in their to-bcast order.
- Consider each read R in ? in the order of
invocation - suppose R is a read by pi of X
- place R in ? immediately after the later of
- the operation by pi that immediately precedes R
in ?, and - the write that R "read from" (caused the latest
update of pi's local copy of X preceding the
response for R)
24Permutation Example
4
read return(2)
p0
write(2)
3
ack
p1
to-bc-send
to-bc-send
p2
read return(1)
write(1)
ack
1
2
permutation is given by green numbers
25Permutation ? Respects Per Proc. Ordering
- For a specific proc
- Relative ordering of two writes is preserved by
Lemma 9.3 - Relative ordering of two reads is preserved by
the construction of ? - If write W precedes read R in exec. ?, then W
precedes R in ? by construction - Suppose read R precedes write W in ?. Show same
is true in ?.
26Permutation ? Respects Ordering
- Suppose in contradiction R and W are swapped in
? - There is a read R' by pi that equals or precedes
R in ? - There is a write W' that equals W or follows W in
the to-bcast order - And R' "reads from" W'.
R
W
R'
?pi
W W' R' R
?
- But
- R' finishes before W starts in ? and
- updates are done to local replicas in to-bcast
order (Lemma 9.3) so update for W' does not
precede update for W - so R' cannot read from W'.
27Permutation ? is Legal
- Consider some read R of X by pi and some write W
s.t. R reads from W in ?. - Suppose in contradiction, some other write W' to
X falls between W and R in ? - Why does R follow W' in ??
28Permutation ? is Legal
- Case 1 W' is also by pi. Then R follows W' in ?
because R follows W' in ?. - Update for W at pi precedes update for W' at pi
in ? (Lemma 9.3). - Thus R does not read from W, contradiction.
29Permutation ? is Legal
- Case 2 W' is not by pi. Then R follows W' in ?
due to some operation O, also by pi , s.t. - O precedes R in ?, and
- O is placed between W' and R in ?
- Consider the earliest such O.
- Case 2.1 O is a write (not necessarily to X).
- update for W' at pi precedes update for O at pi
in ? (Lemma 9.3) - update for O at pi precedes pi's local read for R
in ? (Lemma 9.4) - So R does not read from W, contradiction.
30Permutation ? is Legal
W W' O R
?
- Case 2.2 O is a read.
- By construction of ?, O must read X and in fact
read from W' (otherwise O would not be after W') - Update for W at pi precedes update for W' at pi
in ? (Lemma 9.3). - Update for W' at pi precedes local read for O at
pi in ? (otherwise O would not read from W'). - Thus R cannot read from W, contradiction.
31Performance of SC Algorithm
- Read operations are implemented "locally",
without requiring any inter-process
communication. - Thus reads can be viewed as "fast" time between
invocation and response is only that needed for
some local computation. - Time for writes is time for delivery of one
totally ordered broadcast (depends on how
to-bcast is implemented).
32Alternative SC Algorithm
- It is possible to have an algorithm that
implements sequentially consistent shared memory
on top of totally ordered broadcast that has
reverse performance - writes are local/fast (even though bcasts are
sent, don't wait for them to be received) - reads can require waiting for some bcasts to be
received - Like the previous SC algorithm, this one does not
implement linearizable shared memory.
33Time Complexity for DSM Algorithms
- One complexity measure of interest for DSM
algorithms is how long it takes for operations to
complete. - The linearizability algorithm required D time for
both reads and writes, where D is the maximum
time for a totally-ordered broadcast message to
be received. - The sequential consistency algorithm required D
time for writes and C time for reads, where C is
the time for doing some local computation. - Can we do better? To answer this question, we
need some kind of timing model.
34Timing Model
- Assume the underlying communication system is the
point-to-point message passing system (not
totally ordered broadcast). - Assume that every message has delay in the range
d-u,d. - Claim Totally ordered broadcast can be
implemented in this model so that D, the maximum
time for delivery, is O(d).
35Time and Clocks in Layered Model
- Timed execution associate an occurrence time
with each node input event. - Times of other events are "inherited" from time
of triggering node input - recall assumption that local processing time is
negligible. - Model hardware clocks as before run at same
rate as real time, but not synchronized - Notions of view, timed view, shifting are same
- Shifting Lemma still holds (relates h/w clocks
and msg delays between original and shifted execs)
36Lower Bound for SC
- Let Tread worst-case time for a read to
complete - Let Twrite worst-case time for a write to
complete - Theorem (9.7) In any simulation of sequentially
consistent shared memory on top of point-to-point
message passing, Tread Twrite ? d.
37SC Lower Bound Proof
- Consider any SC simulation with Tread Twrite lt
d. - Let X and Y be two shared variables, both
initially 0. - Let ?0 be admissible execution whose top layer
behavior is - write0(X,1) ack0(X) read0(Y) return0(Y,0)
- write begins at time 0, read ends before time d
- every msg has delay d
- Why does ?0 exist?
- The alg. must respond correctly to any sequence
of invocations. - Suppose user at p0 wants to do a write,
immediately followed by a read. - By SC, read must return 0.
- By assumption, total elapsed time is less than d.
38SC Lower Bound Proof
- Similarly, let ?1 be admissible execution whose
top layer behavior is - write1(Y,1) ack1(Y) read1(X) return1(X,0)
- write begins at time 0, read ends before time d
- every msg has delay d
- ?1 exists for similar reason.
- Now merge p0's timed view in ?0 with p1's timed
view in ?1 to create admissible execution ?'. - But ?' is not SC, contradiction!
39SC Lower Bound Proof
time
0
d
not SC - contradiction!
40Linearizability Write Lower Bound
- Theorem (9.8) In any simulation of linearizable
shared memory on top of point-to-point message
passing, Twrite u/2. - Proof Consider any linearizable simulation with
Twrite lt u/2. - Let be an admissible exec. whose top layer
behavior is - p1 writes 1 to X, p2 writes 2 to X, p0 reads 2
from X - Shift to create admissible exec. in which p1 and
p2's writes are swapped, causing p0's read to
violate linearizability.
41Linearizability Write Lower Bound
linearizable
admissible
42Linearizability Write Lower Bound
u
time
0
u/2
read 2
p0
not linearizable
write 1
shift p1 by u/2
p1
shift p2 by -u/2
write 2
p2
contradiction!
admissible
43Linearizability Read Lower Bound
- Approach is similar to the write lower bound.
- Assume in contradiction there is an algorithm
with Tread lt u/4. - Identify a particular execution
- fix a pattern of read and write invocations,
occurring at particular times - fix the pattern of message delays
- Shift this execution to get one that is
- still admissible
- but not linearizable
44Linearizability Read Lower Bound
- Original execution
- p1 reads X and gets 0 (old value).
- Then p0 starts writing 1 to X.
- When write is done, p0 reads X and gets 1 (new
value). - Also, during the write, p0 and p1 alternate
reading X. - At some point, the reads stop getting the old
value (0) and start getting the new value (1)
45Linearizability Read Lower Bound
- Set all delays in this execution to be d - u/2.
- Now shift p2 earlier by u/2.
- Verify that result is still admissible (every
delay either stays the same or becomes d or d -
u). - But in shifted execution, sequence of values read
is - 0, 0, , 0, 1, 0, 1, 1, , 1
not linearizable!
46Linearizability Read Lower Bound