Title: Computer Science 328 Distributed Systems
1Computer Science 328Distributed Systems
- Lecture 16
- Distributed Shared Memory
2Multiprocessors and Multicomputers
- In a multiprocessor, two or more CPUs share a
common main memory. Any process on a processor
can read/write any word in the shared memory. - In a multicomputer, each CPU has its own private
memory. - Easier to build One can take a large number of
single-board computers, each containing a CPU,
memory, and a network interface, and connect them
together. - Hard to program Communication has to use message
passing. In contrast, in multiprocessors systems,
one process just writes data to memory to be read
by all the others.
3Bus-Based Multiprocessors with Shared Memory
- When any of the CPUs wants to read a word from
the memory, it puts the address of the requested
word and asserts a bus control (read) line. - To prevent two CPUs from accessing the memory at
the same time, a bus arbitration mechanism is
used, e.g., a CPU may assert a request line
first. - To improve performance, each CPU can be equipped
with a snooping cache.
CPU
CPU
CPU
Memory
CPU
Memory
CPU
CPU
Cache
Cache
Cache
Bus
Bus
4Cache Consistency Write Through
All the other caches see the write (because they
are snooping on the bus) and check to see if
they Are also holding the word being modified. If
so, they invalidate the cache entries.
5Cache Consistency Write Once
CPU
W1
A reads word W and gets W1. B does not respond
but the memory does
Initially both the memory and B have an updated
entry of word W.
W1
C
A
B
W1
C
A
B
W3
W1
W1
W2
Invalid
Dirty
Dirty
Invalid
A writes W again. This and subsequent writes by A
are done locally, without any bus traffic.
A writes a value W2. B snoops on the bus, and
invalidate its entry. As copy is marked as
dirty.
6Cache Consistency Write Once
W1
C
A
B
W1
C
A
B
W4
W3
W1
W1
W3
Invalid
Dirty
Invalid
Dirty
Invalid
A writes a value W3. No bus traffic is incurred
C writes W. A sees the request by snooping on
the bus, asserts a signal that inhibits memory
from responding, provides the values, and
invalidates it own entry. C now has the only
valid copy.
The cache consistency protocol is built upon the
notion of snooping and built into the memory
management unit. Mechanisms are implemented in
hardware.
7Ring-Based Multiprocessors with Shared Memory
- On each machine, a single address space is
divided into a private part and a shared part.
Shared memory is divided into 32-byte blocks
(units for transfer). - Each 32-byte block in the shared memory space has
a home machine on which physical memory (home
memory) is always reserved for it. - All the machines are connected in a token passing
ring. The ring wire consists of 16 data bits and
4 control bits. - The block table (indexed by block number) keeps
track of where each block is located. - Valid bit if the block is present in the cache
and up to date. - Exclusive bit if the local copy (if any) is the
only one. - Home bit if this is the blocks home machine.
- Location field where the block is located in the
cache if it is present and valid.
8Ring-Based Multiprocessors with Shared Memory
Valid
Exclusive
Interrupt
Block
Home Memory
MMU
Table
Cache
0 1 2 3 4
Location
CPU
Private Memory
Home
9Protocol for Ring-Based Multiprocessors
- To read a word from shared memory
- The memory address is passed to the device, which
checks the block table to see if the block is
present. - If yes, the request is satisfied.
- If not,
- the device waits until it captures the
circulating token and puts a request packet onto
the ring. - As the packet passes around the ring, each device
checks if it has the requested block. If it has
the block, it provides it and clears the
exclusive bit (if set). - When the token returns, it always has the
requested block. - To write a word to shared memory
- If the block is present and is the only copy
(exclusive bit is set), the word is written
locally. - If the block is present but not the only copy, an
invalidation packet is first sent around the ring
to invalidate all the other copies. When the
invalidation packet returns, the exclusive bit is
set and the write proceeds. - If the block is not present,
- A packet is sent out that combines a read request
and an invalidation request. - The first machine that has the block copies it
onto the packet and discards (invalidates) its
own copy. All subsequent machines just discard
the block from their caches.
10The Basic Model of DSM
Shred Address Space
0
1
2
3
4
5
6
7
8
9
0
2
1
4
3
6
8
7
9
5
P2
P1
P3
Shred Address Space
Shred Address Space
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
7
8
9
0
2
0
2
1
4
3
6
1
4
3
6
8
8
7
7
5
9
9
5
9
9
Read-only replicated page
Page Transfer
11Distributed Shared Memory
- In a DSM system, the address space is divided up
into chunks (e.g., pages), with the chunks being
spread over all the processors in the system. - When a processor references an address that is
not local, a trap occurs, and the DSM software
fetches the chunk containing the address and
restarts the faulting instruction. - Major difference between a multiprocessor system
with global shared memory (e.g., DASH) and a DSM
system is that processors in the latter can only
reference their own local memory.
12Granularity of Chunks
- When a process references a word that is absent,
it causes a page fault. - On a page fault,
- the missing page is just brought in from another
machine instead of from the disk. - A region of 2, 4, or 8 pages including the
missing page is brought in. - Locality of reference if a program has
referenced one word on a page, it is likely to
reference other neighboring words in the near
future. - False sharing (likely to occur when the chunk
size is large).
13False Sharing
Processor 1
Processor 2
A B
A B
Two unrelated shared variables
Code using B
Code using A
14Achieving Sequential Consistency
- Achieving consistency is not an issue if
- Pages are not replicated.
- Only read-only pages are replicated.
- Two approaches are taken in DSM
- Update the write is allowed to take place
locally, but the address of the modified word and
its new value are broadcast on the bus
simultaneously to all the other caches. Each
cache holding the word copies the new value from
the bus to its cache. - Invalidation The address of the modified word is
broadcast on the bus, but the new value is not. - Paged-based DSM systems typically use an
invalidation protocol instead of an update
protocol.
15Invalidation Protocol to Achieve Consistency
- Each page is either in R or W state.
- When a page is in W state, only one copy exists,
mapped into the owners address space in
read-write mode. - When a page is in R state, the owner has a copy
(mapped read only), but other processes may have
copies too.
In the case of processor 1 attempting for a read
(b)
(a)
16Invalidation Protocol (Read)
Processor 1
Processor 2
(d)
(c)
In the first 4 cases, the page is mapped into its
address space, and no trap occurs.
(f)
(e)
- Ask for a copy
- Mark page as R
- Do read
- Ask for degradation
- Ask for a copy
- Mark page as R
- Do read
17Invalidation Protocol (Write)
Processor 1
Processor 2
Processor 1
Processor 2
R
W
P
P
Owner
Owner
Processor 1
Processor 2
- Ask for invalidation
- Ask for ownership
- Mark page as W
- Do write
- Invalidate copies
- Mark page as W
- Do write
18Invalidation Protocol (Write)
- Ask for invalidation
- Ask for ownership
- Ask for a page
- Mark page as W
- Do write
- Ask for invalidation
- Ask for ownership
- Ask for a page
- Mark page as W
- Do write
19Finding the Owner
- Do a broadcast, asking for the owner to respond.
- An optimization is to include in the message
whether the sender wants to read/write and
whether it needs a copy. - Broadcast interrupts each processor, forcing it
to inspect the request packet. - Designate a page manager to keep track of who
owns which page. - A page manager uses incoming requests not only to
provide replies but also to keep track of changes
in ownership. - Potential performance bottleneck ? multiple page
managers - The lower-order bits of a page number is used as
an index into a table of page managers.
Page Manager
Page Manager
1. Request
1. Request
2. Request forwarded
2. Reply
3. Request
P
Owner
Owner
P
3. Reply
4. Reply
20How does the Owner Find the Copies to Invalidate
- Broadcast a msg giving the page no. and asking
processors holding the page to invalidate it. - works only if broadcast messages are reliable
and can never be lost. - Each owner or page manager maintains a list of
copyset telling which processors hold which
pages. - When a page must be invalidated, the owner or
page manager sends a message to each processor
holding the page and waits for an acknowledgement.
3
4
3
2
1
4
4
1
2
3
2
3
5
1 3 4
2 4
2 4
Network
Copyset
21Strict and Sequential Consistency
- A tradeoff between accuracy and performance.
- Strict Consistency (one-copy semantics)
- Any read to a memory location x returns the value
stored by the most recent write operation to x. - When memory is strictly consistent, all writes
are instantaneously visible to all processes and
an absolute global time order is maintained. - Sequential Consistency
- The result of any execution is the same as if the
operations of all processors were executed in
some sequential order, and the operations of each
individual processor appear in this sequence in
the order specified by its program. - All processes must see the same sequence of
memory reference. - Can be realized in a system with totally ordered
reliable broadcast mechanism as follows all
operations are broadcast. The exact order does
not matter as long as all processes agree on the
order of all operations on the shared memory.
22Sequential Consistency in Textbook
- Sequential Consistency for any execution
- the interleaved sequence of operations is such
that if R(x)a occurs in the sequence, then either
the last write operation that occurs before it in
the interleaved sequence is W(x)a, or no write
operation occurs before it and a is the initial
value of x. - the order of operations for any program is
consistent with program order - In this model, writes must occur in the same
order on all copies, reads however can be
interleaved on each system, as convenient. Stale
reads can occur.
23How to Determine the Sequential Order?
- Example Given H1 W(x)1 and H2 R(x)0 R(x)1, how
do we come up with a single string S that gives
the order the operations would have been carried
out, subject to - Program order must be maintained
- Memory coherence must be respected a read to
some location, x must always return the value
most recently written to x. - Answer S R(x)0 W(x)1 R(x)1
24Causal Consistency
- Writes that are potentially causally related must
be seen by all processes in the same order.
Concurrent writes may be seen in a different
order on different machines. - Example 1
Concurrent writes
W(x) 3
P1
W(x)1
P2
R(x)1 W(x)2
P3
R(x)1
R(x)3 R(x)2
P4
R(x)2 R(x) 3
R(x)1
This sequence is allowed with causally consistent
memory
25Causal Consistency
Causally related
P1
W(x)1
P2
R(x)1 W(x)2
P3
R(x)2 R(x)1
P4
R(x)1 R(x) 2
This sequence is not allowed with causally
consistent memory
P1
W(x)1
P2
W(x)2
P3
R(x)2 R(x)1
P4
R(x)1 R(x) 2
This sequence is allowed with causally consistent
memory
26Pipelined RAM and Processor Consistency
- Writes done by a single process are received by
all other processes in the order in which they
are issued, but writes from different processes
may not be seen in a different order by different
processes.
P1
W(x)1
P2
R(x)1 W(x)2
P3
R(x)2 R(x)1
P4
R(x)1 R(x) 2
This sequence is allowed with PRAM consistent
memory
27Processor Consistency in Textbook
- Processor Consistency
- writes from a single processor must be seen by
all processors in the same order - writes from different processors can be
interleaved differently - This is less strict than sequential consistency
(where all writes must be ordered) - This is useful in applications that each
processor mainly depends on its own actions
28Weak Consistency
- Not all the applications require seeing all the
writes, let alone seeing them in order. - E.g, a process is inside a critical section
reading/writing some variables in a tight loop.
Other processes are not supposed to touch the
variables until the first process has left the
critical section. - A synchronization variable is introduced. When a
synchronization completes, all writes done on
that machine are propagated outward and all
writes done on other machines are brought in. - Access to synchronization variables are
sequentially consistent. - No access to a synchronization variable is
allowed to be performed until all previous writes
have completed elsewhere. - Accessing a synchronization variable flushes the
pipeline. - No data access (read/write) is allowed until all
previous accesses to synchronization variables
have been performed.
29Weak Consistency
P1
W(x)1 W(x) 2
S
P2
P3
R(x)2 R(x)1 S
P4
R(x)1 R(x) 2 S
This sequence is allowed with weak consistent
memory
P1
W(x)1 W(x) 2 S
P2
P3
S R(x)2 R(x)2
P4
S R(x)2 R(x) 2
The memory in P3 and P4 has been brought up to
date
30Release Consistency
- Two synchronization variables are defined
- Acquire gathering in all writes from other
machines. - Release all locally initiated writes have been
completed (propagated to all other machines). - Acquire and release do not have to apply to all
memory, but instead guard specific shared
variables.
P1
Rel(L)
Acq(L) W(x)1 W(x) 2
P2
P3
Acq(L) R(x)2 Rel(L)
P4
R(x) 1
This sequence is allowed with release consistent
memory
31Mechanism for Realizing Release Consistency
- To do an acquire, a process sends a message to a
synchronization manager requesting an acquire on
a lock. - After the lock is acquired, an arbitrary sequence
of reads and writes to the shared data can be
performed without being propagated to other
machines. - When the release is done, the modified data are
sent to the other machines holding copies. - After each machine acknowledges receipt of the
data, the synchronization manager is informed of
the release.