Title: CSS434: Parallel
1CSS434 Distributed Shared Memory Textbook Ch18
Professor Munehiro Fukuda
2Basic Concept
address
Distributed Shared Memory (exists only virtually)
write(address, data)
Data read(address)
Communication Network
A cache line or a page is transferred to and
cached in the requested computer.
3Writer Process on DSM
include "world.h" struct shared int a,b
Program Writer main() int x struct
shared p methersetup() / Initialize the
Mether run-time / p (struct shared
)METHERBASE / overlay structure on
METHER segment / p-gta p-gtb 0 /
initialize fields to zero / while(TRUE) /
continuously update structure fields / p gta
p gta 1 p gtb p gtb - 1
4Reader Process on DSM
Program Reader main() struct shared
p methersetup() p (struct shared
)METHERBASE while(TRUE) / read the fields
once every second / printf("a d, b
d\n", p gta, p gtb) sleep(1)
5Why DSM?
- Simpler abstraction
- Underlying tedious communication primitives are
all shielded by memory accesses - Better portability of distributed application
programs - Natural transition from sequential to distributed
application - Better performance of some applications
- Data locality, one-demand data movement, and
large memory space reduce network traffic and
paging/swapping activities. - Flexible communication environment
- Sender and receiver have no need to know each
other. They even need not coexist. - Ease of process migration
- Migration is completed only by transferring the
corresponding PCB to the destination.
6Main Issues
- Granularity
- Fine (less false sharing but more network
traffic)? Cache line (e.g. Dash and Alewife),
Object (e.g. Orca and Linda), Page (e.g. Ivy) ?
Coarse(more false sharing but less network
traffice) - Memory coherence and access synchronization
- Strict, Sequential, Causal, Weak, and Release
Consistency models - Data location and access
- Broadcasting, centralized data locator, fixed
distributed data locator, and dynamic distributed
data locator - Replacement strategy
- LRU or FIFO (The same issue as OS virtual memory)
- Thrashing
- How to prevent a block from being exchanged back
and forth between two nodes. - Heterogeneity
7Consistency ModelsTwo processes accessing shared
variables
At the beginning a b 0
DSM needs a consistency model.
8Consistency ModelsStrict Consistency
- Wi(x, a) Processor i writes a on variable x,
(i.e., x a). - b?Ri(x) Processor i reads b from variable x.
(i.e., y x y b). - Any read on x must return the value of the most
recent write on x.
Strict Consistency
Not Strict Consistency
P3
P2
P2
P1
P1
P3
W2(x, a)
W2(x, a)
nil?R1(x)
a?R1(x)
a?R1(x)
a?R3(x)
a?R3(x)
a?R1(x)
9Consistency ModelsLinearizability and Sequential
Consistency
- Linearlizability Operations of each individual
process appear to all processes in the same order
as they happen. - Sequential Consistency Operations of each
individual process appear in the same order to
all processes.
Linearlizability
Sequential Consistency
P4
P2
P3
P1
P3
P4
P2
P1
W2(x, a)
W2(x, a)
Nil lt-R1(x)
W3(x, b)
W3(x, b)
a?R1(x)
b?R1(x)
a?R4(x)
b?R4(x)
b?R4(x)
b?R1(x)
a?R4(x)
a?R1(x)
10Consistency ModelsFIFO and Processor Consistency
- FIFO Consistency writes by a single process are
visible to all other processes in the order in
which they were issued. - Processor Consistency FIFO Consistency all
write to the same memory location must be visible
in the same order.
FIFO Consistency
Processor Consistency
P4
P2
P3
P2
P1
P1
P3
P4
W2(x, a)
W2(x, a)
W3(x, 0)
a?R1(x)
W2(x, b)
a?R1(x)
W3(y, 0)
W2(x, b)
0?R1(x)
W3(x, 1)
a?R1(x)
0?R1(y)
a?R1(x)
W3(y, 1)
0?R1(y)
b?R1(x)
1?R1(y)
0?R1(x)
1?R1(y)
W3(z, 1)
W3(z, a)
1?R1(x)
b?R1(x)
W2(y, a)
1?R1(x)
W2(y, a)
b?R1(x)
b?R1(x)
1?R1(z)
1?R1(z)
a?R1(y)
a?R1(y)
a?R1(y)
1?R1(z)
1?R1(z)
a?R1(y)
11Consistency ModelsCausal Consistency
- Causally related write must be visible to all
processes in the same order. Concurrent writes
may be propagated in a different order.
Causal Consistency
Not Causal Consistency
P4
P3
P1
P2
P4
P2
P3
P1
W2(x, a)
W2(x, a)
a?R3(x)
a?R4(x)
a?R3(x)
a?R3(x)
W2(x, c)
W3(x, b)
W3(x, b)
b?R4(x)
c?R1(x)
a?R1(x)
b?R4(x)
c?R4(x)
b?R1(x)
b?R1(x)
a?R4(x)
12Consistency ModelsWeak Consistency
- Accesses to synchronization variables must obey
sequential consistency. - All previous writes must be completed before an
access to a synchronization variable. - All previous accesses to synchronization
variables must be completed before access to
non-synchronization variable.
Weak Consistency
Not Weak Consistency
P2
P3
P3
P1
P2
P1
W2(x, a)
W2(x, a)
W2(x, b)
W2(y, c)
W2(y, c)
b?R4(x)
W2(x, b)
a?R4(x)
S3
Nil?R4(y)
S3
S1
S1
S2
S2
b?R4(x)
a?R4(x)
b?R4(x)
c?R4(y)
c?R4(y)
c?R4(y)
c?R4(y)
b?R4(x)
13Consistency ModelsRelease Consistency
- Access to acquire and release variables obey
processor consistency. - Previous acquires requested by a process must be
completed before the process performs a data
access. - All previous data accesses performed by a process
must be completed before the process performs a
release.
P3
P2
P1
Acq1(L)
W1(x, a)
W1(x, b)
Rel1(L)
Acq2(L)
b?R2(x)
b?R2(x)
a?R3(x)
Rel2(L)
14Consistency ModelsRelease Consistency (Example)
Process 1 acquireLock() // enter critical
section a a 1 b b 1 releaseLock()
// leave critical section Process 2
acquireLock() // enter critical
section print ("The values of a and b are ", a,
b) releaseLock() // leave critical section
15Implementing Sequential ConsistencyReplicated
and Migrating Data Blocks
Node 1
Node 3
x
m
b
Then what if Node 2 updates x?
16Implementing Sequential ConsistencyWrite
Invalidation
Client wants to write
new copy
a copy of block
block
a copy of block
17Implementing Sequential ConsistencyWrite Update
Client wants to write
a copy of block
block
a copy of block
18Implementing Sequential ConsistencyRead/Write
Request
Unused
Read (Read a copy from the onwer)
Replacement
Replacement
Replacement
Replacement
Nil
Write invalidate
Read only
Read (Read from memory and get an ownership)
Write invalidate
Write (invalidate others if they have a copy and
get an ownership)
Write (invalidate others if they have a copy and
get an ownership)
Write invalidate
Writable
Read-owned
Write (invalidate others if they have a copy)
19Implementing Sequential ConsistencyLocating Data
Fixed Distributed-Server Algorithms
Processor 0
Processor 1
Processor 2
Addr0 writable
Addr3 read owned
Addr2 read owned
Addr1 read owned
Addr7 writable
Addr4 read owned
Addr5 writable
Addr6 writable
Read addr2
Addr8 read owned
Addr2 read only
20Implementing Sequential ConsistencyLocating Data
Dynamic Distributed-Server Algorithms
Processor 0
Processor 1
Processor 2
- Breaking the chain of nodes
- When the node receives an invalidation
- When the node relinquishes ownership
- When the node forwards a fault request
- The node points to a new owner
Addr0 writable
Addr3 read owned
Addr2 read owned
Addr2 read only
Addr1 read owned
Addr7 writable
Addr4 read owned
Addr8 read owned
Addr5 writable
Read addr2
Addr2 read owned
21Replacement Strategy
- Which block to replace
- Non-usage based (e.g. FIFO)
- Usage based (e.g. LRU)
- Mixed of those (e.g. Ivy )
- Unused/Nil replaced with the highest priority
- Read-only the second priority
- Read-owned the third priority
- Writable the lowest priority and LRU used.
- Where to place a replaced block
- Invalidating a block if other nodes have a copy.
- Using secondary store
- Using the memory space of other nodes
22Thrashing
- Thrashing
- Two or more processes try to write the same
shared block. - An owner keeps writing its block shared by two or
more reader processes. - The larger a block, the more chances of false
sharing that causes thrashing. - Solutions
- Allow a process to prevent a block from accessed
from the others, using a lock. - Allow a process to hold a block for a certain
amount of time. - Apply a different coherence algorithm to each
block. - What do those solutions require users to do?
- Are there any perfect solutions?
23Paper Review by Students
- IVY
- Dash
- Munin
- Linda/Jini/JavaSpace
- Discussions
- Classify which system is based on sequential
consistency, release consistency, and lazy
release consistency. - Classify the shared data granularity of these
systems cache-line based, page-based, and
object-based. - Classify the implementation of these systems
hardware implementation, OS implementation, and
User-level implementation.
24Non-Turn-In Exercises
- Is the memory underlying the following execution
of two processes sequentially consistent
(assuming that, initially, all variables are set
to zero)? (Textbook p780 Q18.6) - P1 R(x)1 R(x)2 W(y)1
- P2 W(x)1 R(y)1 W(x)2
- Show that the following history is not causally
consistent. (Textbook p781 Q18.18) - P1 W(a)0 W(a)1
- P2 R(a)1 W(b)2
- P3 R(b)2 R(a)0
- Explain the relationship between false sharing
and data granularity in DSM.
25Non-Turn-In Exercises
Processor 3 ownership table
Processor 1 ownership table
Processor 2 ownership table
addr
owner
shared
addr
owner
shared
addr
owner
shared
6
P3
3
P2
0
P0
4
7
P2
1
P0
P3
4
P3
8
P2
5
P0
2
P3
data items
data items
data items
addr 2
addr 3
addr 0
addr 4
addr 7
addr 1
addr 6
addr 8
event
copyaddr1
- There is a DSM system that is based on the
write-invalidation protocol, uses a fixed
distributed-server algorithm for locating a given
data item, and consists of three processors such
as 1, 2, and 3. Each processor has the following
data items and an ownership/sharing-processor
table.
26Non-Turn-In Exercises
Given the following sequence of memory accesses,
draw additional arrows and circles in the above
figure as instructed. To distinguish which arrow
corresponds to which operation, add the operation
number 1 8 to each arrow. Also, update the
corresponding ownership table entries. (1)
Memory access 1 Processor 2 reads data from
address 2. Add arrows in the above figure to
indicate operations required for the memory
access 1. 1. Send a query to search for the
address 2 2. Send a request to read from the
address 2 3. Read data from the address 2 to
Processor 2 Update the corresponding ownership
table entry. (Just add P2 in the share
field.) Draw a circle to indicate that a copy of
address 2 was created on Processor 2. (2)
Memory access 2 Processor 1 reads data from
address 2. Add arrows in the above figure to
indicate operations required for the memory
access 2. 4. Send a query to search for the
address 2 5. Send a request to read from the
address 2 6. Read data from the address 1 to
Processor 2 Update the corresponding ownership
table entry. (Just add P1 in the share
field.) Draw a circle to indicate that a copy of
address 2 was created on Processor 1. (3) Memory
access 3 Processor 2 writes data to address
2. Add arrows in the above figure to indicate
operations required for the memory access 3. 7.
Send a request to update the ownership
information on the address 2 8. Send a write
invalidation to all non-owner processors sharing
the address 2 Update the corresponding ownership
table entry. (Make Processor 2 a new owner of
address 2 and cross out all other processor Ids
in the entry.) Cross out all circles to indicate
that old copies of address 2 were all invalidated.