Title: ECE 1747: Parallel Programming
1ECE 1747 Parallel Programming
- Basics of Parallel Architectures
- Shared-Memory Machines
2Two Parallel Architectures
- Shared memory machines.
- Distributed memory machines.
3Shared Memory Logical View
Shared memory space
proc1
proc2
proc3
procN
4Shared Memory Machines
- Small number of processors shared memory with
coherent caches (SMP). - Larger number of processors distributed shared
memory with coherent caches (CC-NUMA).
5SMPs
- 2- or 4-processors PCs are now commodity.
- Good price/performance ratio.
- Memory sometimes bottleneck (see later).
- Typical price (8-node) 20-40k.
6Physical Implementation
Shared memory
bus
cache1
cache2
cache3
cacheN
proc1
proc2
proc3
procN
7Shared Memory Machines
- Small number of processors shared memory with
coherent caches (SMP). - Larger number of processors distributed shared
memory with coherent caches (CC-NUMA).
8CC-NUMA Physical Implementation
mem2
mem3
memN
mem1
inter- connect
cache2
cache1
cacheN
cache3
proc1
proc2
proc3
procN
9Caches in Multiprocessors
- Suffer from the coherence problem
- same line appears in two or more caches
- one processor writes word in line
- other processors now can read stale data
- Leads to need for a coherence protocol
- avoids coherence problems
- Many exist, will just look at simple one.
10What is coherence?
- What does it mean to be shared?
- Intuitively, read last value written.
- Notion is not well-defined in a system without a
global clock.
11The Notion of last written in a Multi-processor
System
r(x)
P0
w(x)
P1
w(x)
P2
r(x)
P3
12The Notion of last written in a Single-machine
System
w(x)
w(x)
r(x)
r(x)
13Coherence a Clean Definition
- Is achieved by referring back to the single
machine case. - Called sequential consistency.
14Sequential Consistency (SC)
- Memory is sequentially consistent if and only if
it behaves as if the processors were executing
in a time-shared fashion on a single machine.
15Returning to our Example
r(x)
P0
w(x)
P1
w(x)
P2
r(x)
P3
16Another Way of Defining SC
- All memory references of a single process execute
in program order. - All writes are globally ordered.
17SC Example 1
Initial values of x,y are 0.
w(x,1)
w(y,1)
r(x)
r(y)
What are possible final values?
18SC Example 2
w(x,1)
w(y,1)
r(y)
r(x)
19SC Example 3
w(x,1)
w(y,1)
r(y)
r(x)
20SC Example 4
r(x)
w(x,1)
w(x,2)
r(x)
21Implementation
- Many ways of implementing SC.
- In fact, sometimes stronger conditions.
- Will look at a simple one MSI protocol.
22Physical Implementation
Shared memory
bus
cache1
cache2
cache3
cacheN
proc1
proc2
proc3
procN
23Fundamental Assumption
- The bus is a reliable, ordered broadcast bus.
- Every message sent by a processor is received by
all other processors in the same order. - Also called a snooping bus
- Processors (or caches) snoop on the bus.
24States of a Cache Line
- Invalid
- Shared
- read-only, one of many cached copies
- Modified
- read-write, sole valid copy
25Processor Transactions
- processor read(x)
- processor write(x)
26Bus Transactions
- bus read(x)
- asks for copy with no intent to modify
- bus read-exclusive(x)
- asks for copy with intent to modify
27State Diagram Step 0
I
S
M
28State Diagram Step 1
PrRd/BuRd
I
S
M
29State Diagram Step 2
PrRd/-
PrRd/BuRd
I
S
M
30State Diagram Step 3
PrWr/BuRdX
PrRd/-
PrRd/BuRd
I
S
M
31State Diagram Step 4
PrWr/BuRdX
PrRd/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
32State Diagram Step 5
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
33State Diagram Step 6
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRd/Flush
34State Diagram Step 7
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRd/Flush
BuRd/-
35State Diagram Step 8
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRdX/-
BuRd/Flush
BuRd/-
36State Diagram Step 9
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRdX/-
BuRd/Flush
BuRd/-
BuRdX/Flush
37In Reality
- Most machines use a slightly more complicated
protocol (4 states instead of 3). - See architecture books (MESI protocol).
38Problem False Sharing
- Occurs when two or more processors access
different data in same cache line, and at least
one of them writes. - Leads to ping-pong effect.
39False Sharing Example (1 of 3)
- pragma omp parallel for schedule(cyclic)
- for( i0 iltn i )
- ai bi
- Lets assume
- p 2
- element of a takes 4 words
- cache line has 32 words
40False Sharing Example (2 of 3)
cache line
a0
a1
a2
a3
a4
a5
a6
a7
Written by processor 0
Written by processor 1
41False Sharing Example (3 of 3)
a2
a4
P0
a0
...
inv
data
P1
a3
a5
a1
42Summary
- Sequential consistency.
- Bus-based coherence protocols.
- False sharing.
43Algorithms for Scalable Synchronization on
Shared-Memory Multiprocessors
- J.M. Mellor-Crummey, M.L. Scott
- (MCS Locks)
44Introduction
- Busy-waiting techniques heavily used in
synchronization on shared memory MPs - Two general categories locks and barriers
- Locks ensure mutual exclusion
- Barriers provide phase separation in an
application
45 Problem
- Busy-waiting synchronization constructs tend to
- Have significant impact on network traffic due to
cache invalidations - Contention leads to poor scalability
- Main cause spinning on remote variables
46The Proposed Solution
- Minimize access to remote variables
- Instead, spin on local variables
- Claim
- It can be done all in software (no need for fancy
and costly hardware support) - Spinning on local variables will minimize
contention, allow for good scalability, and good
performance
47Spin Lock 1 Test-and-Set Lock
- Repeatedly test-and-set a boolean flag indicating
whether the lock is held - Problem contention for the flag
(read-modify-write instructions are expensive) - Causes lots of network traffic, especially on
cache-coherent architectures (because of cache
invalidations) - Variation test-and-test-and-set less traffic
48Test-and-test with Backoff Lock
- Pause between successive test-and-set (backoff)
- TS with backoff idea
- while testset (L) fails
- pause (delay)
- delay delay 2
-
49Spin Lock 2 The Ticket Lock
- 2 counters (nr_requests, and nr_releases)
- Lock acquire fetch-and-increment on the
nr_requests counter, waits until its ticket is
equal to the value of the nr_releases counter - Lock release increment of the nr_releases counter
50Spin Lock 2 The Ticket Lock
- Advantage over TS polls with read operations
only - Still generates lots of traffic and contention
- Can further improve by using backoff
51Array-Based Queueing Locks
- Each CPU spins on a different location, in a
distinct cache line - Each CPU clears the lock for its successor (sets
it from must-wait to has-lock) - Lock-acquire
- while (slotsmy_place must-wait)
- Lock-release
- slots(my_place 1) P has-lock
52List-Based Queueing Locks (MCS Locks)
- Spins on local flag variables only
- Requires a small constant amount of space per lock
53List-Based Queueing Locks (MCS Locks)
- CPUs are all in a linked list upon release by
current CPU, lock is acquired by its successor - Spinning is on local flag
- Lock points at tail of queue (null if not held)
- Compare-and-swap allows for detection if it is
the only processor in queue and atomic removal of
self from the queue
54List-Based Queueing Locks (MCS Locks)
- Spin in acquire_lock waits for lock to become
free - Spin in release_lock compensates for the time
window between fetch-and-store and assignment to
predecessor-gtnext in acquire_lock - If no compare_and_swap cumbersome
55The MCS Tree-Based Barrier
- Uses a pair of P (nr. of CPUs) trees arrival
tree, and wakeup tree - Arrival tree each node has 4 children
- Wakeup tree binary tree
- Fastest way to wake up all P processors
56Hardware Description
- BBN Butterfly 1 DSM multiprocessor
- Supports up to 256 CPUs, 80 used in experiments
- Atomic primitives allow fetch_and_add,
fetch_and_store (swap), test_and_set - Sequent Symmetry Model B cache-coherent,
shared-bus multiprocessor - Supports up to 30 CPUs, 18 used in experiments
- Snooping cache-coherence protocol
- Neither supports compare-and-swap
57Measurement Technique
- Results averaged over 10k (Butterfly) or 100k
(Symmetry) acquisitions - For 1 CPU, time represents latency between
acquire and release of lock - Otherwise, time represents time elapsed between
successive acquisitions
58Spin Locks on Butterfly
59Spin Locks on Butterfly
60Spin Locks on Butterfly
- Andersons fares poorer because the Butterfly
lacks coherent caches, and CPUs may spin on
statically unpredictable locations which may
not be local - TS with exponential backoff, Ticket lock with
proportional backoff, MCS all scale very well,
with slopes of 0.0025, 0.0021 and 0.00025 µs
respectively
61Spin Locks on Symmetry
62Spin Locks on Symmetry
63Latency and Impact of Spin Locks
64Latency and Impact of Spin Locks
- Latency results are poor on Butterfly because
- Atomic operations are inordinately expensive in
comparison to non-atomic ones - 16-bit atomic primitives on Butterfly cannot
manipulate 24-bit pointers
65Barriers on Butterfly
66Barriers on Butterfly
67Barriers on Symmetry
68Barriers on Symmetry
- Different results from Butterfly because
- More CPUs can spin on same location (own copy in
local cache) - Distributing writes across different memory
modules yields no benefit because the bus
serializes all communication
69Conclusions
- Criteria for evaluating spin locks
- Scalability and induced network load
- Single-processor latency
- Space requirements
- Fairness
- Implementability with available atomic operations
70Conclusions
- MCS lock algorithm scales best, together with
array-based queueing on cache-coherent machines - TS and Ticket Locks with proper backoffs also
scale well, but incur more network load - Anderson and GT prohibitive space requirements
for large numbers of CPUs
71Conclusions
- MCS, array-based, and Ticket Locks guarantee
fairness (FIFO) - MCS benefits significantly from existence of
compare-and-swap - MCS is best when contention expected excellent
scaling, FIFO ordering, least interconnect
contention, low space reqs.