ECE 1747: Parallel Programming - PowerPoint PPT Presentation

About This Presentation

Title:

ECE 1747: Parallel Programming

Description:

Larger number of processors: distributed shared memory with coherent caches (CC-NUMA) ... Anderson's fares poorer because the Butterfly lacks coherent caches, and CPUs ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 72

Provided by: CITI

Category:

more less

Transcript and Presenter's Notes

Title: ECE 1747: Parallel Programming

1
ECE 1747 Parallel Programming

Basics of Parallel Architectures
Shared-Memory Machines

2
Two Parallel Architectures

Shared memory machines.
Distributed memory machines.

3
Shared Memory Logical View
Shared memory space
proc1
proc2
proc3
procN
4
Shared Memory Machines

Small number of processors shared memory with
coherent caches (SMP).
Larger number of processors distributed shared
memory with coherent caches (CC-NUMA).

5
SMPs

2- or 4-processors PCs are now commodity.
Good price/performance ratio.
Memory sometimes bottleneck (see later).
Typical price (8-node) 20-40k.

6
Physical Implementation
Shared memory
bus
cache1
cache2
cache3
cacheN
proc1
proc2
proc3
procN
7
Shared Memory Machines

Small number of processors shared memory with
coherent caches (SMP).
Larger number of processors distributed shared
memory with coherent caches (CC-NUMA).

8
CC-NUMA Physical Implementation
mem2
mem3
memN
mem1
inter- connect
cache2
cache1
cacheN
cache3
proc1
proc2
proc3
procN
9
Caches in Multiprocessors

Suffer from the coherence problem
same line appears in two or more caches
one processor writes word in line
other processors now can read stale data
Leads to need for a coherence protocol
avoids coherence problems
Many exist, will just look at simple one.

10
What is coherence?

What does it mean to be shared?
Intuitively, read last value written.
Notion is not well-defined in a system without a
global clock.

11
The Notion of last written in a Multi-processor
System
r(x)
P0
w(x)
P1
w(x)
P2
r(x)
P3
12
The Notion of last written in a Single-machine
System
w(x)
w(x)
r(x)
r(x)
13
Coherence a Clean Definition

Is achieved by referring back to the single
machine case.
Called sequential consistency.

14
Sequential Consistency (SC)

Memory is sequentially consistent if and only if
it behaves as if the processors were executing
in a time-shared fashion on a single machine.

15
Returning to our Example
r(x)
P0
w(x)
P1
w(x)
P2
r(x)
P3
16
Another Way of Defining SC

All memory references of a single process execute
in program order.
All writes are globally ordered.

17
SC Example 1
Initial values of x,y are 0.
w(x,1)
w(y,1)
r(x)
r(y)
What are possible final values?
18
SC Example 2
w(x,1)
w(y,1)
r(y)
r(x)
19
SC Example 3
w(x,1)
w(y,1)
r(y)
r(x)
20
SC Example 4
r(x)
w(x,1)
w(x,2)
r(x)
21
Implementation

Many ways of implementing SC.
In fact, sometimes stronger conditions.
Will look at a simple one MSI protocol.

22
Physical Implementation
Shared memory
bus
cache1
cache2
cache3
cacheN
proc1
proc2
proc3
procN
23
Fundamental Assumption

The bus is a reliable, ordered broadcast bus.
Every message sent by a processor is received by
all other processors in the same order.
Also called a snooping bus
Processors (or caches) snoop on the bus.

24
States of a Cache Line

Invalid
Shared
read-only, one of many cached copies
Modified
read-write, sole valid copy

25
Processor Transactions

processor read(x)
processor write(x)

26
Bus Transactions

bus read(x)
asks for copy with no intent to modify
bus read-exclusive(x)
asks for copy with intent to modify

27
State Diagram Step 0
I
S
M
28
State Diagram Step 1
PrRd/BuRd
I
S
M
29
State Diagram Step 2
PrRd/-
PrRd/BuRd
I
S
M
30
State Diagram Step 3
PrWr/BuRdX
PrRd/-
PrRd/BuRd
I
S
M
31
State Diagram Step 4
PrWr/BuRdX
PrRd/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
32
State Diagram Step 5
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
33
State Diagram Step 6
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRd/Flush
34
State Diagram Step 7
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRd/Flush
BuRd/-
35
State Diagram Step 8
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRdX/-
BuRd/Flush
BuRd/-
36
State Diagram Step 9
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRdX/-
BuRd/Flush
BuRd/-
BuRdX/Flush
37
In Reality

Most machines use a slightly more complicated
protocol (4 states instead of 3).
See architecture books (MESI protocol).

38
Problem False Sharing

Occurs when two or more processors access
different data in same cache line, and at least
one of them writes.
Leads to ping-pong effect.

39
False Sharing Example (1 of 3)

pragma omp parallel for schedule(cyclic)
for( i0 iltn i )
ai bi
Lets assume
p 2
element of a takes 4 words
cache line has 32 words

40
False Sharing Example (2 of 3)
cache line
a0
a1
a2
a3
a4
a5
a6
a7
Written by processor 0
Written by processor 1
41
False Sharing Example (3 of 3)
a2
a4
P0
a0
...
inv
data
P1
a3
a5
a1
42
Summary

Sequential consistency.
Bus-based coherence protocols.
False sharing.

43
Algorithms for Scalable Synchronization on
Shared-Memory Multiprocessors

J.M. Mellor-Crummey, M.L. Scott
(MCS Locks)

44
Introduction

Busy-waiting techniques heavily used in
synchronization on shared memory MPs
Two general categories locks and barriers
Locks ensure mutual exclusion
Barriers provide phase separation in an
application

45
Problem

Busy-waiting synchronization constructs tend to
Have significant impact on network traffic due to
cache invalidations
Contention leads to poor scalability
Main cause spinning on remote variables

46
The Proposed Solution

Minimize access to remote variables
Instead, spin on local variables
Claim
It can be done all in software (no need for fancy
and costly hardware support)
Spinning on local variables will minimize
contention, allow for good scalability, and good
performance

47
Spin Lock 1 Test-and-Set Lock

Repeatedly test-and-set a boolean flag indicating
whether the lock is held
Problem contention for the flag
(read-modify-write instructions are expensive)
Causes lots of network traffic, especially on
cache-coherent architectures (because of cache
invalidations)
Variation test-and-test-and-set less traffic

48
Test-and-test with Backoff Lock

Pause between successive test-and-set (backoff)
TS with backoff idea
while testset (L) fails
pause (delay)
delay delay 2

49
Spin Lock 2 The Ticket Lock

2 counters (nr_requests, and nr_releases)
Lock acquire fetch-and-increment on the
nr_requests counter, waits until its ticket is
equal to the value of the nr_releases counter
Lock release increment of the nr_releases counter

50
Spin Lock 2 The Ticket Lock

Advantage over TS polls with read operations
only
Still generates lots of traffic and contention
Can further improve by using backoff

51
Array-Based Queueing Locks

Each CPU spins on a different location, in a
distinct cache line
Each CPU clears the lock for its successor (sets
it from must-wait to has-lock)
Lock-acquire
while (slotsmy_place must-wait)
Lock-release
slots(my_place 1) P has-lock

52
List-Based Queueing Locks (MCS Locks)

Spins on local flag variables only
Requires a small constant amount of space per lock

53
List-Based Queueing Locks (MCS Locks)

CPUs are all in a linked list upon release by
current CPU, lock is acquired by its successor
Spinning is on local flag
Lock points at tail of queue (null if not held)
Compare-and-swap allows for detection if it is
the only processor in queue and atomic removal of
self from the queue

54
List-Based Queueing Locks (MCS Locks)

Spin in acquire_lock waits for lock to become
free
Spin in release_lock compensates for the time
window between fetch-and-store and assignment to
predecessor-gtnext in acquire_lock
If no compare_and_swap cumbersome

55
The MCS Tree-Based Barrier

Uses a pair of P (nr. of CPUs) trees arrival
tree, and wakeup tree
Arrival tree each node has 4 children
Wakeup tree binary tree
Fastest way to wake up all P processors

56
Hardware Description

BBN Butterfly 1 DSM multiprocessor
Supports up to 256 CPUs, 80 used in experiments
Atomic primitives allow fetch_and_add,
fetch_and_store (swap), test_and_set
Sequent Symmetry Model B cache-coherent,
shared-bus multiprocessor
Supports up to 30 CPUs, 18 used in experiments
Snooping cache-coherence protocol
Neither supports compare-and-swap

57
Measurement Technique

Results averaged over 10k (Butterfly) or 100k
(Symmetry) acquisitions
For 1 CPU, time represents latency between
acquire and release of lock
Otherwise, time represents time elapsed between
successive acquisitions

58
Spin Locks on Butterfly
59
Spin Locks on Butterfly
60
Spin Locks on Butterfly

Andersons fares poorer because the Butterfly
lacks coherent caches, and CPUs may spin on
statically unpredictable locations which may
not be local
TS with exponential backoff, Ticket lock with
proportional backoff, MCS all scale very well,
with slopes of 0.0025, 0.0021 and 0.00025 µs
respectively

61
Spin Locks on Symmetry
62
Spin Locks on Symmetry
63
Latency and Impact of Spin Locks
64
Latency and Impact of Spin Locks

Latency results are poor on Butterfly because
Atomic operations are inordinately expensive in
comparison to non-atomic ones
16-bit atomic primitives on Butterfly cannot
manipulate 24-bit pointers

65
Barriers on Butterfly
66
Barriers on Butterfly
67
Barriers on Symmetry
68
Barriers on Symmetry

Different results from Butterfly because
More CPUs can spin on same location (own copy in
local cache)
Distributing writes across different memory
modules yields no benefit because the bus
serializes all communication

69
Conclusions

Criteria for evaluating spin locks
Scalability and induced network load
Single-processor latency
Space requirements
Fairness
Implementability with available atomic operations

70
Conclusions

MCS lock algorithm scales best, together with
array-based queueing on cache-coherent machines
TS and Ticket Locks with proper backoffs also
scale well, but incur more network load
Anderson and GT prohibitive space requirements
for large numbers of CPUs

71
Conclusions

MCS, array-based, and Ticket Locks guarantee
fairness (FIFO)
MCS benefits significantly from existence of
compare-and-swap
MCS is best when contention expected excellent
scaling, FIFO ordering, least interconnect
contention, low space reqs.

Write a Comment

User Comments (0)