Computer architecture II - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Computer architecture II

Description:

Computer architecture II – PowerPoint PPT presentation

Number of Views:91

Avg rating:3.0/5.0

Slides: 42

Provided by: Bog61

Category:

more less

Transcript and Presenter's Notes

Title: Computer architecture II

1
Computer architecture II

Lecture 9

2
Today

Synchronization for SMM
Test and set, ll and sc, array
barrier
Scalable Multiprocessors
What is a scalable machine?

3
Synchronization

Types of Synchronization
Mutual Exclusion
Event synchronization
point-to-point
group
global (barriers)
All solutions rely on hardware support for an
atomic read-modify-write operation
We look today at synchronization for
cache-coherent, bus-based multiprocessors

4
Components of a Synchronization Event

Acquire method
Acquire right to the synch (e.g. enter critical
section)
Waiting algorithm
Wait for synch to become available when it isnt
busy-waiting, blocking, or hybrid
Release method
Enable other processors to acquire

5
Performance Criteria for Synch. Ops

Latency (time per op)
especially when light contention
Bandwidth (ops per sec)
especially under high contention
Traffic
load on critical resources
especially on failures under contention
Storage
Fairness

6
Strawman Lock
Busy-Waiting

lock ld register, location / copy location to
register /
cmp location, 0 / compare with 0 /
bnz lock / if not 0, try again /
st location, 1 / store 1 to mark it locked /
ret / return control to caller /
unlock st location, 0 / write 0 to location
/
ret / return control to caller /

Location is initially 0 Why doesnt the acquire
method work?
7
Atomic Instructions

Specifies a location, register, atomic
operation
Value in location read into a register
Another value (function of value read or not)
stored into location
Many variants
Varying degrees of flexibility in second part
Simple example testset
Value in location read into a specified register
Constant 1 stored into location
Successful if value loaded into register is 0
Other constants could be used instead of 1 and 0

8
Simple TestSet Lock

lock ts register, location
bnz lock / if not 0, try again /
ret / return control to caller /
unlock st location, 0 / write 0 to location
/
ret / return control to caller /
The same code for lock in pseudocode
while (not acquired) / lock is aquired be
another one/
testset(location) / try to acquire the lock/
Condition architecture supports atomic test and
set
Copy location to register and set location to 1
Problem
ts modifies the variable location in its cache
each time it tries to acquire the lockgt cache
block invalidations gt bus traffic (especially
for high contention)

9
TS Lock Microbenchmark SGI Challenge
20
l
T
estset,
c
0
s
s
T
estset, exponential backoff

c
3.64
l
s
18
T
estset, exponential backoff

c
0
n
n
Ideal
u
s
16
s
l
s
s
n
14
l
s)
l
s
m
s
n
12
s
n
l
s
ime (
s
s
l
10
l
n
T
l
s
lock delay(c) unlock
n
n
8
n
l
6
n
l
n
4
l
s
n
l
l
n
2
l
s
uuuuuuuuuuuuuuu
l
n
n
n
n
s
u
l
0
11
13
15
9
7
5
3
Number of processors

Why does performance degrade?
Bus Transactions on TS

10
Other read-modify-write primitives

Fetchop
Atomically read and modify (by using op
operation) and write a memory location
E.g. fetchadd, fetchincr
Compareswap
Three operands location, register to compare
with, register to swap with

11
Enhancements to Simple Lock

Problem of ts lots of invalidations if the lock
can not be taken
Reduce frequency of issuing testsets while
waiting
Testset lock with exponential backoff
i0
while (! acquired) / lock is acquired be
another one/
testset(location)
if (!acquired) / testset didnt succeed/
wait (ti) / sleep some time
i
Less invalidations
May wait more

12
TS Lock Microbenchmark SGI Challenge
20
l
T
estset,
c
0
s
s
T
estset, exponential backoff

c
3.64
l
s
18
T
estset, exponential backoff

c
0
n
n
Ideal
u
s
16
s
l
s
s
n
14
l
s)
l
s
m
s
n
12
s
n
l
s
ime (
s
s
l
10
l
n
T
l
s
lock delay(c) unlock
n
n
8
n
l
6
n
l
n
4
l
s
n
l
l
n
2
l
s
uuuuuuuuuuuuuuu
l
n
n
n
n
s
u
l
0
11
13
15
9
7
5
3
Number of processors

Why does performance degrade?
Bus Transactions on TS

13
Enhancements to Simple Lock

Reduce frequency of issuing testsets while
waiting
Test-and-testset lock
while (! acquired) / lock is acquired be
another one/
if (location1) / test with ordinary load /
continue
else
testset(location)
if (acquired) /succeeded/
break
Keep testing with ordinary load
Just a hint cached lock variable will be
invalidated when release occurs
If location becomes 0, use ts to modify the
variable atomically
If failure start over
Further reduces the bus transactions
load produces bus traffic only when the lock is
released
ts produces bus traffic each time is executed

14
Lock performance
15
Improved Hardware Primitives LL-SC

Goals
Problem of testset generate lot of bus traffic
Failed read-modify-write attempts dont generate
invalidations
Nice if single primitive can implement range of
r-m-w operations
Load-Locked (or -linked), Store-Conditional
LL reads variable into register
Work on the value from the register
SC tries to store back to location
succeed if and only if no other write to the
variable since this processors LL
indicated by a condition flag
If SC succeeds, all three steps happened
atomically
If fails, doesnt write or generate invalidations
must retry acquire

16
Simple Lock with LL-SC

lock ll reg1, location / LL location to reg1
/
sc location, reg2 / SC reg2 into
location/
beqz reg2, lock / if failed, start
again /
ret
unlock st location, 0 / write 0 to location /
ret
Can simulate the atomic ops ts, fetchop,
compareswap by changing whats between LL SC
(exercise)
Only a couple of instructions so SC likely to
succeed
Dont include instructions that would need to be
undone (e.g. stores)
SC can fail (without putting transaction on bus)
if
Detects intervening write even before trying to
get bus
Tries to get bus but another processors SC gets
bus first
LL, SC are not lock, unlock respectively
Only guarantee no conflicting write to lock
variable between them
But can use directly to implement simple
operations on shared variables

17
Advanced lock algorithms

Problems with presented approaches
Unfair the order of arrival does not count
All processors try to acquire the lock when
released
More processes may incur a read miss when the
lock released
Desirable only one miss

18
Ticket Lock

Draw a ticket with a number, wait until the
number is shown
Two counters per lock (next_ticket, now_serving)
Acquire fetchinc next_ticket wait for
now_serving next_ticket
atomic op when arrive at lock, not when its free
(so less contention)
Release increment now-serving
Performance
low latency for low-contention
O(p) read misses at release, since all spin on
same variable
FIFO order
like simple LL-SC lock, but no invalidation when
SC succeeds, and fair

19
Array-based Queuing Locks

Waiting processes poll on different locations in
an array of size p
Acquire
fetchinc to obtain address on which to spin
(next array element)
ensure that these addresses are in different
cache lines or memories
Release
set next location in array, thus waking up
process spinning on it
O(1) traffic per acquire with coherent caches
FIFO ordering, as in ticket lock, but, O(p) space
per lock
Not so great for non-cache-coherent machines with
distributed memory
array location I spin on not necessarily in my
local memory

20
Lock performance
21
Point to Point Event Synchronization

Software methods
Busy-waiting use ordinary variables as flags
Blocking semaphores
Interrupts
Full hardware support full-empty bit with each
word in memory
Set when word is full with newly produced data
(i.e. when written)
Unset when word is empty due to being consumed
(i.e. when read)
Natural for word-level producer-consumer
synchronization
producer write if empty, set to full
consumer read if full set to empty
Hardware preserves read or write atomicity
Problem flexibility
multiple consumers
multiple update of a producer

22
Barriers

Hardware barriers
Wired-AND line separate from address/data bus
Set input 1 when arrive, wait for output to be 1
to leave
Useful when barriers are global and very frequent
Difficult to support arbitrary subset of
processors
even harder with multiple processes per processor
Difficult to dynamically change number and
identity of participants
e.g. latter due to process migration
Not common today on bus-based machines
Software algorithms implemented using locks,
flags, counters

23
A Simple Centralized Barrier

Shared counter maintains number of processes that
have arrived
increment when arrive (lock), check until reaches
numprocs
Problem?

struct bar_type int counter struct lock_type
lock int flag 0 bar_name BARRIER
(bar_name, p) LOCK(bar_name.lock) if
(bar_name.counter 0) bar_name.flag 0
/ reset flag if first to reach/ mycount
bar_name.counter / mycount is private
/ UNLOCK(bar_name.lock) if (mycount p)
/ last to arrive / bar_name.counter
0 / reset for next barrier
/ bar_name.flag 1 / release waiters
/ else while (bar_name.flag 0) /
busy wait for release /
24
A Working Centralized Barrier

Consecutively entering the same barrier doesnt
work
Must prevent process from entering until all have
left previous instance
Could use another counter, but increases latency
and contention
Sense reversal wait for flag to take different
value consecutive times
Toggle this value only when all processes reach

BARRIER (bar_name, p)
local_sense !(local_sense) / toggle private
sense variable /
LOCK(bar_name.lock)
mycount bar_name.counter / mycount is
private /
if (bar_name.counter p)
UNLOCK(bar_name.lock)
bar_name.counter 0
bar_name.flag local_sense / release
waiters/
else
UNLOCK(bar_name.lock)
while (bar_name.flag ! local_sense)

25
Centralized Barrier Performance

Latency
critical path length at least proportional to p
(the accesses to the critical region are
serialized by the lock)
Traffic
p bus transaction to obtain the lock
p bus transactions to modify the counter
2 bus transaction for the last processor to reset
the counter and release the waiting process
p-1 bus transactions for the first p-1 processors
to read the flag
Storage Cost
Very low centralized counter and flag
Fairness
Same processor should not always be last to exit
barrier
Key problems for centralized barrier are latency
and traffic
Especially with distributed memory, traffic goes
to same node

26
Improved Barrier Algorithms for a Bus

Software combining tree
Only k processors access the same location, where
k is degree of tree (k2 in the example below)

Separate arrival and exit trees, and use sense
reversal
Valuable in distributed network communicate
along different paths
On bus, all traffic goes on same bus, and no less
total traffic
Higher latency (log p steps of work, and O(p)
serialized bus transactions)
Advantage on bus is use of ordinary reads/writes
instead of locks

27
Scalable Multiprocessors
28
Scalable Machines

Scalability capability of a system to increase
by adding processors, memory, I/O devices
4 important aspects of scalability
bandwidth increases with number of processors
latency does not increase or increases slowly
Cost increases slowly with number of processors
Physical placement of resources

29
Limited Scaling of a Bus
Characteristic Bus Physical Length 1 ft Number
of Connections fixed Maximum Bandwidth fixed Inter
face to Comm. medium extended memory
interface Global Order arbitration Protection virt
ual -gt physical Trust total OS single comm.
abstraction HW

Small configurations are cost-effective

30
Workstations in a LAN?
Characteristic Bus LAN Physical Length 1
ft KM Number of Connections fixed many Maximum
Bandwidth fixed ??? Interface to Comm.
medium memory interface peripheral Global
Order arbitration ??? Protection Virtual -gt
physical OS Trust total none OS single independent
comm. abstraction HW SW

No clear limit to physical scaling, little trust,
no global order
Independent failure and restart

31
Bandwidth Scalability

Bandwidth limitation single set of wires
Must have many independent wires (remember
bisection width?) gt switches

32
Dancehall MP Organization

Network bandwidth demand scales linearly with
number of processors
Latency Increases with number of stages of
switches (remember butterfly?)
Adding local memory would offer fixed latency

33
Generic Distributed Memory Multiprocessor

Most common structure

34
Bandwidth scaling requirements

Large number of independent communication paths
between nodes large number of concurrent
transactions using different wires
Independent transactions
No global arbitration
Effect of a transaction only visible to the nodes
involved
Broadcast difficult (was easy for bus)
additional transactions needed

35
Latency Scaling