Title: Computer architecture II
1Computer architecture II
2Today
- Synchronization for SMM
- Test and set, ll and sc, array
- barrier
- Scalable Multiprocessors
- What is a scalable machine?
3Synchronization
- Types of Synchronization
- Mutual Exclusion
- Event synchronization
- point-to-point
- group
- global (barriers)
- All solutions rely on hardware support for an
atomic read-modify-write operation - We look today at synchronization for
cache-coherent, bus-based multiprocessors
4Components of a Synchronization Event
- Acquire method
- Acquire right to the synch (e.g. enter critical
section) - Waiting algorithm
- Wait for synch to become available when it isnt
- busy-waiting, blocking, or hybrid
- Release method
- Enable other processors to acquire
5Performance Criteria for Synch. Ops
- Latency (time per op)
- especially when light contention
- Bandwidth (ops per sec)
- especially under high contention
- Traffic
- load on critical resources
- especially on failures under contention
- Storage
- Fairness
6Strawman Lock
Busy-Waiting
- lock ld register, location / copy location to
register / - cmp location, 0 / compare with 0 /
- bnz lock / if not 0, try again /
- st location, 1 / store 1 to mark it locked /
- ret / return control to caller /
- unlock st location, 0 / write 0 to location
/ - ret / return control to caller /
Location is initially 0 Why doesnt the acquire
method work?
7Atomic Instructions
- Specifies a location, register, atomic
operation - Value in location read into a register
- Another value (function of value read or not)
stored into location - Many variants
- Varying degrees of flexibility in second part
- Simple example testset
- Value in location read into a specified register
- Constant 1 stored into location
- Successful if value loaded into register is 0
- Other constants could be used instead of 1 and 0
8Simple TestSet Lock
- lock ts register, location
- bnz lock / if not 0, try again /
- ret / return control to caller /
- unlock st location, 0 / write 0 to location
/ - ret / return control to caller /
- The same code for lock in pseudocode
- while (not acquired) / lock is aquired be
another one/ - testset(location) / try to acquire the lock/
- Condition architecture supports atomic test and
set - Copy location to register and set location to 1
- Problem
- ts modifies the variable location in its cache
each time it tries to acquire the lockgt cache
block invalidations gt bus traffic (especially
for high contention)
9TS Lock Microbenchmark SGI Challenge
20
l
T
estset,
c
0
s
s
T
estset, exponential backoff
c
3.64
l
s
18
T
estset, exponential backoff
c
0
n
n
Ideal
u
s
16
s
l
s
s
n
14
l
s)
l
s
m
s
n
12
s
n
l
s
ime (
s
s
l
10
l
n
T
l
s
lock delay(c) unlock
n
n
8
n
l
6
n
l
n
4
l
s
n
l
l
n
2
l
s
uuuuuuuuuuuuuuu
l
n
n
n
n
s
u
l
0
11
13
15
9
7
5
3
Number of processors
- Why does performance degrade?
- Bus Transactions on TS
10Other read-modify-write primitives
- Fetchop
- Atomically read and modify (by using op
operation) and write a memory location - E.g. fetchadd, fetchincr
- Compareswap
- Three operands location, register to compare
with, register to swap with
11Enhancements to Simple Lock
- Problem of ts lots of invalidations if the lock
can not be taken - Reduce frequency of issuing testsets while
waiting - Testset lock with exponential backoff
- i0
- while (! acquired) / lock is acquired be
another one/ - testset(location)
- if (!acquired) / testset didnt succeed/
- wait (ti) / sleep some time
- i
-
-
- Less invalidations
- May wait more
12TS Lock Microbenchmark SGI Challenge
20
l
T
estset,
c
0
s
s
T
estset, exponential backoff
c
3.64
l
s
18
T
estset, exponential backoff
c
0
n
n
Ideal
u
s
16
s
l
s
s
n
14
l
s)
l
s
m
s
n
12
s
n
l
s
ime (
s
s
l
10
l
n
T
l
s
lock delay(c) unlock
n
n
8
n
l
6
n
l
n
4
l
s
n
l
l
n
2
l
s
uuuuuuuuuuuuuuu
l
n
n
n
n
s
u
l
0
11
13
15
9
7
5
3
Number of processors
- Why does performance degrade?
- Bus Transactions on TS
13Enhancements to Simple Lock
- Reduce frequency of issuing testsets while
waiting - Test-and-testset lock
- while (! acquired) / lock is acquired be
another one/ - if (location1) / test with ordinary load /
- continue
- else
- testset(location)
- if (acquired) /succeeded/
- break
-
-
- Keep testing with ordinary load
- Just a hint cached lock variable will be
invalidated when release occurs - If location becomes 0, use ts to modify the
variable atomically - If failure start over
- Further reduces the bus transactions
- load produces bus traffic only when the lock is
released - ts produces bus traffic each time is executed
14Lock performance
15Improved Hardware Primitives LL-SC
- Goals
- Problem of testset generate lot of bus traffic
- Failed read-modify-write attempts dont generate
invalidations - Nice if single primitive can implement range of
r-m-w operations - Load-Locked (or -linked), Store-Conditional
- LL reads variable into register
- Work on the value from the register
- SC tries to store back to location
- succeed if and only if no other write to the
variable since this processors LL - indicated by a condition flag
- If SC succeeds, all three steps happened
atomically - If fails, doesnt write or generate invalidations
- must retry acquire
16Simple Lock with LL-SC
- lock ll reg1, location / LL location to reg1
/ - sc location, reg2 / SC reg2 into
location/ - beqz reg2, lock / if failed, start
again / - ret
- unlock st location, 0 / write 0 to location /
- ret
- Can simulate the atomic ops ts, fetchop,
compareswap by changing whats between LL SC
(exercise) - Only a couple of instructions so SC likely to
succeed - Dont include instructions that would need to be
undone (e.g. stores) - SC can fail (without putting transaction on bus)
if - Detects intervening write even before trying to
get bus - Tries to get bus but another processors SC gets
bus first - LL, SC are not lock, unlock respectively
- Only guarantee no conflicting write to lock
variable between them - But can use directly to implement simple
operations on shared variables
17Advanced lock algorithms
- Problems with presented approaches
- Unfair the order of arrival does not count
- All processors try to acquire the lock when
released - More processes may incur a read miss when the
lock released - Desirable only one miss
18Ticket Lock
- Draw a ticket with a number, wait until the
number is shown - Two counters per lock (next_ticket, now_serving)
- Acquire fetchinc next_ticket wait for
now_serving next_ticket - atomic op when arrive at lock, not when its free
(so less contention) - Release increment now-serving
- Performance
- low latency for low-contention
- O(p) read misses at release, since all spin on
same variable - FIFO order
- like simple LL-SC lock, but no invalidation when
SC succeeds, and fair
19Array-based Queuing Locks
- Waiting processes poll on different locations in
an array of size p - Acquire
- fetchinc to obtain address on which to spin
(next array element) - ensure that these addresses are in different
cache lines or memories - Release
- set next location in array, thus waking up
process spinning on it - O(1) traffic per acquire with coherent caches
- FIFO ordering, as in ticket lock, but, O(p) space
per lock - Not so great for non-cache-coherent machines with
distributed memory - array location I spin on not necessarily in my
local memory
20Lock performance
21Point to Point Event Synchronization
- Software methods
- Busy-waiting use ordinary variables as flags
- Blocking semaphores
- Interrupts
- Full hardware support full-empty bit with each
word in memory - Set when word is full with newly produced data
(i.e. when written) - Unset when word is empty due to being consumed
(i.e. when read) - Natural for word-level producer-consumer
synchronization - producer write if empty, set to full
- consumer read if full set to empty
- Hardware preserves read or write atomicity
- Problem flexibility
- multiple consumers
- multiple update of a producer
22Barriers
- Hardware barriers
- Wired-AND line separate from address/data bus
- Set input 1 when arrive, wait for output to be 1
to leave - Useful when barriers are global and very frequent
- Difficult to support arbitrary subset of
processors - even harder with multiple processes per processor
- Difficult to dynamically change number and
identity of participants - e.g. latter due to process migration
- Not common today on bus-based machines
- Software algorithms implemented using locks,
flags, counters
23A Simple Centralized Barrier
- Shared counter maintains number of processes that
have arrived - increment when arrive (lock), check until reaches
numprocs - Problem?
struct bar_type int counter struct lock_type
lock int flag 0 bar_name BARRIER
(bar_name, p) LOCK(bar_name.lock) if
(bar_name.counter 0) bar_name.flag 0
/ reset flag if first to reach/ mycount
bar_name.counter / mycount is private
/ UNLOCK(bar_name.lock) if (mycount p)
/ last to arrive / bar_name.counter
0 / reset for next barrier
/ bar_name.flag 1 / release waiters
/ else while (bar_name.flag 0) /
busy wait for release /
24A Working Centralized Barrier
- Consecutively entering the same barrier doesnt
work - Must prevent process from entering until all have
left previous instance - Could use another counter, but increases latency
and contention - Sense reversal wait for flag to take different
value consecutive times - Toggle this value only when all processes reach
- BARRIER (bar_name, p)
- local_sense !(local_sense) / toggle private
sense variable / - LOCK(bar_name.lock)
- mycount bar_name.counter / mycount is
private / - if (bar_name.counter p)
- UNLOCK(bar_name.lock)
- bar_name.counter 0
- bar_name.flag local_sense / release
waiters/ - else
- UNLOCK(bar_name.lock)
- while (bar_name.flag ! local_sense)
25Centralized Barrier Performance
- Latency
- critical path length at least proportional to p
(the accesses to the critical region are
serialized by the lock) - Traffic
- p bus transaction to obtain the lock
- p bus transactions to modify the counter
- 2 bus transaction for the last processor to reset
the counter and release the waiting process - p-1 bus transactions for the first p-1 processors
to read the flag - Storage Cost
- Very low centralized counter and flag
- Fairness
- Same processor should not always be last to exit
barrier - Key problems for centralized barrier are latency
and traffic - Especially with distributed memory, traffic goes
to same node
26Improved Barrier Algorithms for a Bus
- Software combining tree
- Only k processors access the same location, where
k is degree of tree (k2 in the example below)
- Separate arrival and exit trees, and use sense
reversal - Valuable in distributed network communicate
along different paths - On bus, all traffic goes on same bus, and no less
total traffic - Higher latency (log p steps of work, and O(p)
serialized bus transactions) - Advantage on bus is use of ordinary reads/writes
instead of locks
27Scalable Multiprocessors
28Scalable Machines
- Scalability capability of a system to increase
by adding processors, memory, I/O devices - 4 important aspects of scalability
- bandwidth increases with number of processors
- latency does not increase or increases slowly
- Cost increases slowly with number of processors
- Physical placement of resources
29Limited Scaling of a Bus
Characteristic Bus Physical Length 1 ft Number
of Connections fixed Maximum Bandwidth fixed Inter
face to Comm. medium extended memory
interface Global Order arbitration Protection virt
ual -gt physical Trust total OS single comm.
abstraction HW
- Small configurations are cost-effective
30Workstations in a LAN?
Characteristic Bus LAN Physical Length 1
ft KM Number of Connections fixed many Maximum
Bandwidth fixed ??? Interface to Comm.
medium memory interface peripheral Global
Order arbitration ??? Protection Virtual -gt
physical OS Trust total none OS single independent
comm. abstraction HW SW
- No clear limit to physical scaling, little trust,
no global order - Independent failure and restart
31Bandwidth Scalability
- Bandwidth limitation single set of wires
- Must have many independent wires (remember
bisection width?) gt switches
32Dancehall MP Organization
- Network bandwidth demand scales linearly with
number of processors - Latency Increases with number of stages of
switches (remember butterfly?) - Adding local memory would offer fixed latency
33Generic Distributed Memory Multiprocessor
34Bandwidth scaling requirements
- Large number of independent communication paths
between nodes large number of concurrent
transactions using different wires - Independent transactions
- No global arbitration
- Effect of a transaction only visible to the nodes
involved - Broadcast difficult (was easy for bus)
additional transactions needed
35Latency Scaling
- T(n) Overhead Channel Time (Channel
Occupancy) Routing Delay Contention Time - Overhead processing time in initiating and
completing a transfer - Channel Time(n) n/B
- RoutingDelay (h,n)
36Cost Scaling
- Cost(p,m) fixed cost incremental cost (p,m)
- Bus Based SMP
- Add more processors and memory
- Scalable machines
- processors, memory, network
- Parallel efficiency(p) Speedup(p) / p
- Costup(p) Cost(p) / Cost(1)
- Cost-effective Speedup(p) gt Costup(p)
37Cost Effective?
- 2048 processors 475 fold speedup at 206x cost
38Physical Scaling
- Chip-level integration
- Board-level
- System level
39Chip-level integration nCUBE/2
1024 Nodes
- Network integrated onto the chip 14 bidirectional
links gt 8096 nodes - Entire machine synchronous at 40 MHz
40Board level integration CM-5
- Use standard microprocessor components
- Scalable network interconnect
41System Level Integration
- Loose packaging
- IBM SP2
- Cluster blades