Spin Lock in Shared Memory Multiprocessors - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Spin Lock in Shared Memory Multiprocessors

Description:

Processors can access common memory locations. Symmetric multiprocessors (SMP) - shared memory multiprocessor ... If too long, spend time waiting needlessly ... – PowerPoint PPT presentation

Number of Views:236
Avg rating:3.0/5.0
Slides: 18
Provided by: r335
Category:

less

Transcript and Presenter's Notes

Title: Spin Lock in Shared Memory Multiprocessors


1
Spin Lock in Shared Memory Multiprocessors
  • Ref Anderson paper

2
Outline
  • Background
  • Multiprocessor architectures
  • Caches and cache coherence
  • Synchronization
  • Simple spin locks and their performance
  • Test-and-set Spin
  • Spin-on-read
  • More complex spin locks
  • Delays
  • Queueing

3
Multiprocessor Architectures
  • Basically, three flavors
  • Message passing machines (multicomputers)
  • Each processor has its own private memory (and
    address space)
  • Communications via message passing
  • Cluster of (uniprocessor) workstations
  • Shared Memory Multiprocessors
  • Processors can access common memory locations
  • Symmetric multiprocessors (SMP) - shared memory
    multiprocessor where average memory access time
    is the same for any processor, independent of
    which memory location you are accessing
  • By contrast, asymmetric machines distinguish
    between local (fast) and remote (slow) memory
    harder to program, less common today
  • Current trend SMPs becoming ubiquitous (aka
    multi-core architecture)
  • Combination of the two
  • E.g., collection of SMPs connected via a fast
    switch or LAN
  • Shared memory within SMP, message passing between
    SMPs (can build message passing library over
    shared memory to hide this)
  • Common organization for many supercomputers

4
SMP Hardware
  • Processor
  • Memory
  • Cache memory
  • Typically, two or three levels of cache
  • First level fast, small (e.g., 1 clock cycle
    (cc), 256 KB), second slower, larger (e.g., 10
    cc, 4 MBs)
  • Main memory (e.g., hundreds of cc, 16 GBs)
  • Store buffers so processor need not wait for
    memory write to complete
  • Switch
  • General switch (e.g., crossbar)
  • Single shared bus
  • Broadcast easy on shared bus - important!
  • Here, focus on bus-based SMPs
  • Many ideas apply to machines with general
    interconnects

5
Cache Coherence
  • Suppose a single variable is stored in the cache
    memory of several different processors, and one
    of the processors modifies the variable
  • Cache coherence problem
  • Solutions
  • Invalidate (delete) the stale, cached copies
  • Update the cached copies
  • Invalidation/Update easier w/ shared bus
  • snoopy cache coherence protocol - processors
    listen to memory accesses on bus,
    update/invalidate as needed

6
Synchronization Operations
  • Atomic read-modify-write operation
  • Sparc ldstub test-and-set instruction
  • Bus can be exploited here too
  • Processor raises a bus line while performing
    read-modify-write operation
  • Other processors cannot start read-modify-write
    operation while bus line held high
  • Other (non-synchronization) memory access can
    access the bus while line held high
  • Since write is performed, requires
    invalidation/update operations

7
Spin Locks
  • Init lock CLEAR
  • Lock while (TestAndSet(lock)BUSY)
  • Unlock lock CLEAR
  • Advantages
  • Efficient for short waits (short critical
    sections) since thread need not give up processor
    and incur scheduling and context switch overheads
  • Processor may not have anything else to do anyway
  • Disadvantages
  • Inefficient for longer waits
  • System overheads (e.g., bus contention, cache
    effects)?
  • Main focus of Anderson paper

8
Performance Metrics
  • Small bandwidth consumption
  • bus shared with other processors
  • Small delay - time from when lock becomes free
    until a waiting processor can acquire it
  • Short latency - time to acquire lock when there
    is no contention (lock is available)
  • Spin locks tradeoff spin locks are polling
    mechanisms if you poll too often, you consume
    too much bandwidth, if rarely, you increase delay
  • Observation one can poll local cached copy
    without using bus bandwidth!
  • Ideally, polling done on local cached copy

9
TestAndSet Spin Lock
Lock while (TestAndSet(lock)BUSY) Unlock
lock CLEAR
  • Latency (time to gain lock if no contention)?
  • Good Low latency to acquire a free lock
  • Bandwidth (bus cycles used)?
  • Each TestAndSet (TS) operation requires a bus
    cycle saturates bus as the number of processors
    increases this slows other processors down, even
    those not accessing lock
  • Delay (lock release to acquire time)?
  • Processor releasing lock must contend with
    processors trying to acquire lock for bus cycles
    SMPs usually do not give priority to processor
    releasing lock
  • Overall, performance poor if many processors due
    to contention does not exploit cache for polling
    because TS used to poll

10
Test and Test-and-Set(Spin-on-Read)
Poll by reading lock only use TestSet if lock
not busy Lock while (lockBUSY
TestAndSet(lock)BUSY)
  • Latency?
  • Again good little time to obtain lock if no
    contention
  • Two memory operations (why?)
  • Bandwidth?
  • If lock Busy, spin by reading cache (dont
    consume bus cycles) because using read to poll,
    not TS
  • Delay (release-to-acquire time)?
  • A bit more complex

11
Spin-on-Read (cont.)
  • Lock
  • while (lockBUSY TestAndSet(lock)BUSY)
  • Delay (release to acquire time)?
  • When lock released, cached copy is invalidated or
    updated, causing the TestAndSet operation to be
    executed
  • One processor acquires lock others continue to
    spin
  • If many processors, flurry of activity when lock
    released (many unneeded bus ops) consider an
    invalidation protocol
  • Lock is released (set to CLEAR), invalidating
    copy in all other caches
  • Some processors read lock (cache miss), load lock
    (valueCLEAR) into their cache (pending TS
    request)
  • Remaining processors have not yet read lock
    (pending read request)
  • One processor executes TS, acquires lock
  • TS operation invalidates caches of processors w/
    pending TS requests!
  • Some processors with pending reads now load
    (busy) lock into cache
  • Other processors with pending TS now do TS,
    fail to get lock, but invalidate all the caches,
    again! etc. etc. etc.
  • Finally, last processor w/ pending TS
    invalidates everyones cache
  • Each processor again reloads lock into cache

12
Inefficiencies with Spin-on-Read
  • During the time from when a processor loads the
    lock into its cache until it does TS to acquire
    lock, other processors may also load the CLEAR
    lock into their cache, then try TS this
    degrades performance
  • Each subsequent TS uses a bus cycle, invalidates
    cached copies of the lock, even though the TS
    write does not change the memory locations value
  • For P processors, after an invalidation occurs, P
    bus cycles are needed to reload the value into
    the caches, even though the same value is being
    read by each processor
  • Empirical measurements
  • Original TS spin performs poorly for many
    processors
  • Spin-on-Read better, but still disappointing
    performance for many processors

13
Update Protocols
  • Update protocols typically use copy-back write
    policy
  • Typically distinguish between
  • Exclusive data (this is only cache with a copy)
  • Shared data (stored in multiple caches
  • Bus operations needed when shared data is written
  • Lock variable is shared in scenarios described
    earlier, so a bus transaction is needed for each
    TS operation to update the other caches

14
Other Spin Locks Adding Delay
  • Delay after noticing lock was released
  • Lock
  • While(lockBUSY TS(lock)BUSY)
  • while (lock BUSY) // spin here
  • Delay()
  • Rationale
  • Recall problem was having many pending TS ops
    after lock becomes available
  • Solution After noticing lock available, Delay,
    then check if busy again before trying TS
  • Hope to reduce number of unsuccessful TS ops,
    reducing number of invalidations

15
Delay Period
  • Length of delay could be set statically or
    dynamically
  • If too short, delay doesnt help much
  • If too long, spend time waiting needlessly
  • Ideally, different processors to delay different
    amounts of time
  • Dynamic delay each processor chooses a random
    delay (and adapts)
  • Like Ethernet (CSMA networks)
  • But, collisions in the spin-lock case cost more
    if there are more processors waiting (it takes
    longer to start spinning on the cache again)
  • Good heuristic for backoff
  • Maximum bound on mean delay equal to the number
    of processors to avoid long delays if only one
    still waiting
  • Initial delay should be a function of the delay
    last time before the lock was acquired (costly to
    learn from mistakes!), e.g., half of the previous
    delay

16
Other Approaches
  • Delay between each reference
  • Lock
  • While(lockBUSY TS(lock)BUSY)
  • Delay()
  • Reduces bus traffic for machines w/o cache
    coherence or SMPs with invalidation protocols
  • Queue waiting processors
  • See paper for deeper discussion, performance
    analyses

17
Summary
  • Implementations of spin locks can have
    significant impact on performance, especially as
    the number of spinning processors increases
  • Lock design must be tailored to the coherence
    protocol for best performance
  • Optimized protocols attempt to minimize the
    number of unfruitful test and set operations in
    order to minimize number of bus cycles, avoid
    excessive invalidation/update operations
Write a Comment
User Comments (0)
About PowerShow.com