Multiprocessors - PowerPoint PPT Presentation

1 / 66
About This Presentation
Title:

Multiprocessors

Description:

Using every possible technique to speedup single-processor systems... Processors run the same program, but don't have to stay in lockstep. 8/19/09. 6 ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 67
Provided by: constantin56
Category:

less

Transcript and Presenter's Notes

Title: Multiprocessors


1
  • Multiprocessors ( multicores)

ECE 411 Spring 2009
2
Outline
  • Multiprocessing intro
  • Classifying Multiprocessors
  • Synchronization

3
Multiprocessors
4
Motivation
  • Using every possible technique to speedup
    single-processor systems
  • If I have N processors, shouldnt I be able to
    get N times as much work done?
  • Anyone whos ever done a group project (MP3!)
    will tell you why its not this simple
  • If I have many programs to run, I can use
    multiple processors to run them at the same time
  • Multiprocessing for throughput
  • But can I use N processors to run one program in
    1/Nth the time?
  • Making this work is the holy grail of parallel
    processing

5
Processor Taxonomy (Flynns)
  • Single Instruction, Single Data (SISD)
  • Basic 1-wide CPU
  • Single Instruction, Multiple Data (SIMD)
  • Vector processors, some multimedia, array
    processors
  • Multiple Instruction, Single Data (MISD)
  • Non-practical
  • Multiple Instruction, Multiple Data (MIMD)
  • Most multiprocessors, superscalar CPUs
  • More recent addition
  • Single Program, Multiple Data (SPMD)
  • Processors run the same program, but dont have
    to stay in lockstep

6
Multiprocessor Performance
7
Multiprocessor Performance
  • Ideal speedup of N on N processors
  • Work evenly divided among the processors with no
    overhead
  • More typical speedups are sub-linear (lt N)
  • Communication overhead
  • Synchronization overhead
  • Load balancing
  • Occasionally, see superlinear (gt N) speedup
  • Indicates that parallel program or computer
    system is more efficient than the base program or
    system

8
How Not to Lie With Multiprocessor Performance
  • Basic rule be fair to the uniprocessor
  • Compare multiprocessor against the best possible
    version of the uniprocessor program
  • Do a good job on the uniprocessor program, use
    optimizing compiler, etc.
  • Make sure uniprocessor is running an efficient
    algorithm for uniprocessors (may require two
    versions of the program)
  • Use the same input data for all versions
  • Much easier to get speedup of N if you increase
    work by factor of N
  • Some performance measures explicitly cover rate
    of work, in which case increasing data size is ok

9
Classifying Multiprocessors
  • Can think about categorizing multiprocessors
    based on three questions
  • How do the processors exchange data?
  • How are the processors connected to each other?
  • How is memory organized?

10
How do Processors Exchange Data?
  • Two major alternatives
  • Message Passing
  • Programs execute explicit operations (sends,
    receives) to transfer data from one processor to
    another
  • Can be more efficient, because can send data
    before it is needed
  • Often viewed as harder to program, particularly
    for irregular applications
  • Shared Memory
  • System maintains one view of memory, programs on
    one processor see the results of writes by
    programs on other processors
  • Data generally not sent until program tries to
    access it
  • Often viewed as easier to program
  • Generally accepted that this is true for just
    getting code working. Less clear if you consider
    effort for high performance

11
Message-Passing Example
  • Processor 1
  • send (a, processor2)
  • c 0
  • for (i 1 i lt 100 i)
  • c aI
  • send (c, processor2)
  • Processor 2
  • receive(a)
  • d 0
  • for (i 100 i lt 200 i)
  • d ai
  • receive(c)
  • d c
  • printf(Sum is d\n, d)

12
Shared-Memory Example
  • Processor 1
  • c 0
  • for (i 0 i lt 100 i)
  • c ai
  • Processor 2
  • d 0
  • for (j 100 j lt200 j)
  • d aj
  • / wait for first processor to be done /
  • d c
  • Printf(Sum is d\n, d)

13
How are the Processors Connected?
  • Shared Bus
  • Simple
  • All processors see every communication
  • Problem bandwidth doesnt scale as number of
    processors increases
  • May actually go down because of wire length,
    loading
  • Network
  • Many different types of network
  • Generally consists of a set of switches that
    connect subsets of the processors
  • Can increase bandwidth as number of processors
    increases
  • Lots of complexity/wire/latency tradeoffs in
    network design

14
How is Memory Organized?
15
Centralized vs. Distributed Memory
  • Centralized Memory
  • Conceptually Simple
  • Integrates well with shared memory
  • Bandwidth doesnt scale with number of processors
  • Note can still have caches on each processor
    with centralized memory
  • Creates cache coherence problem that well talk
    about next time
  • Distributed Memory
  • Integrates well with message-passing model
  • Bandwidth scales with number of processors

16
Multiprocessor Parallelism
  • Origin time-sharing
  • Threads, tasks, multithreading, multitasking
  • Instruction-level parallelism independent
    instructions within a single thread executing
    simultaneously.
  • Thread-level parallelism independent parts of
    an application executing simultaneously.
  • Terminology Processes, tasks, threads, and their
    OS counterparts.

17
Synchronization
  • Sometimes, we need to ensure that events on
    different processors happen in a particular order
    to get the correct result from programs.
  • Example Weather simulation
  • Synchronization refers to both the process of
    ensuring ordering and the technique used to
    enforce order

18
Synchronization in Message-Passing Systems
  • Send-Receive paradigm enforces ordering because
    receive() operation doesnt complete until the
    data from the matching send() arrives.
  • In many cases, getting the set of sends and
    receives right to transfer the data required by
    the program on each processor provides all of the
    synchronization you need
  • In some cases, need to add extra sent-receive
    pairs to enforce ordering
  • Example keeping a producer thread from getting
    too far ahead of a consumer

19
Synchronization in Shared-Memory Systems
  • Synchronization is much more of an issue on
    shared-memory systems than on message-passing
    systems
  • Example If processor P1 one does a store to
    address 37, and P2 does a load from address 37,
    does P2 see the value that P1 wrote?
  • Depends on whether the store or the load happens
    first
  • Need synchronization to enforce the ordering the
    programmer wants
  • Shared-Memory Model We will assume strong
    consistency, which means that
  • On any processor, memory operations happen in
    program order (or at least return the same result
    as if they did)
  • Across all processors, the set of memory
    operations gives the same result as if they
    executed in some sequential order

20
Synchronization
  • Two basic primitives
  • Mutual exclusion (lock)
  • Only one processor can acquire a lock at any time
  • Example Shared counter
  • lock(lockvar)
  • a a 1
  • release(lockvar)
  • Barrier
  • When processor hits a barrier, stops until all
    processors reach the barrier
  • Example Weather simulation typically divides
    time into discrete steps, executes a barrier at
    the end of each step so that no processor
    simulates the next step until all are done with
    the current one.

21
Implementing Locks
  • It is possible to implement a lock with just load
    and store operations but very inefficient
  • Locks become much more efficient if the processor
    provides some sort of atomic read-modify-write
    operation
  • Example Test-and-set
  • Single instruction that reads the value of a
    memory location, writes a new value to that
    location, and returns the old value to a
    destination register
  • Key feature is that no other operation can read
    the memory location between the time the
    test-and-set reads it and the time that the new
    value is written

22
Implementing Locks With Test-and-Set
  • void lock(lockvar)
  • while (test-and-set(lockvar, 1) ! 0)
  • void unlock(lockvar)
  • lockvar 0

23
Executing Multiple Threads
  • What is a thread?
  • Loop iterations, function calls, basic blocks,
    external functions,
  • How do we implement threads?
  • Thread-level parallelism
  • Synchronization
  • Multiprocessors
  • Explicit multithreading
  • Implicit multithreading
  • Redundant multithreading
  • Summary

24
Thread-level Parallelism
  • Reduces effectiveness of temporal and spatial
    locality

25
Thread-level Parallelism
  • Parallelism limited by sharing
  • Amdahls law
  • Access to shared state must be serialized
  • Serial portion limits parallel speedup
  • Many important applications share (lots of) state
  • Relational databases (transaction processing)
    GBs of shared state
  • Even completely independent processes share
    virtualized hardware, hence must synchronize
    access
  • Access to shared state/shared variables
  • Must occur in a predictable, repeatable manner
  • Otherwise, chaos results
  • Architecture must provide primitives for
    serializing access to shared state
  • Multithreading may reduce the effectiveness of
    both spatial and temporal locality
  • temporal same piece of code takes longer to
    execute (context switching)
  • spatial multiple active threads make working set
    much larger

26
Synchronization Memory Consistency (A0)
A 3
A 4
A 4
A 1
27
Some Other Synchronization Primitives
  • Only one is necessary
  • Intels Itanium supports all three.

28
Synchronization Examples
  • All three provide guarantee same semantic
  • Initial value of A 0
  • Final value of A 4 in ALL cases
  • (b) uses additional lock variable AL to protect
    critical section with a spin lock
  • This is the most common synchronization method in
    modern multithreaded applications

29
Multiprocessor Systems the four key abstractions
  • Fully shared memory
  • All processors have equivalent view of all of
    memory
  • Uniform (Unit) latency
  • All memory requests satisfied in a single cycle
  • Lack of contention
  • A processors memory references are never slowed
    down by other processors memory references
  • Instantaneous propagation of writes
  • Any write operation (by any processor) is
    instantaneously visible by all processors.

30
Cache Coherence
  • Simple to build, but long memory latencies and
    lots of contention for the memory bus

31
Cache Coherence
  • Reduced average memory latency
  • Less contention for the bus
  • Problem What happens when both caches have
    copies of the same address?

32
Snooping Caches
33
Implementing Cache Coherence
  • Snooping implementation
  • Origins in shared-bus systems
  • All CPUs could observe all other CPUs requests on
    the bus hence snooping
  • Bus Read, Bus Write, Bus Upgrade
  • React appropriately to snooped commands
  • Invalidate shared copies
  • Provide up-to-date copies of dirty lines
  • Flush (writeback) to memory, or
  • Direct intervention (modified intervention or
    dirty miss)
  • Snooping suffers from
  • Scalability shared busses not practical
  • Ordering of requests without a shared bus
  • Lots of recent and on-going work on scaling
    snoop-based systems

34
Snooping Caches
  • Each cache watches all of the transactions on the
    memory bus
  • If another processor requests a copy of a line in
    your cache, then you handle the request by
    sending the line over the bus and adjusting the
    state in your cache appropriately
  • Main memory provides data if no cache does
  • If your copy is shared, it stays shared if the
    other processor wanted to read the data, becomes
    invalid if the other processor wanted to write
    the data
  • If your copy is exclusive or modified, becomes
    shared if another processor wants to read the
    data, invalid if another processor wants to write

35
Cache Coherence
  • Basic problem If we have multiple copies of a
    memory address, need to keep those copies
    coherent (the same)
  • Writes on one processor must become visible to
    all
  • One solution would be to broadcast all writes to
    every processor
  • Lots of wasted bus bandwidth -- other processors
    may not have copies of a given address
  • Need to know when to send values to other
    processors

36
Invalidation-Based Cache Coherence
  • Basic idea its ok if multiple processors have
    copies of a memory address so long as none of
    them are writing to it
  • Allow multiple processors to have read-only
    (shared) copies of a cache line
  • If a processor wants to write a cache line, it
    must acquire a writable (exclusive) copy of the
    line
  • If a processor has a writable copy of a line, no
    other processor may have a copy of the line
  • Requesting an exclusive copy of a line requires
    that all other processors invalidate their copy
    of the line
  • Alternative approach update-based cache
    coherence
  • Many processors can write to a line, have to send
    written values to all processors with copies of
    the line

37
Update vs. Invalidation Protocols
  • Coherent Shared Memory
  • All processors see the effects of others writes
  • How/when writes are propagated
  • Determine by coherence protocol

38
MESI An Invalidate Protocol
39
Illinois (MESI) Protocol
  • In a processors cache, each line can be in one
    of four states
  • Modified (this processor has the only copy, line
    is writable and readable, line has been written
    since fetched)
  • Exclusive (this processor has the only copy, line
    is writable and readable, line has not been
    written since fetched)
  • Shared (this processor and others have copies,
    line can be read but not written)
  • Invalid (this processor has no copy of the line,
    line cannot be written or read)
  • Each tag in the cache records the state of its
    line, similar to how uniprocessor caches track
    valid/invalid and dirty/clean

40
An Invalidate Protocol The Illinois (MESI)
Protocol
41
Illinois Protocol on Snooping Cache System
42
Limitations of Snooping Cache
  • Bus bandwidth and bandwidth to the main memory
    doesnt increase as number of processors goes up
  • As number of processors goes up, one of these two
    factors eventually becomes the bottleneck
  • Worse than that, bus bandwidth will actually go
    down as number of processors increases due to
    electrical effects
  • Network made up of point-to-point connections can
    be much faster for large machines

43
Distributed Shared Memory
  • Each processor has a cache and a main memory
    attached to it, and communicates with the other
    processors over a network
  • The shared address space is divided up among the
    processors, and each processor becomes the home
    node for the data stored in its main memory
  • Home nodes keep a directory of the state of each
    line they are responsible for and who has copies
    of the line
  • When processors try to access a line they dont
    have a copy of, or try to write a line they have
    a shared copy of, they send a message to the home
    node of the line requesting it
  • Home node sends messages to all sharing nodes
    telling them how the state of their copies needs
    to change
  • Processor with the most up-to-date copy of the
    line sends it back to the requesting processor

44
UMA vs. NUMA
Distributed Memory
Centralized Memory
(NUMA)
(UMA)
Proc.
Proc.
Proc.
Proc.
Proc.
Mem.
Proc.
Mem.
Network
Network
Memory
Proc.
Mem.
Proc.
Mem.
45
Distributed Shared-Memory Machine
46
Consistency vs. Coherence
  • Protocol vs Implementation
  • The memory consistency model tells you when the
    results of a memory operation on one processor
    will be visible on another processor
  • The cache-coherence protocol tells you how the
    memory system implements the memory consistency
    model

47
Strong (Sequential) Consistency
  • A multiprocessor system is sequentially
    consistent if
  • The result of any execution is the same as if all
    of the operations on all the processors were
    executed in some (unspecified) sequential order
  • Atomic execution of memory operations
  • On any processor, the result of any execution is
    the same as if all of the operations on that
    processor were executed in program order
  • Intuitively, this model gives the same result as
    if a set of in-order processors shared a single
    memory system
  • Relaxed consistency models generally require the
    programmer to specify when writes by one
    processor become visible on other processors
  • Reduces communication traffic, but increases
    programming effort

48
Consistency Example -- Dekkers Algorithm
  • Processor 1 Processor 2
  • While(1) While(1)
  • Flag1 1 Flag2 1
  • If (Flag2 0) If (Flag1 0)
  • / Have lock / / Have lock /
  • return() return()
  • Flag1 0 Flag2 0
  • On a system with strong consistency, this
    implements locks without a read-modify-write
    operation
  • Very inefficient, though, particularly as the
    number of processors grows

49
Implementing Cache Coherence
  • Directory implementation
  • Extra bits stored in memory (directory) record
    state of line
  • Memory controller maintains coherence based on
    the current state
  • Other CPUs commands are not snooped, instead
  • Directory forwards relevant commands
  • Powerful filtering effect only observe commands
    that you need to observe
  • Meanwhile, bandwidth at directory scales by
    adding memory controllers as you increase size of
    the system
  • Leads to very scalable designs (100s to 1000s of
    CPUs)
  • Directory shortcomings
  • Indirection through directory has latency penalty
  • If shared line is dirty in other CPUs cache,
    directory must forward request, adding latency
  • This can severely impact performance of
    applications with heavy sharing (e.g. relational
    databases)

50
Memory Consistency Dijkstras 2-way Mutual
Exclusion
  • How are memory references from different
    processors interleaved?
  • If this is not well-specified, synchronization
    becomes difficult or even impossible
  • ISA must specify consistency model
  • Common example using Dekkers algorithm for
    synchronization
  • If load reordered ahead of store (as we assume
    for a baseline OOO CPU)
  • Both Proc0 and Proc1 enter critical section,
    since both observe that others lock variable
    (A/B) is not set
  • If consistency model allows loads to execute
    ahead of stores, Dekkers algorithm no longer
    works
  • Common ISAs allow this IA-32, PowerPC, SPARC,
    Alpha

51
Sequential Consistency Lamport 1979
  • Processors treated as if they are interleaved
    processes on a single time-shared CPU
  • All references must fit into a total global order
    or interleaving that does not violate any CPUs
    program order
  • Otherwise sequential consistency not maintained
  • Now Dekkers algorithm will work
  • Appears to preclude any OOO memory references
  • Hence precludes any real benefit from OOO CPUs

52
High-Performance Sequential Consistency
  • Coherent caches isolate CPUs if no sharing is
    occurring
  • Absence of coherence activity means CPU is free
    to reorder references
  • Still have to order references with respect to
    misses and other coherence activity (snoops)
  • Key use speculation
  • Reorder references speculatively
  • Track which addresses were touched speculatively
  • Force replay (in order execution) of such
    references that collide with coherence activity
    (snoops)

53
High-Performance Sequential Consistency
  • Load queue records all speculative loads
  • Bus writes/upgrades are checked against LQ
  • Any matching load gets marked for replay
  • At commit, loads are checked and replayed if
    necessary
  • Results in machine flush, since load-dependent
    ops must also replay
  • Practically, conflicts are rare, so expensive
    flush is OK

54
Relaxed Consistency Models
  • Key insight only synchronization references need
    to be ordered
  • Hence, relax memory for all references
  • Enable high-performance out-of-order
    implementation
  • Require programmer to label synchronization
    references
  • Hardware must carefully order these labeled
    references
  • All other references can be performed out of
    order
  • Labeling schemes
  • Explicit synchronization ops (acquire/release)
  • Memory fence or memory barrier ops
  • All preceding ops must finish before following
    ones begin
  • Often barrier ops cause pipeline drain in modern
    out-of-order machine

55
Coherent Memory Interface Example
56
Split Transaction Buses
  • Packet switched vs. circuit switched
  • Release bus after request issued
  • Allow multiple concurrent requests to overlap
    memory latency
  • More complicated control, arbitration for bus
  • Much better throughput

57
Explicitly Multithreaded Processors
  • Many approaches for executing multiple threads on
    a single die
  • Mix-and-match IBM Power5 CMPSMT

58
IBM Power4 Example CMP
59
Coarse-grained Multithreading
  • Low-overhead approach for improving processor
    throughput
  • Also known as switch-on-event
  • Long history Denelcor HEP
  • Commercialized in IBM Northstar, Pulsar
  • Rumored in Sun Rock, Niagara

60
SMT Resource Sharing
61
Implicitly Multithreaded Processors
  • Goalspeed up execution of a single thread
  • Implicitly break program up into multiple smaller
    threads, execute them in parallel
  • Parallelize loop iterations across multiple
    processing units
  • Usually, exploit control independence in some
    fashion
  • Many challenges
  • Maintain data dependences (RAW, WAR, WAW) for
    registers
  • Maintain precise state for exception handling
  • Maintain memory dependences (RAW/WAR/WAW)
  • Maintain memory consistency model
  • Not really addressed in any of the literature
  • Active area of research
  • Only a subset is covered here, in a superficial
    manner

62
Sources of Control Independence
63
Implicit Multithreading Approaches
64
Executing The Same Thread
  • Why execute the same thread twice?
  • Detect faults
  • Better performance
  • Prefetch, resolve branches

65
Fault Detection
  • AR/SMT Rotenberg 1999
  • Use second SMT thread to execute program twice
  • Compare results to check for hard and soft errors
    (faults)
  • DIVA Austin 1999
  • Use simple check-processor at commit
  • Re-execute all ops in order
  • Possibly relax main processors correctness
    constraints and safety margins to improve
    performance
  • Lower voltage, higher frequency, etc.
  • Lots of other variations proposed in more recent
    work

66
Speculative Pre-execution
  • Idea create a runahead or future thread that
    helps the main trailing thread
  • Advantage speculative future thread has no
    correctness requirement
  • Slipstream processors Rotenberg 2000
  • Construct speculative, stripped-down version
    (future thread)
  • Let it run ahead and prefetch
  • Speculative precomputation Roth 2001, Zilles
    2002, Collins et al. 2001
  • Construct backward dataflow slice for problematic
    instructions (mispredicted branches, cache
    misses)
  • Pre-execute this slice of the program
  • Resolve branches, prefetch data
  • Implemented in Intel production compiler,
    reflected in Intel Pentium 4 SPEC results
Write a Comment
User Comments (0)
About PowerShow.com