Future Processors to use Coarse-Grain Parallelism - PowerPoint PPT Presentation

About This Presentation
Title:

Future Processors to use Coarse-Grain Parallelism

Description:

Future processors to use coarse-grain parallelism ... Hydra: A single-chip multiprocessor. CPU 0. Centralized Bus Arbitration Mechanisms ... – PowerPoint PPT presentation

Number of Views:131
Avg rating:3.0/5.0
Slides: 49
Provided by: jurij2
Category:

less

Transcript and Presenter's Notes

Title: Future Processors to use Coarse-Grain Parallelism


1
Chapter 6
  • Future Processors to use Coarse-Grain Parallelism

2
Future processors to use coarse-grain parallelism
  • Chip multiprocessors (CMPs) or multiprocessor
    chips
  • integrate two or more complete processors on a
    single chip,
  • every functional unit of a processor is
    duplicated
  • Simultaneous multithreaded processors (SMPs)
  • store multiple contexts in different register
    sets on the chip,
  • the functional units are multiplexed between the
    threads,
  • instructions of different contexts are
    simultaneously executed

3
Principal chip multiprocessor alternatives
  • Symmetric multiprocessor (SMP)
  • Distributed shared memory multiprocessor (DSM)
  • Message-passing shared-nothing multiprocessor

4
Organizational principles of multiprocessors
5
Typical SMP
6
Shared memory candidates for CMPs
Shared-main memory and
shared-secondary cache
7
Shared memory candidates for CMPs
shared-primary cache
8
Grain-levels for CMPs
  • multiple processes in parallel
  • multiple threads from a single application ?
    implies a common address space for all threads
  • extracting threads of control dynamically from a
    single instruction stream
  • ? see last chapter, multiscalar, trace
    processors, ...

9
Texas Instruments TMS320C80 Multimedia Video
Processor
10
Hydra A single-chip multiprocessor
11
Conclusions on CMP
  • Usually, a CMP will feature
  • separate L1 I-cache and D-cache per on-chip CPU
  • and an optional unified L2 cache.
  • If the CPUs always execute threads of the same
    process, the L2 cache organization will be
    simplified, because different processes do not
    have to be distinguished.
  • Recently announced commercial processors with CMP
    hardware
  • IBM Power4 processor with 2 processor on a single
    die
  • Sun MAJC5200 two processor on a die (each
    processor a 4-threaded block-interleaving VLIW)

12
Multithreaded processors
  • Aim Latency tolerance
  • What is the problem?
  • Load access latencies measured on an Alpha Server
    4100 SMP with four 300 MHz Alpha 21164 processors
    are
  • 7 cycles for a primary cache miss which hits in
    the on-chip L2 cache of the 21164 processor,
  • 21 cycles for a L2 cache miss which hits in the
    L3 (board-level) cache,
  • 80 cycles for a miss that is served by the
    memory, and
  • 125 cycles for a dirty miss, i.e., a miss that
    has to be served from another processor's cache
    memory.
  • Multithreaded processors are able to bridge
    latencies by switching to another thread of
    control - in contrast to chip multiprocessors.

13
Multithreaded processors
  • Multithreading
  • Provide several program counters registers (and
    usually several register sets) on chip
  • Fast context switching by switching to another
    thread of control

14
Approaches of multithreaded processors
  • Cycle-by-cycle interleaving
  • An instruction of another thread is fetched and
    fed into the execution pipeline at each processor
    cycle.
  • Block-interleaving
  • The instructions of a thread are executed
    successively until an event occurs that may cause
    latency. This event induces a context switch.
  • Simultaneous multithreading
  • Instructions are simultaneously issued from
    multiple threads to the FUs of a superscalar
    processor.
  • combines a wide issue superscalar instruction
    issue with multithreading.

15
Comparision of multithreading with
non-multithreading approaches
  • (a) single-threaded scalar
  • (b) cycle-by-cycle interleaving multithreaded
    scalar
  • (c) block interleaving multithreaded scalar

16
Comparision of multithreading with
non-multithreading approaches
  • (a) superscalar (c) cycle-by-cycle
    interleaving
  • (b) VLIW (d) cycle-by-cycle interleaving
    VLIW

17
Comparison of multithreading withnon-multithreadi
ng
  • simultaneous multithreading (SMT) chip
    multiprocessor (CMP)

18
Cycle-by-cycle interleaving
  • the processor switches to a different thread
    after each instruction fetch
  • pipeline hazards cannot arise and the processor
    pipeline can be easily built without the
    necessity of complex forwarding paths
  • context-switching overhead is zero cycles
  • memory latency is tolerated by not scheduling a
    thread until the memory transaction has completed
  • requires at least as many threads as pipeline
    stages in the processor
  • degrading the single-thread performance if not
    enough threads are present

19
Cycle-by-cycle interleaving- Improving
single-thread performance
  • The dependence look-ahead technique adds several
    bits to each instruction format in the ISA.
  • Scheduler feeds non data or control dependent
    instructions of the same thread successively into
    the pipeline.
  • The interleaving technique proposed by Laudon et
    al. adds caching and full pipeline interlocks to
    the cycle-by-cycle interleaving approach.

20
Tera MTA
  • cycle-by-cycle interleaving technique
  • employs the dependence-look-ahead technique
  • VLIW ISA (3-issue)
  • The processor switches context every cycle (3 ns
    cycle period) among as many as 128 distinct
    threads, thereby hiding up to 128 cycles (384 ns)
    of memory latency.? 128 register sets

21
Tera processing element
22
Tera MTA
23
Block interleaving
  • Executes a single thread until it reaches a
    situation that triggers a context switch.
  • Typical switching event the instruction
    execution reaches a long-latency operation or a
    situation where a latency may arise.
  • Compared to the cycle-by-cycle interleaving
    technique, a smaller number of threads is needed
  • A single thread can execute at full speed until
    the next context switch.
  • Single thread performance is similar to the
    performance of a comparable processor without
    multithreading.
  • ? IBM NorthStar processors are two-threaded 64
    bit PowerPCs with switch-on-cache-miss
    implemented in departmental computers (eServers)
    of IBM since 10/98! (revealed at MTEAC-4, Dec.
    2000)
  • Recent announcement (Oct. 1999) Sun MAJC5200 two
    processor on a die, each processor is a
    4-threaded block-interleaving VLIW

24
Interleaving techniques
25
Rhamma
26
Komodo-microcontroller
  • Develop multithreaded embedded real-time
    Java-microcontroller
  • Java processor core ? bytecode as machine
    language, portability across all platforms ?
    dense machine code, important for embedded
    applications ? fast byte code execution in
    hardware, microcode and traps
  • Interrupts activate interrupt service threads
    (ISTs) instead of interrupt service routines
    (ISRs) ? extremely fast context switch ? no
    blocking of interrupt services
  • Switch-on-signal technique enhanced to very
    fine-grain switchingdue to hardware-implemented
    real-time scheduling algorithms (FPP, EDF, LLF,
    guranteed percentage)
  • ? hard real-time requirements fulfilled
  • For more information see
  • http//goethe.ira.uka.de/jkreuzin/komodo/komodoE
    ng.html

27
Komodo - microcontroller
28
Nanothreading and microthreading- multithreading
in same register set
  • Nanothreading (DanSoft processor) dismisses full
    multithreading for a nanothread that executes in
    the same register set as the main thread.
  • only a 9-bit PC, some simple control logic, and
    it resides in the same page as the main thread.
  • Whenever the processor stalls on the main thread,
    it automatically begins fetching instructions
    from the nanothread.
  • The microthreading technique (Bolychevsky et al.
    1996) is similar to nanothreading.
  • All threads share the same register set and the
    same run-time stack. However, the number of
    threads is not restricted to two.

29
Simultaneous multithreading (SMT)
  • The SMT approach combines a wide superscalar
    instruction issue with the multithreading
    approach
  • by providing several register sets on the
    multiprocessor
  • and issuing instructions from several instruction
    queues simultaneously.
  • The issue slots of a wide issue processor can be
    filled by operations of several threads.
  • Latencies occurring in the execution of single
    threads are bridged by issuing operations of the
    remaining threads loaded on the processor.

30
Simultaneous multithreading (SMT) - Hardware
organization (1)
  • SMT processors can be organized in two ways
  • First Instructions of different threads share
    all buffer resources in an extended superscalar
    pipeline
  • Thus SMT adds minimal hardware complexity to
    conventional superscalars,
  • hardware designers can focus on building a fast
    single-threaded superscalar and add multithread
    capability on top.
  • Complexity added to superscalars by
    multithreading are thread tag for each internal
    instruction representation, multiple register
    sets, and the abilities of the fetch and the
    retire units to fetch respectively retire
    instructions of different threads.

31
Simultaneous multithreading (SMT) - Hardware
organization (2)
  • Second Replicate all internal buffers of a
    superscalar such that each buffer is bound to a
    specific thread.
  • The issue unit is able to issue instructions of
    different instruction windows simultaneously to
    the FUs.
  • Adds more changes to superscalar processors
    organization
  • but leads to a natural partitioning of the
    instruction window (similar to CMP)
  • and simplifes the issue and retire stages.

32
Simultaneous multithreading (SMT)
  • SMT fetch unit can take advantage of the
    interthread competition for instruction bandwidth
    in two ways
  • First, it can partition fetch bandwidth among the
    threads and fetch from several threads each
    cycle. Goal increasing the probability of
    fetching only non speculative instructions.
  • Second, the fetch unit can be selective about
    which threads it fetches.
  • The main drawback to simultaneous multithreading
    may be that it complicates the instruction issue
    stage, which always is central to the multiple
    threads.
  • A functional partitioning as demanded for
    processors of the 109-transistor era is therefore
    not easily reached.
  • No simultaneous multithreaded processors exist to
    date. Only simulations.
  • General opinion SMT will be in next generation
    microprocessors.
  • Announcement (Oct. 1999) Compaq Alpha 21464
    (EV8) will be four-threaded SMT

33
SMT at the Universities of Washington and San
Diego
  • Hypothetical out-of-order issue superscalar
    microprocessor that resembles MIPS R10000 and HP
    PA-8000.
  • 8 threads and 8-issue superscalar organization
    are assumed.
  • Eight instructions are decoded, renamed and fed
    to either the integer or floating-point
    instruction window.
  • Unified buffers are used
  • When operands become available, up to 8
    instructions are issued out-of-order per cycle,
    executed and retired.
  • Each thread can address 32 architectural integer
    (and floating-point) registers. These registers
    are renamed to a large physical register le of
    356 physical registers.

34
SMT at the Universities of Washington and San
Diego
35
SMT at the Universities of Washington and San
Diego - Instruction fetching schemes
  • Basic Round-robin RR.2.8 fetching scheme, i.e.,
    in each cycle, two times 8 instructions are
    fetched in round-robin policy from two different
    2 threads,
  • superior to different other schemes like RR.1.8,
    RR.4.2, and RR.2.4
  • Other fetch policies
  • BRCOUNT scheme gives highest priority to those
    threads that are least likely to be on a wrong
    path,
  • MISSCOUNT scheme gives priority to the threads
    that have the fewest outstanding D-cache misses
  • IQPOSN policy gives lowest priority to the oldest
    instructions by penalizing those threads with
    instructions closest to the head of either the
    integer or the floating-point queue
  • ICOUNT feedback technique gives highest fetch
    priority to the threads with the fewest
    instructions in the decode, renaming, and queue
    pipeline stages

36
SMT at the Universities of Washington and San
Diego - Instruction fetching schemes
  • The ICOUNT policy proved as superior!
  • The ICOUNT.2.8 fetching strategy reached a IPC of
    about 5.4 (the RR.2.8 reached about 4.2 only).
  • Most interesting is that neither mispredicted
    branches nor blocking due to cache misses, but a
    mix of both and perhaps some other effects showed
    as the best fetching strategy.
  • Recently, simultaneous multithreading has been
    evaluated with
  • SPEC95,
  • database workloads,
  • and multimedia workloads.
  • Both achieving roughly a 3-fold IPC increase
    with an eight-threaded SMT over a single-threaded
    superscalar with similar resources.

37
SMT processor with multimedia enhancement-
Combining SMT and multimedia
  • Start with a wide-issue superscalar
    general-purpose processor
  • Enhance by simultaneous multithreading
  • Enhance by multimedia unit(s)
  • Utilization of subword parallelism (data
    parallel instructions, SIMD)
  • Saturation arithmetic
  • Additional arithmetic, masking and selection,
    reordering and conversion instructions
  • Enhance by additional features useful for
    multimedia processing, e.g. on-chip RAM memory,
    special cache techniques

For more information see http//goethe.ira.uka.d
e/people/ungerer/smt-mm/SM-MM-processor.html
38
The SMT multimedia processor model
39
Maximum processor configuration- IPCs of
8-threaded 8-issue cases
  • Initial maximum configuration 2.28
  • 16 entry reservation stations for thread, global
    and local load/store units (instead of 256)
    2.96
  • one common 256-entry reservation station unit for
    all integer/multimedia units (instead of
    256-entry reservation stations each) 3.27
  • loads and stores may pass blocked load/stores of
    other threads 4.1
  • highest-priority-first, non-speculative-instructio
    n-first, non-saturated-first strategies for
    issue, dispatch, and retire stages 4.34
  • 32-entry reorder buffer (instead of 256) 4.69
  • second local load/store unit (because of 20.1
    local load/stores) 6.07 (6.32 with dynamic
    branch prediction)

40
IPC of maximum processor
On-chip RAM and two local load/store units 4 MB
I-cache, D-cache fill burst rate of 6222
41
More realistic processor
D-cache fill burst rate of 32444 issue
bandwidth 8
42
Speedup
Realistic processor
Maximum processor
A threefold speedup
43
IPC-Performance of SMT and CMP (1)
  • SPEC92-simulations Tullsen et al. vs.
    Sigmund and Ungerer.

44
IPC-Performance of SMT and CMP (2)
  • SPEC95-simulations Eggers et al..CMP2 2
    processors, 4-issue superscalar 2(1,4)CMP4 4
    processors, 2-issue superscalar 4(1,2)SMT
    8-threaded, 8-issue superscalar 1(8,8)

45
IPC-Performance of SMT and CMP
SPEC95-simulations. Performance is given
relative to a single 2-issue superscalar
processor as baseline processor Hammond et al..
46
Comments to the simulation results Hammond et
al.
  • CMP (eight 2-issue processors) outperforms a
    12-issue superscalar and a 12-issue, 8-threaded
    SMT processor on four SPEC95 benchmark programs
    (by hand parallelized for CMP and SMP).
  • The CMP achieved higher performance than SMT due
    to a total of 16 issue slot instead of 12 issue
    slots for SMT.
  • Hammond et al. argue that design complexity for
    16-issue CMPs is similar to 12-issue superscalars
    or 12-issue SMT processors.

47
SMT vs. multiprocessor chip Eggers et al.
  • SMT obtained better speedups than the (CMP) chip
    multiprocessors- in contrast to results of
    Hammond et al.!!Eggers et al. compared 8-issue,
    8-threaded SMTs with four 2-issue CMPs.Hammond
    et al. compared 12-issue, 8-threaded SMTs with
    eight 2-issue CMPs.
  • Eggers et al.
  • Speedups on the CMP were hindered by the fixed
    partitioning of their hardware resources across
    the processors.
  • In CMP processors were idle when thread-level
    parallelism was insufficient.
  • Exploiting large amounts of instruction-level
    parallelism in the unrolled loops of individual
    threads not possible due to CMP processors
    smaller issue bandwidth.
  • An SMT processor dynamically partitions its
    resources among threads, and therefore can
    respond well to variations in both types of
    parallelism, exploiting them interchangeably.

48
Conclusions
  • The performance race between SMT and CMP is not
    yet decided.
  • CMP is easier to implement, but only SMT has the
    ability to hide latencies.
  • A functional partitioning is not easily reached
    within a SMT processor due to the centralized
    instruction issue.
  • A separation of the thread queues is a possible
    solution, although it does not remove the central
    instruction issue.
  • A combination of simultaneous multithreading with
    the CMP may be superior.
  • We favor a CMP consisting of moderately equipped
    (e.g., 4-threaded 4-issue superscalar) SMTs.
  • Future research combine SMT or CMP organization
    with the ability to create threads with compiler
    support or fully dynamically out of a single
    thread
  • thread-level speculation
  • close to multiscalar
Write a Comment
User Comments (0)
About PowerShow.com