Title: Future Processors to use Coarse-Grain Parallelism
1Chapter 6
- Future Processors to use Coarse-Grain Parallelism
2Future processors to use coarse-grain parallelism
- Chip multiprocessors (CMPs) or multiprocessor
chips - integrate two or more complete processors on a
single chip, - every functional unit of a processor is
duplicated - Simultaneous multithreaded processors (SMPs)
- store multiple contexts in different register
sets on the chip, - the functional units are multiplexed between the
threads, - instructions of different contexts are
simultaneously executed
3Principal chip multiprocessor alternatives
- Symmetric multiprocessor (SMP)
- Distributed shared memory multiprocessor (DSM)
- Message-passing shared-nothing multiprocessor
4Organizational principles of multiprocessors
5Typical SMP
6Shared memory candidates for CMPs
Shared-main memory and
shared-secondary cache
7Shared memory candidates for CMPs
shared-primary cache
8Grain-levels for CMPs
- multiple processes in parallel
- multiple threads from a single application ?
implies a common address space for all threads - extracting threads of control dynamically from a
single instruction stream - ? see last chapter, multiscalar, trace
processors, ...
9Texas Instruments TMS320C80 Multimedia Video
Processor
10Hydra A single-chip multiprocessor
11Conclusions on CMP
- Usually, a CMP will feature
- separate L1 I-cache and D-cache per on-chip CPU
- and an optional unified L2 cache.
- If the CPUs always execute threads of the same
process, the L2 cache organization will be
simplified, because different processes do not
have to be distinguished. - Recently announced commercial processors with CMP
hardware - IBM Power4 processor with 2 processor on a single
die - Sun MAJC5200 two processor on a die (each
processor a 4-threaded block-interleaving VLIW)
12Multithreaded processors
- Aim Latency tolerance
- What is the problem?
- Load access latencies measured on an Alpha Server
4100 SMP with four 300 MHz Alpha 21164 processors
are - 7 cycles for a primary cache miss which hits in
the on-chip L2 cache of the 21164 processor, - 21 cycles for a L2 cache miss which hits in the
L3 (board-level) cache, - 80 cycles for a miss that is served by the
memory, and - 125 cycles for a dirty miss, i.e., a miss that
has to be served from another processor's cache
memory. - Multithreaded processors are able to bridge
latencies by switching to another thread of
control - in contrast to chip multiprocessors.
13Multithreaded processors
- Multithreading
- Provide several program counters registers (and
usually several register sets) on chip - Fast context switching by switching to another
thread of control
14Approaches of multithreaded processors
- Cycle-by-cycle interleaving
- An instruction of another thread is fetched and
fed into the execution pipeline at each processor
cycle. - Block-interleaving
- The instructions of a thread are executed
successively until an event occurs that may cause
latency. This event induces a context switch. - Simultaneous multithreading
- Instructions are simultaneously issued from
multiple threads to the FUs of a superscalar
processor. - combines a wide issue superscalar instruction
issue with multithreading.
15Comparision of multithreading with
non-multithreading approaches
- (a) single-threaded scalar
- (b) cycle-by-cycle interleaving multithreaded
scalar - (c) block interleaving multithreaded scalar
16Comparision of multithreading with
non-multithreading approaches
- (a) superscalar (c) cycle-by-cycle
interleaving - (b) VLIW (d) cycle-by-cycle interleaving
VLIW
17Comparison of multithreading withnon-multithreadi
ng
- simultaneous multithreading (SMT) chip
multiprocessor (CMP)
18Cycle-by-cycle interleaving
- the processor switches to a different thread
after each instruction fetch - pipeline hazards cannot arise and the processor
pipeline can be easily built without the
necessity of complex forwarding paths - context-switching overhead is zero cycles
- memory latency is tolerated by not scheduling a
thread until the memory transaction has completed
- requires at least as many threads as pipeline
stages in the processor - degrading the single-thread performance if not
enough threads are present
19Cycle-by-cycle interleaving- Improving
single-thread performance
- The dependence look-ahead technique adds several
bits to each instruction format in the ISA. - Scheduler feeds non data or control dependent
instructions of the same thread successively into
the pipeline. - The interleaving technique proposed by Laudon et
al. adds caching and full pipeline interlocks to
the cycle-by-cycle interleaving approach.
20Tera MTA
- cycle-by-cycle interleaving technique
- employs the dependence-look-ahead technique
- VLIW ISA (3-issue)
- The processor switches context every cycle (3 ns
cycle period) among as many as 128 distinct
threads, thereby hiding up to 128 cycles (384 ns)
of memory latency.? 128 register sets
21Tera processing element
22Tera MTA
23Block interleaving
- Executes a single thread until it reaches a
situation that triggers a context switch. - Typical switching event the instruction
execution reaches a long-latency operation or a
situation where a latency may arise. - Compared to the cycle-by-cycle interleaving
technique, a smaller number of threads is needed - A single thread can execute at full speed until
the next context switch. - Single thread performance is similar to the
performance of a comparable processor without
multithreading. - ? IBM NorthStar processors are two-threaded 64
bit PowerPCs with switch-on-cache-miss
implemented in departmental computers (eServers)
of IBM since 10/98! (revealed at MTEAC-4, Dec.
2000) - Recent announcement (Oct. 1999) Sun MAJC5200 two
processor on a die, each processor is a
4-threaded block-interleaving VLIW
24Interleaving techniques
25Rhamma
26Komodo-microcontroller
- Develop multithreaded embedded real-time
Java-microcontroller - Java processor core ? bytecode as machine
language, portability across all platforms ?
dense machine code, important for embedded
applications ? fast byte code execution in
hardware, microcode and traps - Interrupts activate interrupt service threads
(ISTs) instead of interrupt service routines
(ISRs) ? extremely fast context switch ? no
blocking of interrupt services - Switch-on-signal technique enhanced to very
fine-grain switchingdue to hardware-implemented
real-time scheduling algorithms (FPP, EDF, LLF,
guranteed percentage) - ? hard real-time requirements fulfilled
- For more information see
- http//goethe.ira.uka.de/jkreuzin/komodo/komodoE
ng.html
27Komodo - microcontroller
28Nanothreading and microthreading- multithreading
in same register set
- Nanothreading (DanSoft processor) dismisses full
multithreading for a nanothread that executes in
the same register set as the main thread. - only a 9-bit PC, some simple control logic, and
it resides in the same page as the main thread. - Whenever the processor stalls on the main thread,
it automatically begins fetching instructions
from the nanothread. - The microthreading technique (Bolychevsky et al.
1996) is similar to nanothreading. - All threads share the same register set and the
same run-time stack. However, the number of
threads is not restricted to two.
29Simultaneous multithreading (SMT)
- The SMT approach combines a wide superscalar
instruction issue with the multithreading
approach - by providing several register sets on the
multiprocessor - and issuing instructions from several instruction
queues simultaneously. - The issue slots of a wide issue processor can be
filled by operations of several threads. - Latencies occurring in the execution of single
threads are bridged by issuing operations of the
remaining threads loaded on the processor.
30Simultaneous multithreading (SMT) - Hardware
organization (1)
- SMT processors can be organized in two ways
- First Instructions of different threads share
all buffer resources in an extended superscalar
pipeline - Thus SMT adds minimal hardware complexity to
conventional superscalars, - hardware designers can focus on building a fast
single-threaded superscalar and add multithread
capability on top. - Complexity added to superscalars by
multithreading are thread tag for each internal
instruction representation, multiple register
sets, and the abilities of the fetch and the
retire units to fetch respectively retire
instructions of different threads.
31Simultaneous multithreading (SMT) - Hardware
organization (2)
- Second Replicate all internal buffers of a
superscalar such that each buffer is bound to a
specific thread. - The issue unit is able to issue instructions of
different instruction windows simultaneously to
the FUs. - Adds more changes to superscalar processors
organization - but leads to a natural partitioning of the
instruction window (similar to CMP) - and simplifes the issue and retire stages.
32Simultaneous multithreading (SMT)
- SMT fetch unit can take advantage of the
interthread competition for instruction bandwidth
in two ways - First, it can partition fetch bandwidth among the
threads and fetch from several threads each
cycle. Goal increasing the probability of
fetching only non speculative instructions. - Second, the fetch unit can be selective about
which threads it fetches. - The main drawback to simultaneous multithreading
may be that it complicates the instruction issue
stage, which always is central to the multiple
threads. - A functional partitioning as demanded for
processors of the 109-transistor era is therefore
not easily reached. - No simultaneous multithreaded processors exist to
date. Only simulations. - General opinion SMT will be in next generation
microprocessors. - Announcement (Oct. 1999) Compaq Alpha 21464
(EV8) will be four-threaded SMT
33SMT at the Universities of Washington and San
Diego
- Hypothetical out-of-order issue superscalar
microprocessor that resembles MIPS R10000 and HP
PA-8000. - 8 threads and 8-issue superscalar organization
are assumed. - Eight instructions are decoded, renamed and fed
to either the integer or floating-point
instruction window. - Unified buffers are used
- When operands become available, up to 8
instructions are issued out-of-order per cycle,
executed and retired. - Each thread can address 32 architectural integer
(and floating-point) registers. These registers
are renamed to a large physical register le of
356 physical registers.
34SMT at the Universities of Washington and San
Diego
35SMT at the Universities of Washington and San
Diego - Instruction fetching schemes
- Basic Round-robin RR.2.8 fetching scheme, i.e.,
in each cycle, two times 8 instructions are
fetched in round-robin policy from two different
2 threads, - superior to different other schemes like RR.1.8,
RR.4.2, and RR.2.4 - Other fetch policies
- BRCOUNT scheme gives highest priority to those
threads that are least likely to be on a wrong
path, - MISSCOUNT scheme gives priority to the threads
that have the fewest outstanding D-cache misses - IQPOSN policy gives lowest priority to the oldest
instructions by penalizing those threads with
instructions closest to the head of either the
integer or the floating-point queue - ICOUNT feedback technique gives highest fetch
priority to the threads with the fewest
instructions in the decode, renaming, and queue
pipeline stages
36SMT at the Universities of Washington and San
Diego - Instruction fetching schemes
- The ICOUNT policy proved as superior!
- The ICOUNT.2.8 fetching strategy reached a IPC of
about 5.4 (the RR.2.8 reached about 4.2 only). - Most interesting is that neither mispredicted
branches nor blocking due to cache misses, but a
mix of both and perhaps some other effects showed
as the best fetching strategy. - Recently, simultaneous multithreading has been
evaluated with - SPEC95,
- database workloads,
- and multimedia workloads.
- Both achieving roughly a 3-fold IPC increase
with an eight-threaded SMT over a single-threaded
superscalar with similar resources.
37SMT processor with multimedia enhancement-
Combining SMT and multimedia
- Start with a wide-issue superscalar
general-purpose processor - Enhance by simultaneous multithreading
- Enhance by multimedia unit(s)
- Utilization of subword parallelism (data
parallel instructions, SIMD) - Saturation arithmetic
- Additional arithmetic, masking and selection,
reordering and conversion instructions - Enhance by additional features useful for
multimedia processing, e.g. on-chip RAM memory,
special cache techniques
For more information see http//goethe.ira.uka.d
e/people/ungerer/smt-mm/SM-MM-processor.html
38The SMT multimedia processor model
39Maximum processor configuration- IPCs of
8-threaded 8-issue cases
- Initial maximum configuration 2.28
- 16 entry reservation stations for thread, global
and local load/store units (instead of 256)
2.96 - one common 256-entry reservation station unit for
all integer/multimedia units (instead of
256-entry reservation stations each) 3.27 - loads and stores may pass blocked load/stores of
other threads 4.1 - highest-priority-first, non-speculative-instructio
n-first, non-saturated-first strategies for
issue, dispatch, and retire stages 4.34 - 32-entry reorder buffer (instead of 256) 4.69
- second local load/store unit (because of 20.1
local load/stores) 6.07 (6.32 with dynamic
branch prediction)
40IPC of maximum processor
On-chip RAM and two local load/store units 4 MB
I-cache, D-cache fill burst rate of 6222
41More realistic processor
D-cache fill burst rate of 32444 issue
bandwidth 8
42Speedup
Realistic processor
Maximum processor
A threefold speedup
43IPC-Performance of SMT and CMP (1)
- SPEC92-simulations Tullsen et al. vs.
Sigmund and Ungerer.
44IPC-Performance of SMT and CMP (2)
- SPEC95-simulations Eggers et al..CMP2 2
processors, 4-issue superscalar 2(1,4)CMP4 4
processors, 2-issue superscalar 4(1,2)SMT
8-threaded, 8-issue superscalar 1(8,8)
45IPC-Performance of SMT and CMP
SPEC95-simulations. Performance is given
relative to a single 2-issue superscalar
processor as baseline processor Hammond et al..
46Comments to the simulation results Hammond et
al.
- CMP (eight 2-issue processors) outperforms a
12-issue superscalar and a 12-issue, 8-threaded
SMT processor on four SPEC95 benchmark programs
(by hand parallelized for CMP and SMP). - The CMP achieved higher performance than SMT due
to a total of 16 issue slot instead of 12 issue
slots for SMT. - Hammond et al. argue that design complexity for
16-issue CMPs is similar to 12-issue superscalars
or 12-issue SMT processors.
47SMT vs. multiprocessor chip Eggers et al.
- SMT obtained better speedups than the (CMP) chip
multiprocessors- in contrast to results of
Hammond et al.!!Eggers et al. compared 8-issue,
8-threaded SMTs with four 2-issue CMPs.Hammond
et al. compared 12-issue, 8-threaded SMTs with
eight 2-issue CMPs. - Eggers et al.
- Speedups on the CMP were hindered by the fixed
partitioning of their hardware resources across
the processors. - In CMP processors were idle when thread-level
parallelism was insufficient. - Exploiting large amounts of instruction-level
parallelism in the unrolled loops of individual
threads not possible due to CMP processors
smaller issue bandwidth. - An SMT processor dynamically partitions its
resources among threads, and therefore can
respond well to variations in both types of
parallelism, exploiting them interchangeably.
48Conclusions
- The performance race between SMT and CMP is not
yet decided. - CMP is easier to implement, but only SMT has the
ability to hide latencies. - A functional partitioning is not easily reached
within a SMT processor due to the centralized
instruction issue. - A separation of the thread queues is a possible
solution, although it does not remove the central
instruction issue. - A combination of simultaneous multithreading with
the CMP may be superior. - We favor a CMP consisting of moderately equipped
(e.g., 4-threaded 4-issue superscalar) SMTs. - Future research combine SMT or CMP organization
with the ability to create threads with compiler
support or fully dynamically out of a single
thread - thread-level speculation
- close to multiscalar