Future Processors to use Coarse-Grain Parallelism

About This Presentation

Title:

Future Processors to use Coarse-Grain Parallelism

Description:

Future processors to use coarse-grain parallelism ... Hydra: A single-chip multiprocessor. CPU 0. Centralized Bus Arbitration Mechanisms ... – PowerPoint PPT presentation

Number of Views:131

Avg rating:3.0/5.0

Slides: 49

Provided by: jurij2

Category:

more less

Transcript and Presenter's Notes

Title: Future Processors to use Coarse-Grain Parallelism

1
Chapter 6

Future Processors to use Coarse-Grain Parallelism

2
Future processors to use coarse-grain parallelism

Chip multiprocessors (CMPs) or multiprocessor
chips
integrate two or more complete processors on a
single chip,
every functional unit of a processor is
duplicated
Simultaneous multithreaded processors (SMPs)
store multiple contexts in different register
sets on the chip,
the functional units are multiplexed between the
threads,
instructions of different contexts are
simultaneously executed

3
Principal chip multiprocessor alternatives

Symmetric multiprocessor (SMP)
Distributed shared memory multiprocessor (DSM)
Message-passing shared-nothing multiprocessor

4
Organizational principles of multiprocessors
5
Typical SMP
6
Shared memory candidates for CMPs
Shared-main memory and
shared-secondary cache
7
Shared memory candidates for CMPs
shared-primary cache
8
Grain-levels for CMPs

multiple processes in parallel
multiple threads from a single application ?
implies a common address space for all threads
extracting threads of control dynamically from a
single instruction stream
? see last chapter, multiscalar, trace
processors, ...

9
Texas Instruments TMS320C80 Multimedia Video
Processor
10
Hydra A single-chip multiprocessor
11
Conclusions on CMP

Usually, a CMP will feature
separate L1 I-cache and D-cache per on-chip CPU
and an optional unified L2 cache.
If the CPUs always execute threads of the same
process, the L2 cache organization will be
simplified, because different processes do not
have to be distinguished.
Recently announced commercial processors with CMP
hardware
IBM Power4 processor with 2 processor on a single
die
Sun MAJC5200 two processor on a die (each
processor a 4-threaded block-interleaving VLIW)

12
Multithreaded processors

Aim Latency tolerance
What is the problem?
Load access latencies measured on an Alpha Server
4100 SMP with four 300 MHz Alpha 21164 processors
are
7 cycles for a primary cache miss which hits in
the on-chip L2 cache of the 21164 processor,
21 cycles for a L2 cache miss which hits in the
L3 (board-level) cache,
80 cycles for a miss that is served by the
memory, and
125 cycles for a dirty miss, i.e., a miss that
has to be served from another processor's cache
memory.
Multithreaded processors are able to bridge
latencies by switching to another thread of
control - in contrast to chip multiprocessors.

13
Multithreaded processors

Multithreading
Provide several program counters registers (and
usually several register sets) on chip
Fast context switching by switching to another
thread of control

14
Approaches of multithreaded processors

Cycle-by-cycle interleaving
An instruction of another thread is fetched and
fed into the execution pipeline at each processor
cycle.
Block-interleaving
The instructions of a thread are executed
successively until an event occurs that may cause
latency. This event induces a context switch.
Simultaneous multithreading
Instructions are simultaneously issued from
multiple threads to the FUs of a superscalar
processor.
combines a wide issue superscalar instruction
issue with multithreading.

15
Comparision of multithreading with
non-multithreading approaches

(a) single-threaded scalar
(b) cycle-by-cycle interleaving multithreaded
scalar
(c) block interleaving multithreaded scalar

16
Comparision of multithreading with
non-multithreading approaches

(a) superscalar (c) cycle-by-cycle
interleaving
(b) VLIW (d) cycle-by-cycle interleaving
VLIW

17
Comparison of multithreading withnon-multithreadi
ng

simultaneous multithreading (SMT) chip
multiprocessor (CMP)

18
Cycle-by-cycle interleaving

the processor switches to a different thread
after each instruction fetch
pipeline hazards cannot arise and the processor
pipeline can be easily built without the
necessity of complex forwarding paths
context-switching overhead is zero cycles
memory latency is tolerated by not scheduling a
thread until the memory transaction has completed
requires at least as many threads as pipeline
stages in the processor
degrading the single-thread performance if not
enough threads are present

19
Cycle-by-cycle interleaving- Improving
single-thread performance

The dependence look-ahead technique adds several
bits to each instruction format in the ISA.
Scheduler feeds non data or control dependent
instructions of the same thread successively into
the pipeline.
The interleaving technique proposed by Laudon et
al. adds caching and full pipeline interlocks to
the cycle-by-cycle interleaving approach.

20
Tera MTA

cycle-by-cycle interleaving technique
employs the dependence-look-ahead technique
VLIW ISA (3-issue)
The processor switches context every cycle (3 ns
cycle period) among as many as 128 distinct
threads, thereby hiding up to 128 cycles (384 ns)
of memory latency.? 128 register sets

21
Tera processing element
22
Tera MTA
23
Block interleaving

Executes a single thread until it reaches a
situation that triggers a context switch.
Typical switching event the instruction
execution reaches a long-latency operation or a
situation where a latency may arise.
Compared to the cycle-by-cycle interleaving
technique, a smaller number of threads is needed
A single thread can execute at full speed until
the next context switch.
Single thread performance is similar to the
performance of a comparable processor without
multithreading.
? IBM NorthStar processors are two-threaded 64
bit PowerPCs with switch-on-cache-miss
implemented in departmental computers (eServers)
of IBM since 10/98! (revealed at MTEAC-4, Dec.
2000)
Recent announcement (Oct. 1999) Sun MAJC5200 two
processor on a die, each processor is a
4-threaded block-interleaving VLIW

24
Interleaving techniques
25
Rhamma
26
Komodo-microcontroller

Develop multithreaded embedded real-time
Java-microcontroller
Java processor core ? bytecode as machine
language, portability across all platforms ?
dense machine code, important for embedded
applications ? fast byte code execution in
hardware, microcode and traps
Interrupts activate interrupt service threads
(ISTs) instead of interrupt service routines
(ISRs) ? extremely fast context switch ? no
blocking of interrupt services
Switch-on-signal technique enhanced to very
fine-grain switchingdue to hardware-implemented
real-time scheduling algorithms (FPP, EDF, LLF,
guranteed percentage)
? hard real-time requirements fulfilled
For more information see
http//goethe.ira.uka.de/jkreuzin/komodo/komodoE
ng.html

27
Komodo - microcontroller
28
Nanothreading and microthreading- multithreading
in same register set

Nanothreading (DanSoft processor) dismisses full
multithreading for a nanothread that executes in
the same register set as the main thread.
only a 9-bit PC, some simple control logic, and
it resides in the same page as the main thread.
Whenever the processor stalls on the main thread,
it automatically begins fetching instructions
from the nanothread.
The microthreading technique (Bolychevsky et al.
1996) is similar to nanothreading.
All threads share the same register set and the
same run-time stack. However, the number of
threads is not restricted to two.

29
Simultaneous multithreading (SMT)

The SMT approach combines a wide superscalar
instruction issue with the multithreading
approach
by providing several register sets on the
multiprocessor
and issuing instructions from several instruction
queues simultaneously.
The issue slots of a wide issue processor can be
filled by operations of several threads.
Latencies occurring in the execution of single
threads are bridged by issuing operations of the
remaining threads loaded on the processor.

30
Simultaneous multithreading (SMT) - Hardware
organization (1)

SMT processors can be organized in two ways
First Instructions of different threads share
all buffer resources in an extended superscalar
pipeline
Thus SMT adds minimal hardware complexity to
conventional superscalars,
hardware designers can focus on building a fast
single-threaded superscalar and add multithread
capability on top.
Complexity added to superscalars by
multithreading are thread tag for each internal
instruction representation, multiple register
sets, and the abilities of the fetch and the
retire units to fetch respectively retire
instructions of different threads.

31
Simultaneous multithreading (SMT) - Hardware
organization (2)

Second Replicate all internal buffers of a
superscalar such that each buffer is bound to a
specific thread.
The issue unit is able to issue instructions of
different instruction windows simultaneously to
the FUs.
Adds more changes to superscalar processors
organization
but leads to a natural partitioning of the
instruction window (similar to CMP)
and simplifes the issue and retire stages.

32
Simultaneous multithreading (SMT)

SMT fetch unit can take advantage of the
interthread competition for instruction bandwidth
in two ways
First, it can partition fetch bandwidth among the
threads and fetch from several threads each
cycle. Goal increasing the probability of
fetching only non speculative instructions.
Second, the fetch unit can be selective about
which threads it fetches.
The main drawback to simultaneous multithreading
may be that it complicates the instruction issue
stage, which always is central to the multiple
threads.
A functional partitioning as demanded for
processors of the 109-transistor era is therefore
not easily reached.
No simultaneous multithreaded processors exist to
date. Only simulations.
General opinion SMT will be in next generation
microprocessors.
Announcement (Oct. 1999) Compaq Alpha 21464
(EV8) will be four-threaded SMT

33
SMT at the Universities of Washington and San
Diego

Hypothetical out-of-order issue superscalar
microprocessor that resembles MIPS R10000 and HP
PA-8000.
8 threads and 8-issue superscalar organization
are assumed.
Eight instructions are decoded, renamed and fed
to either the integer or floating-point
instruction window.
Unified buffers are used
When operands become available, up to 8
instructions are issued out-of-order per cycle,
executed and retired.
Each thread can address 32 architectural integer
(and floating-point) registers. These registers
are renamed to a large physical register le of
356 physical registers.

34
SMT at the Universities of Washington and San
Diego
35
SMT at the Universities of Washington and San
Diego - Instruction fetching schemes

Basic Round-robin RR.2.8 fetching scheme, i.e.,
in each cycle, two times 8 instructions are
fetched in round-robin policy from two different
2 threads,
superior to different other schemes like RR.1.8,
RR.4.2, and RR.2.4
Other fetch policies
BRCOUNT scheme gives highest priority to those
threads that are least likely to be on a wrong
path,
MISSCOUNT scheme gives priority to the threads
that have the fewest outstanding D-cache misses
IQPOSN policy gives lowest priority to the oldest
instructions by penalizing those threads with
instructions closest to the head of either the
integer or the floating-point queue
ICOUNT feedback technique gives highest fetch
priority to the threads with the fewest
instructions in the decode, renaming, and queue
pipeline stages

36
SMT at the Universities of Washington and San
Diego - Instruction fetching schemes

The ICOUNT policy proved as superior!
The ICOUNT.2.8 fetching strategy reached a IPC of
about 5.4 (the RR.2.8 reached about 4.2 only).
Most interesting is that neither mispredicted
branches nor blocking due to cache misses, but a
mix of both and perhaps some other effects showed
as the best fetching strategy.
Recently, simultaneous multithreading has been
evaluated with
SPEC95,
database workloads,
and multimedia workloads.
Both achieving roughly a 3-fold IPC increase
with an eight-threaded SMT over a single-threaded
superscalar with similar resources.

37
SMT processor with multimedia enhancement-
Combining SMT and multimedia

Start with a wide-issue superscalar
general-purpose processor
Enhance by simultaneous multithreading
Enhance by multimedia unit(s)
Utilization of subword parallelism (data
parallel instructions, SIMD)
Saturation arithmetic
Additional arithmetic, masking and selection,
reordering and conversion instructions
Enhance by additional features useful for
multimedia processing, e.g. on-chip RAM memory,
special cache techniques

For more information see http//goethe.ira.uka.d
e/people/ungerer/smt-mm/SM-MM-processor.html
38
The SMT multimedia processor model
39
Maximum processor configuration- IPCs of
8-threaded 8-issue cases

Initial maximum configuration 2.28
16 entry reservation stations for thread, global
and local load/store units (instead of 256)
2.96
one common 256-entry reservation station unit for
all integer/multimedia units (instead of
256-entry reservation stations each) 3.27
loads and stores may pass blocked load/stores of
other threads 4.1
highest-priority-first, non-speculative-instructio
n-first, non-saturated-first strategies for
issue, dispatch, and retire stages 4.34
32-entry reorder buffer (instead of 256) 4.69
second local load/store unit (because of 20.1
local load/stores) 6.07 (6.32 with dynamic
branch prediction)

40
IPC of maximum processor
On-chip RAM and two local load/store units 4 MB
I-cache, D-cache fill burst rate of 6222
41
More realistic processor
D-cache fill burst rate of 32444 issue
bandwidth 8
42
Speedup
Realistic processor
Maximum processor
A threefold speedup
43
IPC-Performance of SMT and CMP (1)

SPEC92-simulations Tullsen et al. vs.
Sigmund and Ungerer.

44
IPC-Performance of SMT and CMP (2)

SPEC95-simulations Eggers et al..CMP2 2
processors, 4-issue superscalar 2(1,4)CMP4 4
processors, 2-issue superscalar 4(1,2)SMT
8-threaded, 8-issue superscalar 1(8,8)

45
IPC-Performance of SMT and CMP
SPEC95-simulations. Performance is given
relative to a single 2-issue superscalar
processor as baseline processor Hammond et al..
46
Comments to the simulation results Hammond et
al.

CMP (eight 2-issue processors) outperforms a
12-issue superscalar and a 12-issue, 8-threaded
SMT processor on four SPEC95 benchmark programs
(by hand parallelized for CMP and SMP).
The CMP achieved higher performance than SMT due
to a total of 16 issue slot instead of 12 issue
slots for SMT.
Hammond et al. argue that design complexity for
16-issue CMPs is similar to 12-issue superscalars
or 12-issue SMT processors.

47
SMT vs. multiprocessor chip Eggers et al.

SMT obtained better speedups than the (CMP) chip
multiprocessors- in contrast to results of
Hammond et al.!!Eggers et al. compared 8-issue,
8-threaded SMTs with four 2-issue CMPs.Hammond
et al. compared 12-issue, 8-threaded SMTs with
eight 2-issue CMPs.
Eggers et al.
Speedups on the CMP were hindered by the fixed
partitioning of their hardware resources across
the processors.
In CMP processors were idle when thread-level
parallelism was insufficient.
Exploiting large amounts of instruction-level
parallelism in the unrolled loops of individual
threads not possible due to CMP processors
smaller issue bandwidth.
An SMT processor dynamically partitions its
resources among threads, and therefore can
respond well to variations in both types of
parallelism, exploiting them interchangeably.

48
Conclusions

The performance race between SMT and CMP is not
yet decided.
CMP is easier to implement, but only SMT has the
ability to hide latencies.
A functional partitioning is not easily reached
within a SMT processor due to the centralized
instruction issue.
A separation of the thread queues is a possible
solution, although it does not remove the central
instruction issue.
A combination of simultaneous multithreading with
the CMP may be superior.
We favor a CMP consisting of moderately equipped
(e.g., 4-threaded 4-issue superscalar) SMTs.
Future research combine SMT or CMP organization
with the ability to create threads with compiler
support or fully dynamically out of a single
thread
thread-level speculation
close to multiscalar