CPE 631: Multithreading: Thread-Level Parallelism Within a Processor - PowerPoint PPT Presentation

About This Presentation

Title:

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Description:

Resource Sharing Performance Implications of SMT Single thread performance is likely to go down (caches, branch predictors, registers, etc. are shared) ... – PowerPoint PPT presentation

Number of Views:193

Avg rating:3.0/5.0

Slides: 24

Provided by: Alek155

Learn more at: http://www.ece.uah.edu

Category:

more less

Transcript and Presenter's Notes

Title: CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

1
CPE 631 Multithreading Thread-Level
Parallelism Within a Processor

Electrical and Computer EngineeringUniversity of
Alabama in Huntsville
Aleksandar Milenkovicmilenka_at_ece.uah.edu
http//www.ece.uah.edu/milenka

2
Outline

Trends in microarchitecture
Exploiting thread-level parallelism
Exploiting TLP within a processor
Resource sharing
Performance implications
Design challenges
Intels HT technology

3
Trends in microarchitecture

Higher clock speeds
To achieve high clock frequency make pipeline
deeper (superpipelining)
Events that disrupt pipeline (branch
mispredictions, cache misses, etc) become very
expensive in terms of lost clock cycles
ILP Instruction Level Parallelism
Extract parallelism in a single program
Superscalar processors have multiple execution
units working in parallel
Challenge to find enough instructions that can be
executed concurrently
Out-of-order execution gt instructions are sent
to execution units based on instruction
dependencies rather than program order

4
Trends in microarchitecture

Cache hierarchies
Processor-memory speed gap
Use caches to reduce memory latency
Multiple levels of caches smaller and faster
closer to the processor core
Thread-level Parallelism
Multiple programs execute concurrently
Web-servers have an abundance of software threads
Users surfing the web, listening to music,
encoding/decoding video streams, etc.

5
Exploiting thread-level parallelism

CMP Chip Multiprocessing
Multiple processors, each with a full set of
architectural resources, reside on the same die
Processors may share an on-chip cache or each
can have its own cache
Examples HP Mako, IBM Power4
Challenges Power, Die area (cost)
Time-slice multithreading
Processor switches between software threads
after a predefined time slice
Can minimize the effects of long lasting events
Still, some execution slots are wasted

6
Multithreading Within a Processor

Until now, we have executed multiple threads of
an application on different processors can
multiple threads execute concurrently on the same
processor?
Why is this desireable?
inexpensive one CPU, no external interconnects
no remote or coherence misses (more capacity
misses)
Why does this make sense?
most processors cant find enough work peak IPC
is 6, average IPC is 1.5!
threads can share resources ? we can increase
threads without a corresponding linear increase
in area

7
What Resources are Shared?

Multiple threads are simultaneously active (in
other words, a new thread can start without a
context switch)
For correctness, each thread needs its own PC,
its own logical regs (and its own mapping from
logical to phys regs)
For performance, each thread could have its own
ROB (so that a stall in one thread does not stall
commit in other threads), I-cache, branch
predictor, D-cache, etc. (for low interference),
although note that more sharing ? better
utilization of resources
Each additional thread costs a PC, rename table,
and ROB cheap!

8
Approaches to Multithreading Within a Processor

Fine-grained multithreadingswitches threads on
every clock cycle
Pro hide latency of from both short and long
stalls
Con Slows down execution of the individual
threads ready to go
Course-grained multithreadingswitches threads
only on costly stalls (e.g., L2 stalls)
Pros no switching each clock cycle, no slow down
for ready-to-go threads
Con limitations in hiding shorter stalls
Simultaneous Multithreadingexploits TLP at the
same time it exploits ILP

9
How Resources are Shared?
Each box represents an issue slot for a
functional unit. Peak thruput is 4 IPC.
Thread 1
Thread 2
Thread 3
Cycles
Thread 4
Idle
Coarse-grainedMultithreading
Fine-Grained Multithreading
Superscalar
Simultaneous Multithreading

Superscalar processor has high under-utilization
not enough work every cycle, especially when
there is a cache miss
Fine-grained multithreading can only issue
instructions from a single thread in a cycle
can not find max work every cycle, but cache
misses can be tolerated
Simultaneous multithreading can issue
instructions from any thread every cycle has
the highest probability of finding work for every
issue slot

10
Resource Sharing
Thread-1
R1 ? R1 R2 R3 ? R1 R4 R5 ? R1 R3
P73? P1 P2 P74 ? P73 P4 P75 ? P73 P74
Instr Fetch
Instr Rename
Issue Queue
Instr Fetch
Instr Rename
P73? P1 P2 P74 ? P73 P4 P75 ? P73 P74 P76 ?
P33 P34 P77 ? P33 P76 P78 ? P77 P35
R2 ? R1 R2 R5 ? R1 R2 R3 ? R5 R3
P76 ? P33 P34 P77 ? P33 P76 P78 ? P77 P35
Thread-2
Register File
FU
FU
FU
FU
11
Performance Implications of SMT

Single thread performance is likely to go down
(caches, branch predictors, registers, etc. are
shared) this effect can be mitigated by trying
to prioritize one thread
While fetching instructions, thread priority can
dramatically influence total throughput a
widely accepted heuristic (ICOUNT) fetch such
that each thread has an equal share of processor
resources
With eight threads in a processor with many
resources, SMT yields throughput improvements of
roughly 2-4
Alpha 21464 and Intel Pentium 4 are examples of
SMT

12
Design Challenges

How many threads?
Many to find enough parallelism
However, mixing many threads will compromise
execution of individual threads
Processor front-end (instruction fetch)
Fetch as far as possible in a single thread (to
maximize thread performance)
However, this limits the number of instructions
available for scheduling from other threads
Larger register files (multiple contexts)
Minimize clock cycle time
Cache conflicts

13
Pentium 4 Hyperthreading architecture

One physical processor appears as multiple
logical processors
HT implementation on NetBurst microarchitecture
has 2 logical processors

Architectural State
Architectural State

Architectural state
general purpose registers
control registers
APIC advanced programmable interrupt controller

Processor execution resources
14
Pentium 4 Hyperthreading architecture

Main processor resources are shared
caches, branch predictors, execution units,
buses, control logic
Duplicated resources
register alias tables (map the architectural
registers to physical rename registers)
next instruction pointer and associated control
logic
return stack pointer
instruction streaming buffer and trace cache fill
buffers

15
Pentium 4 Die Size and Complexity
16
Pentium 4 Resources sharing schemes

Partition dedicate equal resources to each
logical processors
Good when expect high utilization and somewhat
unpredicatable
Threshold flexible resource sharing with a
limit on maximum resource usage
Good for small resources with bursty utilization
and when the micro-ops stay in the structure for
short predictable periods
Full sharing flexible with no limits
Good for large structures, with variable
working-set sizes

17
Pentium 4 Shared vs. partitioned queues
shared
partitioned
18
NetBurst Pipeline
threshold
partitioned
19
Pentium 4 Shared vs. partitioned resources