Title: CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
1CPE 631 Multithreading Thread-Level
Parallelism Within a Processor
- Electrical and Computer EngineeringUniversity of
Alabama in Huntsville - Aleksandar Milenkovicmilenka_at_ece.uah.edu
- http//www.ece.uah.edu/milenka
2Outline
- Trends in microarchitecture
- Exploiting thread-level parallelism
- Exploiting TLP within a processor
- Resource sharing
- Performance implications
- Design challenges
- Intels HT technology
3Trends in microarchitecture
- Higher clock speeds
- To achieve high clock frequency make pipeline
deeper (superpipelining) - Events that disrupt pipeline (branch
mispredictions, cache misses, etc) become very
expensive in terms of lost clock cycles - ILP Instruction Level Parallelism
- Extract parallelism in a single program
- Superscalar processors have multiple execution
units working in parallel - Challenge to find enough instructions that can be
executed concurrently - Out-of-order execution gt instructions are sent
to execution units based on instruction
dependencies rather than program order
4Trends in microarchitecture
- Cache hierarchies
- Processor-memory speed gap
- Use caches to reduce memory latency
- Multiple levels of caches smaller and faster
closer to the processor core - Thread-level Parallelism
- Multiple programs execute concurrently
- Web-servers have an abundance of software threads
- Users surfing the web, listening to music,
encoding/decoding video streams, etc.
5Exploiting thread-level parallelism
- CMP Chip Multiprocessing
- Multiple processors, each with a full set of
architectural resources, reside on the same die - Processors may share an on-chip cache or each
can have its own cache - Examples HP Mako, IBM Power4
- Challenges Power, Die area (cost)
- Time-slice multithreading
- Processor switches between software threads
after a predefined time slice - Can minimize the effects of long lasting events
- Still, some execution slots are wasted
6Multithreading Within a Processor
- Until now, we have executed multiple threads of
an application on different processors can
multiple threads execute concurrently on the same
processor? - Why is this desireable?
- inexpensive one CPU, no external interconnects
- no remote or coherence misses (more capacity
misses) - Why does this make sense?
- most processors cant find enough work peak IPC
is 6, average IPC is 1.5! - threads can share resources ? we can increase
threads without a corresponding linear increase
in area
7What Resources are Shared?
- Multiple threads are simultaneously active (in
other words, a new thread can start without a
context switch) - For correctness, each thread needs its own PC,
its own logical regs (and its own mapping from
logical to phys regs) - For performance, each thread could have its own
ROB (so that a stall in one thread does not stall
commit in other threads), I-cache, branch
predictor, D-cache, etc. (for low interference),
although note that more sharing ? better
utilization of resources - Each additional thread costs a PC, rename table,
and ROB cheap!
8Approaches to Multithreading Within a Processor
- Fine-grained multithreadingswitches threads on
every clock cycle - Pro hide latency of from both short and long
stalls - Con Slows down execution of the individual
threads ready to go - Course-grained multithreadingswitches threads
only on costly stalls (e.g., L2 stalls) - Pros no switching each clock cycle, no slow down
for ready-to-go threads - Con limitations in hiding shorter stalls
- Simultaneous Multithreadingexploits TLP at the
same time it exploits ILP
9How Resources are Shared?
Each box represents an issue slot for a
functional unit. Peak thruput is 4 IPC.
Thread 1
Thread 2
Thread 3
Cycles
Thread 4
Idle
Coarse-grainedMultithreading
Fine-Grained Multithreading
Superscalar
Simultaneous Multithreading
- Superscalar processor has high under-utilization
not enough work every cycle, especially when
there is a cache miss - Fine-grained multithreading can only issue
instructions from a single thread in a cycle
can not find max work every cycle, but cache
misses can be tolerated - Simultaneous multithreading can issue
instructions from any thread every cycle has
the highest probability of finding work for every
issue slot
10Resource Sharing
Thread-1
R1 ? R1 R2 R3 ? R1 R4 R5 ? R1 R3
P73? P1 P2 P74 ? P73 P4 P75 ? P73 P74
Instr Fetch
Instr Rename
Issue Queue
Instr Fetch
Instr Rename
P73? P1 P2 P74 ? P73 P4 P75 ? P73 P74 P76 ?
P33 P34 P77 ? P33 P76 P78 ? P77 P35
R2 ? R1 R2 R5 ? R1 R2 R3 ? R5 R3
P76 ? P33 P34 P77 ? P33 P76 P78 ? P77 P35
Thread-2
Register File
FU
FU
FU
FU
11Performance Implications of SMT
- Single thread performance is likely to go down
(caches, branch predictors, registers, etc. are
shared) this effect can be mitigated by trying
to prioritize one thread - While fetching instructions, thread priority can
dramatically influence total throughput a
widely accepted heuristic (ICOUNT) fetch such
that each thread has an equal share of processor
resources - With eight threads in a processor with many
resources, SMT yields throughput improvements of
roughly 2-4 - Alpha 21464 and Intel Pentium 4 are examples of
SMT
12Design Challenges
- How many threads?
- Many to find enough parallelism
- However, mixing many threads will compromise
execution of individual threads - Processor front-end (instruction fetch)
- Fetch as far as possible in a single thread (to
maximize thread performance) - However, this limits the number of instructions
available for scheduling from other threads - Larger register files (multiple contexts)
- Minimize clock cycle time
- Cache conflicts
13Pentium 4 Hyperthreading architecture
- One physical processor appears as multiple
logical processors - HT implementation on NetBurst microarchitecture
has 2 logical processors
Architectural State
Architectural State
- Architectural state
- general purpose registers
- control registers
- APIC advanced programmable interrupt controller
Processor execution resources
14Pentium 4 Hyperthreading architecture
- Main processor resources are shared
- caches, branch predictors, execution units,
buses, control logic - Duplicated resources
- register alias tables (map the architectural
registers to physical rename registers) - next instruction pointer and associated control
logic - return stack pointer
- instruction streaming buffer and trace cache fill
buffers
15Pentium 4 Die Size and Complexity
16Pentium 4 Resources sharing schemes
- Partition dedicate equal resources to each
logical processors - Good when expect high utilization and somewhat
unpredicatable - Threshold flexible resource sharing with a
limit on maximum resource usage - Good for small resources with bursty utilization
and when the micro-ops stay in the structure for
short predictable periods - Full sharing flexible with no limits
- Good for large structures, with variable
working-set sizes
17Pentium 4 Shared vs. partitioned queues
shared
partitioned
18NetBurst Pipeline
threshold
partitioned
19Pentium 4 Shared vs. partitioned resources
- Partitioned
- E.g., major pipeline queues
- Threshold
- Puts a threshold on the number of resource
entries a logical processor can have - E.g., scheduler
- Fully shared resources
- E.g., caches
- Modest interference
- Benefit if we have shared code and/or data
20Pentium 4 Scheduler occupancy
21Pentium 4 Shared vs. partitioned cache
22Pentium 4 Performance Improvements
23Multi-Programmed Speedup