Title: Introduction to Simultaneous Multithreading
1Introduction toSimultaneousMultithreading
2Law of Microprocessor Performance
- 1 Time
Instructions Cycles Time - Performance Program Program
Instruction Cycle
___________ _________ _________
________ ______
x
x
(instr. count) (CPI) (cycle time)
IPC x Hz instr. count
__________
? Performance
3Optimization areas (so far)
- clock speed (?Hz)
- architectural optimizations (?IPC)
- pipelining, superscalar execution, branch
prediction, - out-of-order execution
- cache (?IPC)
4Main problems
- clock speed physical issues when increasing
frequency - heat too much, too hard to dissipate
- power consumption too high
- current leakage problems
- architectural optimizations insufficient
inherent ILP in many applications ? superscalar
processors cannot exploit all available issue
bandwidth - Solution
- Must find other ways than ILP to boost processor
performance - ? increase TLP
5Current directions
- CMP Chip Multiprocessors (multicores)
- SMT Simultaneous Multithreading
- cache
CMP SMT target at increased per chip TLP
- SMT processors
- issue and execute instructions from multiple
threads simultaneously (same cycle) - can be built easily and cheaply on top of
modern superscalar processors - threads share almost all execution resources
6Superscalar execution
7Superscalar execution (2)
8Chip Multiprocessors
9Time Slicing Multithreading (Fine-grained)
10Simultaneous Multithreading
Maximum utilization of functional units by
independent operations
11Hyperthreading technology
- Executes two tasks simultaneously
- multiprogrammed workloads/multithreaded
applications - Physical CPU maintains architectural state
- for two logical processors
- First implemented on the Intel Xeon processor
family - the logical processors come at a cost of lt5
additional - die area
12Replicated vs. Shared resources
Multiprocessor
Hyperthreading
- SMPs replicate execution resources
- HT shares execution resources
13Resource management in HT technology
- Replicated resources
- Architectural state GPRs, control registers,
APICs - Instruction Pointers, renaming logic
- smaller resources (ITLBs, return stack buffers,
branch - history buffers)
-
- Partitioned resources
- Re-Order buffers, Load/Store buffers,
instruction queues - Buffering queues guarantee independent forward
progress - Dynamically shared resources
- Out-of-Order execution engine, global history
array - Caches
14Basic Microarchitecture
15Front End Trace Cache hit
- Trace Cache arbitrated
- access each cycle
- when both LPs request at
- the same time, access is
- granted in alternating cycles
- when 1 LP requests, full TC
- bandwidth is exploited
16Front End Trace Cache miss
- L2 Cache access arbitrated on a FCFS basis,
with 1 slot always reserved for each LP - Decode when used by both LPs at the same time,
access is alternated, but at a more
coarse-grained fashion -
17OoO Execution Engine
- Allocator
- allocates buffer entries to each LP 63/126 ROB,
24/48 Load, 12/24 Store, 128/128 Integer phys.
regs, 128/128 FP phys. regs - on simultaneous requests alternates between LPs
at each cycle - stalls a LP when it tries to use more than the
half of partitioned buffers - (even if there arent any uops from the peer LP
in the queue) -
18OoO Execution Engine (2)
- Register Rename
- expands dynamically arch. registers to physical
registers - per-LP Register Alias Table
19OoO Execution Engine (3)
- Schedulers / Execution Units
- oblivious of logical processors
- generalmemory queues send uops, alternating
between LPs every cycle - 6 µIPC dispatch bandwidth( ? 3 µIPC per-LP
effective bandwidth, when both LPs active)
20Retirement
- architectural state for each thread is
committed in program order by alternating between
LPs at each cycle
21Software development implications
- ILP vs. TLP
- Factors parallelization cost, registercache
reuse, contention - on execution units.
- TLP for non-optimized codes, ILP for highly-tuned
ones. - Shared caches
- Threads working on disjoint small, or shared
large - cache portions?
- Threads execution profile
- Pairing threads of different profile (int-bound
w/ fp-bound , - cpu-bound w/ mem-bound) alleviates resource
contentions. - Can we apply alternative, non-symmetric paral.
methods?
22SMTs
- Targeting
- throughput of multiprogrammed workloads
- latency of multithreaded applications
- Challenge
- latency of single-threaded applications
- How?
- Speculative execution predict and/or precompute
future - accesses, branches, or even operations results,
and integrate them into the main threads
execution. - Hardware of current SMT implementations can
support only memory precomputation (prefetching)
schemes.
23Example prefetching helper threads
24Synchronization issues and HT
- spin-wait loops core of many synch. primitives
(spin-locks, semaphores, barriers, cond. vars) - Typical spin-wait loop
-
- Two issues when executed on HT
- upon condition satisfaction, all o-o-o
pre-executed ld-cmp-jnes waiting to be committed
must be discarded ? pipeline flush penalty - spins too fast checks repeatedly for condition
satisfaction a lot faster than the memory bus can
actually update sync_var ? valuable resources are
consumed
wait_loop ld eax,sync_var cmp eax,0 jne
wait_loop
25Synchronization issues and HT (2)
- Use of pause instruction
- introduces a slight delay in the loop
- causes condition tests to be issued at
approximately the speed of the memory bus - during pause , the execution of spinning thread
is de-pipelined ? dynamically shared resources
are effectively allocated to the working thread
wait_loop pause ld eax,sync_var cmp
eax,0 jne wait_loop
26Synchronization issues and HT (3)
- However statically partitioned resources are
not released (but still remain unused) - performance bottleneck, especially for long
duration wait loops (e.g. in prefetching schemes) - 15 slowdown of working thread on average
-
27Synchronization issues and HT (4)
- Further optimization use of halt instruction
wait_loop hlt ld eax,sync_var cmp
eax,0 jne wait_loop
28Synchronization issues and HT (5)
- spinning thread halts, partitioned resources are
recombined for full use by the working thread
(ST-mode) - upon condition satisfaction, worker sends IPIs
to wake up sleeping thread, resources are
repartitioned (MT-mode) - wait-loop repetitions and resource consumption
are minimized - tradeoffs OS intervention ST/MT-mode
transitions cost