Title: CS136, Advanced Architecture
1CS136, Advanced Architecture
- Limits to ILP
- Simultaneous Multithreading
2Outline
- Limits to ILP (another perspective)
- Thread Level Parallelism
- Multithreading
- Simultaneous Multithreading
- Power 4 vs. Power 5
- Head to Head VLIW vs. Superscalar vs. SMT
- Commentary
- Conclusion
3Limits to ILP
- Conflicting studies of amount
- Benchmarks (vectorized Fortran FP vs. integer C
programs) - Hardware sophistication
- Compiler sophistication
- How much ILP is available using existing
mechanisms with increasing HW budgets? - Do we need to invent new HW/SW mechanisms to keep
on processor performance curve? - Intel MMX, SSE (Streaming SIMD Extensions) 64
bit ints - Intel SSE2 128 bit, including 2 64-bit Fl. Pt.
per clock - Motorola AltiVec 128 bit ints and FPs
- Supersparc Multimedia ops, etc.
4Overcoming Limits
- Advances in compiler technology significantly
new and different hardware techniques may be able
to overcome limitations assumed in studies - However, unlikely such advances when coupled with
realistic hardware will overcome these limits in
near future
5Limits to ILP
- Initial HW Model here MIPS compilers.
- Assumptions for ideal/perfect machine to start
- 1. Register renaming infinite virtual
registers all register WAW WAR hazards are
avoided - 2. Branch prediction perfect no
mispredictions - 3. Jump prediction all jumps perfectly
predicted (returns, case statements)2 3 ? no
control dependencies perfect speculation an
unbounded buffer of instructions available - 4. Memory-address alias analysis addresses
known a load can be moved before a store
provided addresses not equal 14 eliminates all
but RAW - Also perfect caches 1 cycle latency for all
instructions (FP ,/) unlimited instructions
issued/clock cycle
6Limits to ILP HW Model comparison
7Upper Limit to ILP Ideal Machine(Figure 3.1)
FP 75 - 150
Integer 18 - 60
Instructions Per Clock
8Limits to ILP HW Model comparison
9More Realistic HW Window ImpactFigure 3.2
- Change from Infinite window 2048, 512, 128, 32
FP 9 - 150
Integer 8 - 63
IPC
10Limits to ILP HW Model comparison
11More Realistic HW Branch ImpactFigure 3.3
FP 15 - 45
- Change from Infinite window to 2048, and maximum
issue of 64 instructions per clock cycle
Integer 6 - 12
IPC
Profile
BHT (512)
Tournament
Perfect
No prediction
12Misprediction Rates
13Limits to ILP HW Model comparison
14More Realistic HW Renaming Register Impact (N
int N fp) Figure 3.5
FP 11 - 45
- Change to 2048 instr window, 64 instr issue, 8K
2 level Prediction
IPC
Integer 5 - 15
64
None
256
Infinite
32
128
15Limits to ILP HW Model comparison
16More Realistic HW Memory Address Alias
ImpactFigure 3.6
- Change 2048 instr window, 64 instr issue, 8K 2
level Prediction, 256 renaming registers
FP 4 - 45 (Fortran, no heap)
IPC
Integer 4 - 9
None
Global/Stack perfheap conflicts
Perfect
Compiler Inspection
17Limits to ILP HW Model comparison
18Realistic HW Window Impact(Figure 3.7)
- Perfect disambiguation (HW), 1K Selective
Prediction, 16 entry return, 64 registers, issue
as many as window
FP 8 - 45
IPC
Integer 6 - 12
64
16
256
Infinite
32
128
8
4
19Outline
- Limits to ILP (another perspective)
- Thread Level Parallelism
- Multithreading
- Simultaneous Multithreading
- Power 4 vs. Power 5
- Head to Head VLIW vs. Superscalar vs. SMT
- Commentary
- Conclusion
20How to Exceed ILP Limitsof This Study?
- These are not laws of physics
- Just practical limits for today
- Could be overcome via research
- Compiler and ISA advances could change results
- WAR and WAW hazards through memory eliminated
WAW and WAR hazards through register renaming,
but not in memory usage - Can get conflicts via allocation of stack frames
- Because called procedure reuses memory addresses
of previous stack frames
21HW v. SW to increase ILP
- Memory disambiguation HW best
- Speculation
- HW best when dynamic branch prediction better
than compile-time prediction - Exceptions easier for HW
- HW doesnt need bookkeeping code or compensation
code - Very complicated to get right in SW
- Scheduling SW can look ahead to schedule better
- Compiler independence HW does not require new
compiler to run well
22Performance Beyond Single-Thread ILP
- Much higher natural parallelism in some
applications - Database or scientific codes
- Explicit thread-level or data-level parallelism
- Thread has own instructions and data
- May be part of parallel program or independent
program - Each thread has all state (instructions, data,
PC, register state, and so on) needed to execute - Data-level parallelism Perform identical
operations on lots of data
23Thread Level Parallelism (TLP)
- ILP exploits implicit parallel operations within
loop or straight-line code segment - TLP explicitly represented by multiple threads of
execution that are inherently parallel - Goal Use multiple instruction streams to improve
- Throughput of computers that run many programs
- Execution time of multi-threaded programs
- TLP could be more cost-effective to exploit than
ILP
24Do Both ILP and TLP?
- TLP and ILP exploit two different kinds of
parallel structure in a program - Could a processor oriented to ILP still exploit
TLP? - Functional units are often idle in data path
designed for ILP because of either stalls or
dependencies in the code - Could TLP be used as source of independent
instructions that might keep the processor busy
during stalls? - Could TLP be used to employ functional units that
would otherwise lie idle when insufficient ILP
exists?
25New ApproachMultithreaded Execution
- Multithreading multiple threads share functional
units of 1 processor via overlapping - Processor must duplicate independent state of
each thread - Separate copy of register file, PC
- Separate page table if different process
- Memory sharing via virtual memory mechanisms
- Already supports multiple processes
- HW for fast thread switch
- Must be much faster than full process switch
(which is 100s to 1000s of clocks) - When to switch?
- Alternate instruction per thread (fine
grain)round robin - When thread is stalled (coarse grain)
- E.g., cache miss
26Fine-Grained Multithreading
- Switches between threads on each instruction,
interleaving execution of multiple threads - Usually done round-robin, skipping stalled
threads - CPU must be able to switch threads every clock
- Advantage can hide both short and long stalls
- Instructions from other threads always available
to execute - Easy to insert on short stalls
- Disadvantage slows individual threads
- Thread ready to execute without stalls will be
delayed by instructions from other threads - Used on Suns Niagara (will see later)
27Course-Grained Multithreading
- Switches threads only on costly stalls
- E.g., L2 cache misses
- Advantages
- Relieves need to have very fast thread switching
- Doesnt slow thread
- Other threads only issue instructions when main
one would stall (for long time) anyway - Disadvantage pipeline startup costs make it hard
to hide throughput losses from shorter stalls - Pipeline must be emptied or frozen on stall,
since CPU issues instructions from only one
thread - New thread must fill pipe before instructions can
complete - Thus, better for reducing penalty of high-cost
stalls where pipeline refill - Used in IBM AS/400
28Simultaneous Multithreading (SMT)
- Simultaneous multithreading (SMT) insight that
dynamically scheduled processor already has many
HW mechanisms to support multithreading - Large set of virtual registers that can be used
to hold register sets for independent threads - Register renaming provides unique register
identifiers - Instructions from multiple threads can be mixed
in data path - Without confusing sources and destinations across
threads! - Out-of-order completion allows the threads to
execute out of order, and get better utilization
of the HW - Just add per-thread renaming table and keep
separate PCs - Independent commitment can be supported via
separate reorder buffer for each thread
Source Micrprocessor Report, December 6, 1999
Compaq Chooses SMT for Alpha
29Simultaneous Multithreading ...
One thread, 8 units
Two threads, 8 units
Cycle
M
M
FX
FX
FP
FP
BR
CC
M
M
FX
FX
FP
FP
BR
CC
Cycle
M Load/Store, FX Fixed Point, FP Floating
Point, BR Branch, CC Condition Codes
30Multithreaded Categories
Fine-Gr.
Coarse-Gr.
SMT
Superscalar
Multiprocessing
Time (processor cycle)
Thread 1
Thread 3
Thread 5
Thread 2
Thread 4
Idle slot
31Design Challenges in SMT
- What is impact on single-thread performance?
- Preferred-thread approach
- Sacrifices neither throughput nor single-thread
performance? - Nope processor will sacrifice some throughput
when preferred thread stalls - Larger register file needed to hold multiple
contexts - Must not affect clock cycle, especially in
- Instruction issuemore candidate instructions to
consider - Instruction completionhard to choose which to
commit - Must ensure that cache and TLB conflicts caused
by SMT dont degrade performance
32Digression Covert Channels
- Imagine youre spy with account on Knuth
- Want to communicate a secret to Geoff
- Secret is reasonably small
- FBI is watching your account and your e-mail
- Solution process spawning
- Once a second, Geoff spawns process
- Records own PID, waits 10 ms, forks records
child PID - Once a second, you send one bit of information
- If bit is zero, you do nothing
- If bit is one, you spawn processes as fast as
possible - If Geoff sees big PID gap, he records 1, else
0 - Many variations on this basic idea
33Covert-Channel Attacks on Crypto
- Most (not all) crypto code behaves differently on
1 bit in key vs. 0 bit - Runs longer or shorter
- Uses more or less power
- Accesses different memory
- Etc.
- Usually called information leakage
- Has been successfully used in lab to crack strong
crypto - Even recovering some bits from key makes
brute-force attack practical for getting
remainder - Some modern implementations try to fight by doing
wasted work on shorter path of if, etc.
34SMT Attack on SSH
- On SMT machine, lower-priority threads execution
rate depends on higher-priority ones
instructions - More stalls in top thread mean more speed in
bottom one - Stalls vary depending on what crypto code is
doing - Operates at very low level
- Thus much harder to defend against
- Successful attack on ssh keys has been
demonstrated in lab - Best known defense dont do SMT
- Careful coding of crypto could probably also work
- Note that this also applies to things like cache
and TLB - Lots of ways to leak information unintentionally!
35Power 4
Single-threaded predecessor to Power 5. 8
execution units in out-of-order engine each can
issue instruction each cycle.
36Power 4
2 commits (architected register sets)
Power 5
2 fetch (PC),2 initial decodes
37Power 5 data flow ...
Why only 2 threads? With 4, one of the shared
resources (physical registers, cache, memory
bandwidth) would be prone to bottleneck
38Power 5 thread performance ...
Relative priority of each thread controllable in
hardware.
For balanced operation, both threads run slower
than if they owned the machine.
39Changes in Power 5 to support SMT
- Increased associativity of L1 instruction cache
and instruction address translation buffers - Added per-thread load and store queues
- Increased size of L2 (1.92 vs. 1.44 MB) and L3
caches - Added separate instruction prefetch and buffering
per thread - Increased virtual registers from 152 to 240
- Increased size of several issue queues
- Power5 core is about 24 larger than Power4
because of SMT support
40Initial Performance of SMT
- Pentium 4 Extreme SMT yields 1.01 speedup for
SPECint_rate benchmark 1.07 for SPECfp_rate - Pentium 4 is dual-threaded SMT
- SPECRate requires each benchmark to be run
against vendor-selected number of copies of same
benchmark - Pairing each of 26 SPEC benchmarks with every
other on Pentium 4 (262 runs) gives speedups from
0.90 to 1.58 average was 1.20 - 8-processor Power 5 server 1.23 faster for
SPECint_rate w/ SMT, 1.16 faster for SPECfp_rate - Power 5 running 2 copies of each app had speedup
between 0.89 and 1.41 - Most gained some
- Floating-point apps had most cache conflicts and
least gains
41Head-to-Head ILP Competition
42Performance on SPECint2000
43Performance on SPECfp2000
44Normalized Performance Efficiency
45No Silver Bullet for ILP
- No obvious overall leader in performance
- AMD Athlon leads on SPECInt performance, followed
by the Pentium 4, Itanium 2, and Power5 - Itanium 2 and Power5 clearly dominate Athlon and
Pentium 4 on SPECFP - Itanium 2 is most inefficient processor both for
floating-point and integer code for all but one
efficiency measure (SPECFP/Watt) - Athlon and Pentium 4 both use transistors and
area efficiently - IBM Power5 is most effective user of energy on
SPECFP, essentially tied on SPECINT
46Limits to ILP
- Doubling issue rates above todays 3-6
instructions per clock probably requires
processor to - Issue 3-4 data-memory accesses per cycle,
- Resolve 2-3 branches per cycle,
- Rename and access over 20 registers per cycle,
and - Fetch 12-24 instructions per cycle.
- Complexity of implementing these capabilities is
likely to mean sacrifices in maximum clock rate - E.g, widest-issue processor is Itanium 2
- It also has slowest clock rate
- Despite consuming the most power!
47Limits to ILP (contd)
- Most ways to increase performance also boost
power consumption - Key question is energy efficiency does a method
increase power consumption faster than it boosts
performance? - Multiple-issue techniques are energy inefficient
- Incurs logic overhead that grows faster than
issue rate - Growing gap between peak issue rates and
sustained performance - Number of transistors switching f(peak issue
rate) performance f(sustained rate) growing
gap between peak and sustained performance ?
Increasing energy per unit of performance
48Commentary
- Itanium is not significant breakthrough in
scaling ILP or in avoiding problems of complexity
and power consumption - Instead of pursuing more ILP, architects turning
to TLP using single-chip multiprocessors - In 2000, IBM announced Power4, 1st commercial
single-chip, general-purpose multiprocessor has
two Power3 processors and integrated L2 cache - Sun Microsystems, AMD, and Intel have also
switched focus from aggressive uniprocessors to
single-chip multiprocessors - Right balance of ILP and TLP is unclear today
- Maybe desktops (mostly single-threaded?) need
different design than servers (can do lots of TLP)
49And in conclusion
- Limits to ILP (power efficiency, compilers,
dependencies ) seem to limit to 3 to 6 issue for
practical options - Explicitly parallel (data-level parallelism or
thread-level parallelism) is next step to
performance - Coarse-grained vs. fine-grained multithreading
- Only on big stall vs. every clock cycle
- Simultaneous multithreading is fine-grained
multithreading based on OOO superscalar
microarchitecture - Instead of replicating registers, reuse rename
registers - Itanium/EPIC/VLIW is not a breakthrough in ILP
- Balance of ILP and TLP decided in marketplace