CS136, Advanced Architecture - PowerPoint PPT Presentation

About This Presentation

Title:

CS136, Advanced Architecture

Description:

Assumptions for ideal/perfect machine to start: ... 2 & 3 no control dependencies; perfect speculation & an unbounded buffer of ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 50

Provided by: csH2

Learn more at: https://www.cs.hmc.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS136, Advanced Architecture

1
CS136, Advanced Architecture

Limits to ILP
Simultaneous Multithreading

2
Outline

Limits to ILP (another perspective)
Thread Level Parallelism
Multithreading
Simultaneous Multithreading
Power 4 vs. Power 5
Head to Head VLIW vs. Superscalar vs. SMT
Commentary
Conclusion

3
Limits to ILP

Conflicting studies of amount
Benchmarks (vectorized Fortran FP vs. integer C
programs)
Hardware sophistication
Compiler sophistication
How much ILP is available using existing
mechanisms with increasing HW budgets?
Do we need to invent new HW/SW mechanisms to keep
on processor performance curve?
Intel MMX, SSE (Streaming SIMD Extensions) 64
bit ints
Intel SSE2 128 bit, including 2 64-bit Fl. Pt.
per clock
Motorola AltiVec 128 bit ints and FPs
Supersparc Multimedia ops, etc.

4
Overcoming Limits

Advances in compiler technology significantly
new and different hardware techniques may be able
to overcome limitations assumed in studies
However, unlikely such advances when coupled with
realistic hardware will overcome these limits in
near future

5
Limits to ILP

Initial HW Model here MIPS compilers.
Assumptions for ideal/perfect machine to start
1. Register renaming infinite virtual
registers all register WAW WAR hazards are
avoided
2. Branch prediction perfect no
mispredictions
3. Jump prediction all jumps perfectly
predicted (returns, case statements)2 3 ? no
control dependencies perfect speculation an
unbounded buffer of instructions available
4. Memory-address alias analysis addresses
known a load can be moved before a store
provided addresses not equal 14 eliminates all
but RAW
Also perfect caches 1 cycle latency for all
instructions (FP ,/) unlimited instructions
issued/clock cycle

6
Limits to ILP HW Model comparison
7
Upper Limit to ILP Ideal Machine(Figure 3.1)
FP 75 - 150
Integer 18 - 60
Instructions Per Clock
8
Limits to ILP HW Model comparison
9
More Realistic HW Window ImpactFigure 3.2

Change from Infinite window 2048, 512, 128, 32

FP 9 - 150
Integer 8 - 63
IPC
10
Limits to ILP HW Model comparison
11
More Realistic HW Branch ImpactFigure 3.3
FP 15 - 45

Change from Infinite window to 2048, and maximum
issue of 64 instructions per clock cycle

Integer 6 - 12
IPC
Profile
BHT (512)
Tournament
Perfect
No prediction
12
Misprediction Rates
13
Limits to ILP HW Model comparison
14
More Realistic HW Renaming Register Impact (N
int N fp) Figure 3.5
FP 11 - 45

Change to 2048 instr window, 64 instr issue, 8K
2 level Prediction

IPC
Integer 5 - 15
64
None
256
Infinite
32
128
15
Limits to ILP HW Model comparison
16
More Realistic HW Memory Address Alias
ImpactFigure 3.6

Change 2048 instr window, 64 instr issue, 8K 2
level Prediction, 256 renaming registers

FP 4 - 45 (Fortran, no heap)
IPC
Integer 4 - 9
None
Global/Stack perfheap conflicts
Perfect
Compiler Inspection
17
Limits to ILP HW Model comparison
18
Realistic HW Window Impact(Figure 3.7)

Perfect disambiguation (HW), 1K Selective
Prediction, 16 entry return, 64 registers, issue
as many as window

FP 8 - 45
IPC
Integer 6 - 12
64
16
256
Infinite
32
128
8
4
19
Outline

Limits to ILP (another perspective)
Thread Level Parallelism
Multithreading
Simultaneous Multithreading
Power 4 vs. Power 5
Head to Head VLIW vs. Superscalar vs. SMT
Commentary
Conclusion

20
How to Exceed ILP Limitsof This Study?

These are not laws of physics
Just practical limits for today
Could be overcome via research
Compiler and ISA advances could change results
WAR and WAW hazards through memory eliminated
WAW and WAR hazards through register renaming,
but not in memory usage
Can get conflicts via allocation of stack frames
Because called procedure reuses memory addresses
of previous stack frames

21
HW v. SW to increase ILP

Memory disambiguation HW best
Speculation
HW best when dynamic branch prediction better
than compile-time prediction
Exceptions easier for HW
HW doesnt need bookkeeping code or compensation
code
Very complicated to get right in SW
Scheduling SW can look ahead to schedule better
Compiler independence HW does not require new
compiler to run well

22
Performance Beyond Single-Thread ILP

Much higher natural parallelism in some
applications
Database or scientific codes
Explicit thread-level or data-level parallelism
Thread has own instructions and data
May be part of parallel program or independent
program
Each thread has all state (instructions, data,
PC, register state, and so on) needed to execute
Data-level parallelism Perform identical
operations on lots of data

23
Thread Level Parallelism (TLP)

ILP exploits implicit parallel operations within
loop or straight-line code segment
TLP explicitly represented by multiple threads of
execution that are inherently parallel
Goal Use multiple instruction streams to improve
Throughput of computers that run many programs
Execution time of multi-threaded programs
TLP could be more cost-effective to exploit than
ILP

24
Do Both ILP and TLP?

TLP and ILP exploit two different kinds of
parallel structure in a program
Could a processor oriented to ILP still exploit
TLP?
Functional units are often idle in data path
designed for ILP because of either stalls or
dependencies in the code
Could TLP be used as source of independent
instructions that might keep the processor busy
during stalls?
Could TLP be used to employ functional units that
would otherwise lie idle when insufficient ILP
exists?

25
New ApproachMultithreaded Execution

Multithreading multiple threads share functional
units of 1 processor via overlapping
Processor must duplicate independent state of
each thread
Separate copy of register file, PC
Separate page table if different process
Memory sharing via virtual memory mechanisms
Already supports multiple processes
HW for fast thread switch
Must be much faster than full process switch
(which is 100s to 1000s of clocks)
When to switch?
Alternate instruction per thread (fine
grain)round robin
When thread is stalled (coarse grain)
E.g., cache miss

26
Fine-Grained Multithreading

Switches between threads on each instruction,
interleaving execution of multiple threads
Usually done round-robin, skipping stalled
threads
CPU must be able to switch threads every clock
Advantage can hide both short and long stalls
Instructions from other threads always available
to execute
Easy to insert on short stalls
Disadvantage slows individual threads
Thread ready to execute without stalls will be
delayed by instructions from other threads
Used on Suns Niagara (will see later)

27
Course-Grained Multithreading

Switches threads only on costly stalls
E.g., L2 cache misses
Advantages
Relieves need to have very fast thread switching
Doesnt slow thread
Other threads only issue instructions when main
one would stall (for long time) anyway
Disadvantage pipeline startup costs make it hard
to hide throughput losses from shorter stalls
Pipeline must be emptied or frozen on stall,
since CPU issues instructions from only one
thread
New thread must fill pipe before instructions can
complete
Thus, better for reducing penalty of high-cost
stalls where pipeline refill
Used in IBM AS/400

28
Simultaneous Multithreading (SMT)

Simultaneous multithreading (SMT) insight that
dynamically scheduled processor already has many
HW mechanisms to support multithreading
Large set of virtual registers that can be used
to hold register sets for independent threads
Register renaming provides unique register
identifiers
Instructions from multiple threads can be mixed
in data path
Without confusing sources and destinations across
threads!
Out-of-order completion allows the threads to
execute out of order, and get better utilization
of the HW
Just add per-thread renaming table and keep
separate PCs
Independent commitment can be supported via
separate reorder buffer for each thread

Source Micrprocessor Report, December 6, 1999
Compaq Chooses SMT for Alpha
29
Simultaneous Multithreading ...
One thread, 8 units
Two threads, 8 units
Cycle
M
M
FX
FX
FP
FP
BR
CC
M
M
FX
FX
FP
FP
BR
CC
Cycle
M Load/Store, FX Fixed Point, FP Floating
Point, BR Branch, CC Condition Codes
30
Multithreaded Categories
Fine-Gr.
Coarse-Gr.
SMT
Superscalar
Multiprocessing
Time (processor cycle)
Thread 1
Thread 3
Thread 5
Thread 2
Thread 4
Idle slot
31
Design Challenges in SMT

What is impact on single-thread performance?
Preferred-thread approach
Sacrifices neither throughput nor single-thread
performance?
Nope processor will sacrifice some throughput
when preferred thread stalls
Larger register file needed to hold multiple
contexts
Must not affect clock cycle, especially in
Instruction issuemore candidate instructions to
consider
Instruction completionhard to choose which to
commit
Must ensure that cache and TLB conflicts caused
by SMT dont degrade performance

32
Digression Covert Channels

Imagine youre spy with account on Knuth
Want to communicate a secret to Geoff
Secret is reasonably small
FBI is watching your account and your e-mail
Solution process spawning
Once a second, Geoff spawns process
Records own PID, waits 10 ms, forks records
child PID
Once a second, you send one bit of information
If bit is zero, you do nothing
If bit is one, you spawn processes as fast as
possible
If Geoff sees big PID gap, he records 1, else
0
Many variations on this basic idea

33
Covert-Channel Attacks on Crypto

Most (not all) crypto code behaves differently on
1 bit in key vs. 0 bit
Runs longer or shorter
Uses more or less power
Accesses different memory
Etc.
Usually called information leakage
Has been successfully used in lab to crack strong
crypto
Even recovering some bits from key makes
brute-force attack practical for getting
remainder
Some modern implementations try to fight by doing
wasted work on shorter path of if, etc.

34
SMT Attack on SSH

On SMT machine, lower-priority threads execution
rate depends on higher-priority ones
instructions
More stalls in top thread mean more speed in
bottom one
Stalls vary depending on what crypto code is
doing
Operates at very low level
Thus much harder to defend against
Successful attack on ssh keys has been
demonstrated in lab
Best known defense dont do SMT
Careful coding of crypto could probably also work
Note that this also applies to things like cache
and TLB
Lots of ways to leak information unintentionally!

35
Power 4
Single-threaded predecessor to Power 5. 8
execution units in out-of-order engine each can
issue instruction each cycle.
36
Power 4
2 commits (architected register sets)
Power 5
2 fetch (PC),2 initial decodes
37
Power 5 data flow ...
Why only 2 threads? With 4, one of the shared
resources (physical registers, cache, memory
bandwidth) would be prone to bottleneck
38
Power 5 thread performance ...
Relative priority of each thread controllable in
hardware.
For balanced operation, both threads run slower
than if they owned the machine.
39
Changes in Power 5 to support SMT

Increased associativity of L1 instruction cache
and instruction address translation buffers
Added per-thread load and store queues
Increased size of L2 (1.92 vs. 1.44 MB) and L3
caches
Added separate instruction prefetch and buffering
per thread
Increased virtual registers from 152 to 240
Increased size of several issue queues
Power5 core is about 24 larger than Power4
because of SMT support

40
Initial Performance of SMT

Pentium 4 Extreme SMT yields 1.01 speedup for
SPECint_rate benchmark 1.07 for SPECfp_rate
Pentium 4 is dual-threaded SMT
SPECRate requires each benchmark to be run
against vendor-selected number of copies of same
benchmark
Pairing each of 26 SPEC benchmarks with every
other on Pentium 4 (262 runs) gives speedups from
0.90 to 1.58 average was 1.20
8-processor Power 5 server 1.23 faster for
SPECint_rate w/ SMT, 1.16 faster for SPECfp_rate
Power 5 running 2 copies of each app had speedup
between 0.89 and 1.41
Most gained some
Floating-point apps had most cache conflicts and
least gains

41
Head-to-Head ILP Competition
42
Performance on SPECint2000
43
Performance on SPECfp2000
44
Normalized Performance Efficiency
45
No Silver Bullet for ILP

No obvious overall leader in performance
AMD Athlon leads on SPECInt performance, followed
by the Pentium 4, Itanium 2, and Power5
Itanium 2 and Power5 clearly dominate Athlon and
Pentium 4 on SPECFP
Itanium 2 is most inefficient processor both for
floating-point and integer code for all but one
efficiency measure (SPECFP/Watt)
Athlon and Pentium 4 both use transistors and
area efficiently
IBM Power5 is most effective user of energy on
SPECFP, essentially tied on SPECINT

46
Limits to ILP

Doubling issue rates above todays 3-6
instructions per clock probably requires
processor to
Issue 3-4 data-memory accesses per cycle,
Resolve 2-3 branches per cycle,
Rename and access over 20 registers per cycle,
and
Fetch 12-24 instructions per cycle.
Complexity of implementing these capabilities is
likely to mean sacrifices in maximum clock rate
E.g, widest-issue processor is Itanium 2
It also has slowest clock rate
Despite consuming the most power!

47
Limits to ILP (contd)

Most ways to increase performance also boost
power consumption
Key question is energy efficiency does a method
increase power consumption faster than it boosts
performance?
Multiple-issue techniques are energy inefficient
Incurs logic overhead that grows faster than
issue rate
Growing gap between peak issue rates and
sustained performance
Number of transistors switching f(peak issue
rate) performance f(sustained rate) growing
gap between peak and sustained performance ?
Increasing energy per unit of performance

48
Commentary

Itanium is not significant breakthrough in
scaling ILP or in avoiding problems of complexity
and power consumption
Instead of pursuing more ILP, architects turning
to TLP using single-chip multiprocessors
In 2000, IBM announced Power4, 1st commercial
single-chip, general-purpose multiprocessor has
two Power3 processors and integrated L2 cache
Sun Microsystems, AMD, and Intel have also
switched focus from aggressive uniprocessors to
single-chip multiprocessors
Right balance of ILP and TLP is unclear today
Maybe desktops (mostly single-threaded?) need
different design than servers (can do lots of TLP)

49
And in conclusion

Limits to ILP (power efficiency, compilers,
dependencies ) seem to limit to 3 to 6 issue for
practical options
Explicitly parallel (data-level parallelism or
thread-level parallelism) is next step to
performance
Coarse-grained vs. fine-grained multithreading
Only on big stall vs. every clock cycle
Simultaneous multithreading is fine-grained
multithreading based on OOO superscalar
microarchitecture
Instead of replicating registers, reuse rename
registers
Itanium/EPIC/VLIW is not a breakthrough in ILP
Balance of ILP and TLP decided in marketplace