Title: Lecture 24: Instruction Level Parallelism
1Lecture 24 Instruction Level Parallelism
- Computer Engineering 585
- Fall 2001
2Limits to Multi-Issue Machines
- Inherent limitations of ILP
- 1 branch in 5 How to keep a 5-way VLIW busy?
- Latencies of units many operations must be
scheduled. - Need as many independent operations as Pipeline
Depth x No. Function Units to keep machines
busy, e.g. 5 x 4 1520 independent
instructions? - Difficulties in building HW
- Easy More instruction bandwidth.
- Easy Duplicate FUs to get parallel execution.
- Hard Increase ports to Register File
(bandwidth). - VLIW example needs 7 read and 3 write for Int.
Reg. 5 read and 3 write for FP reg files. - Harder Increase ports to memory (bandwidth).
- Decoding Superscalar and impact on clock rate,
pipeline depth?
3Limits to Multi-Issue Machines
- Limitations specific to either Superscalar or
VLIW implementation - Decode issue in Superscalar how wide practical?
- VLIW code size unroll loops wasted fields in
VLIW. - IA-64 compresses dependent instructions, but
still larger. - VLIW lock step gt 1 hazard all instructions
stall. - IA-64 not lock step? Dynamic pipeline?
- VLIW binary compatibility is a practical
weakness as vary number FU and latencies over
time. - IA-64 provides binary compatibility.
4Limits to ILP
- Conflicting studies of amount of parallelism
available in late 1980s and early 1990s.
Different assumptions about - Benchmarks (vectorized Fortran FP vs. integer C
programs). - Hardware sophistication.
- Compiler sophistication.
- How much ILP is available using existing
mechanisms with increasing HW budgets? - Do we need to invent new HW/SW mechanisms to keep
on processor performance curve?
5Limits to ILP
- Initial HW Model here MIPS compilers.
- Assumptions for ideal/perfect machine to start
- 1. Register renaminginfinite virtual registers
and all WAW WAR hazards are avoided. - 2. Branch predictionperfect no mispredictions.
- 3. Jump predictionall jumps perfectly predicted
gt machine with perfect speculation an
unbounded buffer of instructions available. - 4. Memory-address alias analysisaddresses are
known a store can be moved before a load
provided addresses not equal. - 1 cycle latency for all instructions unlimited
number of instructions issued per clock cycle.
6Upper Limit to ILP Ideal Machine(Figure 4.38,
page 319)
FP 75 - 150
Integer 18 - 60
IPC
7More Realistic HW Branch ImpactFigure 4.40,
Page 323
- Change from Infinite window to examine to 2000
and maximum issue of 64 instructions per clock
cycle
FP 15 - 45
Integer 6 - 12
IPC
Profile
BHT (512)
Pick Cor. or BHT
Perfect
No prediction
8Selective History Predictor
8096 x 2 bits
1 0
Taken/Not Taken
11 10 01 00
Choose Non-correlator
Branch Addr
Choose Correlator
2
Global History
00
8K x 2 bit Selector
01
10
11
11 Taken 10 01 Not Taken 00
2048 x 4 x 2 bits
9More Realistic HW Register ImpactFigure 4.44,
Page 328
FP 11 - 45
- Change 2000 instr window, 64 instr issue, 8K 2
level Prediction
Integer 5 - 15
IPC
64
None
256
Infinite
32
128
10More Realistic HW Alias ImpactFigure 4.46, Page
330
Integer 4 - 9
- Change 2000 instr window, 64 instr issue, 8K 2
level Prediction, 256 renaming registers
FP 4 - 45 (Fortran, no heap)
IPC
None
Global/Stack perfheap conflicts
Perfect
Inspec.Assem.
11Realistic HW for 9X Window Impact(Figure 4.48,
Page 332)
- Perfect disambiguation (HW), 1K Selective
Prediction, 16 entry return, 64 registers, issue
as many as window
FP 8 - 45
IPC
Integer 6 - 12
64
16
256
Infinite
32
128
8
4
123 1996 Era Machines
- Alpha 21164 PPro HP PA-8000
- Year 1995 1995 1996
- Clock 400 MHz 200 MHz 180 MHz
- Cache 8K/8K/96K/2M 8K/8K/0.5M 0/0/2M
- Issue rate 2int2FP 3 instr (x86) 4 instr
- Pipe stages 7-9 12-14 7-9
- Out-of-Order 6 loads 40 instr (µop) 56 instr
- Rename regs none 40 56
13SPECint95base Performance (July 1996)
14SPECfp95base Performance (July 1996)
153 1997 Era Machines
- Alpha 21164 Pentium II HP PA-8000
- Year 1995 1996 1996
- Clock 600 MHz (97) 300 MHz (97) 236 MHz (97)
- Cache 8K/8K/96K/2M 16K/16K/0.5M 0/0/4M
- Issue 2int2FP 3 instr (x86) 4 instr
- Pipe stages 7-9 12-14 7-9
- Out-of-Order 6 loads 40 instr (µop) 56 instr
- Rename none 40 56
163 2000-1 Era Machines
- Alpha 21364 Power4 Penitum4
- Year 2000 2001 2000-1
- Clock 1GHz MHz (01?) gt1GHz (2001) 2GHz
(2001) - Cache 64K/64K/1.75M 32K/64K/1.5M/32M 12K
microops trace cache/8K(D)/256K - Issue 2int2FP 8 inst . 6 inst.
- Pipe stages 7-9 15-20 20
- Out-of-Order 6 loads 200 inst. 126
inst. - Rename none gt 200 128
17SPECint95base Performance (Oct. 1997)
18SPECfp95base Performance (Oct. 1997)
19Summary
- Branch Prediction
- Branch History Table 2 bits for loop accuracy.
- Recently executed branches correlated with next
branch? - Branch Target Buffer include branch address
prediction. - Predicated Execution can reduce number of
branches, number of mispredicted branches. - Speculation Out-of-order execution, In-order
commit (reorder buffer). - SW Pipelining
- Symbolic Loop Unrolling to get most from pipeline
with little code expansion, little overhead. - Superscalar and VLIW CPI lt 1 (IPC gt 1)
- Dynamic issue vs. Static issue.
- More instructions issue at same time gt larger
hazard penalty.
20Review Who Cares About the Memory Hierarchy?
- Processor Only Thus Far in Course
- CPU cost/performance, ISA, Pipelined Execution.
- CPU-DRAM Gap
- 1980 no cache in µproc 1995 2-level cache on
chip(1989 first Intel µproc with a cache on chip)
µProc 60/yr.
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 7/yr.
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
21Processor-Memory Performance Gap Tax
- Processor Area Transistors
- (cost) (power)
- Alpha 21164 37 77
- StrongArm SA110 61 94
- Pentium Pro 64 88
- 2 dies per package
- Caches have no inherent value, only try to close
performance gap.