Lecture 24: Instruction Level Parallelism - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Lecture 24: Instruction Level Parallelism

Description:

1 branch in 5: How to keep a 5-way VLIW busy? ... VLIW lock step = 1 hazard & all instructions stall. IA-64 not lock step? Dynamic pipeline? ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 22

Provided by: Akhiles5

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 24: Instruction Level Parallelism

1
Lecture 24 Instruction Level Parallelism

Computer Engineering 585
Fall 2001

2
Limits to Multi-Issue Machines

Inherent limitations of ILP
1 branch in 5 How to keep a 5-way VLIW busy?
Latencies of units many operations must be
scheduled.
Need as many independent operations as Pipeline
Depth x No. Function Units to keep machines
busy, e.g. 5 x 4 1520 independent
instructions?
Difficulties in building HW
Easy More instruction bandwidth.
Easy Duplicate FUs to get parallel execution.
Hard Increase ports to Register File
(bandwidth).
VLIW example needs 7 read and 3 write for Int.
Reg. 5 read and 3 write for FP reg files.
Harder Increase ports to memory (bandwidth).
Decoding Superscalar and impact on clock rate,
pipeline depth?

3
Limits to Multi-Issue Machines

Limitations specific to either Superscalar or
VLIW implementation
Decode issue in Superscalar how wide practical?
VLIW code size unroll loops wasted fields in
VLIW.
IA-64 compresses dependent instructions, but
still larger.
VLIW lock step gt 1 hazard all instructions
stall.
IA-64 not lock step? Dynamic pipeline?
VLIW binary compatibility is a practical
weakness as vary number FU and latencies over
time.
IA-64 provides binary compatibility.

4
Limits to ILP

Conflicting studies of amount of parallelism
available in late 1980s and early 1990s.
Different assumptions about
Benchmarks (vectorized Fortran FP vs. integer C
programs).
Hardware sophistication.
Compiler sophistication.
How much ILP is available using existing
mechanisms with increasing HW budgets?
Do we need to invent new HW/SW mechanisms to keep
on processor performance curve?

5
Limits to ILP

Initial HW Model here MIPS compilers.
Assumptions for ideal/perfect machine to start
1. Register renaminginfinite virtual registers
and all WAW WAR hazards are avoided.
2. Branch predictionperfect no mispredictions.
3. Jump predictionall jumps perfectly predicted
gt machine with perfect speculation an
unbounded buffer of instructions available.
4. Memory-address alias analysisaddresses are
known a store can be moved before a load
provided addresses not equal.
1 cycle latency for all instructions unlimited
number of instructions issued per clock cycle.

6
Upper Limit to ILP Ideal Machine(Figure 4.38,
page 319)
FP 75 - 150
Integer 18 - 60
IPC
7
More Realistic HW Branch ImpactFigure 4.40,
Page 323

Change from Infinite window to examine to 2000
and maximum issue of 64 instructions per clock
cycle

FP 15 - 45
Integer 6 - 12
IPC
Profile
BHT (512)
Pick Cor. or BHT
Perfect
No prediction
8
Selective History Predictor
8096 x 2 bits
1 0
Taken/Not Taken
11 10 01 00
Choose Non-correlator
Branch Addr
Choose Correlator
2
Global History
00
8K x 2 bit Selector
01
10
11
11 Taken 10 01 Not Taken 00
2048 x 4 x 2 bits
9
More Realistic HW Register ImpactFigure 4.44,
Page 328
FP 11 - 45

Change 2000 instr window, 64 instr issue, 8K 2
level Prediction

Integer 5 - 15
IPC
64
None
256
Infinite
32
128
10
More Realistic HW Alias ImpactFigure 4.46, Page
330
Integer 4 - 9

Change 2000 instr window, 64 instr issue, 8K 2
level Prediction, 256 renaming registers

FP 4 - 45 (Fortran, no heap)
IPC
None
Global/Stack perfheap conflicts
Perfect
Inspec.Assem.
11
Realistic HW for 9X Window Impact(Figure 4.48,
Page 332)

Perfect disambiguation (HW), 1K Selective
Prediction, 16 entry return, 64 registers, issue
as many as window

FP 8 - 45
IPC
Integer 6 - 12
64
16
256
Infinite
32
128
8
4
12
3 1996 Era Machines

Alpha 21164 PPro HP PA-8000
Year 1995 1995 1996
Clock 400 MHz 200 MHz 180 MHz
Cache 8K/8K/96K/2M 8K/8K/0.5M 0/0/2M
Issue rate 2int2FP 3 instr (x86) 4 instr
Pipe stages 7-9 12-14 7-9
Out-of-Order 6 loads 40 instr (µop) 56 instr
Rename regs none 40 56

13
SPECint95base Performance (July 1996)
14
SPECfp95base Performance (July 1996)
15
3 1997 Era Machines

Alpha 21164 Pentium II HP PA-8000
Year 1995 1996 1996
Clock 600 MHz (97) 300 MHz (97) 236 MHz (97)
Cache 8K/8K/96K/2M 16K/16K/0.5M 0/0/4M
Issue 2int2FP 3 instr (x86) 4 instr
Pipe stages 7-9 12-14 7-9
Out-of-Order 6 loads 40 instr (µop) 56 instr
Rename none 40 56

16
3 2000-1 Era Machines

Alpha 21364 Power4 Penitum4
Year 2000 2001 2000-1
Clock 1GHz MHz (01?) gt1GHz (2001) 2GHz
(2001)
Cache 64K/64K/1.75M 32K/64K/1.5M/32M 12K
microops trace cache/8K(D)/256K
Issue 2int2FP 8 inst . 6 inst.
Pipe stages 7-9 15-20 20
Out-of-Order 6 loads 200 inst. 126
inst.
Rename none gt 200 128

17
SPECint95base Performance (Oct. 1997)
18
SPECfp95base Performance (Oct. 1997)
19
Summary

Branch Prediction
Branch History Table 2 bits for loop accuracy.
Recently executed branches correlated with next
branch?
Branch Target Buffer include branch address
prediction.
Predicated Execution can reduce number of
branches, number of mispredicted branches.
Speculation Out-of-order execution, In-order
commit (reorder buffer).
SW Pipelining
Symbolic Loop Unrolling to get most from pipeline
with little code expansion, little overhead.
Superscalar and VLIW CPI lt 1 (IPC gt 1)
Dynamic issue vs. Static issue.
More instructions issue at same time gt larger
hazard penalty.

20
Review Who Cares About the Memory Hierarchy?

Processor Only Thus Far in Course
CPU cost/performance, ISA, Pipelined Execution.
CPU-DRAM Gap
1980 no cache in µproc 1995 2-level cache on
chip(1989 first Intel µproc with a cache on chip)

µProc 60/yr.
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 7/yr.
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
21
Processor-Memory Performance Gap Tax