Lecture 24: Instruction Level Parallelism - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Lecture 24: Instruction Level Parallelism

Description:

1 branch in 5: How to keep a 5-way VLIW busy? ... VLIW lock step = 1 hazard & all instructions stall. IA-64 not lock step? Dynamic pipeline? ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 22
Provided by: Akhiles5
Category:

less

Transcript and Presenter's Notes

Title: Lecture 24: Instruction Level Parallelism


1
Lecture 24 Instruction Level Parallelism
  • Computer Engineering 585
  • Fall 2001

2
Limits to Multi-Issue Machines
  • Inherent limitations of ILP
  • 1 branch in 5 How to keep a 5-way VLIW busy?
  • Latencies of units many operations must be
    scheduled.
  • Need as many independent operations as Pipeline
    Depth x No. Function Units to keep machines
    busy, e.g. 5 x 4 1520 independent
    instructions?
  • Difficulties in building HW
  • Easy More instruction bandwidth.
  • Easy Duplicate FUs to get parallel execution.
  • Hard Increase ports to Register File
    (bandwidth).
  • VLIW example needs 7 read and 3 write for Int.
    Reg. 5 read and 3 write for FP reg files.
  • Harder Increase ports to memory (bandwidth).
  • Decoding Superscalar and impact on clock rate,
    pipeline depth?

3
Limits to Multi-Issue Machines
  • Limitations specific to either Superscalar or
    VLIW implementation
  • Decode issue in Superscalar how wide practical?
  • VLIW code size unroll loops wasted fields in
    VLIW.
  • IA-64 compresses dependent instructions, but
    still larger.
  • VLIW lock step gt 1 hazard all instructions
    stall.
  • IA-64 not lock step? Dynamic pipeline?
  • VLIW binary compatibility is a practical
    weakness as vary number FU and latencies over
    time.
  • IA-64 provides binary compatibility.

4
Limits to ILP
  • Conflicting studies of amount of parallelism
    available in late 1980s and early 1990s.
    Different assumptions about
  • Benchmarks (vectorized Fortran FP vs. integer C
    programs).
  • Hardware sophistication.
  • Compiler sophistication.
  • How much ILP is available using existing
    mechanisms with increasing HW budgets?
  • Do we need to invent new HW/SW mechanisms to keep
    on processor performance curve?

5
Limits to ILP
  • Initial HW Model here MIPS compilers.
  • Assumptions for ideal/perfect machine to start
  • 1. Register renaminginfinite virtual registers
    and all WAW WAR hazards are avoided.
  • 2. Branch predictionperfect no mispredictions.
  • 3. Jump predictionall jumps perfectly predicted
    gt machine with perfect speculation an
    unbounded buffer of instructions available.
  • 4. Memory-address alias analysisaddresses are
    known a store can be moved before a load
    provided addresses not equal.
  • 1 cycle latency for all instructions unlimited
    number of instructions issued per clock cycle.

6
Upper Limit to ILP Ideal Machine(Figure 4.38,
page 319)
FP 75 - 150
Integer 18 - 60
IPC
7
More Realistic HW Branch ImpactFigure 4.40,
Page 323
  • Change from Infinite window to examine to 2000
    and maximum issue of 64 instructions per clock
    cycle

FP 15 - 45
Integer 6 - 12
IPC
Profile
BHT (512)
Pick Cor. or BHT
Perfect
No prediction
8
Selective History Predictor
8096 x 2 bits
1 0
Taken/Not Taken
11 10 01 00
Choose Non-correlator
Branch Addr
Choose Correlator
2
Global History
00
8K x 2 bit Selector
01
10
11
11 Taken 10 01 Not Taken 00
2048 x 4 x 2 bits
9
More Realistic HW Register ImpactFigure 4.44,
Page 328
FP 11 - 45
  • Change 2000 instr window, 64 instr issue, 8K 2
    level Prediction

Integer 5 - 15
IPC
64
None
256
Infinite
32
128
10
More Realistic HW Alias ImpactFigure 4.46, Page
330
Integer 4 - 9
  • Change 2000 instr window, 64 instr issue, 8K 2
    level Prediction, 256 renaming registers

FP 4 - 45 (Fortran, no heap)
IPC
None
Global/Stack perfheap conflicts
Perfect
Inspec.Assem.
11
Realistic HW for 9X Window Impact(Figure 4.48,
Page 332)
  • Perfect disambiguation (HW), 1K Selective
    Prediction, 16 entry return, 64 registers, issue
    as many as window

FP 8 - 45
IPC
Integer 6 - 12
64
16
256
Infinite
32
128
8
4
12
3 1996 Era Machines
  • Alpha 21164 PPro HP PA-8000
  • Year 1995 1995 1996
  • Clock 400 MHz 200 MHz 180 MHz
  • Cache 8K/8K/96K/2M 8K/8K/0.5M 0/0/2M
  • Issue rate 2int2FP 3 instr (x86) 4 instr
  • Pipe stages 7-9 12-14 7-9
  • Out-of-Order 6 loads 40 instr (µop) 56 instr
  • Rename regs none 40 56

13
SPECint95base Performance (July 1996)
14
SPECfp95base Performance (July 1996)
15
3 1997 Era Machines
  • Alpha 21164 Pentium II HP PA-8000
  • Year 1995 1996 1996
  • Clock 600 MHz (97) 300 MHz (97) 236 MHz (97)
  • Cache 8K/8K/96K/2M 16K/16K/0.5M 0/0/4M
  • Issue 2int2FP 3 instr (x86) 4 instr
  • Pipe stages 7-9 12-14 7-9
  • Out-of-Order 6 loads 40 instr (µop) 56 instr
  • Rename none 40 56

16
3 2000-1 Era Machines
  • Alpha 21364 Power4 Penitum4
  • Year 2000 2001 2000-1
  • Clock 1GHz MHz (01?) gt1GHz (2001) 2GHz
    (2001)
  • Cache 64K/64K/1.75M 32K/64K/1.5M/32M 12K
    microops trace cache/8K(D)/256K
  • Issue 2int2FP 8 inst . 6 inst.
  • Pipe stages 7-9 15-20 20
  • Out-of-Order 6 loads 200 inst. 126
    inst.
  • Rename none gt 200 128

17
SPECint95base Performance (Oct. 1997)
18
SPECfp95base Performance (Oct. 1997)
19
Summary
  • Branch Prediction
  • Branch History Table 2 bits for loop accuracy.
  • Recently executed branches correlated with next
    branch?
  • Branch Target Buffer include branch address
    prediction.
  • Predicated Execution can reduce number of
    branches, number of mispredicted branches.
  • Speculation Out-of-order execution, In-order
    commit (reorder buffer).
  • SW Pipelining
  • Symbolic Loop Unrolling to get most from pipeline
    with little code expansion, little overhead.
  • Superscalar and VLIW CPI lt 1 (IPC gt 1)
  • Dynamic issue vs. Static issue.
  • More instructions issue at same time gt larger
    hazard penalty.

20
Review Who Cares About the Memory Hierarchy?
  • Processor Only Thus Far in Course
  • CPU cost/performance, ISA, Pipelined Execution.
  • CPU-DRAM Gap
  • 1980 no cache in µproc 1995 2-level cache on
    chip(1989 first Intel µproc with a cache on chip)

µProc 60/yr.
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 7/yr.
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
21
Processor-Memory Performance Gap Tax
  • Processor Area Transistors
  • (cost) (power)
  • Alpha 21164 37 77
  • StrongArm SA110 61 94
  • Pentium Pro 64 88
  • 2 dies per package
  • Caches have no inherent value, only try to close
    performance gap.
Write a Comment
User Comments (0)
About PowerShow.com