Microelectronic devices: processing architectures

About This Presentation

Title:

Microelectronic devices: processing architectures

Description:

Title: Vermijding van afbeeldingsconflicten in microprocessors Author: hvdieren Last modified by: Koen De Bosschere Created Date: 10/24/2005 2:55:00 PM – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 161

Provided by: hvd3

Category:

more less

Transcript and Presenter's Notes

Title: Microelectronic devices: processing architectures

1
Microelectronic devices processing architectures

Koen De BosschereGhent University

2
Moores Law (1965)
16
15
14
13
12
11
10
Gordon Moore 1929- Fairchild Semiconductor
9
8
COMPONENTS PER INTEGRATED FUNCTION
7
OF THE NUMBER OF
6
5
4
3
2
2
1
LOG
0
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
Electronics, April 19, 1965
3
Moores Law
Source Intel
4
Itanium Montecito 1720 M Transistors
5
In-order execution
Fetch Decode
Finish
Execute
6
Superscalar execution
Instruction Window
Retire (commit)
Fetch Decode
Commit width
Fetch width
Issue width
E
E
E
E
7
Superscalar out-of-order processor
Branch predictor
Instruction window
ALU
to L1 I-cache
front-end pipeline
L2 cache
ALU
L1 I-cache
L1 D-cache
ld/st
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,32 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
out-of-order uitvoering in-order commit
c
d
e
f
decoding, register renaming, etc.
g
8
Cycle 1
Branch predictor
Instruction window
ALU
to L1 I-cache
L2 cache
ALU
L1 I-cache
a
L1 D-cache
b
ld/st
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,32 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
c
Per clock cycle, B instruction are fetched, With
B processor width
d
e
f
g
9
Cycle 2
Branch predictor
Instruction window
ALU
to L1 I-cache
L2 cache
ALU
L1 I-cache
a
c1
L1 D-cache
b
d1
ld/st
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,32 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
c
d
e
f
g
10
Cycle 4
Branch predictor
Instruction window
ALU
to L1 I-cache
L2 cache
ALU
L1 I-cache
a
c1
e1
c2
L1 D-cache
b
d1
f1
d2
ld/st
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,32 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
c
Branch predicted taken
d
e
f
g
11
Cycle 5
Branch predictor
ALU
to L1 I-cache
L2 cache
ALU
L1 I-cache
c1
e1
c2
e2
L1 D-cache
d1
f1
d2
f2
ld/st
a
b
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,32 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
c
d
e
f
g
12
Cycle 6
Branch predictor
ALU
to L1 I-cache
a
d1
L2 cache
ALU
L1 I-cache
b
c1
c3
e1
c2
e2
L1 D-cache
d3
f1
d2
f2
ld/st
a
b
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,32 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
c
d
e
f
g
13
Cycle 7
Branch predictor
ALU
to L1 I-cache
L2 cache
d1
f1
ALU
L1 I-cache
c2
e2
c3
e3
c1
e1
L1 D-cache
d2
f2
d3
f3
ld/st
a
b
c1
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,32 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
Operands are available
c
d
operands not yet available
e
f
g
14
Cycle 8
Branch predictor
ALU
to L1 I-cache
d2
e1
L2 cache
ALU
c2
L1 I-cache
c4
e2
c3
e3
L1 D-cache
d1
f1
d4
f2
d3
f3
ld/st
c1
e1
c1
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,32 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
c
d
e
out-of-order execution
f
g
15
Cycle 9
Branch predictor
ALU
to L1 I-cache
d2
f1
L2 cache
ALU
c2
f2
L1 I-cache
d1
c3
e3
c4
e4
L1 D-cache
d1
f1
e2
d3
f3
d4
f4
ld/st
c1
e1
c2
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,32 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
c
d
Instruction-level parallelism (ILP)
e
f
g
16
Cycle 10
Branch predictor
ALU
d3
to L1 I-cache
e2
c3
L2 cache
ALU
d2
f2
L1 I-cache
c5
e3
c4
e4
L1 D-cache
f1
c2
e2
d5
f3
d4
f4
ld/st
d1
e1
c2
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,32 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
c
d
in-order commit
e
f
g
17
Cycle 11
Branch predictor
ALU
to L1 I-cache
d3
f2
L2 cache
ALU
c3
f3
L1 I-cache
d2
c4
e4
c5
e5
L1 D-cache
d2
f2
e3
d4
f4
d5
f5
ld/st
c2
e2
c3
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,32 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
c
d
e
f
g
18
Instruction Level Parallelism
200
180
160
140
120
100
80
60
40
20
0
bzip2
crafty
eon
gcc
gzip
parser
perlbmk
twolf
vortex
vpr
Starting from full trace, SpecInt 2000
19
Performance
IPC (8 execution units)
6
5
4
3
2
1
0
bzip2
crafty
eon
gcc
gzip
parser
perlbmk
twolf
vortex
vpr
20
IPC ?W?
32
16
8
IPC
4
2
4
8
16
32
64
Size instruction window W
21
Branch prediction
22
Branch predictor
Instruction pool
Retire (commit)
Fetch Decode
Commit width
Fetch width
Issue width
Branch Predictor 90-95 correct
E
E
E
E
23
Cycle 11
Branch predictor
ALU
to L1 I-cache
d3
f2
L2 cache
ALU
c3
f3
L1 I-cache
d2
c4
e4
c5
e5
L1 D-cache
d2
f2
e3
d4
d5
f5
f4
ld/st
c2
e2
c3
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,32 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
c
Mispredicted branch
d
e
speculative fetch and execution
f
g
24
Cycle 13
Branch predictor
ALU
to L1 I-cache
d4
f3
L2 cache
ALU
c4
f4
L1 I-cache
d3
c5
e5
c6
e6
L1 D-cache
d3
f3
e4
d5
d6
f6
f5
ld/st
c3
e3
c4
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,32 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
c
d
e
f
g
25
Cycle 15
ALU
Branch predictor
to L1 I-cache
d5
f4
L2 cache
ALU
c5
f5
L1 I-cache
d4
c6
e6
c7
e7
L1 D-cache
d4
e5
f4
d6
d7
f7
f6
ld/st
c4
e4
c5
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,32 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
c
d
e
f
g
26
Cycle 16
ALU
Branch predictor
to L1 I-cache
d5
L2 cache
ALU
c5
f5
L1 I-cache
c6
e6
c7
e7
L1 D-cache
d4
e5
f4
d6
d7
f7
f6
ld/st
e4
c5
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,32 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
c
Instructions on mispredicted path must be
nullified
d
e
f
g
27
Cycle 17
Branch predictor
Instruction window
ALU
to L1 I-cache
L2 cache
ALU
L1 I-cache
g
L1 D-cache
h
ld/st
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,32 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
Instructions on correct path are fetched
c
d
e
f
g
28
Interval analysis
Mispredicted branch enters instruction window
Performance recovers
IPC
IPCmax
instructies langs voorspeld pad worden uitgevoerd
t
Mispredicted branch gets executed instructions
from the correct path are fetched
Correct instructions enter the instruction window
29
Branch prediction
everything perfect
real branch predictor
IPC
6
5
4
3
2
1
0
bzip2
crafty
eon
gcc
gzip
parser
perlbmk
twolf
vortex
vpr
30
Memory Wall Problem
Processor
Cycles 2
L1 I (12Ki)
L1 D (8KiB)
Cycles 19
L2 cache (512 KiB)
Pentium 4 EE
Cycles 43
L3 cache (2 MiB)
Cycles 206
Memory
31
Cycle 10
Branch predictor
ALU
d3
to L1 I-cache
e2
c3
L2 cache
ALU
d2
f2
L1 I-cache
e3
c4
e4
L1 D-cache
f1
c2
e2
f3
d4
f4
ld/st
d1
e1
c2
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,4 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
Suppose f4 is correctly predicted, g causes an I
cache miss
c
d
e
f
I-cache miss latency is 10 cycles
g
32
Cycle 13
Branch predictor
ALU
to L1 I-cache
d4
f3
L2 cache
ALU
c4
f4
L1 I-cache
d3
L1 D-cache
d3
f3
e4
ld/st
c3
e3
c4
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,4 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
c
d
e
f
g
33
Cycle 14
Branch predictor
ALU
to L1 I-cache
d4
e4
L2 cache
ALU
c4
f4
L1 I-cache
L1 D-cache
d3
f3
e4
ld/st
e3
c4
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,4 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
c
d
e
f
g
34
Cycle 15
Branch predictor
ALU
to L1 I-cache
d4
L2 cache
ALU
L1 I-cache
f4
d4
f4
L1 D-cache
ld/st
c4
e4
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,4 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
c
d
e
f
g
35
Cycle 16
Branch predictor
ALU
to L1 I-cache
L2 cache
ALU
L1 I-cache
d4
f4
L1 D-cache
ld/st
e4
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,4 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
c
d
e
f
g
36
Cycle 17 - 19
Branch predictor
ALU
to L1 I-cache
L2 cache
ALU
L1 I-cache
L1 D-cache
ld/st
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,4 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
c
d
e
f
g
37
Cycle 20
Branch predictor
ALU
to L1 I-cache
L2 cache
ALU
L1 I-cache
g
L1 D-cache
h
ld/st
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,4 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
c
d
e
f
g
38
Interval analysis
I-cache miss
Instruction from front-end pipe line arrive in
instruction window
IPC
Performance recovers
IPCmax
t
I-cache miss latency
front-end pipe line refills
Instruction window empties
39
Cycle 10
Branch predictor
ALU
d3
to L1 I-cache
e2
c3
L2 cache
ALU
d2
f2
L1 I-cache
e3
c4
e4
c5
L1 D-cache
f1
c2
e2
f3
d4
f4
d5
ld/st
d1
e1
c2
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,4 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
c
d
stel dit is een L1 D-cache miss
e
f
g
40
Cycle 11
Branch predictor
ALU
to L1 I-cache
f2
d3
f3
L2 cache
ALU
c3
e3
L1 I-cache
c4
e4
c5
e5
L1 D-cache
d2
f2
d4
f4
d5
f5
ld/st
c2
e2
c2
c3
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,4 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
toegangstijd tot L2 cache bedraagt 3 cycli
c
d
e
f
g
41
Cycle 12
Branch predictor
ALU
d4
to L1 I-cache
c4
e3
d3
f3
L2 cache
ALU
c3
e3
L1 I-cache
e4
c5
e5
c6
L1 D-cache
d2
f2
f4
d5
f5
d6
ld/st
c2
e2
c2
c3
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,4 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
c
d
e
f
g
42
Cycle 13
f4
Branch predictor
ALU
d4
to L1 I-cache
e4
c4
f3
d3
f3
L2 cache
ALU
c3
e3
L1 I-cache
c5
e5
c6
e6
L1 D-cache
d2
f2
d5
f5
d6
f6
ld/st
c2
e2
c2
c4
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,4 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
c
d
e
f
g
43
Cycle 14
d5
f4
c5
Branch predictor
ALU
d4
to L1 I-cache
e4
c4
e4
d3
f3
L2 cache
ALU
c3
e3
L1 I-cache
d2
e5
c6
e6
c7
L1 D-cache
d2
f2
f5
d6
f6
d7
ld/st
c2
e2
c4
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,4 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
c
d
e
f
g
44
L1 D-cache miss ? instruction with long execution
latency
no L1 D-cache misses
with L1 D-cache misses
4
3
2
IPC
1
0
gap
gzip
eon
bzip2
crafty
vortex
perlbmk
45
Cycle 13
to main memory
f4
Branch predictor
ALU
d4
to L1 I-cache
e4
c4
f3
d3
f3
L2 cache
ALU
c3
e3
L1 I-cache
c5
e5
c6
e6
L1 D-cache
d2
f2
d5
f5
d6
f6
ld/st
c2
e2
c2
c4
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,4 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
MSHRs
c
d
Suppose this is a L2 cache miss
e
f
g
46
Cycle 14
to main memory
d5
f4
c5
Branch predictor
ALU
d4
to L1 I-cache
e4
c4
e4
d3
f3
L2 cache
ALU
c3
e3
L1 I-cache
e5
c6
e6
c7
L1 D-cache
d2
f2
f5
d6
f6
d7
ld/st
c2
e2
c4
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,4 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
MSHRs
b
c2
instruction waits in MSHRs (Miss Status Handling
Registers) for data from main memory -- Assume
the access time to main memory is 250 cycles
c
d
e
f
g
47
Cycle 15
to main memory
f5
e5
Branch predictor
ALU
d5
to L1 I-cache
f4
f4
c5
d4
e4
L2 cache
c4
ALU
L1 I-cache
d3
f3
c6
e6
c7
e7
L1 D-cache
c3
d6
f6
d7
f7
e3
ld/st
d2
f2
c5
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,4 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
MSHRs
b
c2
e2
c2
c
d
e
f
g
48
Cycle 16
to main memory
f5
d6
e5
c6
Branch predictor
ALU
d5
to L1 I-cache
f4
e5
c5
d4
e4
L2 cache
c4
ALU
L1 I-cache
d3
f3
e6
c7
e7
c8
L1 D-cache
c3
f6
d7
f7
d8
e3
ld/st
d2
f2
c5
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,4 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
MSHRs
b
c2
e2
c2
c
d
e
Instruction window fills up instruction c2
prohibits commit
f
g
49
cycle 264
to main memory
f5
d6
e5
c6
Branch predictor
ALU
d5
to L1 I-cache
f4
d2
c5
d4
e4
L2 cache
c4
ALU
L1 I-cache
d3
f3
e6
c7
e7
c8
L1 D-cache
c3
f6
d7
f7
d8
e3
ld/st
d2
f2
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,4 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
MSHRs
b
c2
e2
c
d
e
f
g
50
Performance impact of non-ideal memory
everything perfect
real branch predictor
IPC
real branch predictor real memory hierarchy
6
5
4
3
2
1
0
bzip2
crafty
eon
gcc
gzip
parser
perlbmk
twolf
vortex
vpr
51
Interval analysis
L2 D-cache miss
Instruction window fills up
Instruction window full
IPC
Performance recovers
IPCmax
t
L2 D-cache miss latency
Instruction that do not depend on the cache miss
are executed
52
Multiple L2 D-cache misses
to main memory
f5
d6
e5
c6
Branch predictor
ALU
d5
to L1 I-cache
f4
c5
d4
e4
L2 cache
c4
ALU
L1 I-cache
d3
f3
e6
c7
e7
c8
L1 D-cache
c3
f6
d7
f7
d8
e3
ld/st
d2
f2
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,4 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
MSHRs
b
c2
e2
c4
c3
c2
c
d
c6
c5
e
f
Memory-Level Parallelism (MLP)
g
53
Overview

Superscalar pipeline
Speculative execution
branch prediction
value prediction
Predicated execution
Multithreaded execution
VLIW

54
Pipeline
i1
i2
i3
F
D
E
W
i1
4 stages
F
D
E
W
i2
F
D
E
W
i3
i1
8 stages
i2
i3
55
Examples
Processor Stages Frequency
Itanium 2 8 1.5 GHz
Alpha 21364 7-9 1.15 GHz
AMD Opteron 9-11 2.2 GHz
Power4 12-17 1.7 GHz
Pentium 4 20 3.2 GHz
IA32 Prescott 31 3.4 GHz
Source Microprocessor report
56
Performance/Mhz
1400
1200
Alpha
Athlon
speed demons
brainiacs
1000
Opteron
PA-RISC
800
Pentium III, 4
POWER
SPECint_peak
600
MIPS
SPARC64
SPARC Sun
400
Xeon
Itanium
200
0
0
500
1000
1500
2000
2500
3000
3500
Frequency (MHz)
57
Pentium 4 pipeline stages
1 Trace cache next IP 2 Trace cache next IP 3
Trace cache fetch 4 Trace cache fetch 5 Drive 6
Allocate and rename 7 Allocate and rename 8
Allocate and rename 9 Queue 10 Issue
11 Issue 12 Issue 13 Dispatch 14 Dispatch 15
Operand 1 16 Operand 2 17 Execute 18 Flags 19
Branch check 20 Drive
58
8 stages
5 stages (1-5)
3 op/cycle
7 stages (6-12)
6 op/cycle
5 stages (13-17)
3 op/cycle
3 stages (18-20)
59
Tomasulo Speculation
reorder buffer
Fetch unit
I-Cache
Instruction queue
Load/store
operationbus
Address unit
registers
Load buffer
Common data bus (CDB)
reservationstations
D-Cache
FU1
FU2
FU3
60
Overview

Superscalar pipeline
Speculative execution
branch prediction
value prediction
Predicated execution
Multithreaded execution
VLIW

61
Control flow instructions
i1 i2 bc lab
i1 i2 bc lab i4 i5 i6 jmp end lab
i8 i9 i10 end i11
i4 i5 i6 jmp end
lab i8 i9 i10
end i11
62
Branch Predictors

Static predictors
Not taken
Backward taken/forward not taken
Dynamic predictors
Simple dynamic predictor
2-bit predictor
Local
Global
Hybrid
Branch Target Buffer

63
Predict not taken
loop cmp r1,r2 je end loop
jump loop end
64
Branch backward taken/forward not taken
cmp r1,r2 je end loop loop
cmp r1,r2 jne loop end
65
Simple dynamic predictor
table
1 0 0 1 1 1 1 0
1taken 0not taken
lowest bits
PC
predict taken
update table with correct information
66
2-bit dynamic predictor
table
predict taken
11 01 00 10 10 11 11 00
11
10
lowest bits
PC
01
00
predict not taken
1taken 0not taken
taken
not taken
67
Local dynamic predictor
predictor table
history table
11010101 01111111 00000000 10101010 01010101 11111
111 11111111 00000000
00 01 00 10 .. 11 .. 11
lowest bits
PC
1taken 0not taken
predict taken
68
Global dynamic predictor
prediction table
1
Global history register
1
0
00 01 00 10 10 11 11 11
1
0
0
1011111111
0
1
predict taken
1taken 0not taken
PC
69
Hybrid predictor
predictor A
PC
predictor B
Meta predictor
70
Branch Target Buffer
prediction information
PC
target
address branch
prediction
71
Branch prediction
everything perfect
real branch predictor
IPC
6
5
4
3
2
1
0
bzip2
crafty
eon
gcc
gzip
parser
perlbmk
twolf
vortex
vpr
72
Overview

Superscalar pipeline
Speculative execution
branch prediction
value prediction
Predicated execution
Multithreaded execution
VLIW

73
Cycle 11
Branch predictor
ALU
to L1 I-cache
f2
d3
f3
L2 cache
ALU
c3
e3
L1 I-cache
c4
e4
c5
e5
L1 D-cache
d2
f2
d4
f4
d5
f5
ld/st
c2
e2
c2
c3
mov 0 ? r1 mov 0x0fe0 ? r3 L ld MEMr3 ?
r2 add r2,r1 ? r1 add r3,4 ? r3 brl
r3,0x10a0 ? L st r3 ? MEMA
a
b
toegangstijd tot L2 cache bedraagt 3 cycli
c
d
e
f
g
74
Dependency Graph
A
B
L
H
C
D
D
G
H
J
I
ilp13/81.62
t
K
M
75
Dependency Graph
A
B
L
H
C
D
D
G
H
J
I
ilp13/81.62
t
K
M
76
Dependency Graph
A
B
H
L
H
C
J
I
D
D
K
M
G
ilp13/43.25 gtgt 1.62
t
True dependencies limit the achievable IPC
77
Prediction Schemes

Last value prediction (LVP)
Stride prediction (SP)
Finite Context Method (FCM)

Cases one instruction generates 1 1 1
1 1 1 1 1 1 (constant) 1 2 3 4 5 1 2
3 4 5 (stride) 1 3 2 5 4 1 3 2 5 4
(repetition)
78
Last Value Prediction(Lipasti, 1996)
PC
1 1 1 1 1 1 1 1 1 1 1 2 3 4 5 1 2 3 4 5 1 3 2
5 4 1 3 2 5 4
last value
49 accuracy for an infinite table
79
Stride Prediction(Gabbay, 1997)
PC
1 1 1 1 1 1 1 1 1 1 1 2 3 4 5 1 2 3 4 5 1 3 2
5 4 1 3 2 5 4
value stride
60 accuracy for an infinite table
80
Improved Stride Prediction
1 1 1 1 1 1 1 1 1 1 1 2 3 4 5 1 2 3 4 5 1 3 2
5 4 1 3 2 5 4
Or saturating counter...
81
Finite Context Method(Sazeides, 1997)
1 3 2 5 4 1 3 2 5 4
1 3 2
5 4 1 3 2
3 2 5
2 5 4
5 4 1
4 1 3
Order 3
82
Finite Context Method(Sazeides, 1997)
history
PC
1 1 1 1 1 1 1 1 1 1 1 2 3 4 5 1 2 3 4 5 1 3 2
5 4 1 3 2 5 4
value
78 accuracy for infinite tables and order 3
hashed values
83
Accuracy vs. size
80
70
60
50
40
30
20
10
0
1
10
100
1000
10000
100000
size (Kbit)
84
Overview

Superscalar pipeline
Speculative execution
branch prediction
value prediction
Predicated execution
Multithreaded execution
VLIW

85
Predicated Execution
if (r10 r20) if (r30) r5
r4 - 1 else r5 r7 3 else
r5 r31 r5 r5 2
86
Predicated Execution

beq r1,0,L1
beq r2,0,L1
add r5,r3,1
jump L3
L1 beq r3,0,L2
add r5,r7,3
jump L3
L2 sub r5,r4,1
L3 mul r5,r5,2

beq r1,0,L1
beq r2,0,L1
add r5,r3,1
beq r3,0,L2
add r5,r7,3
sub r5,r4,1
mul r5,r5,2
87
Predicated Execution
beq r1,0,L1
beq r2,0,L1
-p1
add r5,r3,1
beq r3,0,L2
add r5,r7,3
sub r5,r4,1
mul r5,r5,2
88
Predicated code
beq r1,0,L1 beq r2,0,L1 add r5,r3,1 jump
L3 L1 beq r3,0,L2 add r5,r7,3 jump
L3 L2 sub r5,r4,1 L3 mul r5,r5,2
ltTgt p1 r1 0 lt-p1gt p1 r2 0 lt-p1gt add
r5,r3,1 ltp1gt p2 r3 0 ltp1-p2gt add
r5,r7,3 ltp1p2gt sub r5,r4,1 ltTgt mul r5,r5,2
Predicates control retirement
89
Advantages predication

Less control transfers
Easier scheduling of instructions

ltTgt p1 r1 0 lt-p1gt p1 r2 0 lt-p1gt add
r5,r3,1 ltp1gt p2 r3 0 ltp1-p2gt add
r5,r7,3 ltp1p2gt sub r5,r4,1 ltTgt mul r5,r5,2
90
Overview

Superscalar pipeline
speculative execution
branch prediction
value prediction
predicated execution
multithreaded execution
VLIW

91
Simultaneous multithreading

t
(also called hyperthreading)
92
Overview

Superscalar pipeline
Speculative execution
branch prediction
value prediction
Predicated execution
Multithreaded execution
VLIW

93
VLIW Very Long Instruction Word processors
94
VLIW execution
Static scheduling Simple processor
E
Retire (commit)
Fetch Decode
E
E
E
95
Branch penalty
8 instructions lost!
F
D
E
W
E
E
E
F
D
E
W
Solution Execute anyhow Problem How to find
8 instructions
E
E
E
96
VLIW
97
EPIC Explicitly Parallel Instruction Computing
(Itanium)
Instruction bundle
Operation 1
Temp
Operation 1
Operation 1
41 bits
41 bits
41 bits
5 bits
Template determines which operations can be
executed in parallel
98
TMS320C62xx VLIW Processor
32 bits
0
0
1
0
1
1
1
A
D
F
G
E
C
B
Cycle Instruction
1 A 2 B C D 3 E F G
99
Questions?
100
Optimisation de logiciels pour les systèmes
enfouis

Prof. Koen De Bosschere
Université de Gand

101
Memory hierarchy
Second Lecture
102
Preliminaria
1 kibibyte (KiB) 210 1024 bytes
1 mebibyte (MiB) 220 1 048 576 bytes
1 gibibyte (GiB) 230 1 073 741 824 bytes
1 tebibyte (TiB) 240 bytes
1 kilobyte (kB) 103 1 000 bytes
1 megabyte (MB) 106 1 000 000 bytes
1 gigabyte (GB) 109 1 000 000 000 bytes
1 terabyte (TB) 1012
http//physics.nist.gov/cuu/Units/binary.html
International Standard IEC 60027-2
103
Memory Hierarchy
Smaller, faster, and costlier (per byte) storage
devices
L0
registers
CPU registers hold words retrieved from L1 cache.
on-chip L1 cache (SRAM)
L1
On/off-chip L2/L3 cache (SRAM)
L2
main memory (DRAM)
L3
Larger, slower, and cheaper (per
byte) storage devices
local secondary storage (local disks)
L4
remote secondary storage (distributed file
systems, Web servers)
L5
104
Storage Evolution
metric 1980 1985 1990 1995 2000 20001980 /MiB
19,200 2,900 320 256 100 192 access
(ns) 300 150 35 15 2 150
SRAM
metric 1980 1985 1990 1995 2000 20001980 /MiB
8,000 880 100 30 1 8,000 access
(ns) 375 200 100 70 60 6 typical size(MiB)
0.064 0.256 4 16 64 1,000
DRAM
metric 1980 1985 1990 1995 2000 20001980 /MiB
500 100 8 0.30 0.05 10,000 access
(ms) 87 75 28 10 8 11 typical size(MiB)
1 10 160 1,000 9,000 9,000
Disk
Source Byte and PC Magazine
105
Magnetic storage
20 MB/mm2
8.5 nm particles
Assumed max density 50 Tbpsi
100 nm
106
/MiB
100000
10000
1000
SRAM
100
DRAM
10
DISK
1
0,1
0,01
1980
1985
1990
1995
2000
107
Storage Capacity Evolution
100000
10000
DISK
1000
DISK/DRAM
100
DRAM
10
MiB
1
0,1
0,01
1980
1985
1990
1995
2000
2005
Machrones law RAM 500 Hard disk 500
108
Access time evolution
ns
100000000
10000000
1000000
100000
DISK
access time gap
10000
DRAM
SRAM
1000
100
10
1
1980
1985
1990
1995
2000
109
Memory Wall
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 9/yr. (2X/10 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
110
Overview

Caches basic operation
Miss classification
Cache improvements

111
Caches
Cache keeps intruders away from backcountry
supplies
112
Pentium 4 EECache hierarchy
Processor
Cycles 2
L1 I (12Ki)
L1 D (8KiB)
Cycles 19
L2 cache (512 KiB)
Cycles 43
L3 cache (2 MiB)
Cycles 206
Memory
113
Basic cache operation
memory
cache
00 08 10 18 20 28 30 38 40 48 50
CPU
114
Locality
temporal
spatial
Quicksort
115
Working set
Set of memory locations used during ?t
Working set size
t
116
Performance impact of non-ideal memory
everything perfect
real branch predictor
IPC
real branch predictor real memory hierarchy
6
5
4
3
2
1
0
bzip2
crafty
eon
gcc
gzip
parser
perlbmk
twolf
vortex
vpr
117
Basic Cache Types

Direct-mapped caches
Set-associative caches
Fully associative caches

118
Direct mapped cache
memory
cache
119
Direct mapped cache
valid
dirty
tag
index
offset
address
e.g. 512 blocks of 32 bytes

data
hit
120
4-way set associative
address

multiplexer
data
hit
121
Two-way set associative cache
122
Fully associative cache
123
Associativity
Size sets x associativity x blocksize
Direct mapped associativity 1 Fully associative
sets 1
tag
Direct mapped
2-way SA, 4 sets
data
4-way SA, 2 sets
Fully associative
124
Exploiting spatial locality
address

multiplexer
125
Average Memory Access Time
AMAT Hit Time (Miss Rate x Miss Penalty)
(Hit Rate x Hit Time) (Miss Rate x Miss Time)
3 c 0.02 x 100 c 5 c 0.98 x 3 c 0.02 x 103
c 5 c
Miss rate ? Miss penalty ? Hit time ? ? AMAT ?
126
Overview

Caches basic operation
Miss classification
Cache improvements

127
Miss classification 3Cs model

Compulsory misses first time misses
INF infinitely large cache
compulsory misses misses(INF)
Capacity misses cache size
FA fully associative cache, LRU replacement
capacity misses misses(FA) - misses(INF)

128
Miss classification 3Cs model

Conflict misses set index functions
C investigated cache with investigated
replacement policy
Conflict misses misses(C) - misses(FA)

129
Cache size ? ? Miss rate ? Associativity ? ? Miss
rate ?
0.14
1-way
0.12
Spec 92 Benchmarks
2-way
0.1
4-way
21 rule
0.08
8-way
0.06
capacity misses
Miss Rate
0.04
compulsory misses
0.02
0
1
2
4
8
16
32
64
128
source PattersonHennessy
Cache size (KiB)
130
3Cs Relative Miss Rate
100
1-way
Conflict
80
2-way
4-way
8-way
60
Miss Rate per Type
40
Capacity
20
!
Compulsory
0
1
2
4
8
16
32
64
128
Cache Size (KiB)
131
Replacement strategies

Least recently used
OPT (will not be used for the longest time)
Random (choose one)

Associativity 2-way 2-way 4-way 4-way 8-way 8-way
Size LRU Random LRU Random LRU Random
16 KiB 5.18 5.69 4.67 5.29 4.39 4.96
64 KiB 1.88 2.01 1.54 1.66 1.39 1.53
256 KiB 1.15 1.17 1.13 1.13 1.12 1.12
Miss Rates instruction cache
132
Overview

Caches basic operation
Miss classification
Cache improvements
Related to block size
Related to cache size
Related to indexing

133
Block size ? ? Miss rate ??
25
1KiB
20
Miss Rate Direct Mapped Cache
15
10
4KiB
5
16KiB
64KiB
256KiB
0
64
16
32
128
256
Blok size (bytes)
134
AMAT
Cache Size Cache Size Cache Size Cache Size
Block Size MissPen (to mem) 4KiB 16KiB 64KiB 256KiB
16 82 8.027 4.231 2.673 1.894
32 84 7.082 3.411 2.134 1.588
64 88 7.160 3.323 1.933 1.449
128 96 8.469 3.659 1.979 1.470
256 112 11.651 4.685 2.288 1.549
135
Critical word firstEarly restart hit time ?

Critical word first first load the requested
word from memory and forward it to the CPU, then
complete the rest of the cache block.
Early restart load a complete cache block, but
forward the requested word to the CPU as soon as
it arrives.

Good for large cache blocks Early restart
varying hit time
136
Stream buffer miss rate ?
Instruction cache Alpha 21064 fetches 2 blocks
on a miss Extra block placed in stream buffer On
miss check stream buffer - 1 data stream buffer
eliminated 25 misses from 4KiB cache - 4
streams got 43 Jouppi, 1990
L1
Stream buffer
L2
137
Stream buffer
Data cache for scientific programs for 8
streams got 50 to 70 of misses from 64KiB,
4-way set associative caches Palacharla
Kessler, 1994
L1
Stream buffer
Stream buffer only make sense when there is
enough bandwidth to the next level in the memory
hierarchy.
L2
Reduces compulsory and capacity misses
138
Stream buffer improvements

Multi-way streams
Multiple parallel stream buffers, one per
instruction or data stream
Stride detection
For non-unit stride access to memory

139
Overview

Caches basic operation
Miss classification
Cache improvements
Related to block size
Related to cache size
Related to indexing

140
Cache size ? ? hit time ?Associativity ? ? hit
time ?
ns
14
assoc
12
1
10
2
8
4
6
FA
4
2
0
4
8
16
32
64
128
256 KiB
L1 data cache reduced from 2W 16KiB in Pentium
III to 4W 8KiB in Pentium 4
141
Cache size/assoc vs. AMAT
Cache Size (KiB) AMAT (c) AMAT (c) AMAT (c) AMAT (c)
Cache Size (KiB) 1-way 2-way (10) 4-way (12) 8-way (14)
1 7.65 6.60 6.22 5.44
2 5.90 4.90 4.62 4.09
4 4.60 3.95 3.57 3.19
8 3.30 3.00 2.87 2.59
16 2.45 2.20 2.12 2.04
32 2.00 1.80 1.77 1.79
64 1.70 1.60 1.57 1.59
128 1.50 1.45 1.42 1.44
AMAT Hit Time Miss Rate x Miss Penalty
142
Split L1 caches
Processor
Cycles 2
L1 I (12Ki)
L1 D (8KiB)
Cycles 19
L2 cache (512 KiB)
Pentium 4 EE
Cycles 43
L3 cache (2 MiB)
Cycles 206
Memory
143
Split vs. Unified Cache
Size Instruction Cache Data Cache Unified Cache
1 KiB 3.06 24.61 13.34
2 KiB 2.26 20.57 9.78
4 KiB 1.78 15.94 7.24
8 KiB 1.10 10.19 4.57
16 KiB 0.64 6.47 2.87
32 KiB 0.39 4.82 1.99
64 KiB 0.15 3.77 1.35
128 KiB 0.02 2.88 0.95
Harvard architecture
144
Example
Make the common case fast
20 data cache 80 instruction cache 16
KiB miss penalty 50 cycles hit time 1
cycle Split Cache AMAT 80 x (1 0.64 x 50)
20 x (1 6.47 x 50) 1.903 For the
unified cache AMAT 80 x (1 1.99 x 50)
20 x (2 1.99 x 50) 2.195
Extra cycle single ported cache
145
Filter cache
Processor

Small L0 direct mapped cache (e.g. 256 B)
Standard cache
Performance penalty of 21 due to high miss rate
(Kin97)
Consumes less power

Filter cache (L0)
L1 Cache
146
Dynamically Loaded Loop Cache

Small loop cache
Alternative location to fetch instructions
Dynamically fills the loop cache
Triggered by short backwards branch (sbb)
instruction

... add r1,2 ... sbb -5
147
Preloaded loop cache

Small loop cache
Loop cache filled at compile time and remains
fixed
Fetch triggered by
short backwards branch
start address of the loop

... add r1,2 ... sbb -5
148
Victim Buffer
CPU
4-entry victim cache removes 20 to 95 of
conflicts for a 4 KiB direct mapped data cache
HIT
HIT
MISS
Victim buffer
Memory
Jouppi90
149
Overview

Caches basic operation
Miss classification
Cache improvements
Related to block size
Related to cache size
Related to indexing

150
Randomizing cache index functions
0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0
111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 11
10xx 1111xx
00 01 10 11 00 01 10 11 00 01 10 11 00 01 10 11
a5a4a3a2a1a0
H
a3a2
Direct mapped cache
151
Randomizing cache index functions
0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0
111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 11
10xx 1111xx
00 01 10 11 01 00 11 10 10 11 00 01 11 10 01 00
a5a4a3a2a1a0
H
(a5?a3)(a4?a2)
Direct mapped cache
152
Effect of randomized address bits
fp
int
overall
7
6
5
4
Miss rate
3
2
1
0
8
9
10
11
12
13
14
15
16
No. of randomized address bits
Vandierendonck04
153
Skewed-Associative Cachemapping conflicts ?

2-way skewing
2 banks, different set index functions
Randomization!
Inter-bank dispersion
Blocks may conflict in one bank, but probably not
in the other
Set-associative
H1 H2

block address
bank 1
bank 2
tag
data
tag
data
154
Inter bank dispersion in action

Set-associative

Skewed-associative

bank 1
bank 2
bank 1
bank 2
tag
data
tag
data
tag
data
tag
data
155
Limited Inter Bank Dispersion
H2
Goal choose H1 and H2 such that the IBD is
maximal
00 11 01 10
00
11
H1
01
10
156
Trace cache
traditional cache
i1 call f
i7 i8 i9 ret
i2 i3
trace cache
157
Example
Processor Pentium 4 Ultrasparc III Clock
(2001) 2000 Mhz 900 Mhz L1 I cache 96 KiB TC 32
KiB, 4WSA Latency 4 2 L1 D cache 8 KiB 4WSA 64
KiB, 4WSA Latency 2 2 TLB 128 128 L2 cache 256
KiB 8WSA 8 MiB DM (off chip) Latency 6 15 Block
size 64 bytes 32 bytes Bus width 64 bits 128
bits Bus clock 400 Mhz 150 Mhz
158
Other examples of caching
159
Memory Access
load/store
0,001-0,00001
in cache?
in RAM?
no
no
Load page
yes
yes
Load cache line
90-99
1 cycle
60 cycles
8 000 000 cycles
1 s
1 min
92 days
160
Questions?

Write a Comment

User Comments (0)