Title: Optimisation de logiciels pour les syst
1Optimisation de logiciels pour les systèmes
enfouis
- Prof. Koen De Bosschere
- Université de Gand
2Memory hierarchy
Second Lecture
3Preliminaria
1 kibibyte (KiB) 210 1024 bytes
1 mebibyte (MiB) 220 1 048 576 bytes
1 gibibyte (GiB) 230 1 073 741 824 bytes
1 tebibyte (TiB) 240 bytes
1 kilobyte (kB) 103 1 000 bytes
1 megabyte (MB) 106 1 000 000 bytes
1 gigabyte (GB) 109 1 000 000 000 bytes
1 terabyte (TB) 1012
http//physics.nist.gov/cuu/Units/binary.html
International Standard IEC 60027-2
4Memory Hierarchy
Smaller, faster, and costlier (per byte) storage
devices
L0
registers
CPU registers hold words retrieved from L1 cache.
on-chip L1 cache (SRAM)
L1
On/off-chip L2/L3 cache (SRAM)
L2
main memory (DRAM)
L3
Larger, slower, and cheaper (per
byte) storage devices
local secondary storage (local disks)
L4
remote secondary storage (distributed file
systems, Web servers)
L5
5Storage Evolution
metric 1980 1985 1990 1995 2000 20001980 /MiB
19,200 2,900 320 256 100 192 access
(ns) 300 150 35 15 2 150
SRAM
metric 1980 1985 1990 1995 2000 20001980 /MiB
8,000 880 100 30 1 8,000 access
(ns) 375 200 100 70 60 6 typical size(MiB)
0.064 0.256 4 16 64 1,000
DRAM
metric 1980 1985 1990 1995 2000 20001980 /MiB
500 100 8 0.30 0.05 10,000 access
(ms) 87 75 28 10 8 11 typical size(MiB)
1 10 160 1,000 9,000 9,000
Disk
Source Byte and PC Magazine
6Magnetic storage
20 MB/mm2
8.5 nm particles
Assumed max density 50 Tbpsi
100 nm
7/MiB
100000
10000
1000
SRAM
100
DRAM
10
DISK
1
0,1
0,01
1980
1985
1990
1995
2000
8Storage Capacity Evolution
100000
10000
DISK
1000
DISK/DRAM
100
DRAM
10
MiB
1
0,1
0,01
1980
1985
1990
1995
2000
2005
Machrones law RAM 500 Hard disk 500
9Access time evolution
ns
100000000
10000000
1000000
100000
DISK
access time gap
10000
DRAM
SRAM
1000
100
10
1
1980
1985
1990
1995
2000
10Memory Wall
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 9/yr. (2X/10 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
11Overview
- Caches basic operation
- Miss classification
- Cache improvements
12Caches
Cache keeps intruders away from backcountry
supplies
13Pentium 4 EECache hierarchy
Processor
Cycles 2
L1 I (12Ki)
L1 D (8KiB)
Cycles 19
L2 cache (512 KiB)
Cycles 43
L3 cache (2 MiB)
Cycles 206
Memory
14Performance impact of non-ideal memory
everything perfect
real branch predictor
IPC
real branch predictor real memory hierarchy
6
5
4
3
2
1
0
bzip2
crafty
eon
gcc
gzip
parser
perlbmk
twolf
vortex
vpr
15Locality
temporal
spatial
Quicksort
16Working set
Set of memory locations used during ?t
Working set size
t
17Basic cache operation
memory
cache
00 08 10 18 20 28 30 38 40 48 50
CPU
18Basic Cache Types
- Direct-mapped caches
- Set-associative caches
- Fully associative caches
19Direct mapped cache
valid
dirty
tag
index
offset
address
e.g. 512 blocks of 32 bytes
data
hit
20Direct mapped cache
memory
cache
214-way set associative
address
multiplexer
data
hit
22Two-way set associative cache
23Fully associative cache
24Associativity
Size sets x associativity x blocksize
Direct mapped associativity 1 Fully associative
sets 1
tag
Direct mapped
2-way SA, 4 sets
data
4-way SA, 2 sets
Fully associative
25Exploiting spatial locality
address
multiplexer
26Average Memory Access Time
AMAT Hit Time (Miss Rate x Miss Penalty)
(Hit Rate x Hit Time) (Miss Rate x Miss Time)
3 c 0.02 x 100 c 5 c 0.98 x 3 c 0.02 x 103
c 5 c
Miss rate ? Miss penalty ? Hit time ? ? AMAT ?
27Overview
- Caches basic operation
- Miss classification
- Cache improvements
28Miss classification 3Cs model
- Compulsory misses first time misses
- INF infinitely large cache
- compulsory misses misses(INF)
- Capacity misses cache size
- FA fully associative cache, LRU replacement
- capacity misses misses(FA) - misses(INF)
29Miss classification 3Cs model
- Conflict misses set index functions
- C investigated cache with investigated
replacement policy - Conflict misses misses(C) - misses(FA)
30Cache size ? ? Miss rate ? Associativity ? ? Miss
rate ?
0.14
1-way
0.12
Spec 92 Benchmarks
2-way
0.1
4-way
21 rule
0.08
8-way
0.06
capacity misses
Miss Rate
0.04
compulsory misses
0.02
0
1
2
4
8
16
32
64
128
source PattersonHennessy
Cache size (KiB)
313Cs Relative Miss Rate
100
1-way
Conflict
80
2-way
4-way
8-way
60
Miss Rate per Type
40
Capacity
20
!
Compulsory
0
1
2
4
8
16
32
64
128
Cache Size (KiB)
32Replacement strategies
- Least recently used
- OPT (will not be used for the longest time)
- Random (choose one)
Associativity 2-way 2-way 4-way 4-way 8-way 8-way
Size LRU Random LRU Random LRU Random
16 KiB 5.18 5.69 4.67 5.29 4.39 4.96
64 KiB 1.88 2.01 1.54 1.66 1.39 1.53
256 KiB 1.15 1.17 1.13 1.13 1.12 1.12
Miss Rates instruction cache
33Overview
- Caches basic operation
- Miss classification
- Cache improvements
- Related to block size
- Related to cache size
- Related to indexing
34Block size ? ? Miss rate ??
25
1KiB
20
15
Miss
Rate
10
4KiB
5
16KiB
64KiB
256KiB
0
64
16
32
128
256
Blok size (bytes)
35AMAT
Cache Size Cache Size Cache Size Cache Size
Block Size MissPen (to mem) 4K 16K 64K 256K
16 82 8.027 4.231 2.673 1.894
32 84 7.082 3.411 2.134 1.588
64 88 7.160 3.323 1.933 1.449
128 96 8.469 3.659 1.979 1.470
256 112 11.651 4.685 2.288 1.549
36Critical word firstEarly restart hit time ?
- Critical word first first load the requested
word from memory and forward it to the CPU, then
complete the rest of the cache block. - Early restart load a complete cache block, but
forward the requested word to the CPU as soon as
it arrives.
Good for large cache blocks Early restart
varying hit time
37Stream buffer miss rate ?
Instruction cache Alpha 21064 fetches 2 blocks
on a miss Extra block placed in stream buffer On
miss check stream buffer - 1 data stream buffer
got 25 misses from 4KiB cache - 4 streams
got 43 Jouppi, 1990
L1
Stream buffer
L2
38Stream buffer
Data cache for scientific programs for 8
streams got 50 to 70 of misses from 64KiB,
4-way set associative caches Palacharla
Kessler, 1994
L1
Stream buffer
Stream buffer only make sense when there is
enough bandwidth to the next level in the memory
hierarchy.
L2
Reduces compulsory and capacity misses
39Stream buffer improvements
- Multi-way streams
- Multiple parallel stream buffers, one per
instruction or data stream - Stride detection
- For non-unit stride access to memory
40Overview
- Caches basic operation
- Miss classification
- Cache improvements
- Related to block size
- Related to cache size
- Related to indexing
41Cache size ? ? hit time ?Associativity ? ? hit
time ?
ns
14
assoc
12
1
10
2
8
4
6
FA
??
4
2
0
4
8
16
32
64
128
256 KiB
L1 data cache reduced from 2W 16KiB in Pentium
III to 4W 8KiB in Pentium 4
42Cache size/assoc vs. AMAT
Cache Size (KiB) AMAT (c) AMAT (c) AMAT (c) AMAT (c)
Cache Size (KiB) 1-way 2-way (10) 4-way (12) 8-way (14)
1 7.65 6.60 6.22 5.44
2 5.90 4.90 4.62 4.09
4 4.60 3.95 3.57 3.19
8 3.30 3.00 2.87 2.59
16 2.45 2.20 2.12 2.04
32 2.00 1.80 1.77 1.79
64 1.70 1.60 1.57 1.59
128 1.50 1.45 1.42 1.44
AMAT Hit Time Miss Rate x Miss Penalty
43Split L1 caches
Processor
Cycles 2
L1 I (12Ki)
L1 D (8KiB)
Cycles 19
L2 cache (512 KiB)
Pentium 4 EE
Cycles 43
L3 cache (2 MiB)
Cycles 206
Memory
44Split vs. Unified Cache
Size Instruction Cache Data Cache Unified Cache
1 KiB 3.06 24.61 13.34
2 KiB 2.26 20.57 9.78
4 KiB 1.78 15.94 7.24
8 KiB 1.10 10.19 4.57
16 KiB 0.64 6.47 2.87
32 KiB 0.39 4.82 1.99
64 KiB 0.15 3.77 1.35
128 KiB 0.02 2.88 0.95
Harvard architecture
45Example
Make the common case fast
20 data cache 80 instruction cache miss
penalty 50 cycles hit time 1 cycle Split
Cache AMAT 80 x (1 0.64 x 50) 20
x (1 6.47 x 50) 1.903 For the unified
cache AMAT 80 x (1 1.99 x 50)
20 x (2 1.99 x 50) 2.195
Extra cycle single ported cache
46Filter cache
Processor
- Small L0 direct mapped cache
- Standard cache
- Performance penalty of 21 due to high miss rate
(Kin 1997) - Consumes less power
Filter cache (L0)
L1 Cache
47Dynamically Loaded Loop Cache
- Small loop cache
- Alternative location to fetch instructions
- Dynamically fills the loop cache
- Triggered by short backwards branch (sbb)
instruction
... add r1,2 ... sbb -5
48Preloaded loop cache
- Small loop cache
- Loop cache filled at compile time and remains
fixed - Fetch triggered by
- short backwards branch
- start address of the loop
... add r1,2 ... sbb -5
49Victim Buffer
CPU
4-entry victim cache removes 20 to 95 of
conflicts for a 4 KiB direct mapped data cache
HIT
HIT
MISS
Victim buffer
Memory
Jouppi90
50Overview
- Caches basic operation
- Miss classification
- Cache improvements
- Related to block size
- Related to cache size
- Related to indexing
51Randomizing cache index functions
0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0
111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 11
10xx 1111xx
00 01 10 11 00 01 10 11 00 01 10 11 00 01 10 11
a5a4a3a2a1a0
H
a3a2
Direct mapped cache
52Randomizing cache index functions
0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0
111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 11
10xx 1111xx
00 01 10 11 01 00 11 10 10 11 00 01 11 10 01 00
a5a4a3a2a1a0
H
(a5?a3)(a4?a2)
Direct mapped cache
53Effect of randomized address bits
fp
int
overall
7
6
5
4
Miss rate
3
2
1
0
8
9
10
11
12
13
14
15
16
No. of randomized address bits
54Skewed-Associative Cachemapping conflicts ?
- 2-way skewing
- 2 banks, different set index functions
- Randomization!
- Inter-bank dispersion
- Blocks may conflict in one bank, but probably not
in the other - Set-associative
- H1 H2
block address
bank 1
bank 2
tag
data
tag
data
55Inter bank dispersion in action
bank 1
bank 2
bank 1
bank 2
tag
data
tag
data
tag
data
tag
data
56Limited Inter Bank Dispersion
H2
Goal choose H1 and H2 such that the IBD is
maximal
00 11 01 10
00
11
H1
01
10
57Trace cache
traditional cache
i1 call f
i7 i8 i9 ret
i2 i3
trace cache
58Example
Processor Pentium 4 Ultrasparc III Clock
(2001) 2000 Mhz 900 Mhz L1 I cache 96 KiB TC 32
KiB, 4WSA Latency 4 2 L1 D cache 8 KiB 4WSA 64
KiB, 4WSA Latency 2 2 TLB 128 128 L2 cache 256
KiB 8WSA 8 MiB DM (off chip) Latency 6 15 Block
size 64 bytes 32 bytes Bus width 64 bits 128
bits Bus clock 400 Mhz 150 Mhz
59Other examples of caching
60Questions?