Optimisation de logiciels pour les syst - PowerPoint PPT Presentation

About This Presentation

Title:

Optimisation de logiciels pour les syst

Description:

1 kibibyte (KiB) = 210 = 1024 bytes. 1 mebibyte (MiB) ... crafty. eon. gcc. gzip. parser. perlbmk. twolf. vortex. vpr. everything perfect. real branch predictor ... – PowerPoint PPT presentation

Number of Views:153

Avg rating:3.0/5.0

Slides: 61

Provided by: KoenDeBo7

Category:

more less

Transcript and Presenter's Notes

Title: Optimisation de logiciels pour les syst

1
Optimisation de logiciels pour les systèmes
enfouis

Prof. Koen De Bosschere
Université de Gand

2
Memory hierarchy
Second Lecture
3
Preliminaria
1 kibibyte (KiB) 210 1024 bytes
1 mebibyte (MiB) 220 1 048 576 bytes
1 gibibyte (GiB) 230 1 073 741 824 bytes
1 tebibyte (TiB) 240 bytes
1 kilobyte (kB) 103 1 000 bytes
1 megabyte (MB) 106 1 000 000 bytes
1 gigabyte (GB) 109 1 000 000 000 bytes
1 terabyte (TB) 1012
http//physics.nist.gov/cuu/Units/binary.html
International Standard IEC 60027-2
4
Memory Hierarchy
Smaller, faster, and costlier (per byte) storage
devices
L0
registers
CPU registers hold words retrieved from L1 cache.
on-chip L1 cache (SRAM)
L1
On/off-chip L2/L3 cache (SRAM)
L2
main memory (DRAM)
L3
Larger, slower, and cheaper (per
byte) storage devices
local secondary storage (local disks)
L4
remote secondary storage (distributed file
systems, Web servers)
L5
5
Storage Evolution
metric 1980 1985 1990 1995 2000 20001980 /MiB
19,200 2,900 320 256 100 192 access
(ns) 300 150 35 15 2 150
SRAM
metric 1980 1985 1990 1995 2000 20001980 /MiB
8,000 880 100 30 1 8,000 access
(ns) 375 200 100 70 60 6 typical size(MiB)
0.064 0.256 4 16 64 1,000
DRAM
metric 1980 1985 1990 1995 2000 20001980 /MiB
500 100 8 0.30 0.05 10,000 access
(ms) 87 75 28 10 8 11 typical size(MiB)
1 10 160 1,000 9,000 9,000
Disk
Source Byte and PC Magazine
6
Magnetic storage
20 MB/mm2
8.5 nm particles
Assumed max density 50 Tbpsi
100 nm
7
/MiB
100000
10000
1000
SRAM
100
DRAM
10
DISK
1
0,1
0,01
1980
1985
1990
1995
2000
8
Storage Capacity Evolution
100000
10000
DISK
1000
DISK/DRAM
100
DRAM
10
MiB
1
0,1
0,01
1980
1985
1990
1995
2000
2005
Machrones law RAM 500 Hard disk 500
9
Access time evolution
ns
100000000
10000000
1000000
100000
DISK
access time gap
10000
DRAM
SRAM
1000
100
10
1
1980
1985
1990
1995
2000
10
Memory Wall
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 9/yr. (2X/10 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
11
Overview

Caches basic operation
Miss classification
Cache improvements

12
Caches
Cache keeps intruders away from backcountry
supplies
13
Pentium 4 EECache hierarchy
Processor
Cycles 2
L1 I (12Ki)
L1 D (8KiB)
Cycles 19
L2 cache (512 KiB)
Cycles 43
L3 cache (2 MiB)
Cycles 206
Memory
14
Performance impact of non-ideal memory
everything perfect
real branch predictor
IPC
real branch predictor real memory hierarchy
6
5
4
3
2
1
0
bzip2
crafty
eon
gcc
gzip
parser
perlbmk
twolf
vortex
vpr
15
Locality
temporal
spatial
Quicksort
16
Working set
Set of memory locations used during ?t
Working set size
t
17
Basic cache operation
memory
cache
00 08 10 18 20 28 30 38 40 48 50
CPU
18
Basic Cache Types

Direct-mapped caches
Set-associative caches
Fully associative caches

19
Direct mapped cache
valid
dirty
tag
index
offset
address
e.g. 512 blocks of 32 bytes

data
hit
20
Direct mapped cache
memory
cache
21
4-way set associative
address

multiplexer
data
hit
22
Two-way set associative cache
23
Fully associative cache
24
Associativity
Size sets x associativity x blocksize
Direct mapped associativity 1 Fully associative
sets 1
tag
Direct mapped
2-way SA, 4 sets
data
4-way SA, 2 sets
Fully associative
25
Exploiting spatial locality
address

multiplexer
26
Average Memory Access Time
AMAT Hit Time (Miss Rate x Miss Penalty)
(Hit Rate x Hit Time) (Miss Rate x Miss Time)
3 c 0.02 x 100 c 5 c 0.98 x 3 c 0.02 x 103
c 5 c
Miss rate ? Miss penalty ? Hit time ? ? AMAT ?
27
Overview

Caches basic operation
Miss classification
Cache improvements

28
Miss classification 3Cs model

Compulsory misses first time misses
INF infinitely large cache
compulsory misses misses(INF)
Capacity misses cache size
FA fully associative cache, LRU replacement
capacity misses misses(FA) - misses(INF)

29
Miss classification 3Cs model

Conflict misses set index functions
C investigated cache with investigated
replacement policy
Conflict misses misses(C) - misses(FA)

30
Cache size ? ? Miss rate ? Associativity ? ? Miss
rate ?
0.14
1-way
0.12
Spec 92 Benchmarks
2-way
0.1
4-way
21 rule
0.08
8-way
0.06
capacity misses
Miss Rate
0.04
compulsory misses
0.02
0
1
2
4
8
16
32
64
128
source PattersonHennessy
Cache size (KiB)
31
3Cs Relative Miss Rate
100
1-way
Conflict
80
2-way
4-way
8-way
60
Miss Rate per Type
40
Capacity
20
!
Compulsory
0
1
2
4
8
16
32
64
128
Cache Size (KiB)
32
Replacement strategies

Least recently used
OPT (will not be used for the longest time)
Random (choose one)

Associativity 2-way 2-way 4-way 4-way 8-way 8-way
Size LRU Random LRU Random LRU Random
16 KiB 5.18 5.69 4.67 5.29 4.39 4.96
64 KiB 1.88 2.01 1.54 1.66 1.39 1.53
256 KiB 1.15 1.17 1.13 1.13 1.12 1.12
Miss Rates instruction cache
33
Overview

Caches basic operation
Miss classification
Cache improvements
Related to block size
Related to cache size
Related to indexing

34
Block size ? ? Miss rate ??
25
1KiB
20
15
Miss
Rate
10
4KiB
5
16KiB
64KiB
256KiB
0
64
16
32
128
256
Blok size (bytes)
35
AMAT
Cache Size Cache Size Cache Size Cache Size
Block Size MissPen (to mem) 4K 16K 64K 256K
16 82 8.027 4.231 2.673 1.894
32 84 7.082 3.411 2.134 1.588
64 88 7.160 3.323 1.933 1.449
128 96 8.469 3.659 1.979 1.470
256 112 11.651 4.685 2.288 1.549
36
Critical word firstEarly restart hit time ?

Critical word first first load the requested
word from memory and forward it to the CPU, then
complete the rest of the cache block.
Early restart load a complete cache block, but
forward the requested word to the CPU as soon as
it arrives.

Good for large cache blocks Early restart
varying hit time
37
Stream buffer miss rate ?
Instruction cache Alpha 21064 fetches 2 blocks
on a miss Extra block placed in stream buffer On
miss check stream buffer - 1 data stream buffer
got 25 misses from 4KiB cache - 4 streams
got 43 Jouppi, 1990
L1
Stream buffer
L2
38
Stream buffer
Data cache for scientific programs for 8
streams got 50 to 70 of misses from 64KiB,
4-way set associative caches Palacharla
Kessler, 1994
L1
Stream buffer
Stream buffer only make sense when there is
enough bandwidth to the next level in the memory
hierarchy.
L2
Reduces compulsory and capacity misses
39
Stream buffer improvements

Multi-way streams
Multiple parallel stream buffers, one per
instruction or data stream
Stride detection
For non-unit stride access to memory

40
Overview

Caches basic operation
Miss classification
Cache improvements
Related to block size
Related to cache size
Related to indexing

41
Cache size ? ? hit time ?Associativity ? ? hit
time ?
ns
14
assoc
12
1
10
2
8
4
6
FA
??
4
2
0
4
8
16
32
64
128
256 KiB
L1 data cache reduced from 2W 16KiB in Pentium
III to 4W 8KiB in Pentium 4
42
Cache size/assoc vs. AMAT
Cache Size (KiB) AMAT (c) AMAT (c) AMAT (c) AMAT (c)
Cache Size (KiB) 1-way 2-way (10) 4-way (12) 8-way (14)
1 7.65 6.60 6.22 5.44
2 5.90 4.90 4.62 4.09
4 4.60 3.95 3.57 3.19
8 3.30 3.00 2.87 2.59
16 2.45 2.20 2.12 2.04
32 2.00 1.80 1.77 1.79
64 1.70 1.60 1.57 1.59
128 1.50 1.45 1.42 1.44
AMAT Hit Time Miss Rate x Miss Penalty
43
Split L1 caches
Processor
Cycles 2
L1 I (12Ki)
L1 D (8KiB)
Cycles 19
L2 cache (512 KiB)
Pentium 4 EE
Cycles 43
L3 cache (2 MiB)
Cycles 206
Memory
44
Split vs. Unified Cache
Size Instruction Cache Data Cache Unified Cache
1 KiB 3.06 24.61 13.34
2 KiB 2.26 20.57 9.78
4 KiB 1.78 15.94 7.24
8 KiB 1.10 10.19 4.57
16 KiB 0.64 6.47 2.87
32 KiB 0.39 4.82 1.99
64 KiB 0.15 3.77 1.35
128 KiB 0.02 2.88 0.95
Harvard architecture
45
Example
Make the common case fast
20 data cache 80 instruction cache miss
penalty 50 cycles hit time 1 cycle Split
Cache AMAT 80 x (1 0.64 x 50) 20
x (1 6.47 x 50) 1.903 For the unified
cache AMAT 80 x (1 1.99 x 50)
20 x (2 1.99 x 50) 2.195
Extra cycle single ported cache
46
Filter cache
Processor

Small L0 direct mapped cache
Standard cache
Performance penalty of 21 due to high miss rate
(Kin 1997)
Consumes less power

Filter cache (L0)
L1 Cache
47
Dynamically Loaded Loop Cache

Small loop cache
Alternative location to fetch instructions
Dynamically fills the loop cache
Triggered by short backwards branch (sbb)
instruction

... add r1,2 ... sbb -5
48
Preloaded loop cache

Small loop cache
Loop cache filled at compile time and remains
fixed
Fetch triggered by
short backwards branch
start address of the loop

... add r1,2 ... sbb -5
49
Victim Buffer
CPU
4-entry victim cache removes 20 to 95 of
conflicts for a 4 KiB direct mapped data cache
HIT
HIT
MISS
Victim buffer
Memory
Jouppi90
50
Overview

Caches basic operation
Miss classification
Cache improvements
Related to block size
Related to cache size
Related to indexing

51
Randomizing cache index functions
0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0
111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 11
10xx 1111xx
00 01 10 11 00 01 10 11 00 01 10 11 00 01 10 11
a5a4a3a2a1a0
H
a3a2
Direct mapped cache
52
Randomizing cache index functions
0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0
111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 11
10xx 1111xx
00 01 10 11 01 00 11 10 10 11 00 01 11 10 01 00
a5a4a3a2a1a0
H
(a5?a3)(a4?a2)
Direct mapped cache
53
Effect of randomized address bits
fp
int
overall
7
6
5
4
Miss rate
3
2
1
0
8
9
10
11
12
13
14
15
16
No. of randomized address bits
54
Skewed-Associative Cachemapping conflicts ?

2-way skewing
2 banks, different set index functions
Randomization!
Inter-bank dispersion
Blocks may conflict in one bank, but probably not
in the other
Set-associative
H1 H2

block address
bank 1
bank 2
tag
data
tag
data
55
Inter bank dispersion in action

Set-associative

Skewed-associative

bank 1
bank 2
bank 1
bank 2
tag
data
tag
data
tag
data
tag
data
56
Limited Inter Bank Dispersion
H2
Goal choose H1 and H2 such that the IBD is
maximal
00 11 01 10
00
11
H1
01
10
57
Trace cache
traditional cache
i1 call f
i7 i8 i9 ret
i2 i3
trace cache
58
Example
Processor Pentium 4 Ultrasparc III Clock
(2001) 2000 Mhz 900 Mhz L1 I cache 96 KiB TC 32
KiB, 4WSA Latency 4 2 L1 D cache 8 KiB 4WSA 64
KiB, 4WSA Latency 2 2 TLB 128 128 L2 cache 256
KiB 8WSA 8 MiB DM (off chip) Latency 6 15 Block
size 64 bytes 32 bytes Bus width 64 bits 128
bits Bus clock 400 Mhz 150 Mhz
59
Other examples of caching
60
Questions?

Write a Comment

User Comments (0)