Advanced Computer Architecture 5MD00 / 5Z033 Memory Hierarchy

About This Presentation

Title:

Advanced Computer Architecture 5MD00 / 5Z033 Memory Hierarchy

Description:

Advanced Computer Architecture 5MD00 / 5Z033 Memory Hierarchy & Caches Henk Corporaal www.ics.ele.tue.nl/~heco/courses/aca h.corporaal_at_tue.nl TUEindhoven – PowerPoint PPT presentation

Number of Views:261

Avg rating:3.0/5.0

Slides: 68

Provided by: henkcor1

Category:

more less

Transcript and Presenter's Notes

Title: Advanced Computer Architecture 5MD00 / 5Z033 Memory Hierarchy

1
Advanced Computer Architecture5MD00 /
5Z033Memory Hierarchy Caches

Henk Corporaal
www.ics.ele.tue.nl/heco/courses/aca
h.corporaal_at_tue.nl
TUEindhoven
2007

2
Topics

Processor Memory gap
Recap of cache basics
Basic cache optimizations
Advanced cache optimizations
reduce miss penalty
reduce miss rate
reduce hit time

3
Review Who Cares About the Memory Hierarchy?
µProc 60/yr.
CPU
DRAM 7/yr.
DRAM
4
Cache operation
Cache / Higher level
Memory / Lower level
block / line
tags
data
5
Direct Mapped Cache

Taking advantage of spatial locality

Address (bit positions)
6
A 4-Way Set-Associative Cache
7
6 basic cache optimizations (App. C)

Reduces miss rate
Larger block size
Bigger cache
Higher associativity
reduces conflict rate
Reduce miss penalty
Multi-level caches
Give priority to read messes over write misses
Reduce hit time
Avoid address translation during indexing of the
cache

8
11 Advanced Cache Optimizations (5.2)

Reducing hit time
Small and simple caches
Way prediction
Trace caches
Increasing cache bandwidth
Pipelined caches
Multibanked caches
Nonblocking caches

Reducing Miss Penalty
Critical word first
Merging write buffers
Reducing Miss Rate
Compiler optimizations
Reducing miss penalty or miss rate via
parallelism
Hardware prefetching
Compiler prefetching

9
1. Fast Hit via Small and Simple Caches

Index tag memory and then compare takes time
? Small cache is faster
Also L2 cache small enough to fit on chip with
the processor avoids time penalty of going off
chip
Simple ? direct mapping
Can overlap tag check with data transmission
since no choice
Access time estimate for 90 nm using CACTI model
4.0

10
2. Fast Hit via Way Prediction

Make set-associative caches faster
Keep extra bits in cache to predict the way, or
block within the set, of next cache access.
Multiplexor is set early to select desired block,
only 1 tag comparison performed
Miss ? 1st check other blocks for matches in next
clock cycle
Accuracy ? 85
Drawback CPU pipeline is hard if hit takes 1 or
2 cycles
Used for instruction caches vs. L1 data caches
Also used on MIPS R10K for off-chip L2 unified
cache, way-prediction table on-chip

11
A 4-Way Set-Associative Cache
12
Way Predicting Caches

Use processor address to index into way
prediction table
Look in predicted way at given index, then

HIT
MISS
Return copy of data from cache
Look in other way Read block of data from
next level of cache
MISS
SLOW HIT (change entry in prediction table)
13
Way Predicting Instruction Cache (Alpha
21264-like)

Jump target
0x4
Jump control
Add
PC
addr
inst
Primary Instruction Cache
way
Sequential Way
Branch Target Way
14
3. Fast (Inst. Cache) Hit via Trace Cache

Key Idea Pack multiple non-contiguous basic
blocks into one contiguous trace cache line

BR
BR
BR

Single fetch brings in multiple basic blocks
Trace cache indexed by start address and next n
branch predictions

15
3. Fast Hit times via Trace Cache

Trace cache in Pentium 4
Dynamic instr. traces cached (in level 1 cache)
Cache the micro-ops vs. x86 instructions
Decode/translate from x86 to micro-ops on trace
cache miss
? better utilize long blocks (dont exit in
middle of block, dont enter at label in middle
of block)
? complicated address mapping since addresses no
longer aligned to power-of-2 multiples of word
size
- ? instructions may appear multiple times in
multiple dynamic traces due to different branch
outcomes

16
4 Increasing Cache Bandwidth by Pipelining

Pipeline cache access to maintain bandwidth, but
higher latency
Nr. of Instruction cache access pipeline stages
1 Pentium
2 Pentium Pro through Pentium III
4 Pentium 4
? greater penalty on mispredicted branches
? more clock cycles between the issue of the load
and the use of the data

17
5. Increasing Cache Bandwidth Non-Blocking
Caches

Non-blocking cache or lockup-free cache
allow data cache to continue to supply cache hits
during a miss
requires out-of-order execution CPU
hit under miss reduces the effective miss
penalty by continuing during miss
hit under multiple miss or miss under miss
may further lower the effective miss penalty by
overlapping multiple misses
Requires that memory system can service multiple
misses
Significantly increases the complexity of the
cache controller as there can be multiple
outstanding memory accesses
Requires multiple memory banks (otherwise cannot
support)
Pentium Pro allows 4 outstanding memory misses

18
5. Increasing Cache Bandwidth Non-Blocking
Caches
19
Value of Hit Under Miss for SPEC
0-gt1 1-gt2 2-gt64 Base
Average Memory Access Time
Hit under n Misses
Integer
Floating Point

FP programs on average AMAT 0.68 -gt 0.52 -gt
0.34 -gt 0.26
Int programs on average AMAT 0.24 -gt 0.20 -gt
0.19 -gt 0.19
8 KB Data Cache, Direct Mapped, 32B block, 16
cycle miss

20
6 Increase Cache Bandwidth via Multiple Banks

Divide cache into independent banks that can
support simultaneous accesses
E.g., T1 (Niagara) L2 has 4 banks
Banking works best when accesses naturally spread
themselves across banks ? mapping of addresses to
banks affects behavior of memory system
Simple mapping that works well is sequential
interleaving
Spread block addresses sequentially across banks
E.g., with 4 banks, Bank 0 has all blocks with
address4 0 bank 1 has all blocks whose
address4 1

21
7. Early Restart and Critical Word First to
reduce miss penalty

Dont wait for full block to be loaded before
restarting CPU
Early restartAs soon as the requested word of
the block arrives, send it to the CPU and
continue
Critical Word FirstRequest the missed word first
from memory and send it to the CPU as soon as it
arrives let the CPU continue while filling the
rest of the words in the block
Generally useful only when blocks are large

22
8. Merging Write Buffer to Reduce Miss Penalty

Write buffer to allow processor to continue while
waiting to write to memory
E.g., four writes are merged into one buffer
entry rather than putting them in separate
buffers
Less frequent write backs

23
9. Reducing Misses by Compiler Optimizations

McFarling 1989 reduced caches misses by 75
for 8KB direct-mapped cache, 4-byte blocks in
software
Instructions
Reorder procedures in memory so as to reduce
conflict misses
Profiling to look at conflicts (using developed
tools)
Data
Merging Arrays improve spatial locality by
single array of compound elements vs. 2 arrays
Loop Interchange change nesting of loops to
access data in order stored in memory
Loop Fusion combine 2 independent loops that
have same looping and some variables overlap
Blocking Improve temporal locality by accessing
blocks of data repeatedly vs. going down whole
columns or rows

24
Merging Arrays

int valSIZE struct record
int keySIZE int val
int key
for (i0 iltSIZE i)
keyi newkey struct record recordsSIZE
vali
for (i0 iltSIZE i)
recordsi.key newkey
recordsi.val
Reduces conflicts between val key and improves
spatial locality

25
Loop Interchange

for (col0 collt100 col)
for (row0 rowlt5000 row)
Xrowcol Xrowcol1
for (row0 rowlt5000 row)
for (col0 collt100 col)
Xrowcol Xrowcol1
Sequential accesses instead of striding through
memory every 100 words
Improves spatial locality

columns
rows
array X
26
Loop Fusion

for (i 0 i lt N i)
for (j 0 j lt N j)
aij 1/bij cij
for (i 0 i lt N i)
for (j 0 j lt N j)
dij aij cij
for (i 0 i lt N i)
for (j 0 j lt N j)
aij 1/bij cij
dij aij cij
Splitted loops every access to a and c misses.
Fused loops only 1st access misses. Improves
temporal locality

Reference can be directly to register
27
Blocking applied to array multiplication

for (i0 iltN i)
for (j0 jltN j)
cij 0.0
for (k0 kltN k)
cij aikbkj
The two inner loops
Read all NxN elements of b
Read all N elements of one row of a repeatedly
Write all N elements of one row of c
If a whole matrix does not fit in the cache, many
cache misses.
Idea compute on BxB submatrix that fits in the
cache

c

a
x
b
28
Blocking Example

for (ii0 iiltN iiB)
for (jj0 jjltN jjB)
for (iii iltmin(iiB-1,N) i)
for (jjj jltmin(jjB-1,N) j)
cij 0.0
for (k0 kltN k)
cij aikbkj
B is called Blocking Factor
Can reduce capacity misses from 2N3 N2 to
2N3/B N2

c

a
x
b
29
Reducing Conflict Misses by Blocking

Conflict misses in caches vs. Blocking size
Lam et al 1991 a blocking factor of 24 had a
fifth the misses vs. 48 despite both fit in cache

30
Summary of Compiler Optimizations to Reduce Cache
Misses (by hand)
31
10. Reducing Misses by HW Prefetching

Use extra memory bandwidth (if available)
Instruction Prefetching
Typically, CPU fetches 2 blocks on a miss the
requested block and the next consecutive block.
Requested block is placed in instruction cache
when it returns, and prefetched block is placed
into instruction stream buffer
Data Prefetching
Pentium 4 can prefetch data into L2 cache from up
to 8 streams from 8 different 4 KB pages
Prefetching invoked if 2 successive L2 cache
misses to a page, if distance between those
cache blocks is lt 256 bytes

32
Performance impact of prefetching
33
Issues in Prefetching

Usefulness should produce hits
Timeliness not late and not too early
Cache and bandwidth pollution

L1 Instruction
Unified L2 Cache
CPU
L1 Data
RF
Prefetched data
34
Hardware Instruction Prefetching

Instruction prefetch in Alpha AXP 21064
Fetch two blocks on a miss the requested block
(i) and the next consecutive block (i1)
Requested block placed in cache, and next block
in instruction stream buffer
If miss in cache but hit in stream buffer, move
stream buffer block into cache and prefetch next
block (i2)

35
Hardware Data Prefetching

Prefetch-on-miss
Prefetch b 1 upon miss on b
One Block Lookahead (OBL) scheme
Initiate prefetch for block b 1 when block b is
accessed
Why is this different from doubling block size?
Can extend to N block lookahead
Strided prefetch
If observe sequence of accesses to block b, bN,
b2N, then prefetch b3N etc.
Example IBM Power 5 2003 supports eight
independent streams of strided prefetch per
processor, prefetching 12 lines ahead of current
access

36
11. Reducing Misses by Software (Compiler
controlled) Prefetching Data

Data Prefetch
Load data into register (HP PA-RISC loads)
Cache Prefetch load into cache (MIPS IV,
PowerPC, SPARC v. 9)
Special prefetching instructions cannot cause
faultsa form of speculative execution
Issuing Prefetch Instructions takes time
Is cost of prefetch issues lt savings in reduced
misses?
Wider superscalar reduces difficulty of issue
bandwidth

37
Technique Hit Time Band-width Miss penalty Miss rate Miss rate HW cost/ complexity Comment
Small and simple caches 0 Trivial widely used
Way-predicting caches 1 Used in Pentium 4
Trace caches 3 Used in Pentium 4
Pipelined cache access 1 Widely used
Nonblocking caches 3 Widely used
Banked caches 1 Used in L2 of Opteron and Niagara
Critical word first and early restart 2 Widely used
Merging write buffer 1 Widely used with write through
Compiler techniques to reduce cache misses 0 Software is a challenge some computers have compiler option
Hardware prefetching of instructions and data 2 instr. 3 data Many prefetch instructions AMD Opteron prefetches data
Compiler-controlled prefetching 3 Needs nonblocking cache in many CPUs
38
Recap of Cache basics
39
Cache operation
Cache / Higher level
Memory / Lower level
block / line
tags
data
40
Direct Mapped Cache

Mapping address is modulo the number of blocks
in the cache

41
Review Four Questions for Memory Hierarchy
Designers

Q1 Where can a block be placed in the upper
level? (Block placement)
Fully Associative, Set Associative, Direct Mapped
Q2 How is a block found if it is in the upper
level? (Block identification)
Tag/Block
Q3 Which block should be replaced on a miss?
(Block replacement)
Random, FIFO, LRU
Q4 What happens on a write? (Write strategy)
Write Back or Write Through (with Write Buffer)

42
Direct Mapped Cache
Address (bit positions)
3
1

3
0

1
3

1
2

1
1

2

1

0
B
y
t
e
o
f
f
s
e
t

QWhat kind of locality are we taking advantage
of?

2
0
1
0
H
i
t
D
a
t
a
T
a
g
I
n
d
e
x
V
a
l
i
d
T
a
g
D
a
t
a
I
n
d
e
x
0
1
2
1
0
2
1
1
0
2
2
1
0
2
3
2
0
3
2
43
Direct Mapped Cache

Taking advantage of spatial locality

Address (bit positions)
44
A 4-Way Set-Associative Cache
45
Cache Basics

cache_size Nsets x Assoc x Block_size
block_address Byte_address DIV Block_size in
bytes
index Block_address MOD Nsets
Because the block size and the number of sets are
(usually) powers of two, DIV and MOD can be
performed efficiently

block address
block offset
tag
index
2 1 0
31
46
Example 1

Assume
Cache of 4K blocks
4 word block size
32 bit address
Direct mapped (associativity1)
16 bytes per block 24
32 bit address 32-428 bits for index and tag
setsblocks/ associativity log2 of 4K12 12
for index
Total number of tag bits (28-12)4K64 Kbits
2-way associative
setsblocks/associativity 2K sets
1 bit less for indexing, 1 bit more for tag
Tag bits (28-11) 2 2K68 Kbits
4-way associative
setsblocks/associativity 1K sets
1 bit less for indexing, 1 bit more for tag
Tag bits (28-10) 4 1K72 Kbits

47
Example 2

3 caches consisting of 4 one-word blocks
Cache 1 fully associative
Cache 2 two-way set associative
Cache 3 direct mapped
Suppose following sequence of block addresses
0, 8, 0, 6, 8

48
Example 2 Direct Mapped
Block address Cache Block
0 0 mod 40
6 6 mod 42
8 8 mod 40
Address of memory block Hit or miss Location 0 Location 1 Location 2 Location 3
0 miss Mem0
8 miss Mem8
0 miss Mem0
6 miss Mem0 Mem6
8 miss Mem8 Mem6
Coloured new entry miss
49
Example 2 2-way Set Associative 2 sets
Block address Cache Block
0 0 mod 20
6 6 mod 20
8 8 mod 20
(so all in set/location 0)
Address of memory block Hit or miss SET 0 entry 0 SET 0 entry 1 SET 1 entry 0 SET 1 entry 1
0 Miss Mem0
8 Miss Mem0 Mem8
0 Hit Mem0 Mem8
6 Miss Mem0 Mem6
8 Miss Mem8 Mem6
LEAST RECENTLY USED BLOCK
50
Example 2 Fully associative (4 way assoc., 1
set)
Address of memory block Hit or miss Block 0 Block 1 Block 2 Block 3
0 Miss Mem0
8 Miss Mem0 Mem8
0 Hit Mem0 Mem8
6 Miss Mem0 Mem8 Mem6
8 Hit Mem0 Mem6 Mem6
51
6 basic cache optimizations (App. C)

Reduces miss rate
Larger block size
Bigger cache
Higher associativity
reduces conflict rate
Reduce miss penalty
Multi-level caches
Give priority to read messes over write misses
Reduce hit time
Avoid address translation during indexing of the
cache

52
Improving Cache Performance

T Ninstr CPI Tcycle
CPI (with cache) CPI_base CPI_cachepenalty
CPI_cachepenalty ...............................
..............
Reduce the miss penalty
Reduce the miss rate
Reduce the time to hit in the cache

53
1. Increase Block Size
54
2. Larger Caches

Increase capacity of cache
Disadvantages
longer hit time (may determine processor cycle
time!!)
higher cost

55
3. Increase Associativity

21 Cache Rule
Miss Rate direct-mapped cache of size N ? Miss
Rate 2-way set-associative cache of size N/2
Beware Execution time is only true measure of
performance!
Access time of set-associative caches larger than
access time direct-mapped caches
L1 cache often direct-mapped (access must fit in
one clock cycle)
L2 cache often set-associative (cannot afford to
go to main memory)

56
Classifying Misses the 3 Cs

The 3 Cs
CompulsoryFirst access to a block is always a
miss. Also called cold start misses
misses in infinite cache
CapacityMisses resulting from the finite
capacity of the cache
misses in fully associative cache with optimal
replacement strategy
ConflictMisses occurring because several blocks
map to the same set. Also called collision misses
remaining misses

57
3 Cs Compulsory, Capacity, Conflict

In all cases, assume total cache size not changed
What happens if we
1) Change Block Size Which of 3Cs is obviously
affected? compulsory
2) Change Cache Size Which of 3Cs is obviously
affected? capacity misses
3) Introduce higher associativity Which of 3Cs
is obviously affected? conflict misses

58
3Cs Absolute Miss Rate (SPEC92)
Conflict
Miss rate per type
59
3Cs Relative Miss Rate
Conflict
Miss rate per type
60
Improving Cache Performance

Reduce the miss penalty
Reduce the miss rate / number of misses
Reduce the time to hit in the cache

61
4. Second Level Cache (L2)

Most CPUs
have an L1 cache small enough to match the cycle
time (reduce the time to hit the cache)
have an L2 cache large enough and with sufficient
associativity to capture most memory accesses
(reduce miss rate)
L2 Equations
AMAT Hit TimeL1 Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 Hit TimeL2 Miss RateL2 x Miss
PenaltyL2
AMAT Hit TimeL1 Miss RateL1 x (Hit TimeL2
Miss RateL2 x Miss PenaltyL2)
Definitions
Local miss rate misses in this cache divided by
the total number of memory accesses to this cache
(Miss rateL2)
Global miss ratemisses in this cache divided by
the total number of memory accesses generated by
the CPU (Miss RateL1 x Miss RateL2)

62
4. Second Level Cache (L2)

Suppose processor with base CPI of 1.0
Clock rate of 500 Mhz
Main memory access time 200 ns
Miss rate per instruction primary cache 5
What improvement with second cache having 20ns
access time, reducing miss rate to memory to 2 ?
Miss penalty 200 ns/ 2ns per cycle100 clock
cycles
Effective CPIbase CPI memory stall per
instruction ?
1 level cache total CPI151006
2 level cache a miss in first level cache is
satisfied by second cache or memory
Access second level cache 20 ns / 2ns per
cycle10 clock cycles
If miss in second cache, then access memory in
2 of the cases
Total CPI1primary stalls per instruction
secondary stalls per instruction
Total CPI151021003.5
Machine with L2 cache 6/3.51.7 times faster

63
4. Second Level Cache

Global cache miss is similar to single cache
miss rate of second level cache provided L2
cache is much bigger than L1.
Local cache rate is NOT good measure of
secondary caches as it is function of L1 cache.
Global cache miss rate should be used.

64
4. Second Level Cache
65
5. Read Priority over Write on Miss

Write-through with write buffers can cause RAW
data hazards
SW 512(R0),R3 Mem512 R3
LW R1,1024(R0) R1 Mem1024
LW R2,512(R0) R2 Mem512
Problem if write buffer used, final LW may read
wrong value from memory !!
Solution 1 Simply wait for write buffer to
empty
increases read miss penalty (old MIPS 1000 by 50
)
Solution 2 Check write buffer contents before
read if no conflicts, let read continue

Map to same cache block
66
5. Read Priority over Write on Miss

What about write-back?
Dirty bit whenever a write is cached, this bit
is set (made a 1) to tell the cache controller
"when you decide to re-use this cache line for a
different address, you need to write the current
contents back to memory
What if read-miss
Normal Write dirty block to memory, then do the
read
Instead Copy dirty block to a write buffer, then
do the read, then the write
Less CPU stalls since restarts as soon as read
done

67
6. Avoiding address translation during cache
access

Write a Comment

User Comments (0)

About PowerShow.com

Advanced Computer Architecture 5MD00 / 5Z033 Memory Hierarchy - PowerPoint PPT Presentation

Advanced Computer Architecture 5MD00 / 5Z033 Memory Hierarchy

Advanced Computer Architecture 5MD00 / 5Z033 Memory Hierarchy & Caches Henk Corporaal www.ics.ele.tue.nl/~heco/courses/aca h.corporaal_at_tue.nl TUEindhoven – PowerPoint PPT presentation