Title: Advanced Computer Architecture 5MD00 / 5Z033 Memory Hierarchy
1Advanced Computer Architecture5MD00 /
5Z033Memory Hierarchy Caches
- Henk Corporaal
- www.ics.ele.tue.nl/heco/courses/aca
- h.corporaal_at_tue.nl
- TUEindhoven
- 2011
2Topics
- Processor Memory gap
- Cache basics
- Basic cache optimizations
- Advanced cache optimizations
- reduce miss penalty
- reduce miss rate
- reduce hit time
- Extra from appendix C
- extensive recap of caches and several
optimizations (starts at slide 39 !)
3Review Who Cares About the Memory Hierarchy?
µProc 60/yr.
CPU
DRAM 7/yr.
DRAM
4Cache operation (direct mapped cache)
Cache / Higher level
Memory / Lower level
block / line
tags
data
5Why does a cache work?
- Principle of Locality
- Temporal locality
- an accessed item has a high probability being
accessed in the near future - Spatial locality
- items close in space to a recently accessed item
have a high probability of being accessed next - Check yourself why there is temporal and spatial
locality for instruction accesses and for data
accesses - Regular programs have high instruction and data
locality
6Direct Mapped Cache
- Taking advantage of spatial locality
Address (bit positions)
7A 4-Way Set-Associative Cache
86 basic cache optimizations (App. C)
- Reduces miss rate
- Larger block size
- Bigger cache
- Higher associativity
- reduces conflict rate
- Reduce miss penalty
- Multi-level caches
- Give priority to read misses over write misses
- Reduce hit time
- Avoid address translation (from virtual to
physical addr.) during indexing of the cache
911 Advanced Cache Optimizations (5.2)
- Reducing hit time
- Small and simple caches
- Way prediction
- Trace caches
- Increasing cache bandwidth
- Pipelined caches
- Multibanked caches
- Nonblocking caches
- Reducing Miss Penalty
- Critical word first
- Merging write buffers
- Reducing Miss Rate
- Compiler optimizations
- Reducing miss penalty or miss rate via
parallelism - Hardware prefetching
- Compiler prefetching
101. Fast Hit via Small and Simple Caches
- Index tag memory and thereafter compare takes
time - ? Small cache is faster
- Also L2 cache small enough to fit on chip with
the processor avoids time penalty of going off
chip - Simple ? direct mapping
- Can overlap tag check with data transmission
since no choice - Access time estimate for 90 nm using CACTI model
4.0
112. Fast Hit via Way Prediction
- Make set-associative caches faster
- Keep extra bits in cache to predict the way, or
block within the set, of next cache access. - Multiplexor is set early to select desired block,
only 1 tag comparison performed - Miss ? first check other blocks for matches in
next clock cycle - Accuracy ? 85
- Saves also energy
- Drawback CPU pipeline is hard if hit takes 1 or
2 cycles
12A 4-Way Set-Associative Cache
13Way Predicting Caches
- Use processor address to index into way
prediction table - Look in predicted way at given index, then
HIT
MISS
Return copy of data from cache
Look in other way Read block of data from
next level of cache
MISS
SLOW HIT (change entry in prediction table)
14Way Predicting Instruction Cache (Alpha
21264-like)
Jump target
0x4
Jump control
Add
PC
addr
inst
Primary Instruction Cache
way
Sequential Way
Branch Target Way
153. Fast (Inst. Cache) Hit via Trace Cache
- Key Idea Pack multiple non-contiguous basic
blocks into one contiguous trace cache line
instruction trace
BR
BR
BR
cache line
- Single fetch brings in multiple basic blocks
- Trace cache indexed by start address and next n
branch predictions
163. Fast Hit times via Trace Cache
- Trace cache in Pentium 4
- Dynamic instr. traces cached (in level 1 cache)
- Cache the micro-ops vs. x86 instructions
- Decode/translate from x86 to micro-ops on trace
cache miss - ? better utilize long blocks (dont exit in
middle of block, dont enter at label in middle
of block) - ? complicated address mapping since addresses no
longer aligned to power-of-2 multiples of word
size - - ? instructions may appear multiple times in
multiple dynamic traces due to different branch
outcomes
174 Increasing Cache Bandwidth by Pipelining
- Pipeline cache access to maintain bandwidth, but
higher latency - Nr. of Instruction cache access pipeline stages
- 1 Pentium
- 2 Pentium Pro through Pentium III
- 4 Pentium 4
- ? greater penalty on mispredicted branches
- ? more clock cycles between the issue of the load
and the use of the data
185. Increasing Cache Bandwidth Non-Blocking
Caches
- Non-blocking cache or lockup-free cache
- allow data cache to continue to supply cache hits
during a miss - requires out-of-order execution CPU
- hit under miss reduces the effective miss
penalty by continuing during miss - hit under multiple miss or miss under miss
may further lower the effective miss penalty by
overlapping multiple misses - Requires that memory system can service multiple
misses - Significantly increases the complexity of the
cache controller as there can be multiple
outstanding memory accesses - Requires multiple memory banks (otherwise cannot
support it) - Pentium Pro allows 4 outstanding memory misses
195. Increasing Cache Bandwidth Non-Blocking
Caches
20Value of Hit Under Miss for SPEC
0-gt1 1-gt2 2-gt64 Base
Average Memory Access Time
Hit under n Misses
Integer
Floating Point
- FP programs on average AMAT 0.68 -gt 0.52 -gt
0.34 -gt 0.26 - Int programs on average AMAT 0.24 -gt 0.20 -gt
0.19 -gt 0.19 - 8 KB Data Cache, Direct Mapped, 32B block, 16
cycle miss
216 Increase Cache Bandwidth via Multiple Banks
- Divide cache into independent banks that can
support simultaneous accesses - E.g., T1 (Niagara) L2 has 4 banks
- Banking works best when accesses naturally spread
themselves across banks ? mapping of addresses to
banks affects behavior of memory system - Simple mapping that works well is sequential
interleaving - Spread block addresses sequentially across banks
- E.g., with 4 banks, Bank 0 has all blocks with
address4 0 bank 1 has all blocks whose
address4 1
227. Early Restart and Critical Word First to
reduce miss penalty
- Dont wait for full block to be loaded before
restarting CPU - Early restartAs soon as the requested word of
the block arrives, send it to the CPU and
continue - Critical Word First
- Request the missed word first from memory and
send it to the CPU as soon as it arrives let the
CPU continue while filling the rest of the words
in the block - Generally useful only when blocks are large
- Costly
238. Merging Write Buffer to Reduce Miss Penalty
- Write buffer to allow processor to continue while
waiting to write to memory - E.g., four writes are merged into one buffer
entry rather than putting them in separate
buffers - Less frequent write backs
249. Reducing Misses by Compiler Optimizations
- McFarling 1989 reduced caches misses by 75
for 8KB direct-mapped cache, 4-byte blocks in
software - Instructions
- Reorder procedures in memory so as to reduce
conflict misses - Profiling to look at conflicts (using developed
tools) - Data
- Merging Arrays improve spatial locality by
single array of compound elements vs. 2 arrays - Loop Interchange change nesting of loops to
access data in order stored in memory - Loop Fusion combine 2 independent loops that
have same looping and some variables overlap - Blocking Improve temporal locality by accessing
blocks of data repeatedly vs. going down whole
columns or rows
25Merging Arrays
- int valSIZE struct record
- int keySIZE int val
- int key
- for (i0 iltSIZE i)
- keyi newkey struct record recordsSIZE
- vali
- for (i0 iltSIZE i)
- recordsi.key newkey
- recordsi.val
-
- Reduces conflicts between val key and improves
spatial locality
26Loop Interchange
- for (col0 collt100 col)
- for (row0 rowlt5000 row)
- Xrowcol Xrowcol1
- for (row0 rowlt5000 row)
- for (col0 collt100 col)
- Xrowcol Xrowcol1
- Sequential accesses instead of striding through
memory every 100 words - Improves spatial locality
columns
rows
array X
27Loop Fusion
- for (i 0 i lt N i)
- for (j 0 j lt N j)
- aij 1/bij cij
- for (i 0 i lt N i)
- for (j 0 j lt N j)
- dij aij cij
- for (i 0 i lt N i)
- for (j 0 j lt N j)
- aij 1/bij cij
- dij aij cij
-
- Splitted loops every access to a and c misses.
Fused loops only 1st access misses. Improves
temporal locality
Reference can be directly to register
28Blocking (Tiling) applied to array multiplication
- for (i0 iltN i)
- for (j0 jltN j)
- cij 0.0
- for (k0 kltN k)
- cij aikbkj
-
c
a
- The two inner loops
- Read all NxN elements of b
- Read all N elements of one row of a repeatedly
- Write all N elements of one row of c
- If a whole matrix does not fit in the cache, many
cache misses. - Idea compute on BxB submatrix that fits in the
cache
x
b
29Blocking Example
- for (ii0 iiltN iiB)
- for (jj0 jjltN jjB)
- for (iii iltmin(iiB-1,N) i)
- for (jjj jltmin(jjB-1,N) j)
- cij 0.0
- for (k0 kltN k)
- cij aikbkj
-
- B is called Blocking Factor
- Can reduce capacity misses from 2N3 N2 to
2N3/B N2
c
a
x
b
30Reducing Conflict Misses by Blocking
- Conflict misses in caches vs. Blocking size
- Lam et al 1991 a blocking factor of 24 had a
fifth the misses compared to 48, despite both fit
in cache
31Summary of Compiler Optimizations to Reduce Cache
Misses (by hand)
3210. Reducing Misses by HW Prefetching
- Use extra memory bandwidth (if available)
- Instruction Prefetching
- Typically, CPU fetches 2 blocks on a miss the
requested block and the next consecutive block. - Prefetched block is placed into separate buffer
- Data Prefetching
- Pentium 4 can prefetch data into L2 cache from up
to 8 streams from 8 different 4 KB pages - Prefetching invoked if 2 successive L2 cache
misses to a page, if distance between those cache
blocks is lt 256 bytes
33Performance impact of prefetching
34Issues in Prefetching
- Usefulness should produce hits
- Timeliness not too late and not too early
- Cache and bandwidth pollution
L1 Instruction
Unified L2 Cache
CPU
L1 Data
RF
Prefetched data
35Hardware Instruction Prefetching
- Instruction prefetch in Alpha AXP 21064
- Fetch two blocks on a miss the requested block
(i) and the next consecutive block (i1) - Requested block placed in cache, and next block
in instruction stream buffer - If miss in cache but hit in stream buffer, move
stream buffer block into cache and prefetch next
block (i2)
36Hardware Data Prefetching
- Prefetch-on-miss
- Prefetch b 1 upon miss on b
- One Block Lookahead (OBL) scheme
- Initiate prefetch for block b 1 when block b is
accessed - Why is this different from doubling block size?
- Can extend to N block lookahead
- Strided prefetch
- If observed sequence of accesses to block b,
bN, b2N, then prefetch b3N etc. - Example IBM Power 5 2003 supports eight
independent streams of strided prefetch per
processor, prefetching 12 lines ahead of current
access
3711. Reducing Misses by Software (Compiler
controlled) Prefetching Data
- Data Prefetch
- Load data into register
- Cache Prefetch load into cache
- Special prefetching instructions cannot cause
faultsa form of speculative execution - Issuing Prefetch Instructions (to prefetch data)
takes time - Is cost of prefetch issues lt savings in reduced
misses? - Wider superscalar reduces difficulty of issue
bandwidth
38Technique Hit Time Band-width Miss penalty Miss rate Miss rate HW cost/ complexity Comment
Small and simple caches 0 Trivial widely used
Way-predicting caches 1 Used in Pentium 4
Trace caches 3 Used in Pentium 4
Pipelined cache access 1 Widely used
Nonblocking caches 3 Widely used
Banked caches 1 Used in L2 of Opteron and Niagara
Critical word first and early restart 2 Widely used
Merging write buffer 1 Widely used with write through
Compiler techniques to reduce cache misses 0 Software is a challenge some computers have compiler option
Hardware prefetching of instructions and data 2 instr. 3 data Many prefetch instructions AMD Opteron prefetches data
Compiler-controlled prefetching 3 Needs nonblocking cache in many CPUs
39Recap of Cache basics
See appendix C
40Cache operation
Cache / Higher level
Memory / Lower level
block / line
tags
data
41Direct Mapped Cache
- Mapping address is modulo the number of blocks
in the cache
42Review Four Questions for Memory Hierarchy
Designers
- Q1 Where can a block be placed in the upper
level? (Block placement) - Fully Associative, Set Associative, Direct Mapped
- Q2 How is a block found if it is in the upper
level? (Block identification) - Tag/Block
- Q3 Which block should be replaced on a miss?
(Block replacement) - Random, FIFO, LRU
- Q4 What happens on a write? (Write strategy)
- Write Back or Write Through (with Write Buffer)
43Direct Mapped Cache
Address (bit positions)
3
1
3
0
1
3
1
2
1
1
2
1
0
B
y
t
e
o
f
f
s
e
t
- QWhat kind of locality are we taking advantage
of?
2
0
1
0
H
i
t
D
a
t
a
T
a
g
I
n
d
e
x
V
a
l
i
d
T
a
g
D
a
t
a
I
n
d
e
x
0
1
2
1
0
2
1
1
0
2
2
1
0
2
3
2
0
3
2
44Direct Mapped Cache
- Taking advantage of spatial locality
Address (bit positions)
45A 4-Way Set-Associative Cache
46Cache Basics
- cache_size Nsets x Assoc x Block_size
- block_address Byte_address DIV Block_size in
bytes - index Block_address MOD Nsets
- Because the block size and the number of sets are
(usually) powers of two, DIV and MOD can be
performed efficiently
block address
block offset
tag
index
2 1 0
31
47Example 1
- Assume
- Cache of 4K blocks
- 4 word block size
- 32 bit address
- Direct mapped (associativity1)
- 16 bytes per block 24
- 32 bit address 32-428 bits for index and tag
- setsblocks/ associativity log2 of 4K12 12
for index - Total number of tag bits (28-12)4K64 Kbits
- 2-way associative
- setsblocks/associativity 2K sets
- 1 bit less for indexing, 1 bit more for tag
- Tag bits (28-11) 2 2K68 Kbits
- 4-way associative
- setsblocks/associativity 1K sets
- 1 bit less for indexing, 1 bit more for tag
- Tag bits (28-10) 4 1K72 Kbits
48Example 2
- 3 caches consisting of 4 one-word blocks
- Cache 1 fully associative
- Cache 2 two-way set associative
- Cache 3 direct mapped
- Suppose following sequence of block addresses
0, 8, 0, 6, 8
49Example 2 Direct Mapped
Block address Cache Block
0 0 mod 40
6 6 mod 42
8 8 mod 40
Address of memory block Hit or miss Location 0 Location 1 Location 2 Location 3
0 miss Mem0
8 miss Mem8
0 miss Mem0
6 miss Mem0 Mem6
8 miss Mem8 Mem6
Coloured new entry miss
50Example 2 2-way Set Associative 2 sets
Block address Cache Block
0 0 mod 20
6 6 mod 20
8 8 mod 20
(so all in set/location 0)
Address of memory block Hit or miss SET 0 entry 0 SET 0 entry 1 SET 1 entry 0 SET 1 entry 1
0 Miss Mem0
8 Miss Mem0 Mem8
0 Hit Mem0 Mem8
6 Miss Mem0 Mem6
8 Miss Mem8 Mem6
LEAST RECENTLY USED BLOCK
51Example 2 Fully associative (4 way assoc., 1
set)
Address of memory block Hit or miss Block 0 Block 1 Block 2 Block 3
0 Miss Mem0
8 Miss Mem0 Mem8
0 Hit Mem0 Mem8
6 Miss Mem0 Mem8 Mem6
8 Hit Mem0 Mem8 Mem6
526 basic cache optimizations (App. C)
- Reduces miss rate
- Larger block size
- Bigger cache
- Higher associativity
- reduces conflict rate
- Reduce miss penalty
- Multi-level caches
- Give priority to read messes over write misses
- Reduce hit time
- Avoid address translation during indexing of the
cache
53Improving Cache Performance
- T Ninstr CPI Tcycle
- CPI (with cache) CPI_base CPI_cachepenalty
- CPI_cachepenalty ...............................
.............. - Reduce the miss penalty
- Reduce the miss rate
- Reduce the time to hit in the cache
541. Increase Block Size
552. Larger Caches
- Increase capacity of cache
- Disadvantages
- longer hit time (may determine processor cycle
time!!) - higher cost
563. Increase Associativity
- 21 Cache Rule
- Miss Rate direct-mapped cache of size N ? Miss
Rate 2-way set-associative cache of size N/2 - Beware Execution time is only true measure of
performance! - Access time of set-associative caches larger than
access time direct-mapped caches - L1 cache often direct-mapped (access must fit in
one clock cycle) - L2 cache often set-associative (cannot afford to
go to main memory)
57Classifying Misses the 3 Cs
- The 3 Cs
- CompulsoryFirst access to a block is always a
miss. Also called cold start misses - misses in infinite cache
- CapacityMisses resulting from the finite
capacity of the cache - misses in fully associative cache with optimal
replacement strategy - ConflictMisses occurring because several blocks
map to the same set. Also called collision misses - remaining misses
583 Cs Compulsory, Capacity, Conflict
- In all cases, assume total cache size not changed
- What happens if we
- 1) Change Block Size Which of 3Cs is obviously
affected? compulsory - 2) Change Cache Size Which of 3Cs is obviously
affected? capacity misses - 3) Introduce higher associativity Which of 3Cs
is obviously affected? conflict misses
593Cs Absolute Miss Rate (SPEC92)
Conflict
Miss rate per type
603Cs Relative Miss Rate
Conflict
Miss rate per type
61Improving Cache Performance
- Reduce the miss penalty
- Reduce the miss rate / number of misses
- Reduce the time to hit in the cache
624. Second Level Cache (L2)
- Most CPUs
- have an L1 cache small enough to match the cycle
time (reduce the time to hit the cache) - have an L2 cache large enough and with sufficient
associativity to capture most memory accesses
(reduce miss rate) - L2 Equations
- AMAT Hit TimeL1 Miss RateL1 x Miss PenaltyL1
- Miss PenaltyL1 Hit TimeL2 Miss RateL2 x Miss
PenaltyL2 - AMAT Hit TimeL1 Miss RateL1 x (Hit TimeL2
Miss RateL2 x Miss PenaltyL2) - Definitions
- Local miss rate misses in this cache divided by
the total number of memory accesses to this cache
(Miss rateL2) - Global miss ratemisses in this cache divided by
the total number of memory accesses generated by
the CPU (Miss RateL1 x Miss RateL2)
634. Second Level Cache (L2)
- Suppose processor with base CPI of 1.0
- Clock rate of 500 Mhz
- Main memory access time 200 ns
- Miss rate per instruction primary cache 5
- What improvement with second cache having 20ns
access time, reducing miss rate to memory to 2 ? - Miss penalty 200 ns/ 2ns per cycle100 clock
cycles - Effective CPIbase CPI memory stall per
instruction ? - 1 level cache total CPI151006
- 2 level cache a miss in first level cache is
satisfied by second cache or memory - Access second level cache 20 ns / 2ns per
cycle10 clock cycles - If miss in second cache, then access memory in
2 of the cases - Total CPI1primary stalls per instruction
secondary stalls per instruction - Total CPI151021003.5
- Machine with L2 cache 6/3.51.7 times faster
644. Second Level Cache
- Global cache miss is similar to single cache
miss rate of second level cache provided L2
cache is much bigger than L1. - Local cache rate is NOT good measure of
secondary caches as it is function of L1 cache. - Global cache miss rate should be used.
654. Second Level Cache
665. Read Priority over Write on Miss
- Write-through with write buffers can cause RAW
data hazards - SW 512(R0),R3 Mem512 R3
- LW R1,1024(R0) R1 Mem1024
- LW R2,512(R0) R2 Mem512
- Problem if write buffer used, final LW may read
wrong value from memory !! - Solution 1 Simply wait for write buffer to
empty - increases read miss penalty (old MIPS 1000 by 50
) - Solution 2 Check write buffer contents before
read if no conflicts, let read continue
Map to same cache block
675. Read Priority over Write on Miss
- What about write-back?
- Dirty bit whenever a write is cached, this bit
is set (made a 1) to tell the cache controller
"when you decide to re-use this cache line for a
different address, you need to write the current
contents back to memory - What if read-miss
- Normal Write dirty block to memory, then do the
read - Instead Copy dirty block to a write buffer, then
do the read, then the write - Less CPU stalls since restarts as soon as read
done
686. Avoiding address translation during cache
access