Title: Lecture 13: Caches
1Lecture 13Caches
- Prof. Kenneth M. Mackenzie
- Computer Systems and Networks
- CS2200, Spring 2003
Includes slides from Bill Leahy
2Review
- Page tables
- data structures!
- Virtual memory
- what happens when pages gt frames
- policy questions
- fetch policy
- replacement policy
- Performance of virtual memory
- phenomenon of locality, working sets
- average-memory-access-time (AMAT)
3Page Fault
Disk
Physical Memory
Operating System
CPU
42
356
356
page table
i
4Today Cachesand the full memory hierarchy
5Problem1. want a big memory2. big memory is slow
Processor
Memory
6Memory Background
row decoder
wordline
storage cell
bitline
address in
sense amplifiers
column mux
data in/out
7Itanium 2 (McKinley) Die Photo
microprocessor report
8Pick Your Storage Cells
- DRAM
- dynamic must be refreshed
- densest technology. Cost/bit is paramount
- SRAM
- static value is stored in a latch
- fastest technology 8-16x faster than DRAM
- larger cell 4-8x larger
- more expensive 8-16x more per bit
- others
- EEPROM/Flash high density, non-volatile
- core...
9Main Memory Deep Background
- Out-of-Core, In-Core, Core Dump?
- Core store bit as magnetic state (ca. 1955-75)
- Non-volatile also radiation resistant
- Replaced by 4 Kbit DRAM (current is 256Mbit)
- Access time 750 ns, cycle time 1500-3000 ns
10Pre-core Memory Technology Mercury Delay Lines!
shift register via accoustic wave in a tube of
mercury
Maurice Wilkes, Computing Perspectives
11Problem1. want a big memory2. big memory is slow
Processor
Memory
12How big is the problem?
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Performance
10
DRAM 9/yr. (2X/10yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
13Ideally one would desire an indefinitely large
capacity memory such that any particular...word
would be immediately available.... We
are...forced to recognize the possibility of
constructing a hierarchy of memories, each of
which has greater capacity than the preceding but
which is less quickly accessible.
- A. W. Burks, H. H. Goldstine, and J. von Neumann
- Preliminary Discussions of the Logical Design of
an Electronic Computing Instrument, 1946
14SolutionSmall memory unit closer to processor
Processor
small, fast memory
BIG SLOW MEMORY
15Terminology
Processor
upper level (the cache)
small, fast memory
Memory
lower level (sometimes called backing store)
BIG SLOW MEMORY
16Terminology
Processor
hit rate fraction of accesses resulting in
hits.
A hit block found in upper lever
small, fast memory
Memory
BIG SLOW MEMORY
17Terminology
Processor
hit rate fraction of accesses resulting in
hits.
A miss not found in upper level, must look in
lower level
small, fast memory
Memory
miss rate (1 - hit_rate)
BIG SLOW MEMORY
18Terminology Summary
- Hit data appears in some block in the upper
level (example Block X in cache) - Hit Rate the fraction of memory access found in
the upper level - Hit Time Time to access the upper level which
consists of - RAM access time Time to determine hit/miss
- Miss data needs to be retrieved from a block in
the lower level (example Block Y in memory) - Miss Rate 1 - (Hit Rate)
- Miss Penalty Extra time to replace a block in
the upper level - Time to deliver the block the processor
- Hit Time ltlt Miss Penalty (500 instructions on
21264)
19Average Memory Access Time
AMAT HitTime (1 - h) x MissPenalty
- Hit time basic time of every access.
- Hit rate (h) fraction of access that hit
- Miss penalty extra time to fetch a block from
lower level, including time to replace in CPU
20The Full Memory Hierarchyalways reuse a good
idea
Upper Level
Capacity Access Time Cost
Staging Xfer Unit
faster
CPU Registers 100s Bytes lt10s ns
Registers
prog./compiler 1-8 bytes
Instr. Operands
Cache K Bytes 10-100 ns 1-0.1 cents/bit
Cache
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 200ns- 500ns .0001-.00001
cents /bit
Memory
OS 4K-16K bytes
Pages
Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10
cents/bit
Disk
-5
-6
user/operator Mbytes
Files
Larger
Tape infinite sec-min 10
Tape
Lower Level
-8
21Virtual Memory
- Virtual memory is a kind of cache DRAM is used
as a cache for disk. - Why does it work?
- How did it work?
22Virtual Memory
- Virtual memory is a kind of cache DRAM is used
as a cache for disk. - Why does it work?
- locality! phenomena of locality means that you
tend to reuse the same locations - How did it work?
- 1. find block in upper level (DRAM) via page
table (a map) - 2. replace least-recently-used (LRU) page on a
miss
23Virtual Memory
- Timing was tough with virtual memory
- AMAT Tmem (1-h) Tdisk
- 100nS (1-h) 25,000,000nS
- h (hit rate) had to be incredibly (almost
unattainably) close to perfect to work - so VM is a cache but an odd one.
24Hardware CacheTiming is much more feasible
Processor
cache
1nS
AMAT Thit (1-h) Tmem 1nS
(1-h) 100nS hit rate of 98 would yield an
AMAT of 3nS ... pretty good!
BIG SLOW MEMORY
100nS
25Hardware CacheHow do you find things in the
upper level?
Processor
cache
1nS
dont have much time!
BIG SLOW MEMORY
100nS
26One way
- Have a scheme that allows the contents of an main
memory address to be found in exactly one place
in the cache. - Remember the cache is smaller than the level
below it, thus multiple locations could map to
the same place - Severe restriction! But lets see what we can do
with it...
27One way
Example Looking for Location 10011 (19) Look in
011 (3) 3 19 MOD 8
28One way
If there are four possible locations in
memory which map into the same location in
our cache...
29One way
TAG
000 001 010 011 100 101 110 111
We can add tags which tell us if we have a match.
00 00 00 10 00 00 00 00
30One way
TAG
000 001 010 011 100 101 110 111
But there is still a problem! What if we havent
put anything into the cache? The 00 (for
example) will confuse us.
00 00 00 00 00 00 00 00
31One way
V
000 001 010 011 100 101 110 111
Solution Add valid bit
0 0 0 0 0 0 0 0
32One way
V
000 001 010 011 100 101 110 111
Now if the valid bit is set our match is good
0 0 0 1 0 0 0 0
33Basic Algorithm
- Assume we want contents of location M
- Calculate CacheAddr M CacheSize
- Calculate TargetTag M / CacheSize
- if (ValidCacheAddr SET
- TagCacheAddr TargetTag)
- return DataCacheAddr
- else
- Fetch contents of location M from backup memory
- Put in DataCacheAddr
- Update TagCacheAddr and ValidCacheAddr
hit
miss
34Questions?
35Example
- Cache is initially empty
- We get following sequence of memory references
- 10110
- 11010
- 10110
- 11010
- 10000
- 00011
- 10000
- 10010
36Example
TAG
V
000 001 010 011 100 101 110 111
Initial Condition
00 00 00 00 00 00 00 00
0 0 0 0 0 0 0 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
37Example
TAG
V
000 001 010 011 100 101 110 111
10110 Result?
00 00 00 00 00 00 00 00
0 0 0 0 0 0 0 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
38Example
TAG
V
000 001 010 011 100 101 110 111
10110 Miss
00 00 00 00 00 00 10 00
0 0 0 0 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
39Example
TAG
V
000 001 010 011 100 101 110 111
11010 Result?
00 00 00 00 00 00 10 00
0 0 0 0 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
40Example
TAG
V
000 001 010 011 100 101 110 111
11010 Miss
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
41Example
TAG
V
000 001 010 011 100 101 110 111
10110 Result?
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
42Example
TAG
V
000 001 010 011 100 101 110 111
10110 Hit
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
43Example
TAG
V
000 001 010 011 100 101 110 111
11010 Result?
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
44Example
TAG
V
000 001 010 011 100 101 110 111
11010 Hit
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
45Example
TAG
V
000 001 010 011 100 101 110 111
10000 Result?
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
46Example
TAG
V
000 001 010 011 100 101 110 111
10000 Miss
10 00 11 00 00 00 10 00
1 0 1 0 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
47Example
TAG
V
000 001 010 011 100 101 110 111
00011 Result?
10 00 11 00 00 00 10 00
1 0 1 0 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
48Example
TAG
V
000 001 010 011 100 101 110 111
00011 Miss
10 00 11 00 00 00 10 00
1 0 1 1 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
49Example
TAG
V
000 001 010 011 100 101 110 111
10000 Result?
10 00 11 00 00 00 10 00
1 0 1 1 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
50Example
TAG
V
000 001 010 011 100 101 110 111
10000 Hit
10 00 11 00 00 00 10 00
1 0 1 1 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
51Example
TAG
V
000 001 010 011 100 101 110 111
10010 Result?
10 00 11 00 00 00 10 00
1 0 1 1 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
52Example
TAG
V
000 001 010 011 100 101 110 111
10010 Miss
10 00 10 00 00 00 10 00
1 0 1 1 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
53Hardware Cache Variations
- 1. Block Size
- 2. Associativity
- 3. Write policy
- 4. Multiple caches?
541. Block Size
- Wouldnt make much sense to have a different
entry for every byte! - Block number of bytes sharing the same tag.
551 KB Direct Mapped Cache, 32B blocks
- For a 2 N byte cache
- The uppermost (32 - N) bits are always the Cache
Tag - The lowest M bits are the Byte Select (Block Size
2 M)
0
4
31
9
Cache Index
Cache Tag
Example 0x50
Byte Select
Ex 0x01
Ex 0x00
Stored as part of the cache state
Cache Data
Valid Bit
Cache Tag
0
Byte 0
Byte 1
Byte 31
1
0x50
Byte 32
Byte 33
Byte 63
2
3
31
Byte 992
Byte 1023
56Block Size
- How big should the block size be?
- (i.e. what happens as you change the block size?)
57Misses via Block Size
25
1K
20
4K
15
Miss
16K
Rate
10
64K
5
256K
0
Why does it get worse later?
Why does it get better at first?
16
32
64
128
256
Block Size (bytes)
582. Associativity
- Requiring that every memory location be cachable
in exactly one place (direct-mapped) was simple
but incredibly limiting - How can we relax this constraint?
59Associativity
- Block 12 placed in an 8 block cache
- Fully associative, direct mapped, 2-way set
associative - S.A. Mapping Block Number Modulo Number Sets
Direct Mapped (12 mod 8) 4
2-Way Assoc (12 mod 4) 0
Full Mapped
Cache
Memory
60Two-way Set Associative Cache
- N-way set associative N entries for each Cache
Index - N direct mapped caches operates in parallel (N
typically 2 to 4) - Example Two-way set associative cache
- Cache Index selects a set from the cache
- The two tags in the set are compared in parallel
Cache Index
Cache Data
Cache Tag
Valid
Cache Block 0
Adr Tag
Compare
0
1
Mux
Sel1
Sel0
OR
Cache Block
Advantage typically exhibits a hit rate equal to
a 2X-sized direct-mapped cache
Hit
61Disadvantage of Set Associative Cache
- N-way Set Associative Cache v. Direct Mapped
Cache - N comparators vs. 1
- Extra MUX delay for the data
- Data comes AFTER Hit/Miss
62Associativity
- If you have associativity gt 1 you have to have a
replacement policy (like VM!) - FIFO
- LRU
- random
- Full or Full-map associativity means you
check every tag in parallel and a memory block
can go into any cache block - virtual memory is effectively fully associative
633. Write Policy
- Write throughThe information is written to both
the block in the cache and to the block in the
lower-level memory. - Write backThe information is written only to the
block in the cache. The modified cache block is
written to main memory only when it is replaced. - need to remember whether block is clean or dirty
(dirty bit with each tag in addition to the
valid bit) - Pros of each
- WB no repeated writes to same location
- WT read misses cannot result in writes
644. Multiple Caches
- Caches for different purposes
- instructions vs. data
- Multiple levels of caches
- L1, L2, etc.
65Separate Instruction and Data
Not two memories, just two caches!
M X
1
P C
Instr Cache
DPRF
BEQ
A
Data Cache
M X
M X
D
SE
WB
EX
MEM
ID
IF
66Multilevel cachesRecall 1-level cache numbers
Processor
cache
1nS
AMAT Thit (1-h) Tmem 1nS
(1-h) 100nS hit rate of 98 would yield an
AMAT of 3nS ... pretty good!
BIG SLOW MEMORY
100nS
67Multilevel CacheAdd a medium-size, medium-speed
L2
Processor
AMAT Thit_L1 (1-h_L1)
Thit_L2 ((1-h_L1)
(1-h_L2) Tmem) hit
rate of 98in L1 and 95 in L2 would yield an
AMAT of 1 0.2 0.1 1.3nS -- outstanding!
L1 cache
1nS
L2 cache
10nS
BIG SLOW MEMORY
100nS
68Cache Mechanics Summary
- Basic action
- look up block
- check tag
- select byte from block
- Block size
- Associativity
- Write Policy
69Great Cache Questions
- How do you use the processors address bits to
look up a value in a cache? - How many bits of storage are required in a cache
with a given organization
70Great Cache Questions
- How do you use the processors address bits to
look up a value in a cache? - How many bits of storage are required in a cache
with a given organization - E.g 64KB, direct, 16B blocks, write-back
- 16KB 8 bits for data
- 4K (16 1 1) for tag, valid and dirty bits
tag
index
offset
71More Great Cache Questions
- Suppose you have a loop like this
- Whats the hit rate in a 64KB/direct/16B-block
cache?
char a10241024 for (i 0 i lt 1024 i)
for (j 0 j lt 1024 j) aij
72More Great Cache Questions
- Suppose instead the loop is like this
- Whats the hit rate in a 64KB/direct/16B-block
cache?
char a10241024 for (i 0 i lt 1024 i)
for (j 0 j lt 1024 j) aji
73Bonus Slides
74Intel P6 Core
- Core of PPro, PII, PIII
- Split I D L1 (8K8K)
- Split I D TLBs (3264)
- L2 originally on MCM
- PIII has 2x cache L2
- Xeon has big L2
- Aside note 5 FUs and
- 3-wide fetch unit,
- 40-entry ROB
75Itanium 2 (McKinley) Die Photo
L2 cache 256KB
L3 cache 3MB
microprocessor report
76Program Miss Rate Characteristics
- What can you tell about a program from its miss
rates? - Run your favorite cache simulator and see...
- Three sample programs
- SOR jacobi relaxation -- average points on
2D grid - IJPEG integer version of JPEG compression
- CC1 guts of gcc compiler
- Warning these benchmarks and simulator are
slightly different from those used in Project 1
77Cache Misses in SORpercent misses vs. cache
sizesplit ID, direct-mapped, 16B blocks
key data structure 100x100 grid of doubles (80K
bytes)
78Cache Misses in IJPEGpercent misses vs. cache
sizesplit ID, direct-mapped, 16B blocks
data input 938x636 array of 24-bit pixels
1.8Mbytes
79Cache Misses in CC1percent misses vs. cache
sizesplit ID, direct-mapped, 16B blocks