Lecture 13: Caches

About This Presentation

Title:

Lecture 13: Caches

Description:

storage cell. Mackenzie Spring'03 7. Itanium 2 (McKinley) Die Photo. microprocessor. report ... Pick Your Storage Cells. DRAM: 'dynamic': must be refreshed ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 80

Provided by: Rand235

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 13: Caches

1
Lecture 13Caches

Prof. Kenneth M. Mackenzie
Computer Systems and Networks
CS2200, Spring 2003

Includes slides from Bill Leahy
2
Review

Page tables
data structures!
Virtual memory
what happens when pages gt frames
policy questions
fetch policy
replacement policy
Performance of virtual memory
phenomenon of locality, working sets
average-memory-access-time (AMAT)

3
Page Fault
Disk
Physical Memory
Operating System
CPU
42
356
356
page table
i
4
Today Cachesand the full memory hierarchy
5
Problem1. want a big memory2. big memory is slow
Processor
Memory
6
Memory Background
row decoder
wordline
storage cell
bitline
address in
sense amplifiers
column mux
data in/out
7
Itanium 2 (McKinley) Die Photo
microprocessor report
8
Pick Your Storage Cells

DRAM
dynamic must be refreshed
densest technology. Cost/bit is paramount
SRAM
static value is stored in a latch
fastest technology 8-16x faster than DRAM
larger cell 4-8x larger
more expensive 8-16x more per bit
others
EEPROM/Flash high density, non-volatile
core...

9
Main Memory Deep Background

Out-of-Core, In-Core, Core Dump?
Core store bit as magnetic state (ca. 1955-75)
Non-volatile also radiation resistant
Replaced by 4 Kbit DRAM (current is 256Mbit)
Access time 750 ns, cycle time 1500-3000 ns

10
Pre-core Memory Technology Mercury Delay Lines!
shift register via accoustic wave in a tube of
mercury
Maurice Wilkes, Computing Perspectives
11
Problem1. want a big memory2. big memory is slow
Processor
Memory
12
How big is the problem?
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Performance
10
DRAM 9/yr. (2X/10yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
13
Ideally one would desire an indefinitely large
capacity memory such that any particular...word
would be immediately available.... We
are...forced to recognize the possibility of
constructing a hierarchy of memories, each of
which has greater capacity than the preceding but
which is less quickly accessible.

A. W. Burks, H. H. Goldstine, and J. von Neumann
Preliminary Discussions of the Logical Design of
an Electronic Computing Instrument, 1946

14
SolutionSmall memory unit closer to processor
Processor
small, fast memory
BIG SLOW MEMORY
15
Terminology
Processor
upper level (the cache)
small, fast memory
Memory
lower level (sometimes called backing store)
BIG SLOW MEMORY
16
Terminology
Processor
hit rate fraction of accesses resulting in
hits.
A hit block found in upper lever
small, fast memory
Memory
BIG SLOW MEMORY
17
Terminology
Processor
hit rate fraction of accesses resulting in
hits.
A miss not found in upper level, must look in
lower level
small, fast memory
Memory
miss rate (1 - hit_rate)
BIG SLOW MEMORY
18
Terminology Summary

Hit data appears in some block in the upper
level (example Block X in cache)
Hit Rate the fraction of memory access found in
the upper level
Hit Time Time to access the upper level which
consists of
RAM access time Time to determine hit/miss
Miss data needs to be retrieved from a block in
the lower level (example Block Y in memory)
Miss Rate 1 - (Hit Rate)
Miss Penalty Extra time to replace a block in
the upper level
Time to deliver the block the processor
Hit Time ltlt Miss Penalty (500 instructions on
21264)

19
Average Memory Access Time
AMAT HitTime (1 - h) x MissPenalty

Hit time basic time of every access.
Hit rate (h) fraction of access that hit
Miss penalty extra time to fetch a block from
lower level, including time to replace in CPU

20
The Full Memory Hierarchyalways reuse a good
idea
Upper Level
Capacity Access Time Cost
Staging Xfer Unit
faster
CPU Registers 100s Bytes lt10s ns
Registers
prog./compiler 1-8 bytes
Instr. Operands
Cache K Bytes 10-100 ns 1-0.1 cents/bit
Cache
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 200ns- 500ns .0001-.00001
cents /bit
Memory
OS 4K-16K bytes
Pages
Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10
cents/bit
Disk
-5
-6
user/operator Mbytes
Files
Larger
Tape infinite sec-min 10
Tape
Lower Level
-8
21
Virtual Memory

Virtual memory is a kind of cache DRAM is used
as a cache for disk.
Why does it work?
How did it work?

22
Virtual Memory

Virtual memory is a kind of cache DRAM is used
as a cache for disk.
Why does it work?
locality! phenomena of locality means that you
tend to reuse the same locations
How did it work?
1. find block in upper level (DRAM) via page
table (a map)
2. replace least-recently-used (LRU) page on a
miss

23
Virtual Memory

Timing was tough with virtual memory
AMAT Tmem (1-h) Tdisk
100nS (1-h) 25,000,000nS
h (hit rate) had to be incredibly (almost
unattainably) close to perfect to work
so VM is a cache but an odd one.

24
Hardware CacheTiming is much more feasible
Processor
cache
1nS
AMAT Thit (1-h) Tmem 1nS
(1-h) 100nS hit rate of 98 would yield an
AMAT of 3nS ... pretty good!
BIG SLOW MEMORY
100nS
25
Hardware CacheHow do you find things in the
upper level?
Processor
cache
1nS
dont have much time!
BIG SLOW MEMORY
100nS
26
One way

Have a scheme that allows the contents of an main
memory address to be found in exactly one place
in the cache.
Remember the cache is smaller than the level
below it, thus multiple locations could map to
the same place
Severe restriction! But lets see what we can do
with it...

27
One way
Example Looking for Location 10011 (19) Look in
011 (3) 3 19 MOD 8
28
One way
If there are four possible locations in
memory which map into the same location in
our cache...
29
One way
TAG
000 001 010 011 100 101 110 111
We can add tags which tell us if we have a match.
00 00 00 10 00 00 00 00
30
One way
TAG
000 001 010 011 100 101 110 111
But there is still a problem! What if we havent
put anything into the cache? The 00 (for
example) will confuse us.
00 00 00 00 00 00 00 00
31
One way
V
000 001 010 011 100 101 110 111
Solution Add valid bit
0 0 0 0 0 0 0 0
32
One way
V
000 001 010 011 100 101 110 111
Now if the valid bit is set our match is good
0 0 0 1 0 0 0 0
33
Basic Algorithm

Assume we want contents of location M
Calculate CacheAddr M CacheSize
Calculate TargetTag M / CacheSize
if (ValidCacheAddr SET
TagCacheAddr TargetTag)
return DataCacheAddr
else
Fetch contents of location M from backup memory
Put in DataCacheAddr
Update TagCacheAddr and ValidCacheAddr

hit
miss
34
Questions?
35
Example

Cache is initially empty
We get following sequence of memory references
10110
11010
10110
11010
10000
00011
10000
10010

36
Example
TAG
V
000 001 010 011 100 101 110 111
Initial Condition
00 00 00 00 00 00 00 00
0 0 0 0 0 0 0 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
37
Example
TAG
V
000 001 010 011 100 101 110 111
10110 Result?
00 00 00 00 00 00 00 00
0 0 0 0 0 0 0 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
38
Example
TAG
V
000 001 010 011 100 101 110 111
10110 Miss
00 00 00 00 00 00 10 00
0 0 0 0 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
39
Example
TAG
V
000 001 010 011 100 101 110 111
11010 Result?
00 00 00 00 00 00 10 00
0 0 0 0 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
40
Example
TAG
V
000 001 010 011 100 101 110 111
11010 Miss
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
41
Example
TAG
V
000 001 010 011 100 101 110 111
10110 Result?
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
42
Example
TAG
V
000 001 010 011 100 101 110 111
10110 Hit
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
43
Example
TAG
V
000 001 010 011 100 101 110 111
11010 Result?
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
44
Example
TAG
V
000 001 010 011 100 101 110 111
11010 Hit
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
45
Example
TAG
V
000 001 010 011 100 101 110 111
10000 Result?
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
46
Example
TAG
V
000 001 010 011 100 101 110 111
10000 Miss
10 00 11 00 00 00 10 00
1 0 1 0 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
47
Example
TAG
V
000 001 010 011 100 101 110 111
00011 Result?
10 00 11 00 00 00 10 00
1 0 1 0 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
48
Example
TAG
V
000 001 010 011 100 101 110 111
00011 Miss
10 00 11 00 00 00 10 00
1 0 1 1 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
49
Example
TAG
V
000 001 010 011 100 101 110 111
10000 Result?
10 00 11 00 00 00 10 00
1 0 1 1 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
50
Example
TAG
V
000 001 010 011 100 101 110 111
10000 Hit
10 00 11 00 00 00 10 00
1 0 1 1 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
51
Example
TAG
V
000 001 010 011 100 101 110 111
10010 Result?
10 00 11 00 00 00 10 00
1 0 1 1 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
52
Example
TAG
V
000 001 010 011 100 101 110 111
10010 Miss
10 00 10 00 00 00 10 00
1 0 1 1 0 0 1 0
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
53
Hardware Cache Variations

1. Block Size
2. Associativity
3. Write policy
4. Multiple caches?

54
1. Block Size

Wouldnt make much sense to have a different
entry for every byte!
Block number of bytes sharing the same tag.

55
1 KB Direct Mapped Cache, 32B blocks

For a 2 N byte cache
The uppermost (32 - N) bits are always the Cache
Tag
The lowest M bits are the Byte Select (Block Size
2 M)

0
4
31
9
Cache Index
Cache Tag
Example 0x50
Byte Select
Ex 0x01
Ex 0x00
Stored as part of the cache state
Cache Data
Valid Bit
Cache Tag

0
Byte 0
Byte 1
Byte 31

1
0x50
Byte 32
Byte 33
Byte 63
2
3

31
Byte 992
Byte 1023
56
Block Size

How big should the block size be?
(i.e. what happens as you change the block size?)

57
Misses via Block Size
25
1K
20
4K
15
Miss
16K
Rate
10
64K
5
256K
0
Why does it get worse later?
Why does it get better at first?
16
32
64
128
256
Block Size (bytes)
58
2. Associativity

Requiring that every memory location be cachable
in exactly one place (direct-mapped) was simple
but incredibly limiting
How can we relax this constraint?

59
Associativity

Block 12 placed in an 8 block cache
Fully associative, direct mapped, 2-way set
associative
S.A. Mapping Block Number Modulo Number Sets

Direct Mapped (12 mod 8) 4
2-Way Assoc (12 mod 4) 0
Full Mapped
Cache
Memory
60
Two-way Set Associative Cache

N-way set associative N entries for each Cache
Index
N direct mapped caches operates in parallel (N
typically 2 to 4)
Example Two-way set associative cache
Cache Index selects a set from the cache
The two tags in the set are compared in parallel

Cache Index
Cache Data
Cache Tag
Valid
Cache Block 0

Adr Tag
Compare
0
1
Mux
Sel1
Sel0
OR
Cache Block
Advantage typically exhibits a hit rate equal to
a 2X-sized direct-mapped cache
Hit
61
Disadvantage of Set Associative Cache

N-way Set Associative Cache v. Direct Mapped
Cache
N comparators vs. 1
Extra MUX delay for the data
Data comes AFTER Hit/Miss

62
Associativity

If you have associativity gt 1 you have to have a
replacement policy (like VM!)
FIFO
LRU
random
Full or Full-map associativity means you
check every tag in parallel and a memory block
can go into any cache block
virtual memory is effectively fully associative

63
3. Write Policy

Write throughThe information is written to both
the block in the cache and to the block in the
lower-level memory.
Write backThe information is written only to the
block in the cache. The modified cache block is
written to main memory only when it is replaced.
need to remember whether block is clean or dirty
(dirty bit with each tag in addition to the
valid bit)
Pros of each
WB no repeated writes to same location
WT read misses cannot result in writes

64
4. Multiple Caches

Caches for different purposes
instructions vs. data
Multiple levels of caches
L1, L2, etc.

65
Separate Instruction and Data
Not two memories, just two caches!
M X
1
P C
Instr Cache
DPRF
BEQ
A
Data Cache
M X
M X
D
SE
WB
EX
MEM
ID
IF
66
Multilevel cachesRecall 1-level cache numbers
Processor
cache
1nS
AMAT Thit (1-h) Tmem 1nS
(1-h) 100nS hit rate of 98 would yield an
AMAT of 3nS ... pretty good!
BIG SLOW MEMORY
100nS
67
Multilevel CacheAdd a medium-size, medium-speed
L2
Processor
AMAT Thit_L1 (1-h_L1)
Thit_L2 ((1-h_L1)
(1-h_L2) Tmem) hit
rate of 98in L1 and 95 in L2 would yield an
AMAT of 1 0.2 0.1 1.3nS -- outstanding!
L1 cache
1nS
L2 cache
10nS
BIG SLOW MEMORY
100nS
68
Cache Mechanics Summary

Basic action
look up block
check tag
select byte from block
Block size
Associativity
Write Policy

69
Great Cache Questions

How do you use the processors address bits to
look up a value in a cache?
How many bits of storage are required in a cache
with a given organization

70
Great Cache Questions

How do you use the processors address bits to
look up a value in a cache?
How many bits of storage are required in a cache
with a given organization
E.g 64KB, direct, 16B blocks, write-back
16KB 8 bits for data
4K (16 1 1) for tag, valid and dirty bits

tag
index
offset
71
More Great Cache Questions

Suppose you have a loop like this
Whats the hit rate in a 64KB/direct/16B-block
cache?

char a10241024 for (i 0 i lt 1024 i)
for (j 0 j lt 1024 j) aij
72
More Great Cache Questions

Suppose instead the loop is like this
Whats the hit rate in a 64KB/direct/16B-block
cache?

char a10241024 for (i 0 i lt 1024 i)
for (j 0 j lt 1024 j) aji
73
Bonus Slides
74
Intel P6 Core

Core of PPro, PII, PIII
Split I D L1 (8K8K)
Split I D TLBs (3264)
L2 originally on MCM
PIII has 2x cache L2
Xeon has big L2
Aside note 5 FUs and
3-wide fetch unit,
40-entry ROB

75
Itanium 2 (McKinley) Die Photo
L2 cache 256KB
L3 cache 3MB
microprocessor report
76
Program Miss Rate Characteristics

What can you tell about a program from its miss
rates?
Run your favorite cache simulator and see...
Three sample programs
SOR jacobi relaxation -- average points on
2D grid
IJPEG integer version of JPEG compression
CC1 guts of gcc compiler
Warning these benchmarks and simulator are
slightly different from those used in Project 1

77
Cache Misses in SORpercent misses vs. cache
sizesplit ID, direct-mapped, 16B blocks
key data structure 100x100 grid of doubles (80K
bytes)
78
Cache Misses in IJPEGpercent misses vs. cache
sizesplit ID, direct-mapped, 16B blocks
data input 938x636 array of 24-bit pixels
1.8Mbytes
79
Cache Misses in CC1percent misses vs. cache
sizesplit ID, direct-mapped, 16B blocks

Write a Comment

User Comments (0)