Title: CS 2200 Lecture 14 Memory Management
1CS 2200 Lecture 14Memory Management
- (Lectures based on the work of Jay Brockman,
Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
Ken MacKenzie, Richard Murphy, and Michael
Niemier)
2Memory and Pipelining
- In our 5 stage pipe, weve constantly been
assuming that we can access our operand from
memory in 1 clock cycle - This is possible, but its complicated
- Well discuss how this happens in the next
several lectures - (see board for discussion)
3The processor/memory bottleneck
4The processor/memory bottleneck
Memory Capacity (Single Chip DRAM)
Kb
Year
5How big is the problem?
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Performance
10
DRAM 9/yr. (2X/10yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
6Pick Your Storage Cells
- DRAM
- dynamic must be refreshed
- densest technology. Cost/bit is paramount
- SRAM
- static value is stored in a latch
- fastest technology 8-16x faster than DRAM
- larger cell 4-8x larger
- more expensive 8-16x more per bit
- others
- EEPROM/Flash high density, non-volatile
- core...
7The principle of locality
- says that most programs dont access all code or
data uniformly - i.e. in a loop, small subset of instructions
might be executed over and over again - a block of memory addresses might be accessed
sequentially - This has lead to memory hierarchies
- Some important things to note
- Fast memory is expensive
- Levels of memory usually smaller/faster than
previous - Levels of memory usually subset one another
- All the stuff in a higher level is in some level
below it
8Solution (to processor/memory gap) Small memory
unit closer to processor
Processor
small, fast memory
BIG SLOW MEMORY
9Terminology
Processor
upper level (the cache)
small, fast memory
Memory
lower level
BIG SLOW MEMORY
10Terminology
Processor
hit rate fraction of accesses resulting in
hits.
A hit block found in upper lever
small, fast memory
Memory
BIG SLOW MEMORY
11Terminology
Processor
hit rate fraction of accesses resulting in
hits.
A miss not found in upper level, must look in
lower level
small, fast memory
Memory
miss rate (1 - hit_rate)
BIG SLOW MEMORY
12Terminology Summary
- Hit data appears in block in upper level (i.e.
block X in cache) - Hit Rate fraction of memory access found in
upper level - Hit Time time to access upper level which
consists of - RAM access time Time to determine hit/miss
- Miss data needs to be retrieved from a block in
the lower level (i.e. block Y in memory) - Miss Rate 1 - (Hit Rate)
- Miss Penalty Extra time to replace a block in
the upper level - Time to deliver the block the processor
- Hit Time ltlt Miss Penalty (500 instructions on
21264)
13Average Memory Access Time
AMAT HitTime (1 - h) x MissPenalty
- Hit time basic time of every access.
- Hit rate (h) fraction of access that hit
- Miss penalty extra time to fetch a block from
lower level, including time to replace in CPU
14The Full Memory Hierarchyalways reuse a good
idea
Capacity Access Time Cost
Upper Level
Staging Xfer Unit
faster
CPU Registers 100s Bytes lt10s ns
Registers
prog./compiler 1-8 bytes
Instr. Operands
Cache K Bytes 10-100 ns 1-0.1 cents/bit
Cache
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 200ns- 500ns .0001-.00001
cents /bit
Memory
OS 4K-16K bytes
Pages
Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10
cents/bit
Disk
-5
-6
user/operator Mbytes
Files
Larger
Tape infinite sec-min 10
Tape
Lower Level
-8
15A brief description of a cache
- Cache next level of memory hierarchy up from
register file - All values in register file should be in cache
- Cache entries usually referred to as blocks
- Block is minimum amount of information that can
be in cache - If were looking for item in a cache and find it,
have a cache hit it not a cache miss - Cache miss rate fraction of accesses not in the
cache - Miss penalty is of clock cycles required b/c of
the miss
Mem. stall cycles Inst. count x Mem. ref./inst.
x Miss rate x Miss penalty
16Some initial questions to consider
- Where can a block be placed in an upper level of
memory hierarchy (i.e. a cache)? - How is a block found in an upper level of memory
hierarchy? - Which cache block should be replaced on a cache
miss if entire cache is full and we want to bring
in new data? - What happens if a you want to write back to a
memory location? - Do you just write to the cache?
- Do you write somewhere else?
(See board for discussion)
17Where can a block be placed in a cache?
- 3 schemes for block placement in a cache
- Direct mapped cache
- Block (or data to be stored) can go to only 1
place in cache - Usually (Block address) MOD ( of blocks in the
cache) - Fully associative cache
- Block can be placed anywhere in cache
- Set associative cache
- Set a group of blocks in the cache
- Block mapped onto a set then block can be
placed anywhere within that set - Usually (Block address) MOD ( of sets in the
cache) - If n blocks, we call it n-way set associative
18Where can a block be placed in a cache?
Fully Associative
Direct Mapped
Set Associative
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
Cache
Set 0
Set 1
Set 2
Set 3
Block 12 can go anywhere
Block 12 can go only into Block 4 (12 mod 8)
Block 12 can go anywhere in set 0 (12 mod 4)
1 2 3 4 5 6 7 8 9..
Memory
12
19Associativity
- If you have associativity gt 1 you have to have a
replacement policy - FIFO
- LRU
- Random
- Full or Full-map associativity means you
check every tag in parallel and a memory block
can go into any cache block - Virtual memory is effectively fully associative
- (But dont worry about virtual memory yet)
20How is a block found in the cache?
- Caches have address tag on each block frame that
provides block address - Tag of every cache block that might have entry is
examined against CPU address (in parallel!
why?) - Each entry usually has a valid bit
- Tells us if cache data is useful/not garbage
- If bit is not set, there cant be a match
- How does address provided to CPU relate to entry
in cache? - Entry divided between block address block
offset - and further divided between tag field index
field
(See board for explanation)
21How is a block found in the cache?
Block Address
Block Offset
- Block offset field selects data from block
- (i.e. address of desired data within block)
- Index field selects a specific set
- Tag field is compared against it for a hit
- Could we compare on more of address than the tag?
- Not necessary checking index is redundant
- Used to select set to be checked
- Ex. Address stored in set 0 must have 0 in
index field - Offset not necessary in comparison entire block
is present or not and all block offsets must match
Tag
Index
22Which block should be replaced on a cache miss?
- If we look something up in cache and entry not
there, generally want to get data from memory and
put it in cache - B/c principle of locality says well probably use
it again - Direct mapped caches have 1 choice of what block
to replace - Fully associative or set associative offer more
choices - Usually 2 strategies
- Random pick any possible block and replace it
- LRU stands for Least Recently Used
- Why not throw out the block not used for the
longest time - Usually approximated, not much better than random
i.e. 5.18 vs. 5.69 for 16KB 2-way set
associative
(add to picture on board)
23What happens on a write?
- FYI most accesses to a cache are reads
- Used to fetch instructions (reads)
- Most instructions dont write to memory
- For DLX only about 7 of memory traffic involve
writes - Translates to about 25 of cache data traffic
- Make common case fast! Optimize cache for reads!
- Actually pretty easy to do
- Can read block while comparing/reading tag
- Block read begins as soon as address available
- If a hit, address just passed right on to CPU
- Writes take longer. Any idea why?
24What happens on a write?
- Generically, there are 2 kinds of write policies
- Write through (or store through)
- With write through, information written to block
in cache and to block in lower-level memory - Write back (or copy back)
- With write back, information written only to
cache. It will be written back to lower-level
memory when cache block is replaced - The dirty bit
- Each cache entry usually has a bit that specifies
if a write has occurred in that block or not - Helps reduce frequency of writes to lower-level
memory upon block replacement
(add to picture on board)
25What happens on a write?
- Write back versus write through
- Write back advantageous because
- Writes occur at the speed of cache and dont
incur delay of lower-level memory - Multiple writes to cache block result in only 1
lower-level memory access - Write through advantageous because
- Lower-levels of memory have most recent copy of
data - If CPU has to wait for a write, we have write
stall - 1 way around this is a write buffer
- Ideally, CPU shouldnt have to stall during a
write - Instead, data written to buffer which sends it to
lower-levels of memory hierarchy
(add to picture on board)
26What happens on a write?
- What if we want to write and block we want to
write to isnt in cache? - There are 2 common policies
- Write allocate (or fetch on write)
- The block is loaded on a write miss
- The idea behind this is that subsequent writes
will be captured by the cache (ideal for a write
back cache) - No-write allocate (or write around)
- Block modified in lower-level and not loaded into
cache - Usually used for write-through caches
- (subsequent writes still have to go to memory)
27An example the Alpha 20164 data and instruction
cache
Block Addr.
Block Offset
lt21gt
lt8gt
lt5gt
1
CPU Address
Tag
Index
4
Data in
Data out
Valid lt1gt
Tag lt21gt
Data lt256gt
(256 blocks)
2
...
Write Buffer
41 Mux
3
?
Lower level memory
28Board example 1st
29Ex. Alpha cache trace step 1
- 1st, address coming into cache divided into 2
fields - 29-bit block address and a 5-bit block offset
- Block offset further divided into
- An address tag and a cache index
- Cache index selects tag to be tested to see if
desired block is in cache - Size of index depends on cache size, block size,
and set associativity - Index is 8-bits wide tag is 29-8 21 bits
30Ex. Alpha cache trace step 2
- Next, need to do an index selection
essentially what happens here is - (With direct mapping data read/tag checked in
parallel)
Index (8 bits)
Tag (21 bits)
Data (256 bits)
Valid (1 bit)
31Ex. Alpha cache trace step 3,4
- After reading tag from cache, its compared
against tag portion of block address from CPU - If tags do match, data is still not necessarily
valid valid bit must be set as well - If valid bit is not set, results are ignored by
CPU - If the tags do match, its OK for CPU to load
data - Note
- The 21064 allows 2 clock cycles for these 4 steps
- Instructions following a load in next 2 clock
cycles would stall if it tried to use result of
load
32What happens on a write in the Alpha?
- If something (i.e. a data word) is supposed to be
written to cache, 1st 3 steps will be the same - After tag comparison hit, write takes place
- B/c Alpha uses write through cache, also must
go back to main memory - Go to write buffer next (4 blocks in Alpha)
- If buffer not full, data copied there and as
far as CPU is concerned, write is done - May have to merge writes however
- If buffer full, CPU must wait until buffer has
empty entry
33What happens on a read miss (with the Alpha
cache)?
- (Just here so you can get a practical idea of
whats going on with a real cache) - Say we try to read something in Alpha cache and
its not there - Must get it from next level of memory hierarchy
- So, what happens?
- Cache tells CPU to stall, to wait for new data
- Need to get 32 bytes of data, but only have 16
bytes of available bandwidth - Each transfer takes 5 clock cycles
- So well need 10 clock cycles to get all 32
bytes - Alpha cache direct mapped so theres only one
place for it to go
34One way
- Have a scheme that allows contents of a main
memory address to be found in exactly one place
in cache. - Remember cache is smaller than level below it,
thus multiple locations could map to same place - Severe restriction! But lets see what we can do
with it...
35A simple example
36One way
000 001 010 011 100 101 110 111
- Example
- Looking for location
- 10011 (19)
- Look in 011 (3)
- 3 19 MOD 8
What happens if this is a power of 2?
37One way
If there are four possible locations in
memory which map into the same location in
our cache...
38One way
TAG
000 001 010 011 100 101 110 111
We can add tags which tell us if we have a match.
00 00 00 10 00 00 00 00
39One way
TAG
But there is still a problem! What if we havent
put anything into cache? The 00 (for example)
will confuse us.
000 001 010 011 100 101 110 111
00 00 00 00 00 00 00 00
40One way
V
000 001 010 011 100 101 110 111
Solution Add valid bit
0 0 0 0 0 0 0 0
41One way
V
000 001 010 011 100 101 110 111
Now if the valid bit is set our match is good
0 0 0 1 0 0 0 0
42Basic Algorithm
- Assume we want contents of location M
- Calculate CacheAddr M CacheSize
- Calculate TargetTag M / CacheSize
- if (ValidCacheAddr SET
- TagCacheAddr TargetTag)
- return DataCacheAddr
- else
- Fetch contents of location M from backup memory
- Put in DataCacheAddr
- Update TagCacheAddr and ValidCacheAddr
hit
miss
43A bigger example with multiple accesses
44Example
- Cache is initially empty
- We get following sequence of memory references
- 10110
- 11010
- 10110
- 11010
- 10000
- 00011
- 10000
- 10010
45Example
TAG
V
000 001 010 011 100 101 110 111
00 00 00 00 00 00 00 00
0 0 0 0 0 0 0 0
Initial Condition
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
46Example
TAG
V
000 001 010 011 100 101 110 111
00 00 00 00 00 00 00 00
0 0 0 0 0 0 0 0
10110 Result?
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
47Example
TAG
V
000 001 010 011 100 101 110 111
00 00 00 00 00 00 10 00
0 0 0 0 0 0 1 0
10110 Miss
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
48Example
TAG
V
000 001 010 011 100 101 110 111
00 00 00 00 00 00 10 00
0 0 0 0 0 0 1 0
11010 Result?
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
49Example
TAG
V
000 001 010 011 100 101 110 111
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
11010 Miss
50Example
TAG
V
000 001 010 011 100 101 110 111
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
10110 Result?
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
51Example
TAG
V
000 001 010 011 100 101 110 111
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
10110 Hit
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
52Example
TAG
V
000 001 010 011 100 101 110 111
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
11010 Result?
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
53Example
TAG
V
000 001 010 011 100 101 110 111
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
11010 Hit
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
54Example
TAG
V
000 001 010 011 100 101 110 111
00 00 11 00 00 00 10 00
0 0 1 0 0 0 1 0
10000 Result?
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
55Example
TAG
V
000 001 010 011 100 101 110 111
10 00 11 00 00 00 10 00
1 0 1 0 0 0 1 0
10000 Miss
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
56Example
TAG
V
000 001 010 011 100 101 110 111
10 00 11 00 00 00 10 00
1 0 1 0 0 0 1 0
00011 Result?
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
57Example
TAG
V
000 001 010 011 100 101 110 111
10 00 11 00 00 00 10 00
1 0 1 1 0 0 1 0
00011 Miss
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
58Example
TAG
V
000 001 010 011 100 101 110 111
10 00 11 00 00 00 10 00
1 0 1 1 0 0 1 0
10000 Result?
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
59Example
TAG
V
000 001 010 011 100 101 110 111
10 00 11 00 00 00 10 00
1 0 1 1 0 0 1 0
10000 Hit
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
60Example
TAG
V
000 001 010 011 100 101 110 111
10 00 11 00 00 00 10 00
1 0 1 1 0 0 1 0
10010 Result?
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
61Example
TAG
V
000 001 010 011 100 101 110 111
10 00 10 00 00 00 10 00
1 0 1 1 0 0 1 0
10010 Miss
00000 00001 00010 00011 00100 00101 00110 00111
01000 01001 01010 01011 01100 01101 01110 01111
10000 10001 10010 10011 10100 10101 10110 10111
11000 11001 11010 11011 11100 11101 11110 11111
62Instruction data caches
- Most processors have separate caches for data
instructions - Why?
- What if a load/store instruction executed?
- Processor should request data and fetch another
instruction at same time - If both were in same cache, could be a structural
hazard - Alpha actually uses an 8-KB instruction cache
similar almost identical to data cache - Note may see term unified or mixed cache
- These contain both instructions data
63Cache performance
- When evaluating cache performance, a fallacy is
to only focus on miss rate - Temptation may arise b/c miss rate actually
independent of HW implementation - May think it gives apples-to-apples comparison
- Better way is to use
- Average memory access time
- Hit time Miss Rate X Miss Penalty
- Average memory access time is kinda like CPI
- a good measure of performance but still not
perfect - Again, best end-to-end comparison is execution
time
64See board for another example
65A cache example
- We want to compare the following
- A 16-KB data cache a 16-KB instruction cache
versus a 32-KB unified cache - Assume a hit takes 1 clock cycle to process
- Miss penalty 50 clock cycles
- In unified cache, load or store hit takes 1 extra
clock cycle b/c having only 1 cache port a
structural hazard - 75 of accesses are instruction references
- Whats avg. memory access time in each case?
Miss Rates
66A cache example continued
- 1st, need to determine overall miss rate for
split caches - (75 x 0.64) (25 x 6.47) 2.10
- This compares to the unified cache miss rate of
1.99 - Well use average memory access time formula from
a few slides ago but break it up into instruction
data references - Average memory access time split cache
- 75 x (1 0.64 x 50) 25 x (1 6.47 x 50)
- (75 x 1.32) (25 x 4.235) 2.05 cycles
- Average memory access time unified cache
- 75 x (1 1.99 x 50) 25 x (1 1 1.99 x
50) - (75 x 1.995) (25 x 2.995) 2.24 cycles
- Despite higher miss rate, access time faster for
split cache!
67The Big Picture
- Very generic equation for total CPU time is
- (CPU execution clock cycles memory stall
cycles) x clock cycle time - Raises question of whether or not clock cycles
for a cache hit should be included in - CPU execution cycles part of equation
- or memory stall cycles part of equation
- Convention puts them in the CPU execution cycles
part - With 5 stage pipeline, cache hit time included as
part of memory stage - Allows memory stall cycles to be defined in terms
of - of access per program, miss penalty (in clock
cycles), and miss rate for writes and reads
68Memory access equations
- Using what we defined on previous slide, we can
say - Memory stall clock cycles
- Reads x Read miss rate x Read miss penalty
- Writes x Write miss rate x Write miss penalty
- Often, reads and writes are combined/averaged
- Memory stall cycles
- Memory access x Miss rate x Miss penalty
(approximation) - Also possible to factor in instruction count to
get a complete formula
69Reducing cache misses
- Obviously, we want data accesses to result in
cache hits, not misses this will optimize
performance - Start by looking at ways to increase of hits.
- but first look at 3 kinds of misses!
- Compulsory misses
- Very 1st access to cache block will not be a hit
the datas not there yet! - Capacity misses
- Cache is only so big. Wont be able to store
every block accessed in a program must swap
out! - Conflict misses
- Result from set-associative or direct mapped
caches - Blocks discarded/retrieved if too many map to a
location
70Cache misses and the architect
- What can we do about the 3 kinds of cache misses?
- Compulsory, capacity, and conflict
- Can avoid conflict misses w/fully associative
cache - But fully associative caches mean expensive HW,
possibly slower clock rates, and other bad stuff - Can avoid capacity misses by making cache bigger
small caches can lead to thrashing - W/thrashing, data moves between 2 levels of
memory hierarchy very frequently can really
slow down perf. - Larger blocks can mean fewer compulsory misses
- But can turn a capacity miss into a conflict miss!
71(1) Larger cache block size
- Easiest way to reduce miss rate is to increase
cache block size - This will help eliminate what kind of misses?
- Helps improve miss rate b/c of principle of
locality - Temporal locality says that if something is
accessed once, itll probably be accessed again
soon - Spatial locality says that if something is
accessed, something nearby it will probably be
accessed - Larger block sizes help with spatial locality
- Be careful though!
- Larger block sizes can increase miss penalty!
- Generally, larger blocks reduce of total blocks
in cache
72Larger cache block size (graph comparison)
Why this trend?
(Assuming total cache size stays constant for
each curve)
73(1) Larger cache block size (example)
- Assume that to access lower-level of memory
hierarchy you - Incur a 40 clock cycle overhead
- Get 16 bytes of data every 2 clock cycles
- I.e. get 16 bytes in 42 clock cycles, 32 in 44,
etc - Using data below, which block size has minimum
average memory access time?
Cache sizes
Miss rates
74Larger cache block size (ex. continued)
- Recall that Average memory access time
- Hit time Miss rate X Miss penalty
- Assume a cache hit otherwise takes 1 clock cycle
independent of block size - So, for a 16-byte block in a 1-KB cache
- Average memory access time
- 1 (15.05 X 42) 7.321 clock cycles
- And for a 256-byte block in a 256-KB cache
- Average memory access time
- 1 (0.49 X 72) 1.353 clock cycles
- Rest of the data is included on next slide
75Larger cache block size(ex. continued)
Cache sizes
Red entries are lowest average time for a
particular configuration
Note All of these block sizes are common in
processors today Note Data for cache sizes in
units of clock cycles
76(1) Larger cache block sizes (wrap-up)
- We want to minimize cache miss rate cache miss
penalty at same time! - Selection of block size depends on latency and
bandwidth of lower-level memory - High latency, high bandwidth encourage large
block size - Cache gets many more bytes per miss for a small
increase in miss penalty - Low latency, low bandwidth encourage small block
size - Twice the miss penalty of a small block may be
close to the penalty of a block twice the size - Larger of small blocks may reduce conflict
misses
77(2) Higher associativity
- Higher associativity can improve cache miss
rates - Note that an 8-way set associative cache is
- essentially a fully-associative cache
- Helps lead to 21 cache rule of thumb
- It says
- A direct mapped cache of size N has about the
same miss rate as a 2-way set-associative cache
of size N/2 - But, diminishing returns set in sooner or later
- Greater associativity can cause increased hit time
78Cache miss penalties
- Recall equation for average memory access time
- Hit time Miss Rate X Miss Penalty
- Talked about lots of ways to improve miss rates
of caches in previous slides - But, just by looking at the formula we can see
- Improving miss penalty will work just as well!
- Remember that technology trends have made
processor speeds much faster than memory/DRAM
speeds - Relative cost of miss penalties has increased
over time!
792nd-level caches
- 1st 4 techniques discussed all impact CPU
- Technique focuses on cache/main memory interface
- Processor/memory performance gap makes us
consider - If they should make caches faster to keep pace
with CPUs - If they should make caches larger to overcome
widening gap between CPU and main memory - One solution is to do both
- Add another level of cache (L2) between the 1st
level cache (L1) and main memory - Ideally L1 will be fast enough to match the speed
of the CPU while L2 will be large enough to
reduce the penalty of going to main memory
80Second-level caches
- This will of course introduce a new definition
for average memory access time - Hit timeL1 Miss RateL1 Miss PenaltyL1
- Where, Miss PenaltyL1
- Hit TimeL2 Miss RateL2 Miss PenaltyL2
- So 2nd level miss rate measure from 1st level
cache misses - A few definitions to avoid confusion
- Local miss rate
- of misses in the cache divided by total of
memory accesses to the cache specifically Miss
RateL2 - Global miss rate
- of misses in the cache divided by total of
memory accesses generated by the CPU
specifically -- Miss RateL1 Miss RateL2
81(3) Second-level caches
- Example
- In 1000 memory references there are 40 misses in
the L1 cache and 20 misses in the L2 cache. What
are the various miss rates? - Miss Rate L1 (local or global) 40/1000 4
- Miss Rate L2 (local) 20/40 50
- Miss Rate L2 (global) 20/1000 2
- Note that global miss rate is very similar to
single cache miss rate of the L2 cache - (if the L2 size gtgt L1 size)
- Local cache rate not good measure of secondary
caches its a function of L1 miss rate - Which can vary by changing the L1 cache
- Use global cache miss rate to evaluating 2nd
level caches!
82Second-level caches(some odds and ends
comments)
- The speed of the L1 cache will affect the clock
rate of the CPU while the speed of the L2 cache
will affect only the miss penalty of the L1 cache - Which of course could affect the CPU in various
ways - 2 big things to consider when designing the L2
cache are - Will the L2 cache lower the average memory access
time portion of the CPI? - If so, how much will is cost?
- In terms of HW, etc.
- 2nd level caches are usually BIG!
- Usually L1 is a subset of L2
- Should have few capacity misses in L2 cache
- Only worry about compulsory and conflict for
optimizations
83(3) Second-level caches (example)
- Given the following data
- 2-way set associativity increases hit time by 10
of a CPU clock cycle (or 1 of the overall time
it takes for a hit) - Hit time for L2 direct mapped cache is 10 clock
cycles - Local miss rate for L2 direct mapped cache is
25 - Local miss rate for L2 2-way set associative
cache is 20 - Miss penalty for the L2 cache is 50 clock
cycles - What is the impact of using a 2-way set
associative cache on our miss penalty?
84(3) Second-level caches (example)
- Miss penaltyDirect mapped L2
- 10 25 50 22.5 clock cycles
- Adding the cost of associativity increases the
hit cost by only 0.1 clock cycles - Thus, Miss penalty2-way set associative L2
- 10.1 20 50 20.1 clock cycles
- However, we cant have a fraction for a number of
clock cycles (i.e. 10.1 aint possible!) - Well either need to round up to 11 or optimize
some more to get it down to 10. So - 10 20 50 20.0 clock cycles or
- 11 20 50 21.0 clock cycles (both better
than 22.5)
85(3) Second level caches(some final random
comments)
- We can reduce the miss penalty by reducing the
miss rate of the 2nd level caches using
techniques previously discussed - I.e. Higher associativity or psuedo-associativity
are worth considering b/c they have a small
impact on 2nd level hit time - And much of the average access time is due to
misses in the L2 cache - Could also reduce misses by increasing L2 block
size - Need to think about something called the
multilevel inclusion property - In other words, all data in L1 cache is always in
L2 - Gets complex for writes, and what not
86Multilevel cachesRecall 1-level cache numbers
Processor
cache
1nS
AMAT Thit (1-h) Tmem 1nS
(1-h) 100nS hit rate of 98 would yield an
AMAT of 3nS ... pretty good!
BIG SLOW MEMORY
100nS
87Multilevel CacheAdd a medium-size, medium-speed
L2
Processor
AMAT Thit_L1 (1-h_L1)
Thit_L2 ((1-h_L1)
(1-h_L2) Tmem) hit
rate of 98in L1 and 95 in L2 would yield an
AMAT of 1 0.2 0.1 1.3nS -- outstanding!
L1 cache
1nS
L2 cache
10nS
BIG SLOW MEMORY
100nS
88Reducing the hit time
- Again, recall our average memory access time
equation - Hit time Miss Rate Miss Penalty
- Weve talked about reducing the Miss Rate and the
Miss Penalty Hit time can also be a big
component - On many machines cache accesses can affect the
clock cycle time so making this small is a good
thing! - Well talk about a few ways next
89Small and simple caches
- Why is this good?
- Generally, smaller hardware is faster so a
small cache should help the hit time - If an L1 cache is small enough, it should fit on
the same chip as the actual processing logic - Processor avoids time going off chip!
- Some designs compromise and keep tags on a chip
and data off chip allows for fast tag check and
gtgt memory capacity - Direct mapping also falls under the category of
simple - Relates to point above as well you can check
tag and read data at the same time!
90Cache Mechanics Summary
- Basic action
- look up block
- check tag
- select byte from block
- Block size
- Associativity
- Write Policy
91Great Cache Questions
- How do you use the processors address bits to
look up a value in a cache? - How many bits of storage are required in a cache
with a given organization
92Great Cache Questions
- How do you use the processors address bits to
look up a value in a cache? - How many bits of storage are required in a cache
with a given organization - E.g 64KB, direct, 16B blocks, write-back
- 16KB 8 bits for data
- 4K (16 1 1) for tag, valid and dirty bits
tag
index
offset
93More Great Cache Questions
- Suppose you have a loop like this
- Whats the hit rate in a 64KB/direct/16B-block
cache?
char a10241024 for (i 0 i lt 1024 i)
for (j 0 j lt 1024 j) aij
94A. Terminology
- Take out a piece of paper and draw the following
cache - total data size 256KB
- associativity 4-way
- block size 16 bytes
- address 32 bits
- write-policy write-back
- replacement policy random
- How do you partition the 32-bit address
- How many total bits of storage required?
95C. Measuring Caches
96Measuring Processor Caches
- Generate a test program that, when timed, reveals
the cache size, block size, associativity, etc. - How to do this?
- how do you cause cache misses in a cache of size
X?
97Detecting Cache Size
for (size 1 size lt MAXSIZE size 2) for
(dummy 0 dummy lt ZILLION dummy) for (i
0 i lt size i) arrayi
time this part
- what happens when size lt cache size
- what happens when size gt cache size?
- how can you figure out the block size?
98Cache and Block Size
for (stride 1 stride lt MAXSTRIDE stride
2) for (size 1 size lt MAXSIZE size 2)
for (dummy 0 dummy lt ZILLION dummy)
for (i 0 i lt size i stride) arrayi
time this part
- what happens for stride 1?
- what happens for stride blocksize
99Cache as part of a system
M X
1
P C
Instr Cache
DPRF
BEQ
A
Data Cache
M X
M X
D
SE
WB
EX
MEM
ID
IF