Title: CS 42906290 Lecture 11 Memory Hierarchies
1CS 4290/6290 Lecture 11Memory Hierarchies
- (Lectures based on the work of Jay Brockman,
Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
Ken MacKenzie, Richard Murphy, Michael Niemier,
and Milos Pruvlovic)
2Memory and Pipelining
- In our 5 stage pipe, weve constantly been
assuming that we can access our operand from
memory in 1 clock cycle - This is possible, but its complicated
- Well discuss how this happens in the next
several lectures - (see board for discussion)
- Well talk about
- Memory Technology
- Memory Hierarchy
- Caches
- Memory
- Virtual Memory
3Memory Technology
- Memory Comes in Many Flavors
- SRAM (Static Random Access Memory)
- DRAM (Dynamic Random Access Memory)
- ROM, EPROM, EEPROM, Flash, etc.
- Disks, Tapes, etc.
- Difference in speed, price and size
- Fast is small and/or expensive
- Large is slow and/or expensive
4Is there a problem with DRAM?
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Performance
10
DRAM 9/yr. (2X/10yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
5Why Not Only DRAM?
- Not large enough for some things
- Backed up by storage (disk)
- Virtual memory, paging, etc.
- Will get back to this
- Not fast enough for processor accesses
- Takes hundreds of cycles to return data
- OK in very regular applications
- Can use SW pipelining, vectors
- Not OK in most other applications
6The principle of locality
- says that most programs dont access all code or
data uniformly - i.e. in a loop, small subset of instructions
might be executed over and over again - a block of memory addresses might be accessed
sequentially - This has lead to memory hierarchies
- Some important things to note
- Fast memory is expensive
- Levels of memory usually smaller/faster than
previous - Levels of memory usually subset one another
- All the stuff in a higher level is in some level
below it
7Terminology Summary
- Hit data appears in block in upper level (i.e.
block X in cache) - Hit Rate fraction of memory access found in
upper level - Hit Time time to access upper level which
consists of - RAM access time Time to determine hit/miss
- Miss data needs to be retrieved from a block in
the lower level (i.e. block Y in memory) - Miss Rate 1 - (Hit Rate)
- Miss Penalty Extra time to replace a block in
the upper level - Time to deliver the block the processor
- Hit Time ltlt Miss Penalty (500 instructions on
21264)
8Average Memory Access Time
AMAT HitTime (1 - h) x MissPenalty
- Hit time basic time of every access.
- Hit rate (h) fraction of access that hit
- Miss penalty extra time to fetch a block from
lower level, including time to replace in CPU
9The Full Memory Hierarchyalways reuse a good
idea
Capacity Access Time Cost
Upper Level
Staging Xfer Unit
faster
CPU Registers 100s Bytes lt10s ns
Registers
prog./compiler 1-8 bytes
Instr. Operands
Cache K Bytes 10-100 ns 1-0.1 cents/bit
Cache
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 200ns- 500ns .0001-.00001
cents /bit
Memory
OS 4K-16K bytes
Pages
Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10
cents/bit
Disk
-5
-6
user/operator Mbytes
Files
Larger
Tape infinite sec-min 10
Tape
Lower Level
-8
10A brief description of a cache
- Cache next level of memory hierarchy up from
register file - All values in register file should be in cache
- Cache entries usually referred to as blocks
- Block is minimum amount of information that can
be in cache - If were looking for item in a cache and find it,
have a cache hit it not a cache miss - Cache miss rate fraction of accesses not in the
cache - Miss penalty is of clock cycles required b/c of
the miss
Mem. stall cycles Inst. count x Mem. ref./inst.
x Miss rate x Miss penalty
11Cache Basics
- Fast (but small) memory close to processor
- When data referenced
- If in cache, use cache instead of memory
- If not in cache, bring into cache(actually,
bring entire block of data, too) - Maybe have to kick something else out to do it!
- Important decisions
- Placement where in the cache can a block go
- Identification how do we find a block in cache
- Replacement what to kick out to make room in
cache - Write policy What do we do about writes
12Cache Basics
- Cache consists of block-sized lines
- Line size typically power of two
- Typically 16 to 128 bytes in size
- Example
- Suppose block size is 128 bytes
- Lowest seven bits determine offset within block
- Read data at address A0x7fffa3f4
- Address begins to block with base address
0x7fffa380
13Some initial questions to consider
- Where can a block be placed in an upper level of
memory hierarchy (i.e. a cache)? - How is a block found in an upper level of memory
hierarchy? - Which cache block should be replaced on a cache
miss if entire cache is full and we want to bring
in new data? - What happens if a you want to write back to a
memory location? - Do you just write to the cache?
- Do you write somewhere else?
(See board for discussion)
14Where can a block be placed in a cache?
- 3 schemes for block placement in a cache
- Direct mapped cache
- Block (or data to be stored) can go to only 1
place in cache - Usually (Block address) MOD ( of blocks in the
cache) - Fully associative cache
- Block can be placed anywhere in cache
- Set associative cache
- Set a group of blocks in the cache
- Block mapped onto a set then block can be
placed anywhere within that set - Usually (Block address) MOD ( of sets in the
cache) - If n blocks, we call it n-way set associative
15Where can a block be placed in a cache?
Fully Associative
Direct Mapped
Set Associative
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
Cache
Set 0
Set 1
Set 2
Set 3
Block 12 can go anywhere
Block 12 can go only into Block 4 (12 mod 8)
Block 12 can go anywhere in set 0 (12 mod 4)
1 2 3 4 5 6 7 8 9..
Memory
12
16Associativity
- If you have associativity gt 1 you have to have a
replacement policy - FIFO
- LRU
- Random
- Full or Full-map associativity means you
check every tag in parallel and a memory block
can go into any cache block - Virtual memory is effectively fully associative
- (But dont worry about virtual memory yet)
17How is a block found in the cache?
- Caches have address tag on each block frame that
provides block address - Tag of every cache block that might have entry is
examined against CPU address (in parallel!
why?) - Each entry usually has a valid bit
- Tells us if cache data is useful/not garbage
- If bit is not set, there cant be a match
- How does address provided to CPU relate to entry
in cache? - Entry divided between block address block
offset - and further divided between tag field index
field
(See board for explanation)
18How is a block found in the cache?
Block Address
Block Offset
- Block offset field selects data from block
- (i.e. address of desired data within block)
- Index field selects a specific set
- Tag field is compared against it for a hit
- Could we compare on more of address than the tag?
- Not necessary checking index is redundant
- Used to select set to be checked
- Ex. Address stored in set 0 must have 0 in
index field - Offset not necessary in comparison entire block
is present or not and all block offsets must match
Tag
Index
19Which block should be replaced on a cache miss?
- If we look something up in cache and entry not
there, generally want to get data from memory and
put it in cache - B/c principle of locality says well probably use
it again - Direct mapped caches have 1 choice of what block
to replace - Fully associative or set associative offer more
choices - Usually 2 strategies
- Random pick any possible block and replace it
- LRU stands for Least Recently Used
- Why not throw out the block not used for the
longest time - Usually approximated, not much better than random
i.e. 5.18 vs. 5.69 for 16KB 2-way set
associative
(add to picture on board)
20What happens on a write?
- FYI most accesses to a cache are reads
- Used to fetch instructions (reads)
- Most instructions dont write to memory
- For DLX only about 7 of memory traffic involve
writes - Translates to about 25 of cache data traffic
- Make common case fast! Optimize cache for reads!
- Actually pretty easy to do
- Can read block while comparing/reading tag
- Block read begins as soon as address available
- If a hit, address just passed right on to CPU
- Writes take longer. Any idea why?
21What happens on a write?
- Generically, there are 2 kinds of write policies
- Write through (or store through)
- With write through, information written to block
in cache and to block in lower-level memory - Write back (or copy back)
- With write back, information written only to
cache. It will be written back to lower-level
memory when cache block is replaced - The dirty bit
- Each cache entry usually has a bit that specifies
if a write has occurred in that block or not - Helps reduce frequency of writes to lower-level
memory upon block replacement
(add to picture on board)
22What happens on a write?
- Write back versus write through
- Write back advantageous because
- Writes occur at the speed of cache and dont
incur delay of lower-level memory - Multiple writes to cache block result in only 1
lower-level memory access - Write through advantageous because
- Lower-levels of memory have most recent copy of
data - If CPU has to wait for a write, we have write
stall - 1 way around this is a write buffer
- Ideally, CPU shouldnt have to stall during a
write - Instead, data written to buffer which sends it to
lower-levels of memory hierarchy
(add to picture on board)
23LRU Example
- 4-way set associative
- Need 4 values (2 bits) for counter
0
0x00004000
1
0x00003800
2
0xffff8000
3
0x00cd0800
Access 0xffff8004
0
0x00004000
1
0x00003800
3
0xffff8000
2
0x00cd0800
Access 0x00003840
0
0x00004000
3
0x00003800
2
0xffff8000
1
0x00cd0800
Access 0x00d00008
Replace entry with 0 counter,then update counters
3
0x00d00000
2
0x00003800
1
0xffff8000
0
0x00cd0800
24Approximating LRU
- LRU is too complicated
- Access and possibly update all counters in a
seton every access (not just replacement) - Need something simpler and faster
- But still close to LRU
- NMRU Not Most Recently Used
- The entire set has one MRU pointer
- Points to last-accessed line in the set
- ReplacementRandomly select a non-MRU line
25What happens on a write?
- What if we want to write and block we want to
write to isnt in cache? - There are 2 common policies
- Write allocate (or fetch on write)
- The block is loaded on a write miss
- The idea behind this is that subsequent writes
will be captured by the cache (ideal for a write
back cache) - No-write allocate (or write around)
- Block modified in lower-level and not loaded into
cache - Usually used for write-through caches
- (subsequent writes still have to go to memory)
26Memory access equations
- Using what we defined on previous slide, we can
say - Memory stall clock cycles
- Reads x Read miss rate x Read miss penalty
- Writes x Write miss rate x Write miss penalty
- Often, reads and writes are combined/averaged
- Memory stall cycles
- Memory access x Miss rate x Miss penalty
(approximation) - Also possible to factor in instruction count to
get a complete formula
27Reducing cache misses
- Obviously, we want data accesses to result in
cache hits, not misses this will optimize
performance - Start by looking at ways to increase of hits.
- but first look at 3 kinds of misses!
- Compulsory misses
- Very 1st access to cache block will not be a hit
the datas not there yet! - Capacity misses
- Cache is only so big. Wont be able to store
every block accessed in a program must swap
out! - Conflict misses
- Result from set-associative or direct mapped
caches - Blocks discarded/retrieved if too many map to a
location
28Cache Examples
29Physical Address (10 bits)
Tag (6 bits)
Index (2 bits)
Offset (2 bits)
A 4-entry direct mapped cache with 4 data
words/block
Assume we want to read the following data
words Tag Index Offset Address Holds
Data 101010 10 00 3510 101010 10
01 2410 101010 10 10 1710 101010
10 11 2510
1
2
If we read 101010 10 01 we want to bring data
word 2410 into the cache. Where would this data
go? Well, the index is 10. Therefore, the data
word will go somewhere into the 3rd block of the
cache. (make sure you understand
terminology) More specifically, the data word
would go into the 2nd position within the block
because the offset is 01
3
The principle of spatial locality says that if we
use one data word, well probably use some data
words that are close to it thats why our block
size is bigger than one data word. So we fill
in the data word entries surrounding 101010 10 01
as well.
All of these physical addresses would have the
same tag
All of these physical addresses map to the same
cache entry
30Tag
00
01
10
11
V
D
Physical Address (10 bits)
00
01
101010
2410
3510
1710
2510
10
Tag (6 bits)
Index (2 bits)
Offset (2 bits)
11
A 4-entry direct mapped cache with 4 data
words/block
4
5
Therefore, if we get this pattern of accesses
when we start a new program 1.) 101010 10
00 2.) 101010 10 01 3.) 101010 10 10 4.) 101010
10 11 After we do the read for 101010 10 00
(word 1), we will automatically get the data for
words 2, 3 and 4. What does this mean?
Accesses (2), (3), and (4) ARE NOT COMPULSORY
MISSES
- What happens if we get an access to location
- 100011 10 11 (holding data 1210)
- Index bits tell us we need to look at cache block
10. - So, we need to compare the tag of this address
100011 to the tag that associated with the
current entry in the cache block 101010 - These DO NOT match. Therefore, the data
associated with address 100011 10 11 IS NOT
VALID. What we have here could be - A compulsory miss
- (if this is the 1st time the data was accessed)
- A conflict miss
- (if the data for address 100011 10 11 was
- present, but kicked out by 101010 10 00 for
- example)
31Tag
00
01
10
11
V
D
Physical Address (10 bits)
00
01
101010
2410
3510
1710
2510
10
Tag (6 bits)
Index (2 bits)
Offset (2 bits)
11
This cache can hold 16 data words
6
What if we change the way our cache is laid out
but so that it still has 16 data words? One way
we could do this would be as follows
Tag
000
V
D
0
1
- All of the following are true
- This cache still holds 16 words
- Our block size is bigger therefore this should
help with compulsory misses - Our physical address will now be divided as
follows - The number of cache blocks has DECREASED
- This will INCREASE the of conflict misses
1 cache block entry
327
What if we get the same pattern of accesses we
had before?
Pattern of accesses (note different of bits
for offset and index now) 1.) 101010 1 000 2.)
101010 1 001 3.) 101010 1 010 4.) 101010 1 011
Note that there is now more data associated with
a given cache block.
However, now we have only 1 bit of
index. Therefore, any address that comes along
that has a tag that is different than 101010
and has 1 in the index position is going to
result in a conflict miss.
337
But, we could also make our cache look like this
Again, lets assume we want to read the following
data words Tag Index Offset Address
Holds Data 101010 100 0 3510 101010
100 1 2410 101010 101
0 1710 101010 101 1 2510 Assuming
that all of these accesses were occurring for the
1st time (and would occur sequentially), accesses
(1) and (3) would result in compulsory misses,
and accesses would result in hits because of
spatial locality. (The final state of the
cache is shown after all 4 memory accesses).
1.) 2.) 3.) 4.)
There are now just 2 words associated with each
cache block.
Note that by organizing a cache in this way,
conflict misses will be reduced. There are now
more addresses in the cache that the 10-bit
physical address can map too.
348
All of these caches hold exactly the same amount
of data 16 different word entries
As a general rule of thumb, long and skinny
caches help to reduce conflict misses, short and
fat caches help to reduce compulsory misses, but
a cross between the two is probably what will
give you the best (i.e. lowest) overall miss rate.
But what about capacity misses?
358
- Whats a capacity miss?
- The cache is only so big. We wont be able to
store every block accessed in a program must - them swap out!
- Can avoid capacity misses by making cache bigger
Thus, to avoid capacity misses, wed need to make
our cache physically bigger i.e. there are now
32 word entries for it instead of 16. FYI, this
will change the way the physical address is
divided. Given our original pattern of accesses,
wed have
Tag
00
01
10
11
V
D
000
001
10101
2410
3510
1710
2510
010
011
Pattern of accesses 1.) 10101 010 00
3510 2.) 10101 010 01 2410 3.) 10101
010 10 1710 4.) 10101 010 11
2510 (note smaller tag, bigger index)
100
101
110
111
36Examples Ended
37Cache misses and the architect
- What can we do about the 3 kinds of cache misses?
- Compulsory, capacity, and conflict
- Can avoid conflict misses w/fully associative
cache - But fully associative caches mean expensive HW,
possibly slower clock rates, and other bad stuff - Can avoid capacity misses by making cache bigger
small caches can lead to thrashing - W/thrashing, data moves between 2 levels of
memory hierarchy very frequently can really
slow down perf. - Larger blocks can mean fewer compulsory misses
- But can turn a capacity miss into a conflict miss!
38Addressing Miss Rates
39(1) Larger cache block size
- Easiest way to reduce miss rate is to increase
cache block size - This will help eliminate what kind of misses?
- Helps improve miss rate b/c of principle of
locality - Temporal locality says that if something is
accessed once, itll probably be accessed again
soon - Spatial locality says that if something is
accessed, something nearby it will probably be
accessed - Larger block sizes help with spatial locality
- Be careful though!
- Larger block sizes can increase miss penalty!
- Generally, larger blocks reduce of total blocks
in cache
40Larger cache block size (graph comparison)
Why this trend?
(Assuming total cache size stays constant for
each curve)
41(1) Larger cache block size (example)
- Assume that to access lower-level of memory
hierarchy you - Incur a 40 clock cycle overhead
- Get 16 bytes of data every 2 clock cycles
- I.e. get 16 bytes in 42 clock cycles, 32 in 44,
etc - Using data below, which block size has minimum
average memory access time?
Cache sizes
Miss rates
42Larger cache block size (ex. continued)
- Recall that Average memory access time
- Hit time Miss rate X Miss penalty
- Assume a cache hit otherwise takes 1 clock cycle
independent of block size - So, for a 16-byte block in a 1-KB cache
- Average memory access time
- 1 (15.05 X 42) 7.321 clock cycles
- And for a 256-byte block in a 256-KB cache
- Average memory access time
- 1 (0.49 X 72) 1.353 clock cycles
- Rest of the data is included on next slide
43Larger cache block size(ex. continued)
Cache sizes
Red entries are lowest average time for a
particular configuration
Note All of these block sizes are common in
processors today Note Data for cache sizes in
units of clock cycles
44(1) Larger cache block sizes (wrap-up)
- We want to minimize cache miss rate cache miss
penalty at same time! - Selection of block size depends on latency and
bandwidth of lower-level memory - High latency, high bandwidth encourage large
block size - Cache gets many more bytes per miss for a small
increase in miss penalty - Low latency, low bandwidth encourage small block
size - Twice the miss penalty of a small block may be
close to the penalty of a block twice the size - Larger of small blocks may reduce conflict
misses
45(2) Higher associativity
- Higher associativity can improve cache miss
rates - Note that an 8-way set associative cache is
- essentially a fully-associative cache
- Helps lead to 21 cache rule of thumb
- It says
- A direct mapped cache of size N has about the
same miss rate as a 2-way set-associative cache
of size N/2 - But, diminishing returns set in sooner or later
- Greater associativity can cause increased hit time
46(3) Victim caches
- 1st of all, what is a victim cache?
- A victim cache temporarily stores blocks that
have been discarded from the main cache (usually
not that big) - 2nd of all, how does it help us?
- If theres a cache miss, instead of immediately
going down to the next level of memory hierarchy
we check the victim cache first - If the entry is there, we swap the victim cache
block with the actual cache block - Research shows
- Victim caches with 1-5 entries help reduce
conflict misses - For a 4KB direct mapped cache victim caches
- Removed 20 - 95 of conflict misses!
47(3) Victim caches
CPU Address
Data in
Data out
?
Tag
Data
Victim Cache
?
Write Buffer
Lower level memory
48(4) Psuedo-associative caches
- This techniques should help
- The miss rate of set-associative caches
- The hit speed of direct mapped caches
- Also called column associated cache
- Access proceeds normally as for a direct mapped
cache - But, on a miss, we look at another entry before
going to a lower level of memory hierarchy - Usually done by
- Inverting the most significant bit of index field
to find the other block in the psuedo-set - Psuedo-associative caches usually have 1 fast and
1 slow hit time (regular, psuedo hit
respectively) - In addition to the miss penalty that is
49(5) Hardware prefetching
- This one should intuitively be pretty obvious
- Try and fetch blocks before theyre even
requested - This could work with both instructions and data
- Usually, prefetched blocks are placed either
- Directly in the cache (whats a down side to
this?) - Or in some external buffer thats usually a
small, fast cache - Lets look at an example (the Alpha AXP 21064)
- On a cache miss, it fetches 2 blocks
- One is the new cache entry thats needed
- The other is the next consecutive block it goes
in a buffer - How well does this buffer perform?
- Single entry buffer catches 15-25 of misses
- With 4 entry buffer, the hit rate improves about
50
50(5) Hardware prefetching example
- What is the effective miss rate for the Alpha
using instruction prefetching? - How much larger of an instruction cache would we
need if the Alpha to match the average access
time if prefetching was removed? - Assume
- It takes 1 extra clock cycle if the instruction
misses the cache but is found in the prefetch
buffer - The prefetch hit rate is 25
- Miss rate for 8-KB instruction cache is 1.10
- Hit time is 2 clock cycles
- Miss penalty is 50 clock cycles
51(5) Hardware prefetching example
- We need a revised memory access time formula
- Say Average memory access timeprefetch
- Hit time miss rate prefetch hit rate 1
miss rate (1 prefetch hit rate) miss
penalty - Plugging in numbers to the above, we get
- 2 (1.10 25 1) (1.10 (1 25) 50)
2.415 - To find the miss rate with equivalent
performance, we start with the original formula
and solve for miss rate - Average memory access timeno prefetching
- Hit time miss rate miss penalty
- Results in (2.415 2) / 50 0.83
- Calculation suggests effective miss rate of
prefetching with 8KB cache is 0.83 - Actual miss rates for 16KB 0.64 and 8KB 1.10
52(6) Compiler-controlled prefetching
- Its also possible for the compiler to tell the
hardware that it should prefetch instructions or
data - It (the compiler) could have values loaded into
registers called register prefetching - Or, the compiler could just have data loaded into
the cache called cache prefetching - As youll see, getting things from lower levels
of memory can cause faults if the data is not
there - Ideally, we want prefetching to be invisible to
the program so often, nonbinding/nonfaulting
prefetching used - With nonfautling scheme, faulting instructions
turned into no-ops - With faulting scheme, data would be fetched (as
normal)
53(7) Compiler optimizations merging arrays
- This works by improving spatial locality
- For example, some programs may reference multiple
arrays of the same size at the same time - Could be bad
- Accesses may interfere with one another in the
cache - A solution Generate a single, compound array
/ Before/ int tagSIZE int byte1SIZE int
byte2SIZE int dirtysize
/ After / struct merge int tag int
byte1 int byte2 int dirty struct merge
cache_block_entrySIZE
54(7) Compiler optimizations loop interchange
- Some programs have nested loops that access
memory in non-sequential order - Simply changing the order of the loops may make
them access the data in sequential order - Whats an example of this?
/ Before/ for( j 0 j lt 100 j j 1)
for( k 0 k lt 5000 k k 1) xkj
2 xkj
But who really writes loops like this???
/ After/ for( k 0 k lt 5000 k k 1)
for( j 0 j lt 5000 j j 1) xkj
2 xkj
55(7) Compiler optimizations loop fusion
- This ones pretty obvious once you hear what it
is - Seeks to take advantage of
- Programs that have separate sections of code that
access the same arrays in different loops - Especially when the loops use common data
- The idea is to fuse the loops into one common
loop - Whats the target of this optimization?
- Example
/ After/ for( j 0 j lt N j j 1) for(
k 0 k lt N k k 1) ajk 1/bjk
cjk djk ajk cjk
/ Before/ for( j 0 j lt N j j 1) for(
k 0 k lt N k k 1) ajk 1/bjk
cjk for( j 0 j lt N j j 1) for( k
0 k lt N k k 1) djk ajk
cjk
56(7) Compiler optimizations blocking
- This is probably the most famous of compiler
optimizations to improve cache performance - Tries to reduce misses by improving temporal
locality - Before we go through a blocking example were
first going to introduce some terms - (And Im going to be perfectly honest here, I
never got this concept completely until I worked
through an example) - (And not just in class eitheryou actually have
to look at some code somewhat painstakingly
on your own!) - Also, keep in mind that this is used mainly with
arrays! - So.bear with me and now onto some definitions!
57(7) Compiler optimizations blocking
(definitions)
- 1st of all, we need to realize that arrays can be
accessed/indexed differently - Some arrays are accessed by rows, others by
columns - Storing array data row-by-row is called row major
order - Storing array data column-by-column is called
column major order - In some code this wont help b/c array data is
going to be accessed both by rows and by columns! - Things like loop interchange dont help
- Blocking tries to create submatricies or
blocks to maximize accesses to data loaded in the
cache before its replaced.
58(7) Compiler optimizations blocking (example
preview)
/ Some matrix multiply code / for( i 0 i lt
N i i 1 ) for( j 0 j lt N j j 1)
r 0 for ( k 0 k lt N k k 1) r
r yik zkj xij r
2 inner loops read all N x N elements of z,
access the same N elements in a row of y
repeatedly, and write one row of N elements of x.
Pictorially what happens is
j
k
j
x
y
z
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
White block not accessed Light block
older access Dark block newer access
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
i
i
k
59(7) Compiler optimizations blocking (some
comments)
- In the matrix multiply code, the of capacity
misses is going to depend upon - The factor N (i.e. the sizes of the matrices)
- The size of the cache
- Some possible outcomes
- The cache can hold all N x N matrices (great!)
- Provided there are no conflict misses
- The cache can hold 1 N x N matrix and one row of
size N - Maybe ith row of y and matrix z may stay in the
cache - The cache cant hold even this much
- Misses will occur for both x and z
- In the worst case there will be 2N3 N2 memory
reads for N3 memory operations!
60(7) Compiler optimizations blocking (example
continued)
/ Blocked matrix multiply code / for( jj 0
jj lt N jj jj B ) for( kk 0 kk lt N kk
kk B) for( i 0 i lt N i i 1)
for( j jj j lt min( jj B 1, N) j j 1)
r 0 for( k kk k lt mim( kk B 1, N)
k k 1) r r yik
zkj xij xij r
To ensure that the elements accessed will all
fit/stay in the cache, the code is changed to
operate on submatrices of size B x B. The 2
inner loops compute in steps of size B instead of
going from the beginning to the end of x and
z. B is called the blocking factor.
Pictorially what happens is
j
k
j
x
y
z
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
Smaller of elements accessed but theyre all in
the cache!
i
i
k
61(7) Compiler optimizations blocking (example
conclusion)
- What might happen with regard to capacity misses?
- Total of memory words accessed is 2N3/B N2
- This is an improvement by a factor of B
- Blocking thus exploits a combination of spatial
and temporal locality - y matrix benefits from spatial locality and z
benefits from temporal locality - Usually, blocking aimed at reducing capacity
misses - Assumes that conflict misses are not significant
or - can be eliminated by more associative caches
- Blocking reduces of words active in a cache at
1 point in time therefore small block size
helps with conflicts
62Addressing Miss Penalties
63Cache miss penalties
- Recall equation for average memory access time
- Hit time Miss Rate X Miss Penalty
- Talked about lots of ways to improve miss rates
of caches in previous slides - But, just by looking at the formula we can see
- Improving miss penalty will work just as well!
- Remember that technology trends have made
processor speeds much faster than memory/DRAM
speeds - Relative cost of miss penalties has increased
over time!
64(1) Give priority to read misses over writes
- Reads are the common case make them fast!
- Write buffers helped us with cache writes but
- They complicate memory accesses b/c they might
hold updated value of a location on a read miss - Example
SW 512(R0), R3 M512 ? R3 (cache index 0) LW
R1, 1024(R0) R1 ? M1024 (cache index 0) LW
R2, 512(R0) R2 ? M512 (cache index 0)
- Assume direct mapped, write through cache
- (512, 1024 mapped to the same location)
- Assume a 4 word write buffer
- Will the value in R2 always be equal to the value
in R3?
65(1) Giving priority to read misses over writes
- Example continue
- This code generates a RAW hazard in memory
- A cache access might work as follows
- Data in R3 placed into the write buffer after the
store - Next load uses same cache index we get a miss
- (i.e. b/c the store data is there)
- Next load tries to put value in location 512 into
R2 - This also results in a cache miss
- (i.e. b/c 512 has been updated)
- If write buffer hasnt finished writing to
location 512, reading location 512 will put the
wrong, old value into the cache block and then
into R2 - R3 would not be equal to R2 which is a bad
thing!
66(1) Giving priority to read misses over writes
- 1 solution to this problem is to handle read
misses only if the write buffer is empty - (Causes quite a performance hit however!)
- Alternative is to check contents of the write
buffer on a read miss - If there are no conflicts memory system is
available, let read miss continue - Can also reduce the cost of writes within a
processor with a write-back cache - What if a read miss should replace a dirty memory
block? - Could write to memory, read memory
- Or copy the dirty block to a buffer, read
memory, then write memory lets the CPU not
wait
67(2) Sub-block placement for reduced miss penalty
- Instead of replacing a whole complete block of a
cache, we only replace one of its subblocks - Note Well have to make a hardware change to do
this. What is it??? - Subblocks should have a smaller miss penalty then
full blocks
68(3) Early restart and critical word 1st
- With this strategy were going to be impatient
- As soon as some of the block is loaded, see if
the data is there and send it to the CPU - (i.e. we dont wait for the whole block to be
loaded) - Recall the Alphas cache took 2 cycles to
transfer all of the data needed - but the data word needed could come in the first
cycle - There are 2 general strategies
- Early restart
- As soon as the word gets to the cache, send it to
the CPU - Critical word first
- Specifically ask for the needed word 1st, make
sure it gets to the CPU, then get the rest of the
cache block data
69(4) Nonblocking caches to reduce stalls on cache
misses
- These might be most useful with a Tomasulos or
scoreboard implementation. Why? - The CPU could still fetch instructions and start
them on a cache data miss - A nonblocking cache allows a cache (especially
data cache) to supply cache hits during a miss - This scheme is often called hit under miss
- Other caveats of this are
- hit under multiple miss
- miss under miss
- Which is only useful if the memory system can
handle multiple misses - These will all greatly complicate your cache
hardware!
70(5) 2nd-level caches
- 1st 4 techniques discussed all impact CPU
- Technique focuses on cache/main memory interface
- Processor/memory performance gap makes us
consider - If they should make caches faster to keep pace
with CPUs - If they should make caches larger to overcome
widening gap between CPU and main memory - One solution is to do both
- Add another level of cache (L2) between the 1st
level cache (L1) and main memory - Ideally L1 will be fast enough to match the speed
of the CPU while L2 will be large enough to
reduce the penalty of going to main memory
71(5) Second-level caches
- This will of course introduce a new definition
for average memory access time - Hit timeL1 Miss RateL1 Miss PenaltyL1
- Where, Miss PenaltyL1
- Hit TimeL2 Miss RateL2 Miss PenaltyL2
- So 2nd level miss rate measure from 1st level
cache misses - A few definitions to avoid confusion
- Local miss rate
- of misses in the cache divided by total of
memory accesses to the cache specifically Miss
RateL2 - Global miss rate
- of misses in the cache divided by total of
memory accesses generated by the CPU
specifically -- Miss RateL1 Miss RateL2
72(5) Second-level caches
- Example
- In 1000 memory references there are 40 misses in
the L1 cache and 20 misses in the L2 cache. What
are the various miss rates? - Miss Rate L1 (local or global) 40/1000 4
- Miss Rate L2 (local) 20/40 50
- Miss Rate L2 (global) 20/1000 2
- Note that global miss rate is very similar to
single cache miss rate of the L2 cache - (if the L2 size gtgt L1 size)
- Local cache rate not good measure of secondary
caches its a function of L1 miss rate - Which can vary by changing the L1 cache
- Use global cache miss rate to evaluating 2nd
level caches!
73(5) Second-level caches(some odds and ends
comments)
- The speed of the L1 cache will affect the clock
rate of the CPU while the speed of the L2 cache
will affect only the miss penalty of the L1 cache - Which of course could affect the CPU in various
ways - 2 big things to consider when designing the L2
cache are - Will the L2 cache lower the average memory access
time portion of the CPI? - If so, how much will is cost?
- In terms of HW, etc.
- 2nd level caches are usually BIG!
- Usually L1 is a subset of L2
- Should have few capacity misses in L2 cache
- Only worry about compulsory and conflict for
optimizations
74(5) Second-level caches (example)
- Given the following data
- 2-way set associativity increases hit time by 10
of a CPU clock cycle - Hit time for L2 direct mapped cache is 10 clock
cycles - Local miss rate for L2 direct mapped cache is
25 - Local miss rate for L2 2-way set associative
cache is 20 - Miss penalty for the L2 cache is 50 clock
cycles - What is the impact of using a 2-way set
associative cache on our miss penalty?
75(5) Second-level caches (example)
- Miss penaltyDirect mapped L2
- 10 25 50 22.5 clock cycles
- Adding the cost of associativity increases the
hit cost by only 0.1 clock cycles - Thus, Miss penalty2-way set associative L2
- 10.1 20 50 20.1 clock cycles
- However, we cant have a fraction for a number of
clock cycles (i.e. 10.1 aint possible!) - Well either need to round up to 11 or optimize
some more to get it down to 10. So - 10 20 50 20.0 clock cycles or
- 11 20 50 21.0 clock cycles (both better
than 22.5)
76(5) Second level caches(some final random
comments)
- We can reduce the miss penalty by reducing the
miss rate of the 2nd level caches using
techniques previously discussed - I.e. Higher associativity or psuedo-associativity
are worth considering b/c they have a small
impact on 2nd level hit time - And much of the average access time is due to
misses in the L2 cache - Could also reduce misses by increasing L2 block
size - Need to think about something called the
multilevel inclusion property - In other words, all data in L1 cache is always in
L2 - Gets complex for writes, and what not
77Addressing Hit Time
78Reducing the hit time
- Again, recall our average memory access time
equation - Hit time Miss Rate Miss Penalty
- Weve talked about reducing the Miss Rate and the
Miss Penalty Hit time can also be a big
component - On many machines cache accesses can affect the
clock cycle time so making this small is a good
thing! - Well talk about a few ways next
79(1) Small and simple caches
- Why is this good?
- Generally, smaller hardware is faster so a
small cache should help the hit time - If an L1 cache is small enough, it should fit on
the same chip as the actual processing logic - Processor avoids time going off chip!
- Some designs compromise and keep tags on a chip
and data off chip allows for fast tag check and
gtgt memory capacity - Direct mapping also falls under the category of
simple - Relates to point above as well you can check
tag and read data at the same time!
80(2) Avoid address translation during cache
indexing
- This problem centers around virtual addresses.
Should we send the virtual address to the cache? - In other words we have Virtual caches vs.
Physical caches - Why is this a problem anyhow?
- Well, recall from OS that a processor usually
deals with processes - What if process 1 uses a virtual address xyz and
process 2 uses the same virtual address? - The data in the cache would be totally different!
called aliasing - Every time a process is switched logically, wed
have to flush the cache or wed get false hits. - Cost time to flush compulsory misses from
empty cache - I/O must interact with caches so we need virtual
addressess
81(2) Avoiding address translation during cache
indexing
- Solutions to aliases
- HW that guarantees that every cache block has a
unique physical address - SW guarantee lower n bits must have the same
address - As long as they cover the index field and direct
mapped, they must be unique called page
coloring - Solution to cache flush
- Add a PID processor identifier tag
- The PID will identify the process as well as an
address within the process - So, we cant get a hit if we get the wrong
process!
82Specific Example 1
83A cache example
- We want to compare the following
- A 16-KB data cache a 16-KB instruction cache
versus a 32-KB unified cache - Assume a hit takes 1 clock cycle to process
- Miss penalty 50 clock cycles
- In unified cache, load or store hit takes 1 extra
clock cycle b/c having only 1 cache port a
structural hazard - 75 of accesses are instruction references
- Whats avg. memory access time in each case?
Miss Rates
84A cache example continued
- 1st, need to determine overall miss rate for
split caches - (75 x 0.64) (25 x 6.47) 2.10
- This compares to the unified cache miss rate of
1.99 - Well use average memory access time formula from
a few slides ago but break it up into instruction
data references - Average memory access time split cache
- 75 x (1 0.64 x 50) 25 x (1 6.47 x 50)
- (75 x 1.32) (25 x 4.235) 2.05 cycles
- Average memory access time unified cache
- 75 x (1 1.99 x 50) 25 x (1 1 1.99 x
50) - (75 x 1.995) (25 x 2.995) 2.24 cycles
- Despite higher miss rate, access time faster for
split cache!
85Virtual Memory
86The Full Memory Hierarchyalways reuse a good
idea
Capacity Access Time Cost
Upper Level
Staging Xfer Unit
faster
CPU Registers 100s Bytes lt10s ns
Registers
prog./compiler 1-8 bytes
Instr. Operands
Cache K Bytes 10-100 ns 1-0.1 cents/bit
Cache
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 200ns- 500ns .0001-.00001
cents /bit
Memory
OS 4K-16K bytes
Pages
Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10
cents/bit
Disk
-5
-6
user/operator Mbytes
Files
Larger
Tape infinite sec-min 10
Tape
Lower Level
-8
87Virtual Memory
- Some facts of computer life
- Computers run lots of processes simultaneously
- No full address space of memory for each process
- Must share smaller amounts of physical memory
among many processes - Virtual memory is the answer!
- Divides physical memory into blocks, assigns them
to different processes
Compiler assigns data to a virtual address. VA
translated to a real/physical somewhere in
memory (allows any program to run
anywhere where is determined by a particular
machine, OS)
88Whats the right picture?
Logical Address Space
Physical Address Space
89The gist of virtual memory
- Relieves problem of making a program that was too
large to fit in physical memory well.fit! - Allows program to run in any location in physical
memory - (called relocation)
- Really useful as you might want to run same
program on lots machines
Logical program is in contiguous VA space here,
consists of 4 pages A, B, C, D The physical
location of the 3 pages 3 are in main memory
and 1 is located on the disk
90Some definitions and cache comparisons
- The bad news
- In order to understand exactly how virtual memory
works, we need to define some terms - The good news
- Virtual memory is very similar to a cache
structure - So, some definitions/analogies
- A page or segment of memory is analogous to a
block in a cache - A page fault or address fault is analogous to
a cache miss
real/physical memory
so, if we go to main memory and our data isnt
there, we need to get it from disk
91More definitions and cache comparisons
- These are more definitions than analogies
- With VM, CPU produces virtual addresses that
are translated by a combination of HW/SW to
physical addresses - The physical addresses access main memory
- The process described above is called memory
mapping or address translation
92More definitions and cache comparisons
- Back to cache comparisons
93Even more definitions and comparisons
- Replacement policy
- Replacement on cache misses primarily controlled
by hardware - Replacement with VM (i.e. which page do I
replace?) usually controlled by OS - B/c of bigger miss penalty, want to make the
right choice - Sizes
- Size of processor address determines size of VM
- Cache size independent of processor address size
94Virtual Memory
- Timings tough with virtual memory
- AMAT Tmem (1-h) Tdisk
- 100nS (1-h) 25,000,000nS
- h (hit rate) had to be incredibly (almost
unattainably) close to perfect to work - so VM is a cache but an odd one.
95Pages
96Paging Hardware
Physical Memory
32
32
CPU
page
offset
frame
offset
page table
page
frame
97Paging Hardware
Physical Memory
How big is a page? How big is the page table?
32
32
CPU
page
offset
frame
offset
page table
page
frame
98Address Translation in a Paging System
99How big is a page table?
- Suppose
- 32 bit architecture
- Page size 4 kilobytes
- Therefore
Offset 212
Page Number 220
100Test Yourself
- A processor asks for the contents of virtual
memory address 0x10020. The paging scheme in use
breaks this into a VPN of 0x10 and an offset of
0x020. - PTR (a CPU register that holds the address of the
page table) has a value of 0x100 indicating that
this processes page table starts at location
0x100. - The machine uses word addressing and the page
table entries are each one word long.
101Test Yourself
- ADDR CONTENTS
- 0x00000 0x00000
- 0x00100 0x00010
- 0x00110 0x00022
- 0x00120 0x00045
- 0x00130 0x00078
- 0x00145 0x00010
- 0x10000 0x03333
- 0x10020 0x04444
- 0x22000 0x01111
- 0x22020 0x02222
- 0x45000 0x05555
- 0x45020 0x06666
- What is the physical address calculated?
- 10020
- 22020
- 45000
- 45020
- none of the above
102Test Yourself
- ADDR CONTENTS
- 0x00000 0x00000
- 0x00100 0x00010
- 0x00110 0x00022
- 0x00120 0x00045
- 0x00130 0x00078
- 0x00145 0x00010
- 0x10000 0x03333
- 0x10020 0x04444
- 0x22000 0x01111
- 0x22020 0x02222
- 0x45000 0x05555
- 0x45020 0x06666
- What is the physical address calculated?
- What is the contents of this address returned to
the processor? - How many memory accesses in total were required
to obtain the contents of the desired address?
103Another Example
Physical memory 0 1 2 3 4 i 5 j 6 k 7 l 8 m 9 n 10
o 11 p 12 13 14 15 16 17 18 19 20 a 21 b 22 c 23
d 24 e 25 f 26 g 27 h 28 29 30 31
Logical memory 0 a 1 b 2 c 3 d 4 e 5 f 6 g 7 h 8 i
9 j 10 k 11 l 12 m 13 n 14 o 15 p
Page Table
0 1 2 3
5 6 1 2
104Replacement policies
105Block replacement
- Which block should be replaced on a virtual
memory miss? - Again, well stick with the strategy that its a
good thing to eliminate page faults - Therefore, we want to replace the LRU block
- Many machines use a use or reference bit
- Periodically reset
- Gives the OS an estimation of which pages are
referenced
106Writing a block
- What happens on a write?
- We dont even want to think about a write through
policy! - Time with accesses, VM, hard disk, etc. is so
great that this is not practical - Instead, a write back policy is used with a dirty
bit to tell if a block has been written