CS 42906290 Lecture 11 Memory Hierarchies

About This Presentation

Title:

CS 42906290 Lecture 11 Memory Hierarchies

Description:

Miss penalty: extra time to fetch a block from lower level, including time to replace in CPU ... pick any possible block and replace it. LRU stands for 'Least ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 126

Provided by: michaelt8

Category:

more less

Transcript and Presenter's Notes

Title: CS 42906290 Lecture 11 Memory Hierarchies

1
CS 4290/6290 Lecture 11Memory Hierarchies

(Lectures based on the work of Jay Brockman,
Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
Ken MacKenzie, Richard Murphy, Michael Niemier,
and Milos Pruvlovic)

2
Memory and Pipelining

In our 5 stage pipe, weve constantly been
assuming that we can access our operand from
memory in 1 clock cycle
This is possible, but its complicated
Well discuss how this happens in the next
several lectures
(see board for discussion)
Well talk about
Memory Technology
Memory Hierarchy
Caches
Memory
Virtual Memory

3
Memory Technology

Memory Comes in Many Flavors
SRAM (Static Random Access Memory)
DRAM (Dynamic Random Access Memory)
ROM, EPROM, EEPROM, Flash, etc.
Disks, Tapes, etc.
Difference in speed, price and size
Fast is small and/or expensive
Large is slow and/or expensive

4
Is there a problem with DRAM?
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Performance
10
DRAM 9/yr. (2X/10yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
5
Why Not Only DRAM?

Not large enough for some things
Backed up by storage (disk)
Virtual memory, paging, etc.
Will get back to this
Not fast enough for processor accesses
Takes hundreds of cycles to return data
OK in very regular applications
Can use SW pipelining, vectors
Not OK in most other applications

6
The principle of locality

says that most programs dont access all code or
data uniformly
i.e. in a loop, small subset of instructions
might be executed over and over again
a block of memory addresses might be accessed
sequentially
This has lead to memory hierarchies
Some important things to note
Fast memory is expensive
Levels of memory usually smaller/faster than
previous
Levels of memory usually subset one another
All the stuff in a higher level is in some level
below it

7
Terminology Summary

Hit data appears in block in upper level (i.e.
block X in cache)
Hit Rate fraction of memory access found in
upper level
Hit Time time to access upper level which
consists of
RAM access time Time to determine hit/miss
Miss data needs to be retrieved from a block in
the lower level (i.e. block Y in memory)
Miss Rate 1 - (Hit Rate)
Miss Penalty Extra time to replace a block in
the upper level
Time to deliver the block the processor
Hit Time ltlt Miss Penalty (500 instructions on
21264)

8
Average Memory Access Time
AMAT HitTime (1 - h) x MissPenalty

Hit time basic time of every access.
Hit rate (h) fraction of access that hit
Miss penalty extra time to fetch a block from
lower level, including time to replace in CPU

9
The Full Memory Hierarchyalways reuse a good
idea
Capacity Access Time Cost
Upper Level
Staging Xfer Unit
faster
CPU Registers 100s Bytes lt10s ns
Registers
prog./compiler 1-8 bytes
Instr. Operands
Cache K Bytes 10-100 ns 1-0.1 cents/bit
Cache
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 200ns- 500ns .0001-.00001
cents /bit
Memory
OS 4K-16K bytes
Pages
Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10
cents/bit
Disk
-5
-6
user/operator Mbytes
Files
Larger
Tape infinite sec-min 10
Tape
Lower Level
-8
10
A brief description of a cache

Cache next level of memory hierarchy up from
register file
All values in register file should be in cache
Cache entries usually referred to as blocks
Block is minimum amount of information that can
be in cache
If were looking for item in a cache and find it,
have a cache hit it not a cache miss
Cache miss rate fraction of accesses not in the
cache
Miss penalty is of clock cycles required b/c of
the miss

Mem. stall cycles Inst. count x Mem. ref./inst.
x Miss rate x Miss penalty
11
Cache Basics

Fast (but small) memory close to processor
When data referenced
If in cache, use cache instead of memory
If not in cache, bring into cache(actually,
bring entire block of data, too)
Maybe have to kick something else out to do it!
Important decisions
Placement where in the cache can a block go
Identification how do we find a block in cache
Replacement what to kick out to make room in
cache
Write policy What do we do about writes

12
Cache Basics

Cache consists of block-sized lines
Line size typically power of two
Typically 16 to 128 bytes in size
Example
Suppose block size is 128 bytes
Lowest seven bits determine offset within block
Read data at address A0x7fffa3f4
Address begins to block with base address
0x7fffa380

13
Some initial questions to consider

Where can a block be placed in an upper level of
memory hierarchy (i.e. a cache)?
How is a block found in an upper level of memory
hierarchy?
Which cache block should be replaced on a cache
miss if entire cache is full and we want to bring
in new data?
What happens if a you want to write back to a
memory location?
Do you just write to the cache?
Do you write somewhere else?

(See board for discussion)
14
Where can a block be placed in a cache?

3 schemes for block placement in a cache
Direct mapped cache
Block (or data to be stored) can go to only 1
place in cache
Usually (Block address) MOD ( of blocks in the
cache)
Fully associative cache
Block can be placed anywhere in cache
Set associative cache
Set a group of blocks in the cache
Block mapped onto a set then block can be
placed anywhere within that set
Usually (Block address) MOD ( of sets in the
cache)
If n blocks, we call it n-way set associative

15
Where can a block be placed in a cache?
Fully Associative
Direct Mapped
Set Associative
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
Cache
Set 0
Set 1
Set 2
Set 3
Block 12 can go anywhere
Block 12 can go only into Block 4 (12 mod 8)
Block 12 can go anywhere in set 0 (12 mod 4)
1 2 3 4 5 6 7 8 9..
Memory
12
16
Associativity

If you have associativity gt 1 you have to have a
replacement policy
FIFO
LRU
Random
Full or Full-map associativity means you
check every tag in parallel and a memory block
can go into any cache block
Virtual memory is effectively fully associative
(But dont worry about virtual memory yet)

17
How is a block found in the cache?

Caches have address tag on each block frame that
provides block address
Tag of every cache block that might have entry is
examined against CPU address (in parallel!
why?)
Each entry usually has a valid bit
Tells us if cache data is useful/not garbage
If bit is not set, there cant be a match
How does address provided to CPU relate to entry
in cache?
Entry divided between block address block
offset
and further divided between tag field index
field

(See board for explanation)
18
How is a block found in the cache?
Block Address
Block Offset

Block offset field selects data from block
(i.e. address of desired data within block)
Index field selects a specific set
Tag field is compared against it for a hit
Could we compare on more of address than the tag?
Not necessary checking index is redundant
Used to select set to be checked
Ex. Address stored in set 0 must have 0 in
index field
Offset not necessary in comparison entire block
is present or not and all block offsets must match

Tag
Index
19
Which block should be replaced on a cache miss?

If we look something up in cache and entry not
there, generally want to get data from memory and
put it in cache
B/c principle of locality says well probably use
it again
Direct mapped caches have 1 choice of what block
to replace
Fully associative or set associative offer more
choices
Usually 2 strategies
Random pick any possible block and replace it
LRU stands for Least Recently Used
Why not throw out the block not used for the
longest time
Usually approximated, not much better than random
i.e. 5.18 vs. 5.69 for 16KB 2-way set
associative

(add to picture on board)
20
What happens on a write?

FYI most accesses to a cache are reads
Used to fetch instructions (reads)
Most instructions dont write to memory
For DLX only about 7 of memory traffic involve
writes
Translates to about 25 of cache data traffic
Make common case fast! Optimize cache for reads!
Actually pretty easy to do
Can read block while comparing/reading tag
Block read begins as soon as address available
If a hit, address just passed right on to CPU
Writes take longer. Any idea why?

21
What happens on a write?

Generically, there are 2 kinds of write policies
Write through (or store through)
With write through, information written to block
in cache and to block in lower-level memory
Write back (or copy back)
With write back, information written only to
cache. It will be written back to lower-level
memory when cache block is replaced
The dirty bit
Each cache entry usually has a bit that specifies
if a write has occurred in that block or not
Helps reduce frequency of writes to lower-level
memory upon block replacement

(add to picture on board)
22
What happens on a write?

Write back versus write through
Write back advantageous because
Writes occur at the speed of cache and dont
incur delay of lower-level memory
Multiple writes to cache block result in only 1
lower-level memory access
Write through advantageous because
Lower-levels of memory have most recent copy of
data
If CPU has to wait for a write, we have write
stall
1 way around this is a write buffer
Ideally, CPU shouldnt have to stall during a
write
Instead, data written to buffer which sends it to
lower-levels of memory hierarchy

(add to picture on board)
23
LRU Example

4-way set associative
Need 4 values (2 bits) for counter

0
0x00004000
1
0x00003800
2
0xffff8000
3
0x00cd0800
Access 0xffff8004
0
0x00004000
1
0x00003800
3
0xffff8000
2
0x00cd0800
Access 0x00003840
0
0x00004000
3
0x00003800
2
0xffff8000
1
0x00cd0800
Access 0x00d00008
Replace entry with 0 counter,then update counters
3
0x00d00000
2
0x00003800
1
0xffff8000
0
0x00cd0800
24
Approximating LRU

LRU is too complicated
Access and possibly update all counters in a
seton every access (not just replacement)
Need something simpler and faster
But still close to LRU
NMRU Not Most Recently Used
The entire set has one MRU pointer
Points to last-accessed line in the set
ReplacementRandomly select a non-MRU line

25
What happens on a write?

What if we want to write and block we want to
write to isnt in cache?
There are 2 common policies
Write allocate (or fetch on write)
The block is loaded on a write miss
The idea behind this is that subsequent writes
will be captured by the cache (ideal for a write
back cache)
No-write allocate (or write around)
Block modified in lower-level and not loaded into
cache
Usually used for write-through caches
(subsequent writes still have to go to memory)

26
Memory access equations

Using what we defined on previous slide, we can
say
Memory stall clock cycles
Reads x Read miss rate x Read miss penalty
Writes x Write miss rate x Write miss penalty
Often, reads and writes are combined/averaged
Memory stall cycles
Memory access x Miss rate x Miss penalty
(approximation)
Also possible to factor in instruction count to
get a complete formula

27
Reducing cache misses

Obviously, we want data accesses to result in
cache hits, not misses this will optimize
performance
Start by looking at ways to increase of hits.
but first look at 3 kinds of misses!
Compulsory misses
Very 1st access to cache block will not be a hit
the datas not there yet!
Capacity misses
Cache is only so big. Wont be able to store
every block accessed in a program must swap
out!
Conflict misses
Result from set-associative or direct mapped
caches
Blocks discarded/retrieved if too many map to a
location

28
Cache Examples
29
Physical Address (10 bits)
Tag (6 bits)
Index (2 bits)
Offset (2 bits)
A 4-entry direct mapped cache with 4 data
words/block
Assume we want to read the following data
words Tag Index Offset Address Holds
Data 101010 10 00 3510 101010 10
01 2410 101010 10 10 1710 101010
10 11 2510
1
2
If we read 101010 10 01 we want to bring data
word 2410 into the cache. Where would this data
go? Well, the index is 10. Therefore, the data
word will go somewhere into the 3rd block of the
cache. (make sure you understand
terminology) More specifically, the data word
would go into the 2nd position within the block
because the offset is 01
3
The principle of spatial locality says that if we
use one data word, well probably use some data
words that are close to it thats why our block
size is bigger than one data word. So we fill
in the data word entries surrounding 101010 10 01
as well.
All of these physical addresses would have the
same tag
All of these physical addresses map to the same
cache entry
30
Tag
00
01
10
11
V
D
Physical Address (10 bits)
00
01
101010
2410
3510
1710
2510
10
Tag (6 bits)
Index (2 bits)
Offset (2 bits)
11
A 4-entry direct mapped cache with 4 data
words/block
4
5
Therefore, if we get this pattern of accesses
when we start a new program 1.) 101010 10
00 2.) 101010 10 01 3.) 101010 10 10 4.) 101010
10 11 After we do the read for 101010 10 00
(word 1), we will automatically get the data for
words 2, 3 and 4. What does this mean?
Accesses (2), (3), and (4) ARE NOT COMPULSORY
MISSES

What happens if we get an access to location
100011 10 11 (holding data 1210)
Index bits tell us we need to look at cache block
10.
So, we need to compare the tag of this address
100011 to the tag that associated with the
current entry in the cache block 101010
These DO NOT match. Therefore, the data
associated with address 100011 10 11 IS NOT
VALID. What we have here could be
A compulsory miss
(if this is the 1st time the data was accessed)
A conflict miss
(if the data for address 100011 10 11 was
present, but kicked out by 101010 10 00 for
example)

31
Tag
00
01
10
11
V
D
Physical Address (10 bits)
00
01
101010
2410
3510
1710
2510
10
Tag (6 bits)
Index (2 bits)
Offset (2 bits)
11
This cache can hold 16 data words
6
What if we change the way our cache is laid out
but so that it still has 16 data words? One way
we could do this would be as follows
Tag
000
V
D
0
1

All of the following are true
This cache still holds 16 words
Our block size is bigger therefore this should
help with compulsory misses
Our physical address will now be divided as
follows
The number of cache blocks has DECREASED
This will INCREASE the of conflict misses

1 cache block entry
32
7
What if we get the same pattern of accesses we
had before?
Pattern of accesses (note different of bits
for offset and index now) 1.) 101010 1 000 2.)
101010 1 001 3.) 101010 1 010 4.) 101010 1 011
Note that there is now more data associated with
a given cache block.
However, now we have only 1 bit of
index. Therefore, any address that comes along
that has a tag that is different than 101010
and has 1 in the index position is going to
result in a conflict miss.
33
7
But, we could also make our cache look like this
Again, lets assume we want to read the following
data words Tag Index Offset Address
Holds Data 101010 100 0 3510 101010
100 1 2410 101010 101
0 1710 101010 101 1 2510 Assuming
that all of these accesses were occurring for the
1st time (and would occur sequentially), accesses
(1) and (3) would result in compulsory misses,
and accesses would result in hits because of
spatial locality. (The final state of the
cache is shown after all 4 memory accesses).
1.) 2.) 3.) 4.)
There are now just 2 words associated with each
cache block.
Note that by organizing a cache in this way,
conflict misses will be reduced. There are now
more addresses in the cache that the 10-bit
physical address can map too.
34
8
All of these caches hold exactly the same amount
of data 16 different word entries
As a general rule of thumb, long and skinny
caches help to reduce conflict misses, short and
fat caches help to reduce compulsory misses, but
a cross between the two is probably what will
give you the best (i.e. lowest) overall miss rate.
But what about capacity misses?
35
8

Whats a capacity miss?
The cache is only so big. We wont be able to
store every block accessed in a program must
them swap out!

Can avoid capacity misses by making cache bigger

Thus, to avoid capacity misses, wed need to make
our cache physically bigger i.e. there are now
32 word entries for it instead of 16. FYI, this
will change the way the physical address is
divided. Given our original pattern of accesses,
wed have
Tag
00
01
10
11
V
D
000
001
10101
2410
3510
1710
2510
010
011
Pattern of accesses 1.) 10101 010 00
3510 2.) 10101 010 01 2410 3.) 10101
010 10 1710 4.) 10101 010 11
2510 (note smaller tag, bigger index)
100
101
110
111
36
Examples Ended
37
Cache misses and the architect

What can we do about the 3 kinds of cache misses?
Compulsory, capacity, and conflict
Can avoid conflict misses w/fully associative
cache
But fully associative caches mean expensive HW,
possibly slower clock rates, and other bad stuff
Can avoid capacity misses by making cache bigger
small caches can lead to thrashing
W/thrashing, data moves between 2 levels of
memory hierarchy very frequently can really
slow down perf.
Larger blocks can mean fewer compulsory misses
But can turn a capacity miss into a conflict miss!

38
Addressing Miss Rates
39
(1) Larger cache block size

Easiest way to reduce miss rate is to increase
cache block size
This will help eliminate what kind of misses?
Helps improve miss rate b/c of principle of
locality
Temporal locality says that if something is
accessed once, itll probably be accessed again
soon
Spatial locality says that if something is
accessed, something nearby it will probably be
accessed
Larger block sizes help with spatial locality
Be careful though!
Larger block sizes can increase miss penalty!
Generally, larger blocks reduce of total blocks
in cache

40
Larger cache block size (graph comparison)
Why this trend?
(Assuming total cache size stays constant for
each curve)
41
(1) Larger cache block size (example)

Assume that to access lower-level of memory
hierarchy you
Incur a 40 clock cycle overhead
Get 16 bytes of data every 2 clock cycles
I.e. get 16 bytes in 42 clock cycles, 32 in 44,
etc
Using data below, which block size has minimum
average memory access time?

Cache sizes
Miss rates
42
Larger cache block size (ex. continued)

Recall that Average memory access time
Hit time Miss rate X Miss penalty
Assume a cache hit otherwise takes 1 clock cycle
independent of block size
So, for a 16-byte block in a 1-KB cache
Average memory access time
1 (15.05 X 42) 7.321 clock cycles
And for a 256-byte block in a 256-KB cache
Average memory access time
1 (0.49 X 72) 1.353 clock cycles
Rest of the data is included on next slide

43
Larger cache block size(ex. continued)
Cache sizes
Red entries are lowest average time for a
particular configuration
Note All of these block sizes are common in
processors today Note Data for cache sizes in
units of clock cycles
44
(1) Larger cache block sizes (wrap-up)

We want to minimize cache miss rate cache miss
penalty at same time!
Selection of block size depends on latency and
bandwidth of lower-level memory
High latency, high bandwidth encourage large
block size
Cache gets many more bytes per miss for a small
increase in miss penalty
Low latency, low bandwidth encourage small block
size
Twice the miss penalty of a small block may be
close to the penalty of a block twice the size
Larger of small blocks may reduce conflict
misses

45
(2) Higher associativity

Higher associativity can improve cache miss
rates
Note that an 8-way set associative cache is
essentially a fully-associative cache
Helps lead to 21 cache rule of thumb
It says
A direct mapped cache of size N has about the
same miss rate as a 2-way set-associative cache
of size N/2
But, diminishing returns set in sooner or later
Greater associativity can cause increased hit time

46
(3) Victim caches

1st of all, what is a victim cache?
A victim cache temporarily stores blocks that
have been discarded from the main cache (usually
not that big)
2nd of all, how does it help us?
If theres a cache miss, instead of immediately
going down to the next level of memory hierarchy
we check the victim cache first
If the entry is there, we swap the victim cache
block with the actual cache block
Research shows
Victim caches with 1-5 entries help reduce
conflict misses
For a 4KB direct mapped cache victim caches
Removed 20 - 95 of conflict misses!

47
(3) Victim caches
CPU Address
Data in
Data out
?
Tag
Data
Victim Cache
?
Write Buffer
Lower level memory
48
(4) Psuedo-associative caches

This techniques should help
The miss rate of set-associative caches
The hit speed of direct mapped caches
Also called column associated cache
Access proceeds normally as for a direct mapped
cache
But, on a miss, we look at another entry before
going to a lower level of memory hierarchy
Usually done by
Inverting the most significant bit of index field
to find the other block in the psuedo-set
Psuedo-associative caches usually have 1 fast and
1 slow hit time (regular, psuedo hit
respectively)
In addition to the miss penalty that is

49
(5) Hardware prefetching

This one should intuitively be pretty obvious
Try and fetch blocks before theyre even
requested
This could work with both instructions and data
Usually, prefetched blocks are placed either
Directly in the cache (whats a down side to
this?)
Or in some external buffer thats usually a
small, fast cache
Lets look at an example (the Alpha AXP 21064)
On a cache miss, it fetches 2 blocks
One is the new cache entry thats needed
The other is the next consecutive block it goes
in a buffer
How well does this buffer perform?
Single entry buffer catches 15-25 of misses
With 4 entry buffer, the hit rate improves about
50

50
(5) Hardware prefetching example

What is the effective miss rate for the Alpha
using instruction prefetching?
How much larger of an instruction cache would we
need if the Alpha to match the average access
time if prefetching was removed?
Assume
It takes 1 extra clock cycle if the instruction
misses the cache but is found in the prefetch
buffer
The prefetch hit rate is 25
Miss rate for 8-KB instruction cache is 1.10
Hit time is 2 clock cycles
Miss penalty is 50 clock cycles

51
(5) Hardware prefetching example

We need a revised memory access time formula
Say Average memory access timeprefetch
Hit time miss rate prefetch hit rate 1
miss rate (1 prefetch hit rate) miss
penalty
Plugging in numbers to the above, we get
2 (1.10 25 1) (1.10 (1 25) 50)
2.415
To find the miss rate with equivalent
performance, we start with the original formula
and solve for miss rate
Average memory access timeno prefetching
Hit time miss rate miss penalty
Results in (2.415 2) / 50 0.83
Calculation suggests effective miss rate of
prefetching with 8KB cache is 0.83
Actual miss rates for 16KB 0.64 and 8KB 1.10

52
(6) Compiler-controlled prefetching

Its also possible for the compiler to tell the
hardware that it should prefetch instructions or
data
It (the compiler) could have values loaded into
registers called register prefetching
Or, the compiler could just have data loaded into
the cache called cache prefetching
As youll see, getting things from lower levels
of memory can cause faults if the data is not
there
Ideally, we want prefetching to be invisible to
the program so often, nonbinding/nonfaulting
prefetching used
With nonfautling scheme, faulting instructions
turned into no-ops
With faulting scheme, data would be fetched (as
normal)

53
(7) Compiler optimizations merging arrays

This works by improving spatial locality
For example, some programs may reference multiple
arrays of the same size at the same time
Could be bad
Accesses may interfere with one another in the
cache
A solution Generate a single, compound array

/ Before/ int tagSIZE int byte1SIZE int
byte2SIZE int dirtysize
/ After / struct merge int tag int
byte1 int byte2 int dirty struct merge
cache_block_entrySIZE
54
(7) Compiler optimizations loop interchange

Some programs have nested loops that access
memory in non-sequential order
Simply changing the order of the loops may make
them access the data in sequential order
Whats an example of this?

/ Before/ for( j 0 j lt 100 j j 1)
for( k 0 k lt 5000 k k 1) xkj
2 xkj
But who really writes loops like this???
/ After/ for( k 0 k lt 5000 k k 1)
for( j 0 j lt 5000 j j 1) xkj
2 xkj
55
(7) Compiler optimizations loop fusion

This ones pretty obvious once you hear what it
is
Seeks to take advantage of
Programs that have separate sections of code that
access the same arrays in different loops
Especially when the loops use common data
The idea is to fuse the loops into one common
loop
Whats the target of this optimization?
Example

/ After/ for( j 0 j lt N j j 1) for(
k 0 k lt N k k 1) ajk 1/bjk
cjk djk ajk cjk
/ Before/ for( j 0 j lt N j j 1) for(
k 0 k lt N k k 1) ajk 1/bjk
cjk for( j 0 j lt N j j 1) for( k
0 k lt N k k 1) djk ajk
cjk
56
(7) Compiler optimizations blocking

This is probably the most famous of compiler
optimizations to improve cache performance
Tries to reduce misses by improving temporal
locality
Before we go through a blocking example were
first going to introduce some terms
(And Im going to be perfectly honest here, I
never got this concept completely until I worked
through an example)
(And not just in class eitheryou actually have
to look at some code somewhat painstakingly
on your own!)
Also, keep in mind that this is used mainly with
arrays!
So.bear with me and now onto some definitions!

57
(7) Compiler optimizations blocking
(definitions)

1st of all, we need to realize that arrays can be
accessed/indexed differently
Some arrays are accessed by rows, others by
columns
Storing array data row-by-row is called row major
order
Storing array data column-by-column is called
column major order
In some code this wont help b/c array data is
going to be accessed both by rows and by columns!
Things like loop interchange dont help
Blocking tries to create submatricies or
blocks to maximize accesses to data loaded in the
cache before its replaced.

58
(7) Compiler optimizations blocking (example
preview)
/ Some matrix multiply code / for( i 0 i lt
N i i 1 ) for( j 0 j lt N j j 1)
r 0 for ( k 0 k lt N k k 1) r
r yik zkj xij r
2 inner loops read all N x N elements of z,
access the same N elements in a row of y
repeatedly, and write one row of N elements of x.
Pictorially what happens is
j
k
j
x
y
z
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
White block not accessed Light block
older access Dark block newer access
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
i
i
k
59
(7) Compiler optimizations blocking (some
comments)

In the matrix multiply code, the of capacity
misses is going to depend upon
The factor N (i.e. the sizes of the matrices)
The size of the cache
Some possible outcomes
The cache can hold all N x N matrices (great!)
Provided there are no conflict misses
The cache can hold 1 N x N matrix and one row of
size N
Maybe ith row of y and matrix z may stay in the
cache
The cache cant hold even this much
Misses will occur for both x and z
In the worst case there will be 2N3 N2 memory
reads for N3 memory operations!

60
(7) Compiler optimizations blocking (example
continued)
/ Blocked matrix multiply code / for( jj 0
jj lt N jj jj B ) for( kk 0 kk lt N kk
kk B) for( i 0 i lt N i i 1)
for( j jj j lt min( jj B 1, N) j j 1)
r 0 for( k kk k lt mim( kk B 1, N)
k k 1) r r yik
zkj xij xij r
To ensure that the elements accessed will all
fit/stay in the cache, the code is changed to
operate on submatrices of size B x B. The 2
inner loops compute in steps of size B instead of
going from the beginning to the end of x and
z. B is called the blocking factor.
Pictorially what happens is
j
k
j
x
y
z
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
Smaller of elements accessed but theyre all in
the cache!
i
i
k
61
(7) Compiler optimizations blocking (example
conclusion)

What might happen with regard to capacity misses?
Total of memory words accessed is 2N3/B N2
This is an improvement by a factor of B
Blocking thus exploits a combination of spatial
and temporal locality
y matrix benefits from spatial locality and z
benefits from temporal locality
Usually, blocking aimed at reducing capacity
misses
Assumes that conflict misses are not significant
or
can be eliminated by more associative caches
Blocking reduces of words active in a cache at
1 point in time therefore small block size
helps with conflicts

62
Addressing Miss Penalties
63
Cache miss penalties

Recall equation for average memory access time
Hit time Miss Rate X Miss Penalty
Talked about lots of ways to improve miss rates
of caches in previous slides
But, just by looking at the formula we can see
Improving miss penalty will work just as well!
Remember that technology trends have made
processor speeds much faster than memory/DRAM
speeds
Relative cost of miss penalties has increased
over time!

64
(1) Give priority to read misses over writes

Reads are the common case make them fast!
Write buffers helped us with cache writes but
They complicate memory accesses b/c they might
hold updated value of a location on a read miss
Example

SW 512(R0), R3 M512 ? R3 (cache index 0) LW
R1, 1024(R0) R1 ? M1024 (cache index 0) LW
R2, 512(R0) R2 ? M512 (cache index 0)

Assume direct mapped, write through cache
(512, 1024 mapped to the same location)
Assume a 4 word write buffer
Will the value in R2 always be equal to the value
in R3?

65
(1) Giving priority to read misses over writes

Example continue
This code generates a RAW hazard in memory
A cache access might work as follows
Data in R3 placed into the write buffer after the
store
Next load uses same cache index we get a miss
(i.e. b/c the store data is there)
Next load tries to put value in location 512 into
R2
This also results in a cache miss
(i.e. b/c 512 has been updated)
If write buffer hasnt finished writing to
location 512, reading location 512 will put the
wrong, old value into the cache block and then
into R2
R3 would not be equal to R2 which is a bad
thing!

66
(1) Giving priority to read misses over writes

1 solution to this problem is to handle read
misses only if the write buffer is empty
(Causes quite a performance hit however!)
Alternative is to check contents of the write
buffer on a read miss
If there are no conflicts memory system is
available, let read miss continue
Can also reduce the cost of writes within a
processor with a write-back cache
What if a read miss should replace a dirty memory
block?
Could write to memory, read memory
Or copy the dirty block to a buffer, read
memory, then write memory lets the CPU not
wait

67
(2) Sub-block placement for reduced miss penalty

Instead of replacing a whole complete block of a
cache, we only replace one of its subblocks
Note Well have to make a hardware change to do
this. What is it???
Subblocks should have a smaller miss penalty then
full blocks

68
(3) Early restart and critical word 1st

With this strategy were going to be impatient
As soon as some of the block is loaded, see if
the data is there and send it to the CPU
(i.e. we dont wait for the whole block to be
loaded)
Recall the Alphas cache took 2 cycles to
transfer all of the data needed
but the data word needed could come in the first
cycle
There are 2 general strategies
Early restart
As soon as the word gets to the cache, send it to
the CPU
Critical word first
Specifically ask for the needed word 1st, make
sure it gets to the CPU, then get the rest of the
cache block data

69
(4) Nonblocking caches to reduce stalls on cache
misses

These might be most useful with a Tomasulos or
scoreboard implementation. Why?
The CPU could still fetch instructions and start
them on a cache data miss
A nonblocking cache allows a cache (especially
data cache) to supply cache hits during a miss
This scheme is often called hit under miss
Other caveats of this are
hit under multiple miss
miss under miss
Which is only useful if the memory system can
handle multiple misses
These will all greatly complicate your cache
hardware!

70
(5) 2nd-level caches

1st 4 techniques discussed all impact CPU
Technique focuses on cache/main memory interface
Processor/memory performance gap makes us
consider
If they should make caches faster to keep pace
with CPUs
If they should make caches larger to overcome
widening gap between CPU and main memory
One solution is to do both
Add another level of cache (L2) between the 1st
level cache (L1) and main memory
Ideally L1 will be fast enough to match the speed
of the CPU while L2 will be large enough to
reduce the penalty of going to main memory

71
(5) Second-level caches

This will of course introduce a new definition
for average memory access time
Hit timeL1 Miss RateL1 Miss PenaltyL1
Where, Miss PenaltyL1
Hit TimeL2 Miss RateL2 Miss PenaltyL2
So 2nd level miss rate measure from 1st level
cache misses
A few definitions to avoid confusion
Local miss rate
of misses in the cache divided by total of
memory accesses to the cache specifically Miss
RateL2
Global miss rate
of misses in the cache divided by total of
memory accesses generated by the CPU
specifically -- Miss RateL1 Miss RateL2

72
(5) Second-level caches

Example
In 1000 memory references there are 40 misses in
the L1 cache and 20 misses in the L2 cache. What
are the various miss rates?
Miss Rate L1 (local or global) 40/1000 4
Miss Rate L2 (local) 20/40 50
Miss Rate L2 (global) 20/1000 2
Note that global miss rate is very similar to
single cache miss rate of the L2 cache
(if the L2 size gtgt L1 size)
Local cache rate not good measure of secondary
caches its a function of L1 miss rate
Which can vary by changing the L1 cache
Use global cache miss rate to evaluating 2nd
level caches!

73
(5) Second-level caches(some odds and ends
comments)

The speed of the L1 cache will affect the clock
rate of the CPU while the speed of the L2 cache
will affect only the miss penalty of the L1 cache
Which of course could affect the CPU in various
ways
2 big things to consider when designing the L2
cache are
Will the L2 cache lower the average memory access
time portion of the CPI?
If so, how much will is cost?
In terms of HW, etc.
2nd level caches are usually BIG!
Usually L1 is a subset of L2
Should have few capacity misses in L2 cache
Only worry about compulsory and conflict for
optimizations

74
(5) Second-level caches (example)

Given the following data
2-way set associativity increases hit time by 10
of a CPU clock cycle
Hit time for L2 direct mapped cache is 10 clock
cycles
Local miss rate for L2 direct mapped cache is
25
Local miss rate for L2 2-way set associative
cache is 20
Miss penalty for the L2 cache is 50 clock
cycles
What is the impact of using a 2-way set
associative cache on our miss penalty?

75
(5) Second-level caches (example)

Miss penaltyDirect mapped L2
10 25 50 22.5 clock cycles
Adding the cost of associativity increases the
hit cost by only 0.1 clock cycles
Thus, Miss penalty2-way set associative L2
10.1 20 50 20.1 clock cycles
However, we cant have a fraction for a number of
clock cycles (i.e. 10.1 aint possible!)
Well either need to round up to 11 or optimize
some more to get it down to 10. So
10 20 50 20.0 clock cycles or
11 20 50 21.0 clock cycles (both better
than 22.5)

76
(5) Second level caches(some final random
comments)

We can reduce the miss penalty by reducing the
miss rate of the 2nd level caches using
techniques previously discussed
I.e. Higher associativity or psuedo-associativity
are worth considering b/c they have a small
impact on 2nd level hit time
And much of the average access time is due to
misses in the L2 cache
Could also reduce misses by increasing L2 block
size
Need to think about something called the
multilevel inclusion property
In other words, all data in L1 cache is always in
L2
Gets complex for writes, and what not

77
Addressing Hit Time
78
Reducing the hit time

Again, recall our average memory access time
equation
Hit time Miss Rate Miss Penalty
Weve talked about reducing the Miss Rate and the
Miss Penalty Hit time can also be a big
component
On many machines cache accesses can affect the
clock cycle time so making this small is a good
thing!
Well talk about a few ways next

79
(1) Small and simple caches

Why is this good?
Generally, smaller hardware is faster so a
small cache should help the hit time
If an L1 cache is small enough, it should fit on
the same chip as the actual processing logic
Processor avoids time going off chip!
Some designs compromise and keep tags on a chip
and data off chip allows for fast tag check and
gtgt memory capacity
Direct mapping also falls under the category of
simple
Relates to point above as well you can check
tag and read data at the same time!

80
(2) Avoid address translation during cache
indexing

This problem centers around virtual addresses.
Should we send the virtual address to the cache?
In other words we have Virtual caches vs.
Physical caches
Why is this a problem anyhow?
Well, recall from OS that a processor usually
deals with processes
What if process 1 uses a virtual address xyz and
process 2 uses the same virtual address?
The data in the cache would be totally different!
called aliasing
Every time a process is switched logically, wed
have to flush the cache or wed get false hits.
Cost time to flush compulsory misses from
empty cache
I/O must interact with caches so we need virtual
addressess

81
(2) Avoiding address translation during cache
indexing

Solutions to aliases
HW that guarantees that every cache block has a
unique physical address
SW guarantee lower n bits must have the same
address
As long as they cover the index field and direct
mapped, they must be unique called page
coloring
Solution to cache flush
Add a PID processor identifier tag
The PID will identify the process as well as an
address within the process
So, we cant get a hit if we get the wrong
process!

82
Specific Example 1
83
A cache example

We want to compare the following
A 16-KB data cache a 16-KB instruction cache
versus a 32-KB unified cache
Assume a hit takes 1 clock cycle to process
Miss penalty 50 clock cycles
In unified cache, load or store hit takes 1 extra
clock cycle b/c having only 1 cache port a
structural hazard
75 of accesses are instruction references
Whats avg. memory access time in each case?

Miss Rates
84
A cache example continued

1st, need to determine overall miss rate for
split caches
(75 x 0.64) (25 x 6.47) 2.10
This compares to the unified cache miss rate of
1.99
Well use average memory access time formula from
a few slides ago but break it up into instruction
data references
Average memory access time split cache
75 x (1 0.64 x 50) 25 x (1 6.47 x 50)
(75 x 1.32) (25 x 4.235) 2.05 cycles
Average memory access time unified cache
75 x (1 1.99 x 50) 25 x (1 1 1.99 x
50)
(75 x 1.995) (25 x 2.995) 2.24 cycles
Despite higher miss rate, access time faster for
split cache!

85
Virtual Memory
86
The Full Memory Hierarchyalways reuse a good
idea
Capacity Access Time Cost
Upper Level
Staging Xfer Unit
faster
CPU Registers 100s Bytes lt10s ns
Registers
prog./compiler 1-8 bytes
Instr. Operands
Cache K Bytes 10-100 ns 1-0.1 cents/bit
Cache
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 200ns- 500ns .0001-.00001
cents /bit
Memory
OS 4K-16K bytes
Pages
Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10
cents/bit
Disk
-5
-6
user/operator Mbytes
Files
Larger
Tape infinite sec-min 10
Tape
Lower Level
-8
87
Virtual Memory

Some facts of computer life
Computers run lots of processes simultaneously
No full address space of memory for each process
Must share smaller amounts of physical memory
among many processes
Virtual memory is the answer!
Divides physical memory into blocks, assigns them
to different processes

Compiler assigns data to a virtual address. VA
translated to a real/physical somewhere in
memory (allows any program to run
anywhere where is determined by a particular
machine, OS)
88
Whats the right picture?
Logical Address Space
Physical Address Space
89
The gist of virtual memory

Relieves problem of making a program that was too
large to fit in physical memory well.fit!
Allows program to run in any location in physical
memory
(called relocation)
Really useful as you might want to run same
program on lots machines

Logical program is in contiguous VA space here,
consists of 4 pages A, B, C, D The physical
location of the 3 pages 3 are in main memory
and 1 is located on the disk
90
Some definitions and cache comparisons

The bad news
In order to understand exactly how virtual memory
works, we need to define some terms
The good news
Virtual memory is very similar to a cache
structure
So, some definitions/analogies
A page or segment of memory is analogous to a
block in a cache
A page fault or address fault is analogous to
a cache miss

real/physical memory
so, if we go to main memory and our data isnt
there, we need to get it from disk
91
More definitions and cache comparisons

These are more definitions than analogies
With VM, CPU produces virtual addresses that
are translated by a combination of HW/SW to
physical addresses
The physical addresses access main memory
The process described above is called memory
mapping or address translation

92
More definitions and cache comparisons

Back to cache comparisons

93
Even more definitions and comparisons

Replacement policy
Replacement on cache misses primarily controlled
by hardware
Replacement with VM (i.e. which page do I
replace?) usually controlled by OS
B/c of bigger miss penalty, want to make the
right choice
Sizes
Size of processor address determines size of VM
Cache size independent of processor address size

94
Virtual Memory

Timings tough with virtual memory
AMAT Tmem (1-h) Tdisk
100nS (1-h) 25,000,000nS
h (hit rate) had to be incredibly (almost
unattainably) close to perfect to work
so VM is a cache but an odd one.

95
Pages
96
Paging Hardware
Physical Memory
32
32
CPU
page
offset
frame
offset
page table
page
frame
97
Paging Hardware
Physical Memory
How big is a page? How big is the page table?
32
32
CPU
page
offset
frame
offset
page table
page
frame
98
Address Translation in a Paging System
99
How big is a page table?

Suppose
32 bit architecture
Page size 4 kilobytes
Therefore

Offset 212
Page Number 220
100
Test Yourself

A processor asks for the contents of virtual
memory address 0x10020. The paging scheme in use
breaks this into a VPN of 0x10 and an offset of
0x020.
PTR (a CPU register that holds the address of the
page table) has a value of 0x100 indicating that
this processes page table starts at location
0x100.
The machine uses word addressing and the page
table entries are each one word long.

101
Test Yourself

ADDR CONTENTS
0x00000 0x00000
0x00100 0x00010
0x00110 0x00022
0x00120 0x00045
0x00130 0x00078
0x00145 0x00010
0x10000 0x03333
0x10020 0x04444
0x22000 0x01111
0x22020 0x02222
0x45000 0x05555
0x45020 0x06666

What is the physical address calculated?
10020
22020
45000
45020
none of the above

102
Test Yourself

ADDR CONTENTS
0x00000 0x00000
0x00100 0x00010
0x00110 0x00022
0x00120 0x00045
0x00130 0x00078
0x00145 0x00010
0x10000 0x03333
0x10020 0x04444
0x22000 0x01111
0x22020 0x02222
0x45000 0x05555
0x45020 0x06666

What is the physical address calculated?
What is the contents of this address returned to
the processor?
How many memory accesses in total were required
to obtain the contents of the desired address?

103
Another Example
Physical memory 0 1 2 3 4 i 5 j 6 k 7 l 8 m 9 n 10
o 11 p 12 13 14 15 16 17 18 19 20 a 21 b 22 c 23
d 24 e 25 f 26 g 27 h 28 29 30 31
Logical memory 0 a 1 b 2 c 3 d 4 e 5 f 6 g 7 h 8 i
9 j 10 k 11 l 12 m 13 n 14 o 15 p
Page Table
0 1 2 3
5 6 1 2
104
Replacement policies
105
Block replacement

Which block should be replaced on a virtual
memory miss?
Again, well stick with the strategy that its a
good thing to eliminate page faults
Therefore, we want to replace the LRU block
Many machines use a use or reference bit
Periodically reset
Gives the OS an estimation of which pages are
referenced

106
Writing a block

What happens on a write?
We dont even want to think about a write through
policy!
Time with accesses, VM, hard disk, etc. is so
great that this is not practical
Instead, a write back policy is used with a dirty
bit to tell if a block has been written

Write a Comment

User Comments (0)