Cache

About This Presentation

Title:

Cache

Description:

writes go to cache and the buffer, the CPU continues without stalling ... we might define memory stall cycles / instruction = misses / instruction ... – PowerPoint PPT presentation

Number of Views:287

Avg rating:3.0/5.0

Slides: 52

Provided by: rfox

Category:

more less

Transcript and Presenter's Notes

Title: Cache

1
Cache

Memory hierarchy is our solution for unlimited
fast accesses
at least 1 instruction fetch and maybe 1 data
access per cycle
more for a superscalar
Each level of the hierarchy is based (at least in
part) on Principle of Locality of Reference
as we move higher in the hierarchy, each level
gets faster but also more expensive, therefore it
is more restricted in size
some issues are generic across the hierarchy
but, each level has its unique characteristics,
technology and solutions
we have already looked at registers, here we will
study cache

The higher levels of the hierarchy (which are
faster) are more expensive so we have less
storage there But we want to prevent accesses at
the lower levels because of their longer access
times
2
Effect on Performance

Memory access time has a direct effect on CPU
performance
CPU execution time (CPU clock cycles memory
stall cycles) clock cycle time
mem stall cycles IC mem references per instr
miss rate miss penalty
mem references per instr gt 1 since there will be
the instruction fetch itself, and possible 1 or
more data fetches
whenever an instruction or data is not in
registers, we must fetch it from cache, but if it
is not in cache, we accrue a miss penalty by
having to access the much slower main memory
A large enough miss penalty will cause a
substantial decrease in CPU execute time,
consider for example
CPI 1.0 when all memory accesses are hits
only data accesses are during loads and stores
(50 of all instructions are loads or stores)
miss penalty is 25 clock cycles, miss rate is 2
what is the impact on CPI?
CPI 1.0 100225 50225 1.75
so we are 75 slower than we might be because of
cache misses

3
Four questions

The general piece of memory will be called a
block
blocks differ in size depending on the level of
the memory hierarchy
cache block, memory block, disk block
We ask the following questions
Q1 where can a block be placed?
Q2 how is a block found?
Q3 which block should be replaced on a miss?
Q4 what happens on a write?
The answers to questions 1-3 differ depending on
the type of cache
direct mapped, associative, set-associative
The answer to the last question is based on the
write policy
write through or write back

4
Q1 Where can a block be placed?

Type determines placement
Associative cache
any available block
Direct mapped cache
given memory block has only one location where it
can be placed in cache determined by the
equation
(block address) mod size
Set associative cache
given memory block has a set of blocks in the
cache where it can be placed determined by
(block addr) mod (size / associativity)

Here we have a cache of size 8 and a memory of
size 32 to place memory block 12, we can put
it in any block in associative cache, in block 4
in direct mapped cache, and in block 0 or 1 in
a 2 way set associative cache
5
Q2 How is a block found in cache?

All memory addresses consist of a tag, a line
number (or index), and a block offset
in a direct mapped cache, the line number
dictates the line where a block must be placed or
where it will be found
the tag is used to make sure that the line we
have found is the line we want
in a set associative cache, the line number
references a set of lines
the block must be placed in one of those lines,
but there is some variability which line should
we put it in, which line will we find it in?
we use a multiplexor to compare the tags of the
line from each set as selected by the line number
in a fully associative cache, a line can go
anywhere
since the fully associative cache is all 1-line
sets, we have to compare all blocks at once, this
is done through associative memory (a large
number of MUXs)

6
Q3 Which block should be replaced?

For the direct-mapped cache
there is no choice of which line a new block must
be placed, so there is no need for a replacement
strategy
For set-associative and fully associative caches
we have a choice, for instance in a 2-way
set-associative cache we can place the new block
in set 0 or set1
for the fully associative cache, our choice is to
place the new block into any line at all
There are three common replacement strategies
random
FIFO
least recently used
this best models locality of reference as it
predicts which block will not be used again in
the near future by looking at how long ago it was
last accessed, however, this is difficult to
implement especially in hardware
instead we might use a variation, LRU
approximation or least frequently used
figure C.4 on page C-10 compares the data miss
rate when using FIFO, Random, and LRU replacement
strategies
what we see is that LRU is always the best, but
the difference between the three is not great,
and that random is often as good as FIFO

7
Q4 What happens on a write?

Writes will occur only to data
when the datum is loaded into cache, there are
(at least) two copies in memory now, in the cache
and in main memory (other copies may exist if we
have multiple levels of cache)
on a cache write, the datum in cache is modified,
but what happens to the old (dirty) value in
memory?
Write Through cache
write the datum to both cache and memory at the
same time
this is inefficient because the data access is a
word, typical data movement between cache and
memory is a block, so this write uses only part
of the bus for a transfer
notice other words in the same block may also
soon be updated, so waiting could pay off
Write Back cache
write to cache, wait on writing to memory until
the entire block is being removed from cache
add a dirty bit to the cache to indicate that the
cache value is right, memory is wrong
Write Through is easier to implement since memory
will always be up-to-date and we dont need dirty
bit mechanisms
Write Back is preferred to reduce memory traffic
(a write stall occurs in Write Through if the CPU
must wait for the write to take place)

8
Write Miss Example

What happens on a write miss?
write allocate
block fetched on a miss, the write takes place at
both the cache and memory
no-write allocate
block modified in memory without being brought
into the cache
Consider write-back cache which starts empty and
has the sequence of operations as shown below to
the left
how many hits and how many misses occur with
no-write allocate versus write allocate?

Solution
for no-write allocate
the first two operations cause misses (since
after the first one, 100 is still not loaded into
cache), the third instruction causes a miss, the
fourth instruction is a hit (since 200 is now in
cache) but the fifth is also a miss, so 4 misses,
1 hit
for write allocate
the first access to a memory location is always a
miss, but from there, it is in cache and the rest
are hits, so we have 2 misses (one for each of
100 and 200) and 3 hits

Write 100 Write 100 Read 200 Write
200 Write 100
9
Using a Write Buffer

To alleviate the inefficiency of Write Through,
we may add a write buffer
writes go to cache and the buffer, the CPU
continues without stalling
writes to memory occur when the buffer is full or
when a line is filled
thus, a write to memory is done in parallel with
the processor continuing execution
although stalls can still arise when using a
write buffer, they are less frequent
question on a read or write miss, should we
examine the write buffer?
the buffer may still be storing a value so that
we can avoid the time consuming memory access
Example
given the three memory operations below where
locations 512 and 1024 map to the same
direct-mapped cache line, if the cache is a
write-through cache with a 4-word write buffer
that is not checked on a cache miss, will the
value in R2 equal the value in R3 at the end?
not necessarily

SW R3, 512(R0) // value written to write buffer,
but not yet to memory LW R1, 1024(R1) // cache
miss, new block brought in LW R2, 512(R0) //
cache miss, fetch Mem512 which might still be
the // old value since the write buffer may not
have written yet!
10
Example Opteron Data Cache

Found in AMD Opteron processors
2-way set ass., 512 blocks (per set)
64 Kbytes in 64-byte blocks
write-back, write-allocate
CPU address 40-bit address
comprised of 25-bit tag, 9-bit line number and
6-bit byte number
tags from both sets of the given line number are
compared to this line number (the 21 MUX is used
to select which (if either) datum to return to
the CPU)

the valid bit is used to denote that a block has
been modified
so that it can be written back to memory prior to
being removed from the cache
on a miss, the cache notifies the processor of
the miss so that the processor can stall, and the
request continues down to the next level of the
hierarchy
which takes 7 further cycles to get the first 8
bytes of the needed block
replacement strategy is LRU using a single bit to
denote which sit has been accessed most recently
and replacing the older one

11
Split Versus Unified Caches

Notice that the previous example was of only a
data cache
Why have separate data and instruction caches?
below are statistics indicating the performance
of two split caches versus a single unified cache
the unified cache would be twice the size of the
two split caches, so compare for instance two 8KB
caches to a 16KB unified cache

Note this table does not show miss rate we
are seeing misses per 1000 instruction, not per
1000 access Divide by 10 to get percentage
(e.g., 6.3 for 8KB Unified cache)
Why do you suppose the instruction cache performs
so much better than the data cache of equal size?
12
Example

Compare 16 KB instr and 16 KB data caches vs. 32
KB unified cache assuming
1 clock cycle hit time
100 clock cycle miss penalty for the individual
caches
add 1 clock cycle hit time for load/store in the
unified cache (36 of instructions are
load/stores)
since a single cache cannot handle both an
instruction fetch and data access in the same
cycle
write-through caches with write buffer, no stalls
on writes
What is the average memory access time for both
caches?
To determine each caches performance, we compute
memory access time
avg memory access time hit time miss rate
miss penalty
as noted above, the unified cache, we have a
higher hit time because the unified cache will
need two accesses at a time if an instruction is
a load or store

13
Solution

We get misses per instruction from table on the
earlier slide
Converting to miss rate
(3.82 / 1000) / 1 instr .00382
(40.9 / 1000) / .36 .1136
(43.3 / 1000) / 1.36 .0318
Of 136 accesses per 100 instructions, percentage
of instruction accesses
100 / 136 74
Percentage of data accesses 36 / 136 26
memory access time for 2 caches 74 (1
.00382 100) 26 (1 .1136 100) 4.236
memory access for unified cache 74 (1
.0318 100) 26 (2 .0318 100) 4.44
Separate caches perform better
note typo in the book, they state 100 clock
cycle miss penalty but use 200 in their solution

14
Revised CPU Performance

Recall our previous CPU formula
(CPU cycles memory stall cycles) clock cycle
time
assume memory stalls are caused by cache misses,
not problems like bus contention, I/O, etc
Memory stall cycles
memory accesses miss rate miss penalty
reads read miss rate read miss penalty
writes write miss write write miss penalty
CPU time
IC (CPI mem access per instr miss rate
miss penalty) clock cycle time
IC (CPI CCT mem accesses per instr miss
rate mem access time)
The memory access time is independent of the
clock speed
so an interesting tradeoff arises the faster
the clock, the better the CPU performance, but
the higher the miss penalty!

15
Associativity Example

What impact does cache organization
(direct-mapped vs. 2-way set associative) have on
a CPU?
assume CPI with perfect cache is 1.6, clock cycle
time .35 ns, 1.4 memory references per
instruction, 128 KB data and instr caches with 64
byte blocks each
cache 1 is direct-mapped with a miss rate of
2.1, cache 2 is 2-way set associative with a
miss rate of 1.9
the direct-mapped cache is faster, so the clock
speed is faster, assume clock speed is lengthened
by 35 for 2-way set associative cache
cache miss penalty 65 ns
CPU Time Cache 1
IC (1.6 .35 1.4 .021 65) 2.47 IC
CPU Time Cache 2
IC (1.6 .35 1.35 1.4 .019 65) 2.49
IC
CPU with Cache 1 2.47 / 2.49 1.01 times
faster
Note cache miss penalty should actually be
higher for cache 2 why?

16
Out of Order and Miss Penalty

In our prior examples, cache misses caused the
pipeline to stall thus impacting CPI
in a multiple-issue out-of-order execution
architecture, like Tomasulo, a data miss means
that a particular instruction stalls, possibly
stalling others because it ties up a reservation
station or reorder buffer slot, but it is more
likely that it will not impact overall CPI
How then do we determine the impact of cache
misses on such architectures?
we might define memory stall cycles / instruction
misses / instruction (total miss latency
overlapped miss latency)
total miss latency the total of all memory
latencies where the memory latency for a single
instruction
overlapped miss latency the amount of time that
the miss is not impacting performance because
other instructions remain executing
these two terms are difficult to analyze, so we
wont cover this in any more detail
typically a multi-issue out-of-order architecture
can hide some of the miss penalty, up to 30 as
shown in an example on page C-20 C-21

17
Improving Cache Performance

Having reviewed 1000s of research papers on
caches, the authors provide a number of
approaches to improve cache performance
recall average memory access time hit time
miss rate miss penalty
these approaches can be categorized by what
aspect of the above formula they are trying to
reduce
reduce hit time
reduce miss rate
reduce miss penalty (or increase cache bandwidth
so that part of the miss penalty can be reduced)
using parallelism in order to reduce miss rate
and/or penalty
Comments
miss penalty is the biggest value in the
equation, so this should be the obvious target to
reduce, but in fact little can be done to
increase memory speed
reducing miss rate has a number of different
approaches however miss rates today are often
less than 2, can we continue to improve?
reducing hit time has the benefit of allowing us
to lower clock cycle time as well

18
The 3 Cs

Cache misses can be categorized as
compulsory misses
very first access to a block cannot be in the
cache because the process has just begun and
there has not been a chance to load anything into
the cache
capacity misses
the cache cannot contain all of the blocks needed
for the process
conflict misses
the block placement strategy only allows a block
to be placed in a certain location in the cache
bringing about contention with other blocks for
that same location
some of the approaches we will attempt to reduce
one particular type of miss
figure C.8 on page C-23 demonstrates the miss
rates for various sized caches and
associativities
conflict misses are more common in direct-mapped
caches and are 0 in fully associative caches
whereas capacity misses are the most common cause
of a cache miss

19
Solution 1 Larger Block Sizes

Larger block sizes will reduce compulsory misses
larger blocks can take more advantage of temporal
and spatial reference
but, larger blocks can increase miss penalty
because it physically takes longer to transfer
the block from main memory to cache
also, larger blocks means less blocks in cache
which itself can increase the miss rate
this depends on program layout and the size of
the cache vs. block size

A block size of 64 to 128 bytes provides
the lowest miss rates
20
Example Impact of Block Size

Assume memory takes 80 cycles to respond and
delivers 16 bytes every 2 clock cycles thereafter
Which block size has minimum avg memory access
time for each cache size?
average memory access time hit time miss rate
miss penalty
hit time 1, miss rates from previous slide
miss penalty depends on size of block (82 cycles
for 16 bytes, 84 cycles for 32 bytes, etc, for k
byte blocks miss penalty (k / 16) 2 80)
Solution
16 byte block, 4 KB cache 1 (8.57 82)
8.027 clock cycles
256 byte block, 256KB cache 1 (.49 112)
1.549 clock cycles

We must compromise because a bigger block size
reduces miss rate to some extent, but also
increases hit time
21
Solution 2 Larger Caches

A larger cache will reduce capacity miss rates
since the cache has a larger capacity, but also
conflict miss rates because the larger cache
allows more refill lines and so fewer conflicts
This is an obvious solution and has no seeming
performance drawbacks
you must be careful where you put this larger
cache
a larger on-chip cache might take space away from
other hardware that could provide performance
increases (registers, more functional units,
logic for multiple-issue of instructions, etc)
and more cache means a greater expense for the
machine
Larger caches however are often slower as they
need more multiplexors
the best place for a large cache is off the chip
where we have space and because we dont mind a
slightly slower cache given that the miss penalty
will be larger due to having to go off the chip
the authors note that second-level caches from
2001 computers are equal in size to main memories
from 10 years ago!

22
Solution 3 Higher Associativity

Cache research points out the 21 cache rule of
thumb
a direct-mapped cache of size N has about the
same miss rate as a 2-way set associative cache
of size N/2 so that larger associativity yields
smaller miss rates
However, a large 8-way associative cache will
have about a 0 conflict miss rate
meaning that an 8-way associative cache is about
as good as a fully associative cache
so we dont need to have associativity beyond
8-way

Why use direct-mapped?
associativity will always have a higher hit time
How big is the difference?
as we saw in an earlier example, a 2-way set
associative cache was 35 slower than the
direct-mapped
remember that clock speed is usually equal to
cache hit time so we wind up slowing down the
entire computer when using associative caches of
some kind

So, with this in mind, should we use
direct-mapped or set associative?
23
Example Impact of Associativity

Average memory access time hit time miss rate
miss penalty
4k direct 1 .098 25 3.45
512 KB 8-way 1.52 .006 25 1.67

Assume higher associativity increases clock cycle
time as follows
clock 2-way 1.36 clock direct mapped
clock 4-way 1.44 times clock direct-mapped
clock 8-way 1.52 times clock direct-mapped
assume L1 cache is direct-mapped with 1 cycle hit
time
Determine best L2 type
given that miss penalty for direct-mapped is 25
cycles and L2 never misses

in most cases, higher associativity means higher
access time so direct mapped is often the best
24
Solution 4 Multilevel Caches

To improve performance, we find that we would
like
a faster cache to keep pace with memory
a larger cache to lower miss rate
Which should we pick? Both
offer a small but fast first level cache (L1)
offer a larger but slower second level cache (L2)
on the chip
since this second cache is larger, it will be
somewhat slower, we make it even slower by using
some degree of associativity to improve hit rate
this gives us a new formula for average memory
access time
hit time L1 miss rate L1 miss penalty L1
miss penalty L1 hit time L2 miss rate L2
miss penalty L2
avg mem access time hit time L1 miss rate L1
(hit time L2 miss rate L2 miss penalty L2)
We redefine miss rate for second cache
local miss rate of cache misses / of mem
accesses this cache
global miss rate of cache misses / of mem
accesses overall
these values remain the same for the 1st level
cache

25
Impact of L2 Cache

Local miss rate for second cache will be larger
than local miss rate for first cache since the
first cache skims the cream of the crop
second level cache is only accessed when the
first level misses entirely
global miss rate is more useful than local miss
rate for the second cache
global miss rate tells us how many misses there
are in all accesses
For example
assume in 1000 references, L1 has 40 misses, L2
has 20
local (and global) miss rate cache1 40/1000
4
local miss rate cache2 20/40 50, global miss
rate cache2 20/1000 2
Local miss rate cache2 is misleading, global miss
rate gives us an indication of how both caches
perform overall
L1 hit time is 1, L2 hit time is 10, memory
access time is 200 cycles
what is the average memory access time?
avg. mem access time 1 4(1050200) 5.4
cycles
without L2, we have avg. mem access time 1 4
200 9, so the L2 cache gives us a 9 / 5.4
1.67 or 67 speedup!

26
Associativity for L2 Cache

Here we see the benefit of an associative cache
for a second-level cache instead of direct-mapped
compare direct-mapped vs. 2-way set associative
caches for second level assuming a 2-way set
associative cache is 10 slower than the
direct-mapped cache
direct-mapped L2 has hit time 10 cycles
direct-mapped L2 has local miss rate 25
2-way set-associative L2 has hit time 10.1
cycles
2-way set-associative L2 has local miss rate
20
miss penalty L2 200 cycles
direct-mapped L2, miss penalty 10 .25 200
60 cycles
2-way set-associative L2, miss penalty 10.1
.20 200 50.1 cycles
we round 10.1 and 50.1 up to 11 and 51
respectively since the L2 cache is still governed
by the system clock

27
Solution 5 Priority of Reads over Writes

Reads occur with a much greater frequency than
writes
we want to make the more common case fast so we
would like to make sure that reads are faster
than writes
writes are slower because of the need to write to
both cache and main memory as well as perform a
tag check prior to starting the write (a tag
check on a read can be done in parallel, if the
tag is incorrect, we just cancel the remainder of
the read)
we might use a write buffer for both types of
write policy
write-through cache writes to write buffer first,
and any read misses are given priority over
writing the write buffer to memory
write-back cache writes to write buffer and the
write buffer is only written to memory when we
are assured of no conflict with a read miss
in either case, a read miss will first examine
the write buffer before going on to memory in
order to potentially save time
so, read misses have priority over writes since
read misses are more common, so we make the
common case fast

28
Solution 6 Avoid Address Translation

CPU generates an address and sends it to cache
but the address generated by the CPU is a logical
(virtual) address, not the physical address in
memory
the addresses differ because of paging and
virtual memory
caches store items based on their physical
address (for instance, the line number is derived
from the physical address, not the virtual
address)
to obtain the physical address, the virtual
address must first be translated using the TLB
(or page table if the entry isnt in the TLB)
thus, any memory access will actually take at
least 2 cache accesses and we want to prevent
this
If we store virtual addresses in the cache, we
can skip this translation but there are problems
with this approach
the address translation process includes security
mechanisms to make sure that a generated address
is not trying to access some other process
portion of memory (such as a user process and the
OS)
a solution is to copy protection information into
the cache and check it on every cache access
if a process is switched out of memory then the
cache must be flushed
unless we add process id information to tags,
which like the previous solution means using more
cache space for non-data

29
Another Solution

A very simple solution to the problem of needing
to translate virtual to physical addresses is
this
make sure that the line number is identical in
both the virtual and physical addresses
Consider for instance a memory system of
4G of words (32 bit addresses, assume word-level
addressing, not byte-level)
block size of 16 words
a 16KB on-chip cache would store 1024 blocks
requiring 10 bits of the address
so the address format is 18 bits for the tag, 10
bits for the block, 4 bits for the word
assume a page size of 16K words so that the page
offset is 14 bits and the page number size is 18
bits
paging process will swap out the page number for
the frame number, both of which are 18 bits of
the address so that the virtual address page
offset is the same bits as the physical address
block number and word offset and so no additional
address translation is necessary prior to cache
access
For this to work, we must ensure that log 2 page
size gt tag size

30
Solution 7 Small and Simple Caches

Cache access (for any but an associative cache)
requires
using the index part of the address to find the
appropriate line in the cache
then comparing tags to see if the entry is the
right one
the tag comparison can be time consuming
especially with associative caches that have
large tags or set associative caches where
comparisons use more hardware to be done in
parallel
Two solutions to keeping a fast cache are
use direct-mapped caches
keep tags on the chip for quick tag check but
move the data off the chip
this latter approach is often the case for L2
caches, but not L1
Another idea is to use a small enough L2 cache to
fit on the processor to keep L1 miss penalties
down
L1 cache sizes have not increased lately, but
small enough L2 caches can now fit on the chip

31
Solution 8 Way Prediction

The direct-mapped cache offers a faster access
time than a 2-way set associative cache
because we can avoid the additional multiplexor
to select between the two sets
Lets assume that an access to one set will be
followed by an access to the same set
this prediction is called way prediction and
can be used to speed up access to a 2-way set
associative cache by maintaining a prediction bit
for the cache
the bit is toggled when we have a miss-prediction
and need to switch to the other set
So we get the lower miss rate of the 2-way set
associative cache and the lower hit time whenever
we predict correctly
simulations indicate a prediction accuracy of 85
or more
the Pentium IV uses way prediction

32
Variation Pseudo-Associative Cache

We can alter a direct-mapped cache to have some
associativity as follows
consult the direct-mapped cache as normal
provides fast hit time
if there is a miss, invert the address and try
the new address
inversion might flip the last bit in the line
number
the second access comes at a cost of a higher hit
rate for a second attempt (it may also cause
other accesses to stall while the second access
is being performed!)
thus, the same address might be stored in one of
two locations, thus giving some associativity
the pseudo-associative cache will reduce the
amount of conflict misses
any cache miss may still become a cache hit
first check is fast (hit time of direct-mapped)
second check might take 1-2 cycles further, so is
still faster than a second-level cache

33
Example

Which provides faster avg memory access time for
4KB and 256 KB caches
direct-mapped, 2-way associative or
pseudo-associative (PAC)?
Assume hit time of 1 cycle for direct mapped,
1.36 for 2-way set associative, and miss penalty
of 50 cycles where the PAC will have a hit time
of 3 cycles for a second access
we alter our formula for PAC because a miss does
not necessarily accrue a 50 cycle penalty but
instead a 3 cycle penalty if the item is in the
other position in cache
we need two hit rates the normal hit rate and
the hit rate of finding the item in the second
position (we will call this the alternative hit
rate)
alternative hit rate hit rate2 way - hit rate1
way 1 - miss rate2 way - (1 - miss rate1 way)
miss rate1 way - miss rate2 way
Avg mem access time PAC
1 (miss rate1 way miss rate2 way) 3 miss
rate2 way miss penalty

34
Solution

PAC 4 KB 1 (.098 - .076) 3 (.076 50)
4.866
PAC 256 KB 1 (.013 - .012) 3 (.012
50) 1.603
Direct-mapped 4 KB 1 .098 50 5.9
Direct-mapped 256 KB 1 .013 50 1.65
2-way 4 KB 1.36 .076 50 5.16
2-way 256 KB 1.36 .012 50 1.96
So, pseudo-associative cache outperforms both!

35
Solution 9 Trace Caches

This type of a cache is an instruction cache
which supports multiple issue of instructions by
providing 4 or more independent instructions per
cycle
cache blocks are dynamic, unlike normal caches
where blocks are static based on what is stored
in memory
here, the block is formed around branch
prediction, branch folding, and trace scheduling
note that because of branch folding and trace
scheduling, some instructions might appear
multiple times in the cache, so it is somewhat
more wasteful of cache space
This type of cache then offers the advantage of
directly supporting a multiple issue architecture
the Pentium 4 uses this approach, but most RISC
computers do not because repetition of
instructions and high frequency of branches cause
this approach to waste too much cache space

36
Solution 10 Pipelined Cache Accesses

We can pipeline writes to the cache since writes
take longer than reads
the longer duration is because we need to perform
a tag check before the write can begin
a read can commence and if the tag is wrong, the
item read can be discarded
The write takes two steps, tag comparison first,
followed by the write (a third step might be
included in a write-back cache by combining items
in a buffer)
by pipelining writes
we can partially speed up the process
this works by overlapping the tag checking and
writing portions
assuming the tag is correct
in this way, the second write takes the same time
as a read would
although this only works with more than 1
consecutive write where all writes are cache hits

37
Example

Here we see the impact of pipelining cache
accesses
the advantage is that it allows us to reduce
clock cycle time, the disadvantage is that with a
shorter clock, cache misses have a larger impact
compare the MIPS 5-stage pipeline vs. the MIPS
R4000 8-stage pipeline
assume clock rates of 1 GHz for MIPS and 1.8 GHz
for MIPS R4000
a main memory access time of 50 ns (we will
assume no second level cache) and a cache miss
rate of 5
Assuming no other source of stalls, which machine
is faster?
MIPS miss penalty 50 ns / (1 / 1) ns 50
cycles
MIPS R4000 miss penalty 50 ns / (1 / 1.8) ns
90 cycles
CPU time MIPS (1 .05 50) 1 3.5
CPU time MIPS R4000 (1 .05 90) 1 / 1.8
3.06
So the gain of increased clock speed by
pipelining cache accesses more than offsets the
increased miss penalty
to truly see if this is advantageous, we would
also have to factor in the impact of structural
hazards and branch penalties
the longer the pipeline, the greater the impact is

38
Solution 11 Non-Blocking Caches

When an ordinary cache has a miss and must
retrieve the needed block from memory, the cache
is unable to respond to other requests
a more expensive cache is a non-blocking cache
which can continue to respond to CPU requests
while waiting for memory as long as the new
requests are to blocks that are present
this is sometimes referred to as a hit under
miss
this capability becomes essential for
out-of-order execution architectures like
Tomasulos so that the cache does not cause
stalls
a variation if the cache can permit hits to
continue in spite of multiple misses (although
for this to make any sense, it means that memory
must also be able to respond to multiple
requests)
see figure 5.5 on page 297 which illustrates how
various benchmarks perform with hit under 1 miss,
hit under 2 misses and hit under 64 misses
the non-blocking cache becomes important for
implementing some other optimizations which we
will see next

39
Solution 12 Multi-banked Caches

A common memory optimization is to have multiple
memory banks
each of which can be accessed independently
this allows for parallel access to memory either
by
different devices accessing memory simultaneously
accessing multiple words of the same block
simultaneously
We can use this idea on our caches as well by
interleaving cache addresses across banks
L2 of AMD Opteron uses 2 banks, Sun Niagara uses
4 banks
To make this easy to implement, we interleave the
blocks across banks
for instance, for a 4-bank cache, block address
4 0 would be placed into bank 0, block
address 4 1 would be placed into bank 1,
etc
When would this approach be advantageous?
we could use this to implement a form of
non-blocking cache where we can access another
bank on a miss (thus, we have hit under miss in 3
out of 4 instances) and this would be cheaper
than a truly non-blocking cache

40
Solution 13 Early Restart

On a cache miss, memory system moves a block into
cache
moving a full block requires many memory accesses
and bus transfers
Rather than having the cache (and CPU) wait until
the entire block is available
move requested word from the block first to allow
cache access as soon as the item is available
transfer rest of block in parallel with that
access
This requires two ideas
early restart the cache transmits the requested
word as soon as it arrives from memory
critical word first have memory return the
requested word first and the remainder of the
block afterward (this is also known as wrapped
fetch)
note we could implement early restart without
critical word first, but the improvement then is
based on which word was needed, for instance if
the last word of the block is requested, then
early restart does nothing for us
This optimization requires a non-blocking cache
since the cache needs to start responding to the
request after the first word is returned

41
Solution 14 Merging Write Buffer

We can make our write buffer more efficient as
follows
organize the write buffer in rows, one row
represents one refill line
multiple writes to the same line are saved in the
same buffer row
a write to memory moves the entire block from the
buffer, reducing the number of writes

Recall that a write-through cache may use a write
buffer to make memory accesses more efficient
the write buffer contains multiple items waiting
to be written to memory
with the write buffer, writes to memory are
postponed until either the buffer is full or a
modified refill line is being discarded

42
Solution 15 Compiler Optimizations

Specific techniques include
merging parallel arrays into an array of records
so that access to a single array element is made
to consecutive memory locations and thus the same
(hopefully) refill line
loop interchange exchange loops in a nested loop
situation so that array elements are accessed
based on order that they will appear in the cache
and not programmer-prescribed order
loop fusion combines loops together that access
the same array locations so that all accesses are
made within one iteration
blocking executes code on a part of the array
before moving on to another part of the array so
that array elements do not need to be reloaded
into the cache
this is common for applications like image
processing where several different passes through
a matrix are made

We have already seen that compiler optimizations
can be used to improve hardware performance
There are many ways that compiler optimizations
can improve cache performance by making sure that
array accesses are done in a way that supports
those refill lines currently in cache

43
Solution 16 Hardware Prefetching

Prefetching can be controlled by hardware or
compiler (see the next slide)
prefetching can operate on either instructions or
data or both
one simple idea for instruction prefetching is
that when there is an instruction miss to block
i, fetch it and then fetch block i 1
this might be controlled by hardware outside of
the cache
the second block is left in the instruction
stream buffer and not moved into the cache
if an instruction is part of the instruction
stream buffer, then the cache access is cancelled
(and thus a potential miss is cancelled)
if prefetching is to place multiple blocks into
the cache, then the cache must be non-blocking
See figure 5.10 on page 306 which shows the
speedup of many SPEC 2000 benchmarks (mostly FP)
when hardware prefetching is turned on (speedup
ranges from 1.16 to 1.97)

44
Solution 17 Compiler Prefetching

New loop becomes
for (j0jlt100jj1)
prefetch(bj70) / prefetch 7
iterations later / prefetch(a0j7)
a0jbj0
for (i1ilt3ii1) for
(j0jlt100jj1) prefetch(aij7)
aijbj0bj10
This new code has only 19 misses improving
performance to 6.2 times faster
See page 307-308 for the rest of the analysis for
this problem

Another idea is to have the compiler insert
prefetching instructions into the code (this is
for data prefetching only)
Consider the loop
for (i0ilt3ii1) for (j0jlt100jj1)
aijbj0bj10
if we have a 8KB direct-mapped data cache with 16
byte blocks and each element of a and b are 8
bytes long (double precision floats) we will have
150 misses for array a and 101 misses for array b
by scheduling the code with prefetch
instructions, we can reduce the misses

45
Solution 18 Victim Caches

Misses might arise when refill lines conflict
with each other
one line is discarded for another only to find
the discarded line is needed in the future
The victim cache is a small, fully associative
cache, placed between the cache and memory
this cache might store 1-5 blocks
Victim cache only stores blocks that are
discarded from the cache when a miss occurs
victim cache is checked on a miss before going on
to main memory and if found, the block in the
cache and the block in the victim cache are
switched

The victim cache is most useful if it backs up
a fast direct-mapped cache to reduce the
direct-mapped caches conflict miss rate by
adding some associativity A 4-item victim cache
might remove ¼ of the misses from a 4KB
direct-mapped data cache AMD Athlon uses 8-entry
victim cache
46
Cache Optimization Summary
Hardware complexity ranges from 0
(cheapest/easiest) to 3 (most expensive/hardest)
47
Continued
48
Sample Problem 1

Computer uses a fully associative write-back data
cache
Block size is 64 bytes
Given the code above, assume x1024, b1024 and
c1024 are all double precision floating point
arrays with arrays b and c already in cache, but
not x
moving elements of x into the cache will not
cause elements of b or c to be moved out of cache
x0 is stored at the beginning of a block
Questions
how many misses arise with respect to accessing x
if the cache uses a no-write allocate policy?
how many misses arise with respect to accessing x
if the cache uses a write allocate policy?
redo part a assuming that statement S2 comes
before statement S1
redo part b assuming that statement S2 comes
before statement S1

for(i1ilt1024i) xi bi
y // S1 ci xi z // S2
49
Solution

Each array stores 1024 doubles
A refill line stores 64 bytes so we can get
exactly 8 array elements into each refill line
Our misses only arise because of accesses to
array x
since a and b are already in the cache
No-write allocate policy means that the write in
S1 will not load x into the cache therefore a
miss in S1 results in a read miss in S2
Any read miss will bring in a refill line along
with the next 7 array items
so a read miss occurs for 1 in 8 array elements
we will have 1024 / 8 128 read misses
any read miss will have previously had a write
miss giving us a total of 256 misses
Write allocate policy means that the write miss
in S1 loads the line into cache, so we wont have
a corresponding read miss in S2 and therefore
will have a total of 128 misses
Reversing the order of S1 and S2 leads to the
read miss happening first so no matter which
write policy is used, we will have 128 read
misses and no write misses

50
Sample Problem 2

Memory organized as follows
two on-chip caches (one data, one instruction)
off-chip cache
main memory
disk cache
disk (swap space)
Assume miss rates and access times of
data cache 5, 1 clock cycle
instruction cache 1, 1 clock cycle
off-chip cache 10, 10 clock cycles
main memory 0.2, 100 clock cycles
disk cache 20, 1000 clock cycles
swap space 0, 250000 clock cycles
If 40 of all instructions are loads or stores,
what is the effective memory access time for this
machine?

51
Solution

Average memory access time
instruction (hit time instr cache miss rate
instr cache (hit time second level cache miss
rate second level cache (hit time main memory
miss rate main memory (hit time disk cache
miss rate disk cache hit time disk))))
data (hit time data cache miss rate data
cache (hit time second level cache miss rate
second level cache (hit time main memory miss
rate main memory (hit time disk cache miss
rate disk cache hit time disk))))
With 1.4 memory accesses per instruction, the
of instruction accesses
1.0 / 1.4 71.4, and of data accesses is 0.4
/ 1.4 28.6
Average memory access time
71.4 (1 .01 (10 .10 (100 .002
(1000 .20 250000)))) 28.6 (1 .05 (10
.10 (100 .002 (1000 .20 250000))))
1.647.