Title: Cache
1Cache
- Memory hierarchy is our solution for unlimited
fast accesses - at least 1 instruction fetch and maybe 1 data
access per cycle - more for a superscalar
- Each level of the hierarchy is based (at least in
part) on Principle of Locality of Reference - as we move higher in the hierarchy, each level
gets faster but also more expensive, therefore it
is more restricted in size - some issues are generic across the hierarchy
- but, each level has its unique characteristics,
technology and solutions - we have already looked at registers, here we will
study cache
The higher levels of the hierarchy (which are
faster) are more expensive so we have less
storage there But we want to prevent accesses at
the lower levels because of their longer access
times
2Effect on Performance
- Memory access time has a direct effect on CPU
performance - CPU execution time (CPU clock cycles memory
stall cycles) clock cycle time - mem stall cycles IC mem references per instr
miss rate miss penalty - mem references per instr gt 1 since there will be
the instruction fetch itself, and possible 1 or
more data fetches - whenever an instruction or data is not in
registers, we must fetch it from cache, but if it
is not in cache, we accrue a miss penalty by
having to access the much slower main memory - A large enough miss penalty will cause a
substantial decrease in CPU execute time,
consider for example - CPI 1.0 when all memory accesses are hits
- only data accesses are during loads and stores
(50 of all instructions are loads or stores) - miss penalty is 25 clock cycles, miss rate is 2
- what is the impact on CPI?
- CPI 1.0 100225 50225 1.75
- so we are 75 slower than we might be because of
cache misses
3Four questions
- The general piece of memory will be called a
block - blocks differ in size depending on the level of
the memory hierarchy - cache block, memory block, disk block
- We ask the following questions
- Q1 where can a block be placed?
- Q2 how is a block found?
- Q3 which block should be replaced on a miss?
- Q4 what happens on a write?
- The answers to questions 1-3 differ depending on
the type of cache - direct mapped, associative, set-associative
- The answer to the last question is based on the
write policy - write through or write back
4Q1 Where can a block be placed?
- Type determines placement
- Associative cache
- any available block
- Direct mapped cache
- given memory block has only one location where it
can be placed in cache determined by the
equation - (block address) mod size
- Set associative cache
- given memory block has a set of blocks in the
cache where it can be placed determined by - (block addr) mod (size / associativity)
Here we have a cache of size 8 and a memory of
size 32 to place memory block 12, we can put
it in any block in associative cache, in block 4
in direct mapped cache, and in block 0 or 1 in
a 2 way set associative cache
5Q2 How is a block found in cache?
- All memory addresses consist of a tag, a line
number (or index), and a block offset - in a direct mapped cache, the line number
- dictates the line where a block must be placed or
where it will be found - the tag is used to make sure that the line we
have found is the line we want - in a set associative cache, the line number
references a set of lines - the block must be placed in one of those lines,
but there is some variability which line should
we put it in, which line will we find it in? - we use a multiplexor to compare the tags of the
line from each set as selected by the line number - in a fully associative cache, a line can go
anywhere - since the fully associative cache is all 1-line
sets, we have to compare all blocks at once, this
is done through associative memory (a large
number of MUXs)
6Q3 Which block should be replaced?
- For the direct-mapped cache
- there is no choice of which line a new block must
be placed, so there is no need for a replacement
strategy - For set-associative and fully associative caches
- we have a choice, for instance in a 2-way
set-associative cache we can place the new block
in set 0 or set1 - for the fully associative cache, our choice is to
place the new block into any line at all - There are three common replacement strategies
- random
- FIFO
- least recently used
- this best models locality of reference as it
predicts which block will not be used again in
the near future by looking at how long ago it was
last accessed, however, this is difficult to
implement especially in hardware - instead we might use a variation, LRU
approximation or least frequently used - figure C.4 on page C-10 compares the data miss
rate when using FIFO, Random, and LRU replacement
strategies - what we see is that LRU is always the best, but
the difference between the three is not great,
and that random is often as good as FIFO
7Q4 What happens on a write?
- Writes will occur only to data
- when the datum is loaded into cache, there are
(at least) two copies in memory now, in the cache
and in main memory (other copies may exist if we
have multiple levels of cache) - on a cache write, the datum in cache is modified,
but what happens to the old (dirty) value in
memory? - Write Through cache
- write the datum to both cache and memory at the
same time - this is inefficient because the data access is a
word, typical data movement between cache and
memory is a block, so this write uses only part
of the bus for a transfer - notice other words in the same block may also
soon be updated, so waiting could pay off - Write Back cache
- write to cache, wait on writing to memory until
the entire block is being removed from cache - add a dirty bit to the cache to indicate that the
cache value is right, memory is wrong - Write Through is easier to implement since memory
will always be up-to-date and we dont need dirty
bit mechanisms - Write Back is preferred to reduce memory traffic
(a write stall occurs in Write Through if the CPU
must wait for the write to take place)
8Write Miss Example
- What happens on a write miss?
- write allocate
- block fetched on a miss, the write takes place at
both the cache and memory - no-write allocate
- block modified in memory without being brought
into the cache - Consider write-back cache which starts empty and
has the sequence of operations as shown below to
the left - how many hits and how many misses occur with
no-write allocate versus write allocate?
- Solution
- for no-write allocate
- the first two operations cause misses (since
after the first one, 100 is still not loaded into
cache), the third instruction causes a miss, the
fourth instruction is a hit (since 200 is now in
cache) but the fifth is also a miss, so 4 misses,
1 hit - for write allocate
- the first access to a memory location is always a
miss, but from there, it is in cache and the rest
are hits, so we have 2 misses (one for each of
100 and 200) and 3 hits
Write 100 Write 100 Read 200 Write
200 Write 100
9Using a Write Buffer
- To alleviate the inefficiency of Write Through,
we may add a write buffer - writes go to cache and the buffer, the CPU
continues without stalling - writes to memory occur when the buffer is full or
when a line is filled - thus, a write to memory is done in parallel with
the processor continuing execution - although stalls can still arise when using a
write buffer, they are less frequent - question on a read or write miss, should we
examine the write buffer? - the buffer may still be storing a value so that
we can avoid the time consuming memory access - Example
- given the three memory operations below where
locations 512 and 1024 map to the same
direct-mapped cache line, if the cache is a
write-through cache with a 4-word write buffer
that is not checked on a cache miss, will the
value in R2 equal the value in R3 at the end?
not necessarily
SW R3, 512(R0) // value written to write buffer,
but not yet to memory LW R1, 1024(R1) // cache
miss, new block brought in LW R2, 512(R0) //
cache miss, fetch Mem512 which might still be
the // old value since the write buffer may not
have written yet!
10Example Opteron Data Cache
- Found in AMD Opteron processors
- 2-way set ass., 512 blocks (per set)
- 64 Kbytes in 64-byte blocks
- write-back, write-allocate
- CPU address 40-bit address
- comprised of 25-bit tag, 9-bit line number and
6-bit byte number - tags from both sets of the given line number are
compared to this line number (the 21 MUX is used
to select which (if either) datum to return to
the CPU)
- the valid bit is used to denote that a block has
been modified - so that it can be written back to memory prior to
being removed from the cache - on a miss, the cache notifies the processor of
the miss so that the processor can stall, and the
request continues down to the next level of the
hierarchy - which takes 7 further cycles to get the first 8
bytes of the needed block - replacement strategy is LRU using a single bit to
denote which sit has been accessed most recently
and replacing the older one
11Split Versus Unified Caches
- Notice that the previous example was of only a
data cache - Why have separate data and instruction caches?
- below are statistics indicating the performance
of two split caches versus a single unified cache
- the unified cache would be twice the size of the
two split caches, so compare for instance two 8KB
caches to a 16KB unified cache
Note this table does not show miss rate we
are seeing misses per 1000 instruction, not per
1000 access Divide by 10 to get percentage
(e.g., 6.3 for 8KB Unified cache)
Why do you suppose the instruction cache performs
so much better than the data cache of equal size?
12Example
- Compare 16 KB instr and 16 KB data caches vs. 32
KB unified cache assuming - 1 clock cycle hit time
- 100 clock cycle miss penalty for the individual
caches - add 1 clock cycle hit time for load/store in the
unified cache (36 of instructions are
load/stores) - since a single cache cannot handle both an
instruction fetch and data access in the same
cycle - write-through caches with write buffer, no stalls
on writes - What is the average memory access time for both
caches? - To determine each caches performance, we compute
memory access time - avg memory access time hit time miss rate
miss penalty - as noted above, the unified cache, we have a
higher hit time because the unified cache will
need two accesses at a time if an instruction is
a load or store
13Solution
- We get misses per instruction from table on the
earlier slide - Converting to miss rate
- (3.82 / 1000) / 1 instr .00382
- (40.9 / 1000) / .36 .1136
- (43.3 / 1000) / 1.36 .0318
- Of 136 accesses per 100 instructions, percentage
of instruction accesses - 100 / 136 74
- Percentage of data accesses 36 / 136 26
- memory access time for 2 caches 74 (1
.00382 100) 26 (1 .1136 100) 4.236 - memory access for unified cache 74 (1
.0318 100) 26 (2 .0318 100) 4.44 - Separate caches perform better
- note typo in the book, they state 100 clock
cycle miss penalty but use 200 in their solution
14Revised CPU Performance
- Recall our previous CPU formula
- (CPU cycles memory stall cycles) clock cycle
time - assume memory stalls are caused by cache misses,
not problems like bus contention, I/O, etc - Memory stall cycles
- memory accesses miss rate miss penalty
reads read miss rate read miss penalty
writes write miss write write miss penalty - CPU time
- IC (CPI mem access per instr miss rate
miss penalty) clock cycle time - IC (CPI CCT mem accesses per instr miss
rate mem access time) - The memory access time is independent of the
clock speed - so an interesting tradeoff arises the faster
the clock, the better the CPU performance, but
the higher the miss penalty!
15Associativity Example
- What impact does cache organization
(direct-mapped vs. 2-way set associative) have on
a CPU? - assume CPI with perfect cache is 1.6, clock cycle
time .35 ns, 1.4 memory references per
instruction, 128 KB data and instr caches with 64
byte blocks each - cache 1 is direct-mapped with a miss rate of
2.1, cache 2 is 2-way set associative with a
miss rate of 1.9 - the direct-mapped cache is faster, so the clock
speed is faster, assume clock speed is lengthened
by 35 for 2-way set associative cache - cache miss penalty 65 ns
- CPU Time Cache 1
- IC (1.6 .35 1.4 .021 65) 2.47 IC
- CPU Time Cache 2
- IC (1.6 .35 1.35 1.4 .019 65) 2.49
IC - CPU with Cache 1 2.47 / 2.49 1.01 times
faster - Note cache miss penalty should actually be
higher for cache 2 why?
16Out of Order and Miss Penalty
- In our prior examples, cache misses caused the
pipeline to stall thus impacting CPI - in a multiple-issue out-of-order execution
architecture, like Tomasulo, a data miss means
that a particular instruction stalls, possibly
stalling others because it ties up a reservation
station or reorder buffer slot, but it is more
likely that it will not impact overall CPI - How then do we determine the impact of cache
misses on such architectures? - we might define memory stall cycles / instruction
misses / instruction (total miss latency
overlapped miss latency) - total miss latency the total of all memory
latencies where the memory latency for a single
instruction - overlapped miss latency the amount of time that
the miss is not impacting performance because
other instructions remain executing - these two terms are difficult to analyze, so we
wont cover this in any more detail - typically a multi-issue out-of-order architecture
can hide some of the miss penalty, up to 30 as
shown in an example on page C-20 C-21
17Improving Cache Performance
- Having reviewed 1000s of research papers on
caches, the authors provide a number of
approaches to improve cache performance - recall average memory access time hit time
miss rate miss penalty - these approaches can be categorized by what
aspect of the above formula they are trying to
reduce - reduce hit time
- reduce miss rate
- reduce miss penalty (or increase cache bandwidth
so that part of the miss penalty can be reduced) - using parallelism in order to reduce miss rate
and/or penalty - Comments
- miss penalty is the biggest value in the
equation, so this should be the obvious target to
reduce, but in fact little can be done to
increase memory speed - reducing miss rate has a number of different
approaches however miss rates today are often
less than 2, can we continue to improve? - reducing hit time has the benefit of allowing us
to lower clock cycle time as well
18The 3 Cs
- Cache misses can be categorized as
- compulsory misses
- very first access to a block cannot be in the
cache because the process has just begun and
there has not been a chance to load anything into
the cache - capacity misses
- the cache cannot contain all of the blocks needed
for the process - conflict misses
- the block placement strategy only allows a block
to be placed in a certain location in the cache
bringing about contention with other blocks for
that same location - some of the approaches we will attempt to reduce
one particular type of miss - figure C.8 on page C-23 demonstrates the miss
rates for various sized caches and
associativities - conflict misses are more common in direct-mapped
caches and are 0 in fully associative caches
whereas capacity misses are the most common cause
of a cache miss
19Solution 1 Larger Block Sizes
- Larger block sizes will reduce compulsory misses
- larger blocks can take more advantage of temporal
and spatial reference - but, larger blocks can increase miss penalty
because it physically takes longer to transfer
the block from main memory to cache - also, larger blocks means less blocks in cache
which itself can increase the miss rate - this depends on program layout and the size of
the cache vs. block size
A block size of 64 to 128 bytes provides
the lowest miss rates
20Example Impact of Block Size
- Assume memory takes 80 cycles to respond and
delivers 16 bytes every 2 clock cycles thereafter
- Which block size has minimum avg memory access
time for each cache size? - average memory access time hit time miss rate
miss penalty - hit time 1, miss rates from previous slide
- miss penalty depends on size of block (82 cycles
for 16 bytes, 84 cycles for 32 bytes, etc, for k
byte blocks miss penalty (k / 16) 2 80) - Solution
- 16 byte block, 4 KB cache 1 (8.57 82)
8.027 clock cycles - 256 byte block, 256KB cache 1 (.49 112)
1.549 clock cycles
We must compromise because a bigger block size
reduces miss rate to some extent, but also
increases hit time
21Solution 2 Larger Caches
- A larger cache will reduce capacity miss rates
since the cache has a larger capacity, but also
conflict miss rates because the larger cache
allows more refill lines and so fewer conflicts - This is an obvious solution and has no seeming
performance drawbacks - you must be careful where you put this larger
cache - a larger on-chip cache might take space away from
other hardware that could provide performance
increases (registers, more functional units,
logic for multiple-issue of instructions, etc) - and more cache means a greater expense for the
machine - Larger caches however are often slower as they
need more multiplexors - the best place for a large cache is off the chip
where we have space and because we dont mind a
slightly slower cache given that the miss penalty
will be larger due to having to go off the chip - the authors note that second-level caches from
2001 computers are equal in size to main memories
from 10 years ago!
22Solution 3 Higher Associativity
- Cache research points out the 21 cache rule of
thumb - a direct-mapped cache of size N has about the
same miss rate as a 2-way set associative cache
of size N/2 so that larger associativity yields
smaller miss rates - However, a large 8-way associative cache will
have about a 0 conflict miss rate - meaning that an 8-way associative cache is about
as good as a fully associative cache - so we dont need to have associativity beyond
8-way
- Why use direct-mapped?
- associativity will always have a higher hit time
- How big is the difference?
- as we saw in an earlier example, a 2-way set
associative cache was 35 slower than the
direct-mapped - remember that clock speed is usually equal to
cache hit time so we wind up slowing down the
entire computer when using associative caches of
some kind
So, with this in mind, should we use
direct-mapped or set associative?
23Example Impact of Associativity
- Average memory access time hit time miss rate
miss penalty - 4k direct 1 .098 25 3.45
- 512 KB 8-way 1.52 .006 25 1.67
- Assume higher associativity increases clock cycle
time as follows - clock 2-way 1.36 clock direct mapped
- clock 4-way 1.44 times clock direct-mapped
- clock 8-way 1.52 times clock direct-mapped
- assume L1 cache is direct-mapped with 1 cycle hit
time - Determine best L2 type
- given that miss penalty for direct-mapped is 25
cycles and L2 never misses
in most cases, higher associativity means higher
access time so direct mapped is often the best
24Solution 4 Multilevel Caches
- To improve performance, we find that we would
like - a faster cache to keep pace with memory
- a larger cache to lower miss rate
- Which should we pick? Both
- offer a small but fast first level cache (L1)
- offer a larger but slower second level cache (L2)
on the chip - since this second cache is larger, it will be
somewhat slower, we make it even slower by using
some degree of associativity to improve hit rate - this gives us a new formula for average memory
access time - hit time L1 miss rate L1 miss penalty L1
- miss penalty L1 hit time L2 miss rate L2
miss penalty L2 - avg mem access time hit time L1 miss rate L1
(hit time L2 miss rate L2 miss penalty L2) - We redefine miss rate for second cache
- local miss rate of cache misses / of mem
accesses this cache - global miss rate of cache misses / of mem
accesses overall - these values remain the same for the 1st level
cache
25Impact of L2 Cache
- Local miss rate for second cache will be larger
than local miss rate for first cache since the
first cache skims the cream of the crop - second level cache is only accessed when the
first level misses entirely - global miss rate is more useful than local miss
rate for the second cache - global miss rate tells us how many misses there
are in all accesses - For example
- assume in 1000 references, L1 has 40 misses, L2
has 20 - local (and global) miss rate cache1 40/1000
4 - local miss rate cache2 20/40 50, global miss
rate cache2 20/1000 2 - Local miss rate cache2 is misleading, global miss
rate gives us an indication of how both caches
perform overall - L1 hit time is 1, L2 hit time is 10, memory
access time is 200 cycles - what is the average memory access time?
- avg. mem access time 1 4(1050200) 5.4
cycles - without L2, we have avg. mem access time 1 4
200 9, so the L2 cache gives us a 9 / 5.4
1.67 or 67 speedup!
26Associativity for L2 Cache
- Here we see the benefit of an associative cache
for a second-level cache instead of direct-mapped - compare direct-mapped vs. 2-way set associative
caches for second level assuming a 2-way set
associative cache is 10 slower than the
direct-mapped cache - direct-mapped L2 has hit time 10 cycles
- direct-mapped L2 has local miss rate 25
- 2-way set-associative L2 has hit time 10.1
cycles - 2-way set-associative L2 has local miss rate
20 - miss penalty L2 200 cycles
- direct-mapped L2, miss penalty 10 .25 200
60 cycles - 2-way set-associative L2, miss penalty 10.1
.20 200 50.1 cycles - we round 10.1 and 50.1 up to 11 and 51
respectively since the L2 cache is still governed
by the system clock
27Solution 5 Priority of Reads over Writes
- Reads occur with a much greater frequency than
writes - we want to make the more common case fast so we
would like to make sure that reads are faster
than writes - writes are slower because of the need to write to
both cache and main memory as well as perform a
tag check prior to starting the write (a tag
check on a read can be done in parallel, if the
tag is incorrect, we just cancel the remainder of
the read) - we might use a write buffer for both types of
write policy - write-through cache writes to write buffer first,
and any read misses are given priority over
writing the write buffer to memory - write-back cache writes to write buffer and the
write buffer is only written to memory when we
are assured of no conflict with a read miss - in either case, a read miss will first examine
the write buffer before going on to memory in
order to potentially save time - so, read misses have priority over writes since
read misses are more common, so we make the
common case fast
28Solution 6 Avoid Address Translation
- CPU generates an address and sends it to cache
- but the address generated by the CPU is a logical
(virtual) address, not the physical address in
memory - the addresses differ because of paging and
virtual memory - caches store items based on their physical
address (for instance, the line number is derived
from the physical address, not the virtual
address) - to obtain the physical address, the virtual
address must first be translated using the TLB
(or page table if the entry isnt in the TLB) - thus, any memory access will actually take at
least 2 cache accesses and we want to prevent
this - If we store virtual addresses in the cache, we
can skip this translation but there are problems
with this approach - the address translation process includes security
mechanisms to make sure that a generated address
is not trying to access some other process
portion of memory (such as a user process and the
OS) - a solution is to copy protection information into
the cache and check it on every cache access - if a process is switched out of memory then the
cache must be flushed - unless we add process id information to tags,
which like the previous solution means using more
cache space for non-data
29Another Solution
- A very simple solution to the problem of needing
to translate virtual to physical addresses is
this - make sure that the line number is identical in
both the virtual and physical addresses - Consider for instance a memory system of
- 4G of words (32 bit addresses, assume word-level
addressing, not byte-level) - block size of 16 words
- a 16KB on-chip cache would store 1024 blocks
requiring 10 bits of the address - so the address format is 18 bits for the tag, 10
bits for the block, 4 bits for the word - assume a page size of 16K words so that the page
offset is 14 bits and the page number size is 18
bits - paging process will swap out the page number for
the frame number, both of which are 18 bits of
the address so that the virtual address page
offset is the same bits as the physical address
block number and word offset and so no additional
address translation is necessary prior to cache
access - For this to work, we must ensure that log 2 page
size gt tag size
30Solution 7 Small and Simple Caches
- Cache access (for any but an associative cache)
requires - using the index part of the address to find the
appropriate line in the cache - then comparing tags to see if the entry is the
right one - the tag comparison can be time consuming
- especially with associative caches that have
large tags or set associative caches where
comparisons use more hardware to be done in
parallel - Two solutions to keeping a fast cache are
- use direct-mapped caches
- keep tags on the chip for quick tag check but
move the data off the chip - this latter approach is often the case for L2
caches, but not L1 - Another idea is to use a small enough L2 cache to
fit on the processor to keep L1 miss penalties
down - L1 cache sizes have not increased lately, but
small enough L2 caches can now fit on the chip
31Solution 8 Way Prediction
- The direct-mapped cache offers a faster access
time than a 2-way set associative cache - because we can avoid the additional multiplexor
to select between the two sets - Lets assume that an access to one set will be
followed by an access to the same set - this prediction is called way prediction and
can be used to speed up access to a 2-way set
associative cache by maintaining a prediction bit
for the cache - the bit is toggled when we have a miss-prediction
and need to switch to the other set - So we get the lower miss rate of the 2-way set
associative cache and the lower hit time whenever
we predict correctly - simulations indicate a prediction accuracy of 85
or more - the Pentium IV uses way prediction
32Variation Pseudo-Associative Cache
- We can alter a direct-mapped cache to have some
associativity as follows - consult the direct-mapped cache as normal
- provides fast hit time
- if there is a miss, invert the address and try
the new address - inversion might flip the last bit in the line
number - the second access comes at a cost of a higher hit
rate for a second attempt (it may also cause
other accesses to stall while the second access
is being performed!) - thus, the same address might be stored in one of
two locations, thus giving some associativity - the pseudo-associative cache will reduce the
amount of conflict misses - any cache miss may still become a cache hit
- first check is fast (hit time of direct-mapped)
- second check might take 1-2 cycles further, so is
still faster than a second-level cache
33Example
- Which provides faster avg memory access time for
4KB and 256 KB caches - direct-mapped, 2-way associative or
pseudo-associative (PAC)? - Assume hit time of 1 cycle for direct mapped,
1.36 for 2-way set associative, and miss penalty
of 50 cycles where the PAC will have a hit time
of 3 cycles for a second access - we alter our formula for PAC because a miss does
not necessarily accrue a 50 cycle penalty but
instead a 3 cycle penalty if the item is in the
other position in cache - we need two hit rates the normal hit rate and
the hit rate of finding the item in the second
position (we will call this the alternative hit
rate) - alternative hit rate hit rate2 way - hit rate1
way 1 - miss rate2 way - (1 - miss rate1 way)
miss rate1 way - miss rate2 way - Avg mem access time PAC
- 1 (miss rate1 way miss rate2 way) 3 miss
rate2 way miss penalty
34Solution
- PAC 4 KB 1 (.098 - .076) 3 (.076 50)
4.866 - PAC 256 KB 1 (.013 - .012) 3 (.012
50) 1.603 - Direct-mapped 4 KB 1 .098 50 5.9
- Direct-mapped 256 KB 1 .013 50 1.65
- 2-way 4 KB 1.36 .076 50 5.16
- 2-way 256 KB 1.36 .012 50 1.96
- So, pseudo-associative cache outperforms both!
35Solution 9 Trace Caches
- This type of a cache is an instruction cache
which supports multiple issue of instructions by
providing 4 or more independent instructions per
cycle - cache blocks are dynamic, unlike normal caches
where blocks are static based on what is stored
in memory - here, the block is formed around branch
prediction, branch folding, and trace scheduling - note that because of branch folding and trace
scheduling, some instructions might appear
multiple times in the cache, so it is somewhat
more wasteful of cache space - This type of cache then offers the advantage of
directly supporting a multiple issue architecture - the Pentium 4 uses this approach, but most RISC
computers do not because repetition of
instructions and high frequency of branches cause
this approach to waste too much cache space
36Solution 10 Pipelined Cache Accesses
- We can pipeline writes to the cache since writes
take longer than reads - the longer duration is because we need to perform
a tag check before the write can begin - a read can commence and if the tag is wrong, the
item read can be discarded - The write takes two steps, tag comparison first,
followed by the write (a third step might be
included in a write-back cache by combining items
in a buffer) - by pipelining writes
- we can partially speed up the process
- this works by overlapping the tag checking and
writing portions - assuming the tag is correct
- in this way, the second write takes the same time
as a read would - although this only works with more than 1
consecutive write where all writes are cache hits
37Example
- Here we see the impact of pipelining cache
accesses - the advantage is that it allows us to reduce
clock cycle time, the disadvantage is that with a
shorter clock, cache misses have a larger impact - compare the MIPS 5-stage pipeline vs. the MIPS
R4000 8-stage pipeline - assume clock rates of 1 GHz for MIPS and 1.8 GHz
for MIPS R4000 - a main memory access time of 50 ns (we will
assume no second level cache) and a cache miss
rate of 5 - Assuming no other source of stalls, which machine
is faster? - MIPS miss penalty 50 ns / (1 / 1) ns 50
cycles - MIPS R4000 miss penalty 50 ns / (1 / 1.8) ns
90 cycles - CPU time MIPS (1 .05 50) 1 3.5
- CPU time MIPS R4000 (1 .05 90) 1 / 1.8
3.06 - So the gain of increased clock speed by
pipelining cache accesses more than offsets the
increased miss penalty - to truly see if this is advantageous, we would
also have to factor in the impact of structural
hazards and branch penalties - the longer the pipeline, the greater the impact is
38Solution 11 Non-Blocking Caches
- When an ordinary cache has a miss and must
retrieve the needed block from memory, the cache
is unable to respond to other requests - a more expensive cache is a non-blocking cache
which can continue to respond to CPU requests
while waiting for memory as long as the new
requests are to blocks that are present - this is sometimes referred to as a hit under
miss - this capability becomes essential for
out-of-order execution architectures like
Tomasulos so that the cache does not cause
stalls - a variation if the cache can permit hits to
continue in spite of multiple misses (although
for this to make any sense, it means that memory
must also be able to respond to multiple
requests) - see figure 5.5 on page 297 which illustrates how
various benchmarks perform with hit under 1 miss,
hit under 2 misses and hit under 64 misses - the non-blocking cache becomes important for
implementing some other optimizations which we
will see next
39Solution 12 Multi-banked Caches
- A common memory optimization is to have multiple
memory banks - each of which can be accessed independently
- this allows for parallel access to memory either
by - different devices accessing memory simultaneously
- accessing multiple words of the same block
simultaneously - We can use this idea on our caches as well by
interleaving cache addresses across banks - L2 of AMD Opteron uses 2 banks, Sun Niagara uses
4 banks - To make this easy to implement, we interleave the
blocks across banks - for instance, for a 4-bank cache, block address
4 0 would be placed into bank 0, block
address 4 1 would be placed into bank 1,
etc - When would this approach be advantageous?
- we could use this to implement a form of
non-blocking cache where we can access another
bank on a miss (thus, we have hit under miss in 3
out of 4 instances) and this would be cheaper
than a truly non-blocking cache
40Solution 13 Early Restart
- On a cache miss, memory system moves a block into
cache - moving a full block requires many memory accesses
and bus transfers - Rather than having the cache (and CPU) wait until
the entire block is available - move requested word from the block first to allow
cache access as soon as the item is available - transfer rest of block in parallel with that
access - This requires two ideas
- early restart the cache transmits the requested
word as soon as it arrives from memory - critical word first have memory return the
requested word first and the remainder of the
block afterward (this is also known as wrapped
fetch) - note we could implement early restart without
critical word first, but the improvement then is
based on which word was needed, for instance if
the last word of the block is requested, then
early restart does nothing for us - This optimization requires a non-blocking cache
since the cache needs to start responding to the
request after the first word is returned
41Solution 14 Merging Write Buffer
- We can make our write buffer more efficient as
follows - organize the write buffer in rows, one row
represents one refill line - multiple writes to the same line are saved in the
same buffer row - a write to memory moves the entire block from the
buffer, reducing the number of writes
- Recall that a write-through cache may use a write
buffer to make memory accesses more efficient - the write buffer contains multiple items waiting
to be written to memory - with the write buffer, writes to memory are
postponed until either the buffer is full or a
modified refill line is being discarded
42Solution 15 Compiler Optimizations
- Specific techniques include
- merging parallel arrays into an array of records
so that access to a single array element is made
to consecutive memory locations and thus the same
(hopefully) refill line - loop interchange exchange loops in a nested loop
situation so that array elements are accessed
based on order that they will appear in the cache
and not programmer-prescribed order - loop fusion combines loops together that access
the same array locations so that all accesses are
made within one iteration - blocking executes code on a part of the array
before moving on to another part of the array so
that array elements do not need to be reloaded
into the cache - this is common for applications like image
processing where several different passes through
a matrix are made
- We have already seen that compiler optimizations
can be used to improve hardware performance - There are many ways that compiler optimizations
can improve cache performance by making sure that
array accesses are done in a way that supports
those refill lines currently in cache
43Solution 16 Hardware Prefetching
- Prefetching can be controlled by hardware or
compiler (see the next slide) - prefetching can operate on either instructions or
data or both - one simple idea for instruction prefetching is
that when there is an instruction miss to block
i, fetch it and then fetch block i 1 - this might be controlled by hardware outside of
the cache - the second block is left in the instruction
stream buffer and not moved into the cache - if an instruction is part of the instruction
stream buffer, then the cache access is cancelled
(and thus a potential miss is cancelled) - if prefetching is to place multiple blocks into
the cache, then the cache must be non-blocking - See figure 5.10 on page 306 which shows the
speedup of many SPEC 2000 benchmarks (mostly FP)
when hardware prefetching is turned on (speedup
ranges from 1.16 to 1.97)
44Solution 17 Compiler Prefetching
- New loop becomes
- for (j0jlt100jj1)
- prefetch(bj70) / prefetch 7
iterations later / prefetch(a0j7)
a0jbj0 - for (i1ilt3ii1) for
(j0jlt100jj1) prefetch(aij7)
aijbj0bj10 - This new code has only 19 misses improving
performance to 6.2 times faster - See page 307-308 for the rest of the analysis for
this problem
- Another idea is to have the compiler insert
prefetching instructions into the code (this is
for data prefetching only) - Consider the loop
- for (i0ilt3ii1) for (j0jlt100jj1)
aijbj0bj10 - if we have a 8KB direct-mapped data cache with 16
byte blocks and each element of a and b are 8
bytes long (double precision floats) we will have
150 misses for array a and 101 misses for array b - by scheduling the code with prefetch
instructions, we can reduce the misses
45Solution 18 Victim Caches
- Misses might arise when refill lines conflict
with each other - one line is discarded for another only to find
the discarded line is needed in the future - The victim cache is a small, fully associative
cache, placed between the cache and memory - this cache might store 1-5 blocks
- Victim cache only stores blocks that are
discarded from the cache when a miss occurs - victim cache is checked on a miss before going on
to main memory and if found, the block in the
cache and the block in the victim cache are
switched
The victim cache is most useful if it backs up
a fast direct-mapped cache to reduce the
direct-mapped caches conflict miss rate by
adding some associativity A 4-item victim cache
might remove ¼ of the misses from a 4KB
direct-mapped data cache AMD Athlon uses 8-entry
victim cache
46Cache Optimization Summary
Hardware complexity ranges from 0
(cheapest/easiest) to 3 (most expensive/hardest)
47Continued
48Sample Problem 1
- Computer uses a fully associative write-back data
cache - Block size is 64 bytes
- Given the code above, assume x1024, b1024 and
c1024 are all double precision floating point
arrays with arrays b and c already in cache, but
not x - moving elements of x into the cache will not
cause elements of b or c to be moved out of cache - x0 is stored at the beginning of a block
- Questions
- how many misses arise with respect to accessing x
if the cache uses a no-write allocate policy? - how many misses arise with respect to accessing x
if the cache uses a write allocate policy? - redo part a assuming that statement S2 comes
before statement S1 - redo part b assuming that statement S2 comes
before statement S1
for(i1ilt1024i) xi bi
y // S1 ci xi z // S2
49Solution
- Each array stores 1024 doubles
- A refill line stores 64 bytes so we can get
exactly 8 array elements into each refill line - Our misses only arise because of accesses to
array x - since a and b are already in the cache
- No-write allocate policy means that the write in
S1 will not load x into the cache therefore a
miss in S1 results in a read miss in S2 - Any read miss will bring in a refill line along
with the next 7 array items - so a read miss occurs for 1 in 8 array elements
- we will have 1024 / 8 128 read misses
- any read miss will have previously had a write
miss giving us a total of 256 misses - Write allocate policy means that the write miss
in S1 loads the line into cache, so we wont have
a corresponding read miss in S2 and therefore
will have a total of 128 misses - Reversing the order of S1 and S2 leads to the
read miss happening first so no matter which
write policy is used, we will have 128 read
misses and no write misses
50Sample Problem 2
- Memory organized as follows
- two on-chip caches (one data, one instruction)
- off-chip cache
- main memory
- disk cache
- disk (swap space)
- Assume miss rates and access times of
- data cache 5, 1 clock cycle
- instruction cache 1, 1 clock cycle
- off-chip cache 10, 10 clock cycles
- main memory 0.2, 100 clock cycles
- disk cache 20, 1000 clock cycles
- swap space 0, 250000 clock cycles
- If 40 of all instructions are loads or stores,
what is the effective memory access time for this
machine?
51Solution
- Average memory access time
- instruction (hit time instr cache miss rate
instr cache (hit time second level cache miss
rate second level cache (hit time main memory
miss rate main memory (hit time disk cache
miss rate disk cache hit time disk)))) - data (hit time data cache miss rate data
cache (hit time second level cache miss rate
second level cache (hit time main memory miss
rate main memory (hit time disk cache miss
rate disk cache hit time disk)))) - With 1.4 memory accesses per instruction, the
of instruction accesses - 1.0 / 1.4 71.4, and of data accesses is 0.4
/ 1.4 28.6 - Average memory access time
- 71.4 (1 .01 (10 .10 (100 .002
(1000 .20 250000)))) 28.6 (1 .05 (10
.10 (100 .002 (1000 .20 250000))))
1.647.