Cache - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Cache

Description:

writes go to cache and the buffer, the CPU continues without stalling ... we might define memory stall cycles / instruction = misses / instruction ... – PowerPoint PPT presentation

Number of Views:286
Avg rating:3.0/5.0
Slides: 52
Provided by: rfox
Category:
Tags: cache | stall

less

Transcript and Presenter's Notes

Title: Cache


1
Cache
  • Memory hierarchy is our solution for unlimited
    fast accesses
  • at least 1 instruction fetch and maybe 1 data
    access per cycle
  • more for a superscalar
  • Each level of the hierarchy is based (at least in
    part) on Principle of Locality of Reference
  • as we move higher in the hierarchy, each level
    gets faster but also more expensive, therefore it
    is more restricted in size
  • some issues are generic across the hierarchy
  • but, each level has its unique characteristics,
    technology and solutions
  • we have already looked at registers, here we will
    study cache

The higher levels of the hierarchy (which are
faster) are more expensive so we have less
storage there But we want to prevent accesses at
the lower levels because of their longer access
times
2
Effect on Performance
  • Memory access time has a direct effect on CPU
    performance
  • CPU execution time (CPU clock cycles memory
    stall cycles) clock cycle time
  • mem stall cycles IC mem references per instr
    miss rate miss penalty
  • mem references per instr gt 1 since there will be
    the instruction fetch itself, and possible 1 or
    more data fetches
  • whenever an instruction or data is not in
    registers, we must fetch it from cache, but if it
    is not in cache, we accrue a miss penalty by
    having to access the much slower main memory
  • A large enough miss penalty will cause a
    substantial decrease in CPU execute time,
    consider for example
  • CPI 1.0 when all memory accesses are hits
  • only data accesses are during loads and stores
    (50 of all instructions are loads or stores)
  • miss penalty is 25 clock cycles, miss rate is 2
  • what is the impact on CPI?
  • CPI 1.0 100225 50225 1.75
  • so we are 75 slower than we might be because of
    cache misses

3
Four questions
  • The general piece of memory will be called a
    block
  • blocks differ in size depending on the level of
    the memory hierarchy
  • cache block, memory block, disk block
  • We ask the following questions
  • Q1 where can a block be placed?
  • Q2 how is a block found?
  • Q3 which block should be replaced on a miss?
  • Q4 what happens on a write?
  • The answers to questions 1-3 differ depending on
    the type of cache
  • direct mapped, associative, set-associative
  • The answer to the last question is based on the
    write policy
  • write through or write back

4
Q1 Where can a block be placed?
  • Type determines placement
  • Associative cache
  • any available block
  • Direct mapped cache
  • given memory block has only one location where it
    can be placed in cache determined by the
    equation
  • (block address) mod size
  • Set associative cache
  • given memory block has a set of blocks in the
    cache where it can be placed determined by
  • (block addr) mod (size / associativity)

Here we have a cache of size 8 and a memory of
size 32 to place memory block 12, we can put
it in any block in associative cache, in block 4
in direct mapped cache, and in block 0 or 1 in
a 2 way set associative cache
5
Q2 How is a block found in cache?
  • All memory addresses consist of a tag, a line
    number (or index), and a block offset
  • in a direct mapped cache, the line number
  • dictates the line where a block must be placed or
    where it will be found
  • the tag is used to make sure that the line we
    have found is the line we want
  • in a set associative cache, the line number
    references a set of lines
  • the block must be placed in one of those lines,
    but there is some variability which line should
    we put it in, which line will we find it in?
  • we use a multiplexor to compare the tags of the
    line from each set as selected by the line number
  • in a fully associative cache, a line can go
    anywhere
  • since the fully associative cache is all 1-line
    sets, we have to compare all blocks at once, this
    is done through associative memory (a large
    number of MUXs)

6
Q3 Which block should be replaced?
  • For the direct-mapped cache
  • there is no choice of which line a new block must
    be placed, so there is no need for a replacement
    strategy
  • For set-associative and fully associative caches
  • we have a choice, for instance in a 2-way
    set-associative cache we can place the new block
    in set 0 or set1
  • for the fully associative cache, our choice is to
    place the new block into any line at all
  • There are three common replacement strategies
  • random
  • FIFO
  • least recently used
  • this best models locality of reference as it
    predicts which block will not be used again in
    the near future by looking at how long ago it was
    last accessed, however, this is difficult to
    implement especially in hardware
  • instead we might use a variation, LRU
    approximation or least frequently used
  • figure C.4 on page C-10 compares the data miss
    rate when using FIFO, Random, and LRU replacement
    strategies
  • what we see is that LRU is always the best, but
    the difference between the three is not great,
    and that random is often as good as FIFO

7
Q4 What happens on a write?
  • Writes will occur only to data
  • when the datum is loaded into cache, there are
    (at least) two copies in memory now, in the cache
    and in main memory (other copies may exist if we
    have multiple levels of cache)
  • on a cache write, the datum in cache is modified,
    but what happens to the old (dirty) value in
    memory?
  • Write Through cache
  • write the datum to both cache and memory at the
    same time
  • this is inefficient because the data access is a
    word, typical data movement between cache and
    memory is a block, so this write uses only part
    of the bus for a transfer
  • notice other words in the same block may also
    soon be updated, so waiting could pay off
  • Write Back cache
  • write to cache, wait on writing to memory until
    the entire block is being removed from cache
  • add a dirty bit to the cache to indicate that the
    cache value is right, memory is wrong
  • Write Through is easier to implement since memory
    will always be up-to-date and we dont need dirty
    bit mechanisms
  • Write Back is preferred to reduce memory traffic
    (a write stall occurs in Write Through if the CPU
    must wait for the write to take place)

8
Write Miss Example
  • What happens on a write miss?
  • write allocate
  • block fetched on a miss, the write takes place at
    both the cache and memory
  • no-write allocate
  • block modified in memory without being brought
    into the cache
  • Consider write-back cache which starts empty and
    has the sequence of operations as shown below to
    the left
  • how many hits and how many misses occur with
    no-write allocate versus write allocate?
  • Solution
  • for no-write allocate
  • the first two operations cause misses (since
    after the first one, 100 is still not loaded into
    cache), the third instruction causes a miss, the
    fourth instruction is a hit (since 200 is now in
    cache) but the fifth is also a miss, so 4 misses,
    1 hit
  • for write allocate
  • the first access to a memory location is always a
    miss, but from there, it is in cache and the rest
    are hits, so we have 2 misses (one for each of
    100 and 200) and 3 hits

Write 100 Write 100 Read 200 Write
200 Write 100
9
Using a Write Buffer
  • To alleviate the inefficiency of Write Through,
    we may add a write buffer
  • writes go to cache and the buffer, the CPU
    continues without stalling
  • writes to memory occur when the buffer is full or
    when a line is filled
  • thus, a write to memory is done in parallel with
    the processor continuing execution
  • although stalls can still arise when using a
    write buffer, they are less frequent
  • question on a read or write miss, should we
    examine the write buffer?
  • the buffer may still be storing a value so that
    we can avoid the time consuming memory access
  • Example
  • given the three memory operations below where
    locations 512 and 1024 map to the same
    direct-mapped cache line, if the cache is a
    write-through cache with a 4-word write buffer
    that is not checked on a cache miss, will the
    value in R2 equal the value in R3 at the end?
    not necessarily

SW R3, 512(R0) // value written to write buffer,
but not yet to memory LW R1, 1024(R1) // cache
miss, new block brought in LW R2, 512(R0) //
cache miss, fetch Mem512 which might still be
the // old value since the write buffer may not
have written yet!
10
Example Opteron Data Cache
  • Found in AMD Opteron processors
  • 2-way set ass., 512 blocks (per set)
  • 64 Kbytes in 64-byte blocks
  • write-back, write-allocate
  • CPU address 40-bit address
  • comprised of 25-bit tag, 9-bit line number and
    6-bit byte number
  • tags from both sets of the given line number are
    compared to this line number (the 21 MUX is used
    to select which (if either) datum to return to
    the CPU)
  • the valid bit is used to denote that a block has
    been modified
  • so that it can be written back to memory prior to
    being removed from the cache
  • on a miss, the cache notifies the processor of
    the miss so that the processor can stall, and the
    request continues down to the next level of the
    hierarchy
  • which takes 7 further cycles to get the first 8
    bytes of the needed block
  • replacement strategy is LRU using a single bit to
    denote which sit has been accessed most recently
    and replacing the older one

11
Split Versus Unified Caches
  • Notice that the previous example was of only a
    data cache
  • Why have separate data and instruction caches?
  • below are statistics indicating the performance
    of two split caches versus a single unified cache
  • the unified cache would be twice the size of the
    two split caches, so compare for instance two 8KB
    caches to a 16KB unified cache

Note this table does not show miss rate we
are seeing misses per 1000 instruction, not per
1000 access Divide by 10 to get percentage
(e.g., 6.3 for 8KB Unified cache)
Why do you suppose the instruction cache performs
so much better than the data cache of equal size?
12
Example
  • Compare 16 KB instr and 16 KB data caches vs. 32
    KB unified cache assuming
  • 1 clock cycle hit time
  • 100 clock cycle miss penalty for the individual
    caches
  • add 1 clock cycle hit time for load/store in the
    unified cache (36 of instructions are
    load/stores)
  • since a single cache cannot handle both an
    instruction fetch and data access in the same
    cycle
  • write-through caches with write buffer, no stalls
    on writes
  • What is the average memory access time for both
    caches?
  • To determine each caches performance, we compute
    memory access time
  • avg memory access time hit time miss rate
    miss penalty
  • as noted above, the unified cache, we have a
    higher hit time because the unified cache will
    need two accesses at a time if an instruction is
    a load or store

13
Solution
  • We get misses per instruction from table on the
    earlier slide
  • Converting to miss rate
  • (3.82 / 1000) / 1 instr .00382
  • (40.9 / 1000) / .36 .1136
  • (43.3 / 1000) / 1.36 .0318
  • Of 136 accesses per 100 instructions, percentage
    of instruction accesses
  • 100 / 136 74
  • Percentage of data accesses 36 / 136 26
  • memory access time for 2 caches 74 (1
    .00382 100) 26 (1 .1136 100) 4.236
  • memory access for unified cache 74 (1
    .0318 100) 26 (2 .0318 100) 4.44
  • Separate caches perform better
  • note typo in the book, they state 100 clock
    cycle miss penalty but use 200 in their solution

14
Revised CPU Performance
  • Recall our previous CPU formula
  • (CPU cycles memory stall cycles) clock cycle
    time
  • assume memory stalls are caused by cache misses,
    not problems like bus contention, I/O, etc
  • Memory stall cycles
  • memory accesses miss rate miss penalty
    reads read miss rate read miss penalty
    writes write miss write write miss penalty
  • CPU time
  • IC (CPI mem access per instr miss rate
    miss penalty) clock cycle time
  • IC (CPI CCT mem accesses per instr miss
    rate mem access time)
  • The memory access time is independent of the
    clock speed
  • so an interesting tradeoff arises the faster
    the clock, the better the CPU performance, but
    the higher the miss penalty!

15
Associativity Example
  • What impact does cache organization
    (direct-mapped vs. 2-way set associative) have on
    a CPU?
  • assume CPI with perfect cache is 1.6, clock cycle
    time .35 ns, 1.4 memory references per
    instruction, 128 KB data and instr caches with 64
    byte blocks each
  • cache 1 is direct-mapped with a miss rate of
    2.1, cache 2 is 2-way set associative with a
    miss rate of 1.9
  • the direct-mapped cache is faster, so the clock
    speed is faster, assume clock speed is lengthened
    by 35 for 2-way set associative cache
  • cache miss penalty 65 ns
  • CPU Time Cache 1
  • IC (1.6 .35 1.4 .021 65) 2.47 IC
  • CPU Time Cache 2
  • IC (1.6 .35 1.35 1.4 .019 65) 2.49
    IC
  • CPU with Cache 1 2.47 / 2.49 1.01 times
    faster
  • Note cache miss penalty should actually be
    higher for cache 2 why?

16
Out of Order and Miss Penalty
  • In our prior examples, cache misses caused the
    pipeline to stall thus impacting CPI
  • in a multiple-issue out-of-order execution
    architecture, like Tomasulo, a data miss means
    that a particular instruction stalls, possibly
    stalling others because it ties up a reservation
    station or reorder buffer slot, but it is more
    likely that it will not impact overall CPI
  • How then do we determine the impact of cache
    misses on such architectures?
  • we might define memory stall cycles / instruction
    misses / instruction (total miss latency
    overlapped miss latency)
  • total miss latency the total of all memory
    latencies where the memory latency for a single
    instruction
  • overlapped miss latency the amount of time that
    the miss is not impacting performance because
    other instructions remain executing
  • these two terms are difficult to analyze, so we
    wont cover this in any more detail
  • typically a multi-issue out-of-order architecture
    can hide some of the miss penalty, up to 30 as
    shown in an example on page C-20 C-21

17
Improving Cache Performance
  • Having reviewed 1000s of research papers on
    caches, the authors provide a number of
    approaches to improve cache performance
  • recall average memory access time hit time
    miss rate miss penalty
  • these approaches can be categorized by what
    aspect of the above formula they are trying to
    reduce
  • reduce hit time
  • reduce miss rate
  • reduce miss penalty (or increase cache bandwidth
    so that part of the miss penalty can be reduced)
  • using parallelism in order to reduce miss rate
    and/or penalty
  • Comments
  • miss penalty is the biggest value in the
    equation, so this should be the obvious target to
    reduce, but in fact little can be done to
    increase memory speed
  • reducing miss rate has a number of different
    approaches however miss rates today are often
    less than 2, can we continue to improve?
  • reducing hit time has the benefit of allowing us
    to lower clock cycle time as well

18
The 3 Cs
  • Cache misses can be categorized as
  • compulsory misses
  • very first access to a block cannot be in the
    cache because the process has just begun and
    there has not been a chance to load anything into
    the cache
  • capacity misses
  • the cache cannot contain all of the blocks needed
    for the process
  • conflict misses
  • the block placement strategy only allows a block
    to be placed in a certain location in the cache
    bringing about contention with other blocks for
    that same location
  • some of the approaches we will attempt to reduce
    one particular type of miss
  • figure C.8 on page C-23 demonstrates the miss
    rates for various sized caches and
    associativities
  • conflict misses are more common in direct-mapped
    caches and are 0 in fully associative caches
    whereas capacity misses are the most common cause
    of a cache miss

19
Solution 1 Larger Block Sizes
  • Larger block sizes will reduce compulsory misses
  • larger blocks can take more advantage of temporal
    and spatial reference
  • but, larger blocks can increase miss penalty
    because it physically takes longer to transfer
    the block from main memory to cache
  • also, larger blocks means less blocks in cache
    which itself can increase the miss rate
  • this depends on program layout and the size of
    the cache vs. block size

A block size of 64 to 128 bytes provides
the lowest miss rates
20
Example Impact of Block Size
  • Assume memory takes 80 cycles to respond and
    delivers 16 bytes every 2 clock cycles thereafter
  • Which block size has minimum avg memory access
    time for each cache size?
  • average memory access time hit time miss rate
    miss penalty
  • hit time 1, miss rates from previous slide
  • miss penalty depends on size of block (82 cycles
    for 16 bytes, 84 cycles for 32 bytes, etc, for k
    byte blocks miss penalty (k / 16) 2 80)
  • Solution
  • 16 byte block, 4 KB cache 1 (8.57 82)
    8.027 clock cycles
  • 256 byte block, 256KB cache 1 (.49 112)
    1.549 clock cycles

We must compromise because a bigger block size
reduces miss rate to some extent, but also
increases hit time
21
Solution 2 Larger Caches
  • A larger cache will reduce capacity miss rates
    since the cache has a larger capacity, but also
    conflict miss rates because the larger cache
    allows more refill lines and so fewer conflicts
  • This is an obvious solution and has no seeming
    performance drawbacks
  • you must be careful where you put this larger
    cache
  • a larger on-chip cache might take space away from
    other hardware that could provide performance
    increases (registers, more functional units,
    logic for multiple-issue of instructions, etc)
  • and more cache means a greater expense for the
    machine
  • Larger caches however are often slower as they
    need more multiplexors
  • the best place for a large cache is off the chip
    where we have space and because we dont mind a
    slightly slower cache given that the miss penalty
    will be larger due to having to go off the chip
  • the authors note that second-level caches from
    2001 computers are equal in size to main memories
    from 10 years ago!

22
Solution 3 Higher Associativity
  • Cache research points out the 21 cache rule of
    thumb
  • a direct-mapped cache of size N has about the
    same miss rate as a 2-way set associative cache
    of size N/2 so that larger associativity yields
    smaller miss rates
  • However, a large 8-way associative cache will
    have about a 0 conflict miss rate
  • meaning that an 8-way associative cache is about
    as good as a fully associative cache
  • so we dont need to have associativity beyond
    8-way
  • Why use direct-mapped?
  • associativity will always have a higher hit time
  • How big is the difference?
  • as we saw in an earlier example, a 2-way set
    associative cache was 35 slower than the
    direct-mapped
  • remember that clock speed is usually equal to
    cache hit time so we wind up slowing down the
    entire computer when using associative caches of
    some kind

So, with this in mind, should we use
direct-mapped or set associative?
23
Example Impact of Associativity
  • Average memory access time hit time miss rate
    miss penalty
  • 4k direct 1 .098 25 3.45
  • 512 KB 8-way 1.52 .006 25 1.67
  • Assume higher associativity increases clock cycle
    time as follows
  • clock 2-way 1.36 clock direct mapped
  • clock 4-way 1.44 times clock direct-mapped
  • clock 8-way 1.52 times clock direct-mapped
  • assume L1 cache is direct-mapped with 1 cycle hit
    time
  • Determine best L2 type
  • given that miss penalty for direct-mapped is 25
    cycles and L2 never misses

in most cases, higher associativity means higher
access time so direct mapped is often the best
24
Solution 4 Multilevel Caches
  • To improve performance, we find that we would
    like
  • a faster cache to keep pace with memory
  • a larger cache to lower miss rate
  • Which should we pick? Both
  • offer a small but fast first level cache (L1)
  • offer a larger but slower second level cache (L2)
    on the chip
  • since this second cache is larger, it will be
    somewhat slower, we make it even slower by using
    some degree of associativity to improve hit rate
  • this gives us a new formula for average memory
    access time
  • hit time L1 miss rate L1 miss penalty L1
  • miss penalty L1 hit time L2 miss rate L2
    miss penalty L2
  • avg mem access time hit time L1 miss rate L1
    (hit time L2 miss rate L2 miss penalty L2)
  • We redefine miss rate for second cache
  • local miss rate of cache misses / of mem
    accesses this cache
  • global miss rate of cache misses / of mem
    accesses overall
  • these values remain the same for the 1st level
    cache

25
Impact of L2 Cache
  • Local miss rate for second cache will be larger
    than local miss rate for first cache since the
    first cache skims the cream of the crop
  • second level cache is only accessed when the
    first level misses entirely
  • global miss rate is more useful than local miss
    rate for the second cache
  • global miss rate tells us how many misses there
    are in all accesses
  • For example
  • assume in 1000 references, L1 has 40 misses, L2
    has 20
  • local (and global) miss rate cache1 40/1000
    4
  • local miss rate cache2 20/40 50, global miss
    rate cache2 20/1000 2
  • Local miss rate cache2 is misleading, global miss
    rate gives us an indication of how both caches
    perform overall
  • L1 hit time is 1, L2 hit time is 10, memory
    access time is 200 cycles
  • what is the average memory access time?
  • avg. mem access time 1 4(1050200) 5.4
    cycles
  • without L2, we have avg. mem access time 1 4
    200 9, so the L2 cache gives us a 9 / 5.4
    1.67 or 67 speedup!

26
Associativity for L2 Cache
  • Here we see the benefit of an associative cache
    for a second-level cache instead of direct-mapped
  • compare direct-mapped vs. 2-way set associative
    caches for second level assuming a 2-way set
    associative cache is 10 slower than the
    direct-mapped cache
  • direct-mapped L2 has hit time 10 cycles
  • direct-mapped L2 has local miss rate 25
  • 2-way set-associative L2 has hit time 10.1
    cycles
  • 2-way set-associative L2 has local miss rate
    20
  • miss penalty L2 200 cycles
  • direct-mapped L2, miss penalty 10 .25 200
    60 cycles
  • 2-way set-associative L2, miss penalty 10.1
    .20 200 50.1 cycles
  • we round 10.1 and 50.1 up to 11 and 51
    respectively since the L2 cache is still governed
    by the system clock

27
Solution 5 Priority of Reads over Writes
  • Reads occur with a much greater frequency than
    writes
  • we want to make the more common case fast so we
    would like to make sure that reads are faster
    than writes
  • writes are slower because of the need to write to
    both cache and main memory as well as perform a
    tag check prior to starting the write (a tag
    check on a read can be done in parallel, if the
    tag is incorrect, we just cancel the remainder of
    the read)
  • we might use a write buffer for both types of
    write policy
  • write-through cache writes to write buffer first,
    and any read misses are given priority over
    writing the write buffer to memory
  • write-back cache writes to write buffer and the
    write buffer is only written to memory when we
    are assured of no conflict with a read miss
  • in either case, a read miss will first examine
    the write buffer before going on to memory in
    order to potentially save time
  • so, read misses have priority over writes since
    read misses are more common, so we make the
    common case fast

28
Solution 6 Avoid Address Translation
  • CPU generates an address and sends it to cache
  • but the address generated by the CPU is a logical
    (virtual) address, not the physical address in
    memory
  • the addresses differ because of paging and
    virtual memory
  • caches store items based on their physical
    address (for instance, the line number is derived
    from the physical address, not the virtual
    address)
  • to obtain the physical address, the virtual
    address must first be translated using the TLB
    (or page table if the entry isnt in the TLB)
  • thus, any memory access will actually take at
    least 2 cache accesses and we want to prevent
    this
  • If we store virtual addresses in the cache, we
    can skip this translation but there are problems
    with this approach
  • the address translation process includes security
    mechanisms to make sure that a generated address
    is not trying to access some other process
    portion of memory (such as a user process and the
    OS)
  • a solution is to copy protection information into
    the cache and check it on every cache access
  • if a process is switched out of memory then the
    cache must be flushed
  • unless we add process id information to tags,
    which like the previous solution means using more
    cache space for non-data

29
Another Solution
  • A very simple solution to the problem of needing
    to translate virtual to physical addresses is
    this
  • make sure that the line number is identical in
    both the virtual and physical addresses
  • Consider for instance a memory system of
  • 4G of words (32 bit addresses, assume word-level
    addressing, not byte-level)
  • block size of 16 words
  • a 16KB on-chip cache would store 1024 blocks
    requiring 10 bits of the address
  • so the address format is 18 bits for the tag, 10
    bits for the block, 4 bits for the word
  • assume a page size of 16K words so that the page
    offset is 14 bits and the page number size is 18
    bits
  • paging process will swap out the page number for
    the frame number, both of which are 18 bits of
    the address so that the virtual address page
    offset is the same bits as the physical address
    block number and word offset and so no additional
    address translation is necessary prior to cache
    access
  • For this to work, we must ensure that log 2 page
    size gt tag size

30
Solution 7 Small and Simple Caches
  • Cache access (for any but an associative cache)
    requires
  • using the index part of the address to find the
    appropriate line in the cache
  • then comparing tags to see if the entry is the
    right one
  • the tag comparison can be time consuming
  • especially with associative caches that have
    large tags or set associative caches where
    comparisons use more hardware to be done in
    parallel
  • Two solutions to keeping a fast cache are
  • use direct-mapped caches
  • keep tags on the chip for quick tag check but
    move the data off the chip
  • this latter approach is often the case for L2
    caches, but not L1
  • Another idea is to use a small enough L2 cache to
    fit on the processor to keep L1 miss penalties
    down
  • L1 cache sizes have not increased lately, but
    small enough L2 caches can now fit on the chip

31
Solution 8 Way Prediction
  • The direct-mapped cache offers a faster access
    time than a 2-way set associative cache
  • because we can avoid the additional multiplexor
    to select between the two sets
  • Lets assume that an access to one set will be
    followed by an access to the same set
  • this prediction is called way prediction and
    can be used to speed up access to a 2-way set
    associative cache by maintaining a prediction bit
    for the cache
  • the bit is toggled when we have a miss-prediction
    and need to switch to the other set
  • So we get the lower miss rate of the 2-way set
    associative cache and the lower hit time whenever
    we predict correctly
  • simulations indicate a prediction accuracy of 85
    or more
  • the Pentium IV uses way prediction

32
Variation Pseudo-Associative Cache
  • We can alter a direct-mapped cache to have some
    associativity as follows
  • consult the direct-mapped cache as normal
  • provides fast hit time
  • if there is a miss, invert the address and try
    the new address
  • inversion might flip the last bit in the line
    number
  • the second access comes at a cost of a higher hit
    rate for a second attempt (it may also cause
    other accesses to stall while the second access
    is being performed!)
  • thus, the same address might be stored in one of
    two locations, thus giving some associativity
  • the pseudo-associative cache will reduce the
    amount of conflict misses
  • any cache miss may still become a cache hit
  • first check is fast (hit time of direct-mapped)
  • second check might take 1-2 cycles further, so is
    still faster than a second-level cache

33
Example
  • Which provides faster avg memory access time for
    4KB and 256 KB caches
  • direct-mapped, 2-way associative or
    pseudo-associative (PAC)?
  • Assume hit time of 1 cycle for direct mapped,
    1.36 for 2-way set associative, and miss penalty
    of 50 cycles where the PAC will have a hit time
    of 3 cycles for a second access
  • we alter our formula for PAC because a miss does
    not necessarily accrue a 50 cycle penalty but
    instead a 3 cycle penalty if the item is in the
    other position in cache
  • we need two hit rates the normal hit rate and
    the hit rate of finding the item in the second
    position (we will call this the alternative hit
    rate)
  • alternative hit rate hit rate2 way - hit rate1
    way 1 - miss rate2 way - (1 - miss rate1 way)
    miss rate1 way - miss rate2 way
  • Avg mem access time PAC
  • 1 (miss rate1 way miss rate2 way) 3 miss
    rate2 way miss penalty

34
Solution
  • PAC 4 KB 1 (.098 - .076) 3 (.076 50)
    4.866
  • PAC 256 KB 1 (.013 - .012) 3 (.012
    50) 1.603
  • Direct-mapped 4 KB 1 .098 50 5.9
  • Direct-mapped 256 KB 1 .013 50 1.65
  • 2-way 4 KB 1.36 .076 50 5.16
  • 2-way 256 KB 1.36 .012 50 1.96
  • So, pseudo-associative cache outperforms both!

35
Solution 9 Trace Caches
  • This type of a cache is an instruction cache
    which supports multiple issue of instructions by
    providing 4 or more independent instructions per
    cycle
  • cache blocks are dynamic, unlike normal caches
    where blocks are static based on what is stored
    in memory
  • here, the block is formed around branch
    prediction, branch folding, and trace scheduling
  • note that because of branch folding and trace
    scheduling, some instructions might appear
    multiple times in the cache, so it is somewhat
    more wasteful of cache space
  • This type of cache then offers the advantage of
    directly supporting a multiple issue architecture
  • the Pentium 4 uses this approach, but most RISC
    computers do not because repetition of
    instructions and high frequency of branches cause
    this approach to waste too much cache space

36
Solution 10 Pipelined Cache Accesses
  • We can pipeline writes to the cache since writes
    take longer than reads
  • the longer duration is because we need to perform
    a tag check before the write can begin
  • a read can commence and if the tag is wrong, the
    item read can be discarded
  • The write takes two steps, tag comparison first,
    followed by the write (a third step might be
    included in a write-back cache by combining items
    in a buffer)
  • by pipelining writes
  • we can partially speed up the process
  • this works by overlapping the tag checking and
    writing portions
  • assuming the tag is correct
  • in this way, the second write takes the same time
    as a read would
  • although this only works with more than 1
    consecutive write where all writes are cache hits

37
Example
  • Here we see the impact of pipelining cache
    accesses
  • the advantage is that it allows us to reduce
    clock cycle time, the disadvantage is that with a
    shorter clock, cache misses have a larger impact
  • compare the MIPS 5-stage pipeline vs. the MIPS
    R4000 8-stage pipeline
  • assume clock rates of 1 GHz for MIPS and 1.8 GHz
    for MIPS R4000
  • a main memory access time of 50 ns (we will
    assume no second level cache) and a cache miss
    rate of 5
  • Assuming no other source of stalls, which machine
    is faster?
  • MIPS miss penalty 50 ns / (1 / 1) ns 50
    cycles
  • MIPS R4000 miss penalty 50 ns / (1 / 1.8) ns
    90 cycles
  • CPU time MIPS (1 .05 50) 1 3.5
  • CPU time MIPS R4000 (1 .05 90) 1 / 1.8
    3.06
  • So the gain of increased clock speed by
    pipelining cache accesses more than offsets the
    increased miss penalty
  • to truly see if this is advantageous, we would
    also have to factor in the impact of structural
    hazards and branch penalties
  • the longer the pipeline, the greater the impact is

38
Solution 11 Non-Blocking Caches
  • When an ordinary cache has a miss and must
    retrieve the needed block from memory, the cache
    is unable to respond to other requests
  • a more expensive cache is a non-blocking cache
    which can continue to respond to CPU requests
    while waiting for memory as long as the new
    requests are to blocks that are present
  • this is sometimes referred to as a hit under
    miss
  • this capability becomes essential for
    out-of-order execution architectures like
    Tomasulos so that the cache does not cause
    stalls
  • a variation if the cache can permit hits to
    continue in spite of multiple misses (although
    for this to make any sense, it means that memory
    must also be able to respond to multiple
    requests)
  • see figure 5.5 on page 297 which illustrates how
    various benchmarks perform with hit under 1 miss,
    hit under 2 misses and hit under 64 misses
  • the non-blocking cache becomes important for
    implementing some other optimizations which we
    will see next

39
Solution 12 Multi-banked Caches
  • A common memory optimization is to have multiple
    memory banks
  • each of which can be accessed independently
  • this allows for parallel access to memory either
    by
  • different devices accessing memory simultaneously
  • accessing multiple words of the same block
    simultaneously
  • We can use this idea on our caches as well by
    interleaving cache addresses across banks
  • L2 of AMD Opteron uses 2 banks, Sun Niagara uses
    4 banks
  • To make this easy to implement, we interleave the
    blocks across banks
  • for instance, for a 4-bank cache, block address
    4 0 would be placed into bank 0, block
    address 4 1 would be placed into bank 1,
    etc
  • When would this approach be advantageous?
  • we could use this to implement a form of
    non-blocking cache where we can access another
    bank on a miss (thus, we have hit under miss in 3
    out of 4 instances) and this would be cheaper
    than a truly non-blocking cache

40
Solution 13 Early Restart
  • On a cache miss, memory system moves a block into
    cache
  • moving a full block requires many memory accesses
    and bus transfers
  • Rather than having the cache (and CPU) wait until
    the entire block is available
  • move requested word from the block first to allow
    cache access as soon as the item is available
  • transfer rest of block in parallel with that
    access
  • This requires two ideas
  • early restart the cache transmits the requested
    word as soon as it arrives from memory
  • critical word first have memory return the
    requested word first and the remainder of the
    block afterward (this is also known as wrapped
    fetch)
  • note we could implement early restart without
    critical word first, but the improvement then is
    based on which word was needed, for instance if
    the last word of the block is requested, then
    early restart does nothing for us
  • This optimization requires a non-blocking cache
    since the cache needs to start responding to the
    request after the first word is returned

41
Solution 14 Merging Write Buffer
  • We can make our write buffer more efficient as
    follows
  • organize the write buffer in rows, one row
    represents one refill line
  • multiple writes to the same line are saved in the
    same buffer row
  • a write to memory moves the entire block from the
    buffer, reducing the number of writes
  • Recall that a write-through cache may use a write
    buffer to make memory accesses more efficient
  • the write buffer contains multiple items waiting
    to be written to memory
  • with the write buffer, writes to memory are
    postponed until either the buffer is full or a
    modified refill line is being discarded

42
Solution 15 Compiler Optimizations
  • Specific techniques include
  • merging parallel arrays into an array of records
    so that access to a single array element is made
    to consecutive memory locations and thus the same
    (hopefully) refill line
  • loop interchange exchange loops in a nested loop
    situation so that array elements are accessed
    based on order that they will appear in the cache
    and not programmer-prescribed order
  • loop fusion combines loops together that access
    the same array locations so that all accesses are
    made within one iteration
  • blocking executes code on a part of the array
    before moving on to another part of the array so
    that array elements do not need to be reloaded
    into the cache
  • this is common for applications like image
    processing where several different passes through
    a matrix are made
  • We have already seen that compiler optimizations
    can be used to improve hardware performance
  • There are many ways that compiler optimizations
    can improve cache performance by making sure that
    array accesses are done in a way that supports
    those refill lines currently in cache

43
Solution 16 Hardware Prefetching
  • Prefetching can be controlled by hardware or
    compiler (see the next slide)
  • prefetching can operate on either instructions or
    data or both
  • one simple idea for instruction prefetching is
    that when there is an instruction miss to block
    i, fetch it and then fetch block i 1
  • this might be controlled by hardware outside of
    the cache
  • the second block is left in the instruction
    stream buffer and not moved into the cache
  • if an instruction is part of the instruction
    stream buffer, then the cache access is cancelled
    (and thus a potential miss is cancelled)
  • if prefetching is to place multiple blocks into
    the cache, then the cache must be non-blocking
  • See figure 5.10 on page 306 which shows the
    speedup of many SPEC 2000 benchmarks (mostly FP)
    when hardware prefetching is turned on (speedup
    ranges from 1.16 to 1.97)

44
Solution 17 Compiler Prefetching
  • New loop becomes
  • for (j0jlt100jj1)
  • prefetch(bj70) / prefetch 7
    iterations later / prefetch(a0j7)
    a0jbj0
  • for (i1ilt3ii1) for
    (j0jlt100jj1) prefetch(aij7)
    aijbj0bj10
  • This new code has only 19 misses improving
    performance to 6.2 times faster
  • See page 307-308 for the rest of the analysis for
    this problem
  • Another idea is to have the compiler insert
    prefetching instructions into the code (this is
    for data prefetching only)
  • Consider the loop
  • for (i0ilt3ii1) for (j0jlt100jj1)
    aijbj0bj10
  • if we have a 8KB direct-mapped data cache with 16
    byte blocks and each element of a and b are 8
    bytes long (double precision floats) we will have
    150 misses for array a and 101 misses for array b
  • by scheduling the code with prefetch
    instructions, we can reduce the misses

45
Solution 18 Victim Caches
  • Misses might arise when refill lines conflict
    with each other
  • one line is discarded for another only to find
    the discarded line is needed in the future
  • The victim cache is a small, fully associative
    cache, placed between the cache and memory
  • this cache might store 1-5 blocks
  • Victim cache only stores blocks that are
    discarded from the cache when a miss occurs
  • victim cache is checked on a miss before going on
    to main memory and if found, the block in the
    cache and the block in the victim cache are
    switched

The victim cache is most useful if it backs up
a fast direct-mapped cache to reduce the
direct-mapped caches conflict miss rate by
adding some associativity A 4-item victim cache
might remove ¼ of the misses from a 4KB
direct-mapped data cache AMD Athlon uses 8-entry
victim cache
46
Cache Optimization Summary
Hardware complexity ranges from 0
(cheapest/easiest) to 3 (most expensive/hardest)
47
Continued
48
Sample Problem 1
  • Computer uses a fully associative write-back data
    cache
  • Block size is 64 bytes
  • Given the code above, assume x1024, b1024 and
    c1024 are all double precision floating point
    arrays with arrays b and c already in cache, but
    not x
  • moving elements of x into the cache will not
    cause elements of b or c to be moved out of cache
  • x0 is stored at the beginning of a block
  • Questions
  • how many misses arise with respect to accessing x
    if the cache uses a no-write allocate policy?
  • how many misses arise with respect to accessing x
    if the cache uses a write allocate policy?
  • redo part a assuming that statement S2 comes
    before statement S1
  • redo part b assuming that statement S2 comes
    before statement S1

for(i1ilt1024i) xi bi
y // S1 ci xi z // S2
49
Solution
  • Each array stores 1024 doubles
  • A refill line stores 64 bytes so we can get
    exactly 8 array elements into each refill line
  • Our misses only arise because of accesses to
    array x
  • since a and b are already in the cache
  • No-write allocate policy means that the write in
    S1 will not load x into the cache therefore a
    miss in S1 results in a read miss in S2
  • Any read miss will bring in a refill line along
    with the next 7 array items
  • so a read miss occurs for 1 in 8 array elements
  • we will have 1024 / 8 128 read misses
  • any read miss will have previously had a write
    miss giving us a total of 256 misses
  • Write allocate policy means that the write miss
    in S1 loads the line into cache, so we wont have
    a corresponding read miss in S2 and therefore
    will have a total of 128 misses
  • Reversing the order of S1 and S2 leads to the
    read miss happening first so no matter which
    write policy is used, we will have 128 read
    misses and no write misses

50
Sample Problem 2
  • Memory organized as follows
  • two on-chip caches (one data, one instruction)
  • off-chip cache
  • main memory
  • disk cache
  • disk (swap space)
  • Assume miss rates and access times of
  • data cache 5, 1 clock cycle
  • instruction cache 1, 1 clock cycle
  • off-chip cache 10, 10 clock cycles
  • main memory 0.2, 100 clock cycles
  • disk cache 20, 1000 clock cycles
  • swap space 0, 250000 clock cycles
  • If 40 of all instructions are loads or stores,
    what is the effective memory access time for this
    machine?

51
Solution
  • Average memory access time
  • instruction (hit time instr cache miss rate
    instr cache (hit time second level cache miss
    rate second level cache (hit time main memory
    miss rate main memory (hit time disk cache
    miss rate disk cache hit time disk))))
  • data (hit time data cache miss rate data
    cache (hit time second level cache miss rate
    second level cache (hit time main memory miss
    rate main memory (hit time disk cache miss
    rate disk cache hit time disk))))
  • With 1.4 memory accesses per instruction, the
    of instruction accesses
  • 1.0 / 1.4 71.4, and of data accesses is 0.4
    / 1.4 28.6
  • Average memory access time
  • 71.4 (1 .01 (10 .10 (100 .002
    (1000 .20 250000)))) 28.6 (1 .05 (10
    .10 (100 .002 (1000 .20 250000))))
    1.647.
Write a Comment
User Comments (0)
About PowerShow.com