Chapter 5: Cache - PowerPoint PPT Presentation

About This Presentation
Title:

Chapter 5: Cache

Description:

The memory hierarchy is our solution to a need for unlimited fast memory At least 1 instruction fetch and maybe 1 data access per cycle (more for a superscalar) – PowerPoint PPT presentation

Number of Views:142
Avg rating:3.0/5.0
Slides: 40
Provided by: rfox7
Category:

less

Transcript and Presenter's Notes

Title: Chapter 5: Cache


1
Chapter 5 Cache
  • The memory hierarchy is our solution to a need
    for unlimited fast memory
  • At least 1 instruction fetch and maybe 1 data
    access per cycle (more for a superscalar)
  • Each level of the hierarchy is based (at least in
    part) on Principle of Locality of Reference
  • As we move higher in the hierarchy, each level
    gets faster but also more expensive, therefore it
    is more restricted in size
  • some issues are generic across the hierarchy
  • but, each level has its unique characteristics,
    technology and solutions
  • we have already looked at registers, here we will
    study cache
  • the book also covers main memory and virtual
    memory in this chapter, but we will have to skip
    those for lack of time
  • The main problem we face is that the lower ends
    of the hierarchy are much slower than
    CPU/register/cache speeds but we have a limited
    number of registers and limited space in our cache

we also find that CPU speed is increasing at a
much faster rate than memory access time is
increasing
2
Effects on memory speed
  • Memory speed has a direct effect on CPU
    performance as indicated by
  • CPU execution time (CPU clock cycles memory
    stall cycles) clock cycle time
  • mem stall cycles IC mem references per instr
    miss rate miss penalty
  • mem references per instr gt 1 since there will be
    the instruction fetch itself, and possible 1 or
    more data fetches
  • whenever an instruction or data is not in
    registers, we must fetch it from cache, but if it
    is not in cache, we accrue a miss penalty by
    having to access the much slower main memory
  • A large enough miss penalty will cause a
    substantial decrease in CPU execute time
  • Consider the following example
  • CPI 1.0 when all memory accesses are hits
  • Only data accesses are during loads and stores
    (50 of all instructions are loads or stores)
  • Miss penalty is 25 clock cycles, miss rate is 2
  • How much faster would the computer be if all
    cache accesses were hits?
  • CPI 1.0 without misses
  • CPI 1.0 100225 50225 1.75
  • The ideal machine is 75 faster than our
    realistic machine

3
Four questions
  • The general piece of memory will be called a
    block
  • Blocks differ in size depending on the level of
    the memory hierarchy
  • cache block, memory block, disk block
  • We ask the following questions pertaining to both
    cache, main memory and disk
  • Q1 where can a block be placed?
  • Q2 how is a block found?
  • Q3 which block should be replaced on a miss?
  • Q4 what happens on a write?
  • Cache is made from SRAMs whereas main memory is
    made from DRAM
  • SRAM is faster but much more expensive
  • SRAM is also used to make registers, the
    technology is based on flip-flop circuits
  • Cache acts as an intermediate between registers
    and main memory in the memory hierarchy
  • Three types of caches Direct mapped,
    Associative, Set-associative
  • Today, we usually have two caches one for
    instructions and one for data
  • connected to the CPU by two separate ports

4
Q1 Where can a block be placed?
  • Type determines placement
  • Associative cache
  • any available block
  • Direct mapped cache
  • given memory block has only one location where it
    can be placed in cache determined by the
    equation
  • (block address) mod size
  • Set associative cache
  • given memory block has a set of blocks in the
    cache where it can be placed determined by
  • (block addr) mod (size / associativity)

Here we have a cache of size 8 and a memory of
size 32 to place memory block 12, we can put
it in any block in associative cache, in block 4
in direct mapped cache, and in block 0 or 1 in
a 2 way set associative cache
5
Q2 How is a block found in cache? Q3 Which
block should be replaced?
  • All memory addresses consist of a tag, a line
    number (or index), and a block offset
  • In a direct mapped cache, the line number
  • dictates the line where a block must be placed or
    where it will be found
  • the tag is used to make sure that the line we
    have found is the line we want
  • In a set associative cache, the line number
    references a set of lines
  • the block must be placed in one of those lines,
    but there is some variability which line should
    we put it in, which line will we find it in?
  • In a fully associative cache, a line can go
    anywhere
  • For the last two types of cache
  • we do an associative search of all relevant tags
  • we use a replacement strategy to determine which
    line we will discard to use for the new item
  • Replacement strategies
  • Random
  • FIFO
  • Least Recently Used
  • most efficient as it better models the principle
    of locality of reference but hard to implement
  • Others include LRU approximation and LFU (least
    frequently used)
  • Figure 5.6 page 400 compares the performance
    between FIFO, Random, and LRU
  • notice their performances are similar but LRU is
    usually better

6
Q4 What happens on a write?
  • On a cache write, what happens to the old (dirty)
    value in memory? two approaches
  • Write Through cache
  • write the datum to both cache and memory at the
    same time
  • this is inefficient because the data access is a
    word, typical data movement between cache and
    memory is a block, so this write uses only part
    of the bus for a transfer
  • notice other words in the same block may also
    soon be updated, so waiting could pay off
  • Write Back cache
  • write to cache, wait on writing to memory until
    the entire block is being removed from cache
  • add a dirty bit to the cache to indicate that the
    cache value is right, memory is wrong
  • Write Through is easier to implement since memory
    will always be up-to-date and we dont need dirty
    bit mechanisms
  • Write Back is preferred to reduce memory traffic
    (a write stall occurs in Write Through if the CPU
    must wait for the write to take place)
  • To alleviate the inefficiency of Write Through,
    we may add a write buffer
  • writes go to cache and the buffer, the CPU
    continues without stalling
  • writes to memory occur when the buffer is full or
    when a line is filled
  • What happens on a write miss? Two options
  • Write allocate block fetched on a miss, the
    write takes place at both the cache and memory
  • No-write allocate block modified in memory
    without being brought into the cache

7
Write Miss Example
  • Consider write-back cache which starts empty and
    the sequence of operations to the right
  • How many hits and how many misses occur with
    no-write allocate versus write allocate?
  • Solution
  • For no-write allocate
  • the first two operations cause misses (since
    after the first one, 100 is still not loaded into
    cache), the third instruction causes a miss, the
    fourth instruction is a hit (since 200 is now in
    cache) but the fifth is also a miss, so 4 misses,
    1 hit
  • For write allocate
  • the first access to a memory location is always a
    miss, but from there, it is in cache and the rest
    are hits, so we have 2 misses (one for each of
    100 and 200) and 3 hits

Write 100 Write 100 Read 200 Write
200 Write 100
8
Example Alpha AXP 21064
  • Found in Alpha-Server ES40 workstations
  • 64 Kbytes in 64-byte blocks (1024 blocks)
  • 2-way set associative, write-back, write-allocate
  • CPU address consists of a 29-bit tag, an 9-bit
    index and a 6-bit offset
  • Index is checked in both 512 blocks and the two
    tags are compared in parallel
  • The valid bit is used because it is a write-back
    cache and the memory block might be dirty

Victim buffer will be explained later in the
chapter This cache uses a FIFO replacement
strategy and transfers 16 bytes per cycle for 4
cycles on a miss
9
Cache Size and Performance
  • To determine caches performance, we compute
    memory access time
  • Average memory access time hit time miss rate
    miss penalty
  • Hit time - time to fetch from cache (usually 1-2
    clock cycles)
  • Miss rate - percentage of accesses not found in
    cache
  • Miss penalty - time it takes to access and
    retrieve missed item from main memory (might be
    20-120 clock cycles or more)
  • The larger the cache, the better its performance
  • As cache size increases, miss rate decreases
  • Another issue is whether the cache is used for
    both data and instructions or just one
  • Notice that instruction caches perform much
    better than data caches why?

Note this table does not show miss rate we
are seeing misses per instruction, not per access
Number of misses per 1000 instructions divide
by 10 to get percentage (e.g., 6.3 for 8KB
Unified cache)
10
Example
  • We get misses per instruction from table on
    previous slide
  • Converting to miss rate
  • (3.82 / 1000) / 1 instr .00382
  • (40.9 / 1000) / .36 .1136
  • (43.3 / 1000) / 1.36 .0318
  • Of 136 accesses per 100 instr., percentage of
    instr. accesses 100 / 136 74 and percentage
    of data accesses 36 / 136 26
  • Memory access time for 2 caches 74 (1
    .00382 100) 26 (1 .1136 100) 4.236
  • Memory access for unified cache 74 (1
    .0318 100) 26 (2 .0318 100) 4.44
  • Separate caches perform better!
  • Lets compare using 16 KB instruction and 16 KB
    data caches vs. 1 32 KB unified cache
  • Assume
  • 1 clock cycle hit time
  • 100 clock cycle miss penalty for the individual
    caches
  • add 1 clock cycle hit time for load/store in the
    unified cache (36 of instructions are
    load/stores)
  • write-through caches with write buffer, no stalls
    on writes
  • What is the average memory access time for both
    caches?

11
Revised CPU Performance and Example
  • Recall our previous CPU formula
  • (CPU cycles memory stall cycles) clock cycle
    time
  • assume memory stalls are caused by cache misses,
    not problems like bus contention, I/O, etc
  • Memory stall cycles
  • memory accesses miss rate miss penalty
    reads read miss rate read miss penalty
    writes write miss write write miss penalty
  • CPU time
  • IC (CPI mem access per instr miss rate
    miss penalty) clock cycle time
  • IC (CPI CCT mem accesses per instr miss
    rate mem access time)
  • Sun Ultrasparc III, assume
  • miss penalty 100 cycles
  • instructions normally take 1.0 cycles (CPI 1.0)
  • cache miss rate of 2
  • 1.5 memory references per instruction (1 fetch,
    50 loads/stores)
  • average number of cache misses is 30 per 1000
  • NOTE this is the same as 2 miss rate (1.5
    memory accesses with a 2 miss rate yields 30
    misses per 1000)
  • Impact of imperfect cache
  • CPU time IC (CPI memory stalls / instr)
    clock cycle time
  • IC (1 .02 1.5 100) CCT IC 4.0
    CCT or
  • IC (1 30 / 1000 100) CCT IC 4.0
    CCT
  • With a perfect cache, we would have CPU Time IC
    1 CCT, so the imperfect cache provides a
    slowdown of 1 / 4 or a 4 times slow down!

12
Another Example
  • What impact does cache organization
    (direct-mapped vs. 2-way set associative) have on
    a CPU?
  • Cache 1 d-m, 64 KB, 64 byte blocks, 1.4 miss
    rate
  • Cache 2 2-way assoc, 64 KB, 64 byte blocks,
    1.0 miss rate
  • CPU has a CPI 2.0, clock cycle time 1 ns,
    memory access time is 75 ns, 1.5 memory
    references per instruction, cache access is 1
    cycle
  • the direct-mapped cache is faster, so the clock
    speed is faster, we will assume the CPU clock
    cycle time for the set associative cache is 1.25
    that of the direct-mapped cache
  • CPU Time Cache 1
  • IC (2.0 CCT 1.5 .014 75) 3.575 IC
    CCT
  • CPU Time Cache 2
  • IC (2.0 1.25 CCT 1.5 .01 75) 3.625
    IC CCT
  • CPU with Cache 1 3.625 / 3.575 1.014 times
    faster

13
Out of Order and Miss Penalty
  • In our prior examples, cache misses caused the
    pipeline to stall thus impacting CPI
  • In a multiple-issue out-of-order execution
    architecture, like Tomasulo, a miss means that a
    particular instruction stalls, possibly stalling
    others because it ties up a reservation station
    or reorder buffer slot, but it is more likely
    that it will not impact overall CPI
  • How then do we determine the impact of cache
    misses on such architectures?
  • We might define memory stall cycles / instruction
    misses / instruction (total miss latency
    overlapped miss latency)
  • Total miss latency the total of all memory
    latencies where the memory latency for a single
    instruction
  • Overlapped miss latency the amount of time that
    the miss is not impacting performance because
    other instructions remain executing
  • these two terms are difficult to analyze, so we
    wont cover this in any more detail
  • Typically a multi-issue out-of-order architecture
    can hide some of the miss penalty, up to 30 as
    shown in an example on page 411-412

14
Improving Cache Performance
  • After reading some 5000 research papers on
    caches, the authors offer four distinct
    approaches to improving cache performance based
    on the formula
  • average memory access time hit time miss rate
    miss penalty
  • Reduce miss rate
  • Reduce miss penalty
  • Reduce miss rate or miss penalty through
    parallelism
  • Reduce hit time
  • For each of these, there are numerous possible
    approaches, many of them hardware or technology
    based, but a few can also be implemented by the
    compiler
  • Comments
  • miss penalty is the biggest value in the
    equation, so this should be the obvious target to
    reduce, but in fact little can be done to
    increase memory speed
  • reducing miss rate has a number of different
    approaches however miss rates today are often
    less than 2, can we continue to improve?
  • reducing hit time has the benefit of allowing us
    to lower clock cycle time as well
  • We will look at each of these in sections 5.4-5.7

15
Reducing Cache Miss Penalties
  • Traditionally, the focus on cache improvements is
    on miss rate
  • Since miss penalty is a large value, reducing it
    will have a large impact on cache performance
  • Recall
  • average memory access time hit time miss rate
    miss penalty
  • miss penalty is the time to retrieve from main
    memory
  • A smaller miss penalty means that the miss rate
    has less of an impact
  • The problem with reducing miss penalty is that
  • DRAM speeds stay roughly the same over time while
    processor speed and SRAM access time increase
    dramatically
  • The net result is that the miss penalty has been
    increasing over time rather than decreasing!

16
Solution 1 Multilevel Caches
  • To improve performance, we find that we would
    like
  • a faster cache to keep pace with memory
  • a larger cache to lower miss rate
  • Which should we pick? Both
  • Offer a small but fast cache on the CPU chip
  • Offer a larger but slower cache cache on the
    motherboard
  • the slower cache is still be much faster than
    main memory
  • This gives us a new formula for average memory
    access time
  • Hit time L1 miss rate L1 miss penalty L1
  • L1 is the first cache (called the first-level
    cache)
  • Miss penalty L1 hit time L2 miss rate L2
    miss penalty L2
  • L2 is the second cache (called the second-level
    cache)
  • Avg mem access time hit time L1 miss rate L1
    (hit time L2 miss rate L2 miss penalty L2)

17
Redefining Miss Rate and Example
  • We must redefine miss rate for second cache
  • Local miss rate number of cache misses / number
    of mem accesses this cache
  • Global miss rate number of cache misses /
    number of mem accesses overall
  • Values are the same for 1st level cache, but
    differ for 2nd level cache
  • Local miss rate for second cache will be larger
    than local miss rate for first cache
  • the first cache skims the cream of the crop
  • second level cache is only accessed when the
    first level misses entirely
  • Global miss rate is more useful than local miss
    rate for the second cache
  • global miss rate tells us how many misses there
    are in all accesses
  • Assume
  • in 1000 references, level one has 40 misses,
    level 2 has 20, determine local/global miss rates
  • Local (and global) miss rate cache1 40/1000
    4
  • Local miss rate cache2 20/40 50
  • Global miss rate cache2 20/1000 2
  • Local miss rate cache2 is misleading, global miss
    rate gives us an indication of how both caches
    perform overall
  • L1 hit time is 1, L2 hit time is 10, memory
    access time is 100 cycles
  • what is the average memory access time?
  • Avg. mem access time 1 4(1050100) 3.4
    cycles
  • Without L2, we have avg. mem access time 1 4
    100 5, so the L2 cache gives us a 5 / 3.4
    1.47 or 47 speedup!

18
Another Example
  • Here we see the benefit of an associative cache
    for a second-level cache instead of direct-mapped
  • Compare direct-mapped vs. 2-way set associative
    caches for second level
  • Direct-mapped L2 has hit time 10 cycles
  • Direct-mapped L2 has local miss rate 25
  • 2-way set-associative L2 has hit time 10.1
    cycles
  • 2-way set-associative L2 has local miss rate
    20
  • Miss penalty L2 100 cycles
  • Direct-mapped L2, miss penalty 10 .25 100
    35 cycles
  • 2-way set-associative L2, miss penalty 10.1
    .20 100 30.1 cycles
  • NOTE we will almost always synchronize L2 with
    the clock, so in this case, we would just raise
    the hit rate for the set-associative cache to be
    11 cycles, resulting in a miss penalty 11 .20
    100 31, still an improvement over
    direct-mapped

19
Solution 2 Early Restart
  • On a cache miss, memory system moves a block into
    cache
  • moving a full block will require many bus
    transfers
  • Rather than having the cache (and CPU) wait until
    the entire block is available
  • move requested word from the block first to allow
    cache access as soon as the item is available
  • transfer rest of block in parallel with that
    access
  • this requires two ideas
  • early restart the cache transmits the requested
    word as soon as it arrives from memory
  • critical word first have memory return the
    requested word first and the remainder of the
    block afterward (this is also known as wrapped
    fetch)
  • Example calculate average memory access time
    for critical word and for the remainder of the
    block and compare against a cache that fetches
    the entire block without critical word first
  • 64-byte cache blocks
  • L2 takes 11 cycles to get first 8 bytes
  • 2 clock cycles per 8 bytes for the remainder of
    the transfer
  • Avg. miss penalty 11 cycles for first word
  • Average miss penalty for entire block 11 2
    (64 8 ) / 8 25
  • To implement early restart/critical word first,
    we need a non-block cache, this is expensive, so
    this approach only pays off if we have large
    block sizes (e.g., block size gt bus bandwidth)

20
Solution 3 Priority of Reads over Writes
  • Make the more common case fast
  • Reads occur with a much greater frequency than
    writes
  • instructions are read only, many operands are
    read but not written back
  • So, lets make sure that reads are faster than
    writes
  • Writes are slower anyway because of the need to
    write to both cache and main memory
  • If we use a write buffer for both types of write
    policy
  • Write-through cache writes to write buffer first,
    and any read misses are given priority over
    writing the write buffer to memory
  • Write-back cache writes to write buffer and the
    write buffer is only written to memory when we
    are assured of no conflict with a read miss
  • So, read misses have priority over write misses
    since read misses are more common, so we make the
    common case fast
  • See the example on pages 419-420

21
Solution 4 Merging Write Buffer
  • Here, we will organize the write buffer in rows,
    one row represents one refill line
  • Multiple writes to the same line can be saved in
    the same buffer row
  • a write to memory moves the entire block from the
    buffer, reducing the number of writes
  • We follow up the previous idea with a more
    efficient write buffer
  • the write buffer contains multiple items to be
    written to memory
  • in write-through, writes to memory are postponed
    until either the buffer is full or a refill line
    is discarded and has been modified

22
Solution 5 Victim Caches
  • Misses might arise when refill lines conflict
    with each other
  • one line is discarded for another only to find
    the discarded line is needed in the future
  • The victim cache is a small, fully associative
    cache, placed between the cache and memory
  • this cache might store 1-5 blocks
  • Victim cache only stores blocks that are
    discarded from the cache when a miss occurs
  • victim cache is checked on a miss before going on
    to main memory and if found, the block in the
    cache and the block in the victim cache are
    switched

The victim cache is most useful if it backs up
a fast direct-mapped cache to reduce the
direct-mapped caches conflict miss rate by
adding some associativity A 4-item victim cache
might remove ¼ of the misses from a 4KB
direct-mapped data cache AMD Athlon uses 8-entry
victim cache
23
Reducing Cache Misses
  • Compulsory miss rates are usually small
  • there is little we can do about these misses
    other than prefetching
  • We can eliminate all conflict misses if we
  • use a fully associative cache
  • but fully associative caches are expensive in
    terms of hardware and slower which lengthens the
    clock cycle, reducing overall performance
  • Little can be done for capacity misses
  • other than having larger caches but we will find
    other things we can adjust to improve on capacity
    misses
  • Misses can be categorized as
  • Compulsory
  • very first access to a block cannot be in the
    cache because the process has just begun and
    there has not been a chance to load anything into
    the cache
  • Capacity
  • the cache cannot contain all of the blocks needed
    for the process
  • Conflict
  • the block placement strategy only allows a block
    to be placed in a certain location in the cache
    bringing about contention with other blocks for
    that same location
  • See figure 5.14 page 424

24
Solution 1 Larger Block Sizes
  • Larger block sizes will reduce compulsory misses
  • Larger blocks can take more advantage of temporal
    and spatial reference
  • But, larger blocks can increase miss penalty
    because it physically takes longer to transfer
    the block from main memory to cache
  • Also, larger blocks means less blocks in cache
    which itself can increase the miss rate
  • this depends on program layout and the size of
    the cache vs. block size

A block size of 64 to 128 bytes provides
the lowest miss rates
25
Example Impact of Block Size
  • Assume memory system takes 80 clock cycles and
    then delivers 16 bytes every 2 clock cycles.
  • Which block size has the minimum average memory
    access time for each cache size?
  • Average memory access time hit time miss rate
    miss penalty
  • Hit time 1
  • Use data in fig 5.17 for miss rate
  • Miss penalty depends on size of block
  • 82 cycles for 16 bytes, 84 cycles for 32 bytes,
    etc
  • For k byte blocks
  • miss penalty (k / 16) 2 80
  • Solution
  • Average memory access time for 16 byte block in a
    4 KB cache 1 (8.57 82) 8.027 cycles
  • For 256 byte block in a 256KB cache 1 (.49
    112) 1.549 clock cycles
  • The complete results of this exercise are in fig
    5.18
  • Note lowest avg memory access time comes with
  • 32 byte blocks (for 4K) and
  • 64 byte blocks (for 16K, 64K and 256K cache)

We must compromise because a bigger block size
reduces miss rate to some extent, but also
increases hit time
26
Solution 2 Larger Caches
  • A larger cache will reduce capacity miss rates
    since the cache has a larger capacity, but also
    conflict miss rates because the larger cache
    allows more refill lines and so fewer conflicts
  • This is an obvious solution and has no seeming
    performance drawbacks
  • However, you must be careful where you put this
    larger cache
  • A larger on-chip cache might take space away from
    other hardware that could provide performance
    increases (registers, more functional units,
    logic for multiple-issue of instructions, etc)
  • And more cache means a greater expense for the
    machine
  • The authors note that second-level caches from
    2001 computers are equal in size to main memories
    from 10 years ago!

27
Solution 3 Higher Associativity
  • So why use direct-mapped?
  • associativity will always have a higher hit time
  • How big is the difference?
  • As we saw in an earlier example, a 2-way set
    associative cache was about 10 slower than the
    direct-mapped
  • This doesnt seem like a big deal
  • BUT
  • Clock speed is usually equal to cache hit time so
    we wind up slowing down the entire computer when
    using associative caches of some kind
  • So, with this in mind, should we use
    direct-mapped or set associative?
  • A large 8-way associative cache will have about a
    0 conflict miss rate meaning that they are about
    as good at reducing miss rate as fully
    associative caches
  • Cache research also points out the 21 cache
    rule of thumb
  • a direct-mapped cache of size N has about the
    same miss rate as a 2-way set associative cache
    of size N/2 so that larger associativity yields
    smaller miss rates

28
Example Impact of Associativity
  • Average memory access time hit time miss rate
    miss penalty
  • Using a 4 KB cache we get
  • 1 .098 25 3.45 (direct)
  • 1.36 .076 25 3.26 (2-way)
  • 1.44 .071 25 3.22 (4-way)
  • 1.52 .071 25 3.30 (8-way)
  • Using a 512 KB cache we get
  • 1 .008 25 1.2 (direct)
  • 1.36 .007 25 1.535 (2-way)
  • 1.44 .006 25 1.59 (4-way)
  • 1.52 .006 25 1.67 (8-way)
  • See figure 5.19 although their answers are off
    a little, you can see that direct-mapped is often
    the best in spite of worse miss rate (4-way is
    best for 4 KB and 8 KB caches)
  • Assume higher associativity increases clock cycle
    time as follows
  • Clock 2-way 1.36 clock direct mapped
  • Clock 4-way 1.44 times clock direct-mapped
  • Clock 8-way 1.52 times clock direct-mapped
  • Assume L1 cache is direct-mapped with 1 cycle hit
    time and determine best L2 type given that miss
    penalty for direct-mapped is 25 cycles and L2
    never misses (further, we will not round off
    clock cycles)

29
Solution 4 Pseudo-Associative Cache
  • We can alter a direct-mapped cache to have some
    associativity as follows
  • Consult the direct-mapped cache as normal
  • provides fast hit time
  • If there is a miss, invert the address and try
    the new address
  • inversion might flip the last bit in the line
    number
  • the second access comes at a cost of a higher hit
    rate for a second attempt (it may also cause
    other accesses to stall while the second access
    is being performed!)
  • Thus, the same address might be stored in one of
    two locations, thus giving some associativity
  • The pseudo-associative cache will reduce the
    amount of conflict misses
  • any cache miss may still become a cache hit
  • First check is fast (hit time of direct-mapped)
  • Second check might take 1-2 cycles further, so is
    still faster than a second-level cache

30
Example
  • For PAC
  • 4 KB 1 (.098 - .076) 3 (.076 50)
    4.866
  • 256 KB 1 (.013 - .012) 3 (.012 50)
    1.603
  • For direct-mapped cache
  • For 4 KB 1 .098 50 5.9
  • For 256 KB 1 .013 50 1.65
  • For 2-way set associative (recall, longer clock
    cycle)
  • For 4 KB 1.36 .076 50 5.16
  • For 256 KB 1.36 .012 50 1.96
  • So, pseudo-associative cache outperforms both!
  • Assume hit time 1 cycle for 1st access, 3
    cycles for 2nd access and a miss penalty of 50
    cycles
  • Which provides a faster average memory access
    time for 4KB and 256 KB caches, direct-mapped,
    2-way associative or pseudo-associative (PAC)?
  • avg mem acc time hit time miss rate miss
    penalty
  • For PAC, an entry will either be in its
    direct-mapped location or the location found by
    inverting 1 bit
  • since each entry in the PAC has 2 possible
    locations, this makes the PAC similar to a 2-way
    associative cache, but the PAC has a faster first
    hit time than 2-way associative, followed by a
    second access (in this case, 3 cycles)
  • avg mem access time hit time alternative hit
    rate 3 miss rate2 way miss penalty1 way
  • Alternative hit rate is hit rate for the second
    access
  • with 2 possible places for the item, this second
    hit rate will be hit rate2 way - hit rate1 way
    that is, the hit rate of a 2-way set associative
    cache (because there are 2 places the item could
    be placed) the hit rate of a direct-mapped
    cache
  • Alternative hit rate hit rate2 way - hit rate1
    way 1 - miss rate2 way - (1 - miss rate1 way)
    miss rate1 way - miss rate2 way

31
Solution 5 Compiler Optimizations
  • Specific techniques include
  • merging parallel arrays into an array of records
    so that access to a single array element is made
    to consecutive memory locations and thus the same
    (hopefully) refill line
  • loop interchange exchange loops in a nested loop
    situation so that array elements are accessed
    based on order that they will appear in the cache
    and not programmer-prescribed order
  • Loop fusion combines loops together that access
    the same array locations so that all accesses are
    made within one iteration
  • Blocking executes code on a part of the array
    before moving on to another part of the array so
    that array elements do not need to be reloaded
    into the cache
  • This is common for applications like image
    processing where several different passes through
    a matrix are made
  • We have already seen that compiler optimizations
    can be used to improve hardware performance
  • What about using compiler optimizations to
    improve cache performance?
  • It turns that that there are numerous things we
    can do
  • For specific examples, see pages 432-434

32
Using Parallelism for Reduction
  • Other techniques to reduce miss penalty and/or
    rate utilize parallelism
  • A non-blocking cache allows a cache to continue
    to handle accesses even after a cache miss
    results in a memory request
  • Non-blocking caches are needed for out-of-order
    execution architectures and for allowing critical
    word first to work (if the cache was blocked, the
    first word received would not be available until
    the entire block was received)
  • Non-blocking caches are expensive even though
    they can be very useful
  • Two additional ideas that use non-blocking caches
    are
  • Hardware prefetching to fetch multiple blocks
    when a miss is made (that is, hardware predicts
    what else should be retrieved from memory)
  • See pages 438-439 for an example
  • Compiler-controlled prefetching whereby the
    compiler places prefetching commands in the
    program so that data are loaded into the cache
    before they are needed (reducing compulsory miss
    rate)

33
Compiler-Controlled Example
  • Consider the loop
  • for (i0ilt3ii1) for (j0jlt100jj1)
    aijbj0bj10
  • If we have a 8KB direct-mapped data cache with 16
    byte blocks and each element of a and b are 8
    bytes long (double precision floats) we will have
    150 misses for array a and 101 misses for array b
  • By scheduling the code with prefetch
    instructions, we can reduce the misses
  • New loop becomes
  • for (j0jlt100jj1)
  • prefetch(bj70) / prefetch 7
    iterations later / prefetch(a0j7)
    a0jbj0
  • for (i1ilt3ii1) for
    (j0jlt100jj1) prefetch(aij7)
    aijbj0bj10
  • This new code has only 19 misses improving
    performance to 4.2 times faster
  • See page 441 for the rest of the analysis for
    this problem

34
Reducing Hit Time
  • Again, recall our average memory access time
    formula
  • Avg. mem. access time hit time miss rate
    miss penalty
  • Miss penalty has an impact only on a miss, but
    hit time has an impact for every memory access
  • Reducing hit time might improve performance
    beyond reducing miss rate and miss penalty
  • Hit time also has an impact on the clock speed
  • it doesnt make much sense to have a faster clock
    than cache because the CPU would have to
    constantly stall for any memory fetch (whether
    instruction or data fetch)
  • However, as miss penalty was dictated primarily
    by the speed of DRAM, hit time is dictated
    primarily by the speed of SRAM
  • What can we do?

35
Solution 1 Small and Simple Caches
  • Cache access (for any but an associative cache)
    requires using the index part of the address to
    find the appropriate line in the cache
  • Then comparing tags to see if the entry is the
    right one
  • The tag comparison can be time consuming,
    especially with associative caches that have
    large tags or set associative caches where
    comparisons use more hardware to be done in
    parallel
  • It is also critical to keep the cache small so
    that it fits on the chip
  • One solution is to keep tags on the chip and data
    off the chip
  • This permits a faster comparison followed by
    accessing the data portion somewhat slower
  • In the end, this result is not appealing for
    reducing hit time
  • A better approach is to use direct-mapped caches

36
Solution 2 Avoid Address Translation
  • CPU generates an address and sends it to cache
  • But the address generated is a logical (virtual)
    address, not the physical address in memory
  • To obtain the physical address, the virtual
    address must first be translated
  • Translation requires accessing information stored
    in registers, TLB or main memory page table,
    followed by a concatenation
  • If we store virtual addresses in the cache, we
    can skip this translation
  • There are problems with this approach though
  • if a process is switched out of memory then the
    cache must be flushed
  • the OS and user may share addresses in two
    separate virtual address spaces
  • and this may cause problems if we use the virtual
    addresses in the cache

37
Solution 3 Pipelining Writes
  • Writes will take longer than reads because the
    tag must be checked before the write can begin
  • A read can commence and if the tag is wrong, the
    item read can be discarded
  • The write takes two steps, tag comparison first,
    followed by the write (a third step might be
    included in a write-back cache by combining items
    in a buffer)
  • By pipelining writes
  • we can partially speed up the process
  • This works by overlapping the tag checking and
    writing portions
  • assuming the tag is correct
  • in this way, the second write takes the same time
    as a read would
  • although this only works with more than 1
    consecutive write where all writes are cache hits

38
Solution 4 Trace Caches
  • This type of a cache is an instruction cache
    which supports multiple issue of instructions by
    providing 4 or more independent instructions per
    cycle
  • Cache blocks are dynamic, unlike normal caches
    where blocks are static based on what is stored
    in memory
  • Here, the block is formed around branch
    prediction, branch folding, and trace scheduling
    (from chapter 4)
  • Note that because of branch folding and trace
    scheduling, some instructions might appear
    multiple times in the cache, so it is somewhat
    more wasteful of cache space
  • This type of cache then offers the advantage of
    directly supporting a multiple issue architecture
  • The Pentium 4 uses this approach, but most RISC
    computers do not because repetition of
    instructions and high frequency of branches cause
    this approach to waste too much cache space

39
Cache Optimization Summary
Technique Miss Penalty Miss Rate Hit Rate Hardware Complexity Comments ( means widely used)
Multilevel Caches 2 Costly
Critical word first/early restart 2
Read miss over write priority 1 , Trivial for uni-proc.
Merging write buffer 1 , used w/ write-through
Victim caches 2 AMD Athlon
Larger block sizes - 0 Trivial
Larger caches - 1 Expecially L2 caches
Higher Associativity - 1
Pseudoassociative cache 2 Found in RISC
Compiler techniques 0 Software is challenging
Nonblocking caches 3 Used with all OOC
Hardware prefetching 3
Compiler prefetching 3
Small/simple caches - 0 , trivial
No address translation 2 Trivial if small cache
Pipelining writes 1
Trace cache 3 Used in P4
Hardware complexity ranges from 0
(cheapest/easiest) to 3 (most expensive/hardest)
Write a Comment
User Comments (0)
About PowerShow.com