Title: Reducing Cache Misses
1Reducing Cache Misses
Classifying Misses 3 Cs
- 5.1 Introduction
- 5.2 The ABCs of Caches
- 5.3 Reducing Cache Misses
- 5.4 Reducing Cache Miss Penalty
- 5.5 Reducing Hit Time
- 5.6 Main Memory
- 5.7 Virtual Memory
- 5.8 Protection and Examples of Virtual Memory
- CompulsoryThe first access to a block is not in
the cache, so the block must be brought into the
cache. Also called cold start misses or first
reference misses.(Misses in even an Infinite
Cache) - CapacityIf the cache cannot contain all the
blocks needed during execution of a program,
capacity misses will occur due to blocks being
discarded and later retrieved.(Misses in Fully
Associative Size X Cache) - ConflictIf block-placement strategy is set
associative or direct mapped, conflict misses (in
addition to compulsory capacity misses) will
occur because a block can be discarded and later
retrieved if too many blocks map to its set. Also
called collision misses or interference
misses.(Misses in N-way Associative, Size X
Cache)
23Cs Absolute Miss Rate (SPEC92)
Reducing Cache Misses
Classifying Misses 3 Cs
Conflict
Compulsory vanishingly small
321 Cache Rule
Reducing Cache Misses
Classifying Misses 3 Cs
miss rate 1-way associative cache size X
miss rate 2-way associative cache size X/2
Conflict
43Cs Relative Miss Rate
Reducing Cache Misses
Classifying Misses 3 Cs
Conflict
5Reducing Cache Misses
1. Larger Block Size
Using the principle of locality. The larger the
block, the greater the chance parts of it will be
used again.
Size of Cache
62. Higher Associativity
Reducing Cache Misses
- 21 Cache Rule
- Miss Rate Direct Mapped cache size N
- Miss Rate 2-way cache size N/2
- But Beware Execution time is the only final
measure we can believe! - Clock Cycle time increase as a result of having a
more complicated cache. - Hill 1988 suggested hit time for 2-way vs.
1-way is external cache 10internal 2
7Avg. Memory Access Time vs. Miss Rate
Reducing Cache Misses
2. Higher Associativity
The time to access memory has several components.
The equation is Average Memory Access Time
Hit Time Miss Rate X Miss Penalty The miss
penalty is 50 cycles. See data on next page.
Associativity Clock Cycle Time
1 1.00
2 1.10
3 1.12
8 1.14
Result
8Example Avg. Memory Access Time vs. Miss Rate
Reducing Cache Misses
2. Higher Associativity
9Reducing Cache Misses
3. Victim Caches
- How to combine fast hit time of direct mapped yet
still avoid conflict misses? - Add buffer to place data discarded from cache
- A 4-entry victim cache removed 20 to 95 of
conflicts for a 4 KB direct mapped data cache - Used in Alpha, HP machines.
- In effect, this gives the same behavior as
associativity, but only on those cache lines that
really need it.
10Reducing Cache Miss Penalty
- 5.1 Introduction
- 5.2 The ABCs of Caches
- 5.3 Reducing Cache Misses
- 5.4 Reducing Cache Miss Penalty
- 5.5 Reducing Hit Time
- 5.6 Main Memory
- 5.7 Virtual Memory
- 5.8 Protection and Examples of Virtual Memory
Time to handle a miss is becoming more and more
the controlling factor. This is because of the
great improvement in speed of processors as
compared to the speed of memory.
Average Memory Access Time Hit Time Miss
Rate Miss Penalty
11Reducing Cache Miss Penalty
Prioritization of Read Misses over Writes
- Write through with write buffers offer RAW
conflicts with main memory reads on cache misses - If simply wait for write buffer to empty, might
increase read miss penalty (old MIPS 1000 by 50
) - Check write buffer contents before read if no
conflicts, let the memory access continue - Write Back?
- Read miss replacing dirty block
- Normal Write dirty block to memory, and then do
the read - Instead copy the dirty block to a write buffer,
then do the read, and then do the write - CPU stall less since restarts as soon as do read
12Reducing Cache Miss Penalty
Sub Block Placement for Reduced Miss Penalty
- Dont have to load full block on a miss
- Have valid bits per subblock to indicate valid
Subblocks
Valid Bits
13Reducing Cache Miss Penalty
Early Restart and Critical Word First
- Dont wait for full block to be loaded before
restarting CPU - Early restartAs soon as the requested word of
the block arrives, send it to the CPU and let the
CPU continue execution - Critical Word FirstRequest the missed word first
from memory and send it to the CPU as soon as it
arrives let the CPU continue execution while
filling the rest of the words in the block. Also
called wrapped fetch and requested word first - Generally useful only in large blocks,
- Spatial locality a problem tend to want next
sequential word, so not clear if benefit by early
restart
block
14Reducing Cache Miss Penalty
Second Level Caches
- L2 Equations
- Average Memory Access Time Hit TimeL1 Miss
RateL1 x Miss PenaltyL1 - Miss PenaltyL1 Hit TimeL2 Miss RateL2 x Miss
PenaltyL2 - Average Memory Access Time Hit TimeL1
- Miss RateL1 x (Hit TimeL2 Miss RateL2
Miss PenaltyL2) - Definitions
- Local miss rate misses in this cache divided by
the total number of memory accesses to this cache
(Miss rateL2) - Global miss ratemisses in this cache divided by
the total number of memory accesses generated by
the CPU (Miss RateL1 x Miss RateL2) - Global Miss Rate is what matters
15Reducing Hit Time
This is about how to reduce time to access data
that IS in the cache. What techniques are useful
for quickly and efficiently finding out if data
is in the cache, and if it is, getting that data
out of the cache.
- 5.1 Introduction
- 5.2 The ABCs of Caches
- 5.3 Reducing Cache Misses
- 5.4 Reducing Cache Miss Penalty
- 5.5 Reducing Hit Time
- 5.6 Main Memory
- 5.7 Virtual Memory
- 5.8 Protection and Examples of Virtual Memory
Average Memory Access Time Hit Time Miss
Rate Miss Penalty
16Reducing Hit Time
Small and Simple Caches
- Why Alpha 21164 has 8 KB Instruction and 8 KB
data cache 96 KB second level cache? - Small data cache and clock rate
- Direct Mapped, on chip
17Reducing Hit Time
Pipelining Writes for Fast Write Hits
- Pipeline Tag Check and Update Cache as separate
stages current write tag check previous write
cache update - Only STORES in the pipeline empty during a
missStore r2, (r1) Check r1Add
--Sub --Store r4, (r3)
Mr1lt-r2 - In shade is Delayed Write Buffer must be
checked on reads either complete write or read
from buffer
Check r3
18(No Transcript)
19Way prediction to reduce Hit time Reduce
conflict-miss in associative caches. Predict
which of the block within the set contains the
current data. The multiplexor is preset to this
predicted value so that the delay caused by
multiplexer is avoided. If error, correct block
is chosen and prediction is updated. One-bit
history can be used for prediction.
20- Trace caches to reduce Hit time
- used in Pentium 4.
- idea is to use dynamic trace of memory access
pattern to fetch a sequence of instructions. - complex to implement.
- high overhead.
21- Nonblocking cache
- Most caches can only handle one outstanding
request at a time. If a request is made to the
cache and there is a miss, the cache must wait
for the memory to supply the value that was
needed, and until then it is "blocked". - A non-blocking cache has the ability to work on
other requests while waiting for memory to supply
any misses. - The Intel Pentium Pro and Pentium
II processors use this technology for their level
2 caches, which can manage up to four
simultaneous requests. - This is done by using a transaction-based
architecture, and a dedicated "backside" bus for
the cache that is independent of the main memory
bus. Intel calls this "dual independent bus"
(DIB) architecture.