Measuring and Improving Cache Performance - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Measuring and Improving Cache Performance

Description:

Advantage: increasing associativity usually decreases the miss rate. ... Misses and Associativity in Caches ... Size of tags vs Set Associativity ... – PowerPoint PPT presentation

Number of Views:1086

Avg rating:3.0/5.0

Slides: 30

Provided by: informat1898

Category:

more less

Transcript and Presenter's Notes

Title: Measuring and Improving Cache Performance

1
Measuring and Improving Cache Performance

Chapter 7.3
Austin Orgah

2
In this section

We explore two different techniques for
improving cache performance.
------------------------------------------------
-----------------------------
1. Focusing on reducing the miss rate by
reducing the probability that two different
memory blocks will contend for the same cache
location.
2. Reducing the miss penalty by adding an
additional level to the memory hierarchy.
(MULTILEVEL CACHING)

CPU time can be divided into
1. Clock cycles that the CPU spends executing
the program.
2. Clock cycles that the CPU spends waiting
for the memory system.
Normally we assume that the costs of cache
accesses that are hits are part of the normal CPU
execution cycles.
Thus,
CPU time (CPU execution clock cycles Memory-
stall clock cycles x clock cycle
time)
Memory-stall c/cycles primarily come from cache
misses.

Memory-stall c/cycles are the sum of the stall
cycles coming from reads and those coming from
writes.
Memory-stall c/cycles read-stall cycles
write-stall cycles
Read-stall c/cycles are the number of read
accesses per program by miss penalty in c/cycles
for a read by the read miss rate.
Read-stall c/cycles
Reads x Read miss rate x Read miss penalty
Program

Write-stalls Continued.
With writes, the write-through scheme has two
sources of stalls
1. write misses requires that the block is
fetched before continuing the write.
2. write buffer stalls which occur when
the write buffer is full when a write occurs.
Therefore, cycles stalled for writes equal the
sum of these two.
Write-Stall cycles (Writes/ Program) x Write
miss rate x Write miss penalty
write buffer stalls

Write-stalls Continued.
Its difficult to compute write stalls but
systems with a reasonable write buffer depth say
four or more words and memory capable of
accepting writes at a rate that significantly
exceeds the average write frequency in programs
e.g. by a factor of 2 then the write buffer
stalls are minimal and could be safely ignored.
(Else bad design)
In most write-through caches, read and write miss
penalties are the same so

Write-stalls Continued.
The formula
Memory-stall c/cycles read-stall cycles
write-stall cycles
Becomes
Memory-stall c/cycles
Mem/accesses x miss rate x miss penalty
Program

Calculating Cache performance
Example an instruction cache miss rate for a
program is 2, the data cache miss rate is 4. If
the processor has a CPI of 2 w/out any memory
stalls and a miss penalty of 100 cycles for all
misses, determine how much faster a processor
would run with a perfect cache that never missed.
(take the frequency of all loads stores in
SPECint2000 to be 36)

Cache performance with increased clock rate
Suppose we increase the performance of the
computer in the previous example. by doubling its
clock rate. Since main memory speed is unlikely
to change, assume that the absolute time to
handle a cache miss doesnt change. How much
faster will the computer be with the faster
clock, assuming the same miss rate as the
previous example.?

As the examples depict,
Relative cache penalties increase as the
processor becomes faster. Processor improvements
on both the c/rate and CPI causes it to suffer a
double hit.
The lower the CPI, the more pronounced the impact
of stalls.
Main memory system is unlikely to improve as fast
as processor cycle time because the DRAM isnt
getting faster. Main memories of two processors
having the same absolute access times, a higher
processor clock rate leads to a larger miss
penalty.
If hit time increases, the total access time for
a word from memory will increase possibly causing
an increase in the processor cycle time.
Also having a larger cache results in longer
access time.

Reducing Cache Misses by More Flexible
Placement of Blocks
Note when we place a block in a cache, a simple
placement scheme is being utilized.
Direct Mapped cache - direct mapping from any
block address in memory to a single location in
the upper level of hierarchy.
Fully associative cache - structure in which a
block can be placed in any location in the cache.
Finding a block results to searching of all
entries in the cache since the block could be
placed anywhere.
Comparators associated with each entry(cache)
make the search practical. Its not cost
effective. Its only practical to use comparators
for caches with small number of blocks.

Set-associative cache - middle range btw direct
and fully associative. This structure has a fixed
number of locations (at least 2) where each block
can be placed. Each block in memory maps to a
unique set in the cache given by the index field,
and a block can be placed in any element of that
set. Conversely, a block is directly mapped into
a set and then all the blocks in the set are
searched for a match.
Recall - for direct mapped cache
(Block number) mod (number of cache
blocks)
For set-associative
(Block num) mod (num of sets in the cache)

13
(No Transcript)
14

A direct mapped cache is a one way
set-associative cache
A fully associative cache with m entries is an
m-way set-associative cache.
Advantage increasing associativity usually
decreases the miss rate.
Disadvantage increases the hit time.

15
(No Transcript)
16

Misses and Associativity in Caches
e.g. take 3 small caches, each consisting of 4
one-word blocks. One cache is fully assoc, a
second is two-way set-associative the third is
direct mapped. Find the of misses for each
given the following sequence of block addresses
0,8,0,6,8.
Hint
For d/mapped cache
(Block num) mod (num of cache blocks)
For set-associative
(Block num) mod (num of sets in the cache)

17
Direct Mapped
18
2-way set Associative
19
2-way set Associative

Note this cache has 2 sets (with indices 0 and
1) with two elements per set.
This cache also replaces the least recently used
block within a set.

20
Fully Associative
21
Fully Associative

This cache has 4 blocks in a single set
Has the best performance with only three misses

22
(No Transcript)
23
Locating a Block in the Cache

In a set-associative cache includes an address
tag that gives the block address. The tag of
every cache block within the appropriate set is
checked to see if it matches the block address
from the processor. The index value is used to
select the set containing the address of
interest.
Sequential search as in a fully associative cache
would make the hit time of a set-assoc cache too
slow.
In a fully associative cache theres effectively
only one set and all the blocks must be checked
in parallel. Theres no index and hence the
entire address, excluding the blk offset is
compared against the tag of every block.
In direct mapping the entry can only be in one
block so access is simply by indexing.

4-way set-associtive cache implementation.

For the 4-way set-associative cache, 4
comparators are needed together with a 4-to-1
multiplexor to choose among the 4 members of the
selected set.
The choice among, direct, set-assoc, or fully
assoc in any memory hierarchy will depend on the
cost of a miss vs the cost of implementing
associativity both in time and in extra h/ware.

Size of tags vs Set Associativity
E.g. increasing assoc requires more comparators
and more tag bits/cache block. Assuming a cache
of 4K blocks, a four word block size, and a
32-bit address, find the total of sets and the
total of tag bits for caches that are direct
mapped, two-way, and four-way set assoc, and
fully associative.

Choosing Which Block to Replace
Direct-mapped cache - the requested block can go
in exactly 1 position and the block occupying
that position must be replaced.
Fully associative cache all blocks are
candidates for replacement.
Set-associative cache must choose among the
blocks in the selected set.
An associative cache has the choice of where to
place the requested block and hence a choice of
which block to replace.
Least recently used (LRU) a replacement scheme
in which the block replaced is the one that has
been unused for the longest time.

Reducing the Miss Penalty Using Multilevel
Caches
Multilevel cache - a memory hierarchy with
multiple levels of caches, rather than just a
cache and main memory.
To close the gap btw the fast c/rates of modern
processors and relatively long time required to
access DRAMs many microprocessors support an
additional level of caching.
In particular, a 2 level cache structure allows
the primary cache to focus on minimizing hit time
to yield a shorter c/cycle, while allowing the
secondary cache to focus on miss rate to reduce
the penalty of long memory access times.

- The miss penalty of the p/cache is
significantly reduced by the presence of the
s/cache allowing the p/cache to be smaller and
have a higher miss rate.
-the s/cache access time becomes less important
with the presence of the p/cache since the access
time of the s/cache affects the miss penalty of
the p/cache rather than directly affecting the
p/cache hit time or processor cycle time.
p/cache uses smaller block size, cache size and
reduced miss penalty.
s/cache uses larger total size, block size and
less critical access time.