Chapter 5 Memory III - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Chapter 5 Memory III

Description:

Notice the 'U' shape: some is good, too much is bad. Michigan State University ... An 8-way associative cache has close to the same miss rate as fully associative ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 31

Provided by: cse6

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 5 Memory III

1
Chapter 5Memory III

CSE 820

2
Miss Rate Reduction (contd)
3
Larger Block Size

Reduces compulsory missesthrough spatial
locality
But,
miss penalty increaseshigher bandwidth helps
miss rate can increasefixed cache size larger
blocksmeans fewer blocks in the cache

4
Notice the U shape some is good, too much is
bad.
5
Larger Caches

Reduces capacity misses
But
Increased hit time
Increased cost ()
Over time, L2 and higher cache size increases

6
Higher Associativity

Reduces miss rates with fewer conflicts
But
Increased hit time (tag check)
Note
An 8-way associative cache has close to the same
miss rate as fully associative

7
Way Prediction

Predict which way of a L1 cache will be accessed
next
Alpha 21264 correct prediction is 1
cycleincorrect prediction is 3 cycles
SPEC95 prediction is 85 correct

8
Compiler Techniques

Reduce conflicts in I-cache 1989 study showed
reduced misses by 50 for a 2KB cache and by
75 for an 8KB cache
D-cache performs differently

9
Compiler data optimizations

Loop Interchange
Before
for (j
for (i
xij 2 xij
After
for (i
for (j
xij 2 xij
Improved Spatial Locality

10
Before
After
Blocking Improve Spatial Locality
11
Miss Rate and Miss Penalty Reduction via
Parallelism
12
Nonblocking Caches

Reduces stalls on cache miss
A blocking cache refuses all requests while
waiting for data
A nonblocking cache continues to handle other
requests while waiting for data on another
request
Increases cache controller complexity

13
NonBlocking Cache (8K direct L1 32 byte blocks)
14
Hardware Prefetch

Fetch two blocks desired next
Next goes into stream bufferon fetch check
stream buffer first
Performance
Single-instruction stream buffercaught 15 to
25 of L1 misses
4-instruction stream buffer caught 50
16-instruction stream buffer caught 72

15
Hardware Prefetch

Data prefetch
Single-data stream buffercaught 25 of L1 misses
4-data stream buffer caught 43
8-data stream buffers caught 50 to 70
Prefetch from multiple addresses
UltraSPARCIII handles 8 prefetchescalculates
stride for next prediction

16
Software Prefetch

Many processors such as Itanium have prefetch
instructions
Remember they are nonfaulting

17
Hit Time Reduction
18
Small, Simple Caches

Time
Indexing
Comparing tag
Small ? indexing is fast
Simple ? direct allows tag comparison in
parallel with data load
? L2 with tag on chip with data off chip

19
Time vs cache size organization
20
Perspective on previous graph

Same
1ns clock is 10-9 sec/clockCycle
1 GHz is 109 clockCycles/sec
Therefore,
2ns clock is 500 MHz
4ns clock is 250 MHz
Conclude that small differences in nsrepresents
a large difference in MHz

21
Virtual vs Physical Address in L1

Translating from virtual address to physical
address as part of cache access takes time on
critical path
Translation is needed for both index and tag
Making the common case fast suggests avoiding
translation for hits (misses must be translated)

22
Why are L1 caches physical?(almost all)

Security (Protection) page-level protection must
be checked on access(protection data can be
copied into cache)
Process switch can change virtual mapping
requiring cache flush(or Process ID) see next
slide
Synonyms two virtual addresses for same (shared)
physical address

23
Virtually-addressed cache context-switch cost
24
Hybrid virtually indexed, physically tagged

Index with the part of the page offset that is
identical in virtual and physical addresses i.e.
the index bits are a subset of the
page-offset bits
In parallel with indexing, translate the virtual
address to check the physical tag
Limitation direct-mapped cache page size
(determined by address bits) set-associative
caches can be bigger since fewer bits are
needed for index

25
Example

Pentium III
8 KB pages with 16KB 2-way set-associative cache
IBM 3033
4KB pageswith 64KB 16-way set-associative
cache(note that 8-way is sufficient, but 16-way
is needed to keep index bits sufficiently small)

26
Trace Cache

Pentium 4 NetBurst architecture
I-cache blocks are organized to contain
instruction traces including predicted taken
branchesinstead of organized around memory
addresses
Advantage over regular large cache blocks which
contain branches and, hence, many unused
instructionse.g. AMD Athlon 64-byte blocks
contain 16-24 x86 instructions with 1-in-5 being
branches
Disadvantage complex addressing

27
Trace Cache

P4 trace cache (I-cache) is placed after decode
and branch predictso it contains
µops
only desired instructions
Trace cache contains 12K µops
Branch predict BTB is 4K(33 improvement over
PIII)

28
(No Transcript)
29
Summary (so far)

Figure 5.26 summarizes all

30
Main-memory

Main-memory modifications can help cache miss
penalty by bringing words faster from memory
Wider path to memory brings in more words at a
time, e.g. one address request brings in 4 words
(reduces overhead)
Interleaved memory can allow memory to respond
faster

Write a Comment

User Comments (0)