Title: Chapter 5 Memory III
1Chapter 5Memory III
2Miss Rate Reduction (contd)
3Larger Block Size
- Reduces compulsory missesthrough spatial
locality - But,
- miss penalty increaseshigher bandwidth helps
- miss rate can increasefixed cache size larger
blocksmeans fewer blocks in the cache
4Notice the U shape some is good, too much is
bad.
5Larger Caches
- Reduces capacity misses
- But
- Increased hit time
- Increased cost ()
- Over time, L2 and higher cache size increases
6Higher Associativity
- Reduces miss rates with fewer conflicts
- But
- Increased hit time (tag check)
- Note
- An 8-way associative cache has close to the same
miss rate as fully associative
7Way Prediction
- Predict which way of a L1 cache will be accessed
next - Alpha 21264 correct prediction is 1
cycleincorrect prediction is 3 cycles - SPEC95 prediction is 85 correct
8Compiler Techniques
- Reduce conflicts in I-cache 1989 study showed
reduced misses by 50 for a 2KB cache and by
75 for an 8KB cache - D-cache performs differently
9Compiler data optimizations
- Loop Interchange
- Before
- for (j
- for (i
- xij 2 xij
- After
- for (i
- for (j
- xij 2 xij
- Improved Spatial Locality
10Before
After
Blocking Improve Spatial Locality
11Miss Rate and Miss Penalty Reduction via
Parallelism
12Nonblocking Caches
- Reduces stalls on cache miss
- A blocking cache refuses all requests while
waiting for data - A nonblocking cache continues to handle other
requests while waiting for data on another
request - Increases cache controller complexity
13NonBlocking Cache (8K direct L1 32 byte blocks)
14Hardware Prefetch
- Fetch two blocks desired next
- Next goes into stream bufferon fetch check
stream buffer first - Performance
- Single-instruction stream buffercaught 15 to
25 of L1 misses - 4-instruction stream buffer caught 50
- 16-instruction stream buffer caught 72
15Hardware Prefetch
- Data prefetch
- Single-data stream buffercaught 25 of L1 misses
- 4-data stream buffer caught 43
- 8-data stream buffers caught 50 to 70
- Prefetch from multiple addresses
- UltraSPARCIII handles 8 prefetchescalculates
stride for next prediction
16Software Prefetch
- Many processors such as Itanium have prefetch
instructions - Remember they are nonfaulting
17Hit Time Reduction
18Small, Simple Caches
- Time
- Indexing
- Comparing tag
- Small ? indexing is fast
- Simple ? direct allows tag comparison in
parallel with data load - ? L2 with tag on chip with data off chip
19Time vs cache size organization
20Perspective on previous graph
- Same
- 1ns clock is 10-9 sec/clockCycle
- 1 GHz is 109 clockCycles/sec
- Therefore,
- 2ns clock is 500 MHz
- 4ns clock is 250 MHz
- Conclude that small differences in nsrepresents
a large difference in MHz
21Virtual vs Physical Address in L1
- Translating from virtual address to physical
address as part of cache access takes time on
critical path - Translation is needed for both index and tag
- Making the common case fast suggests avoiding
translation for hits (misses must be translated)
22Why are L1 caches physical?(almost all)
- Security (Protection) page-level protection must
be checked on access(protection data can be
copied into cache) - Process switch can change virtual mapping
requiring cache flush(or Process ID) see next
slide - Synonyms two virtual addresses for same (shared)
physical address
23Virtually-addressed cache context-switch cost
24Hybrid virtually indexed, physically tagged
- Index with the part of the page offset that is
identical in virtual and physical addresses i.e.
the index bits are a subset of the
page-offset bits - In parallel with indexing, translate the virtual
address to check the physical tag - Limitation direct-mapped cache page size
(determined by address bits) set-associative
caches can be bigger since fewer bits are
needed for index
25Example
- Pentium III
- 8 KB pages with 16KB 2-way set-associative cache
- IBM 3033
- 4KB pageswith 64KB 16-way set-associative
cache(note that 8-way is sufficient, but 16-way
is needed to keep index bits sufficiently small)
26Trace Cache
- Pentium 4 NetBurst architecture
- I-cache blocks are organized to contain
instruction traces including predicted taken
branchesinstead of organized around memory
addresses - Advantage over regular large cache blocks which
contain branches and, hence, many unused
instructionse.g. AMD Athlon 64-byte blocks
contain 16-24 x86 instructions with 1-in-5 being
branches - Disadvantage complex addressing
27Trace Cache
- P4 trace cache (I-cache) is placed after decode
and branch predictso it contains - µops
- only desired instructions
- Trace cache contains 12K µops
- Branch predict BTB is 4K(33 improvement over
PIII)
28(No Transcript)
29Summary (so far)
- Figure 5.26 summarizes all
30Main-memory
- Main-memory modifications can help cache miss
penalty by bringing words faster from memory - Wider path to memory brings in more words at a
time, e.g. one address request brings in 4 words
(reduces overhead) - Interleaved memory can allow memory to respond
faster