Chapter 5 Memory III - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Chapter 5 Memory III

Description:

Notice the 'U' shape: some is good, too much is bad. Michigan State University ... An 8-way associative cache has close to the same miss rate as fully associative ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 31
Provided by: cse6
Category:
Tags: iii | chapter | memory

less

Transcript and Presenter's Notes

Title: Chapter 5 Memory III


1
Chapter 5Memory III
  • CSE 820

2
Miss Rate Reduction (contd)
3
Larger Block Size
  • Reduces compulsory missesthrough spatial
    locality
  • But,
  • miss penalty increaseshigher bandwidth helps
  • miss rate can increasefixed cache size larger
    blocksmeans fewer blocks in the cache

4
Notice the U shape some is good, too much is
bad.
5
Larger Caches
  • Reduces capacity misses
  • But
  • Increased hit time
  • Increased cost ()
  • Over time, L2 and higher cache size increases

6
Higher Associativity
  • Reduces miss rates with fewer conflicts
  • But
  • Increased hit time (tag check)
  • Note
  • An 8-way associative cache has close to the same
    miss rate as fully associative

7
Way Prediction
  • Predict which way of a L1 cache will be accessed
    next
  • Alpha 21264 correct prediction is 1
    cycleincorrect prediction is 3 cycles
  • SPEC95 prediction is 85 correct

8
Compiler Techniques
  • Reduce conflicts in I-cache 1989 study showed
    reduced misses by 50 for a 2KB cache and by
    75 for an 8KB cache
  • D-cache performs differently

9
Compiler data optimizations
  • Loop Interchange
  • Before
  • for (j
  • for (i
  • xij 2 xij
  • After
  • for (i
  • for (j
  • xij 2 xij
  • Improved Spatial Locality

10
Before
After
Blocking Improve Spatial Locality
11
Miss Rate and Miss Penalty Reduction via
Parallelism
12
Nonblocking Caches
  • Reduces stalls on cache miss
  • A blocking cache refuses all requests while
    waiting for data
  • A nonblocking cache continues to handle other
    requests while waiting for data on another
    request
  • Increases cache controller complexity

13
NonBlocking Cache (8K direct L1 32 byte blocks)
14
Hardware Prefetch
  • Fetch two blocks desired next
  • Next goes into stream bufferon fetch check
    stream buffer first
  • Performance
  • Single-instruction stream buffercaught 15 to
    25 of L1 misses
  • 4-instruction stream buffer caught 50
  • 16-instruction stream buffer caught 72

15
Hardware Prefetch
  • Data prefetch
  • Single-data stream buffercaught 25 of L1 misses
  • 4-data stream buffer caught 43
  • 8-data stream buffers caught 50 to 70
  • Prefetch from multiple addresses
  • UltraSPARCIII handles 8 prefetchescalculates
    stride for next prediction

16
Software Prefetch
  • Many processors such as Itanium have prefetch
    instructions
  • Remember they are nonfaulting

17
Hit Time Reduction
18
Small, Simple Caches
  • Time
  • Indexing
  • Comparing tag
  • Small ? indexing is fast
  • Simple ? direct allows tag comparison in
    parallel with data load
  • ? L2 with tag on chip with data off chip

19
Time vs cache size organization
20
Perspective on previous graph
  • Same
  • 1ns clock is 10-9 sec/clockCycle
  • 1 GHz is 109 clockCycles/sec
  • Therefore,
  • 2ns clock is 500 MHz
  • 4ns clock is 250 MHz
  • Conclude that small differences in nsrepresents
    a large difference in MHz

21
Virtual vs Physical Address in L1
  • Translating from virtual address to physical
    address as part of cache access takes time on
    critical path
  • Translation is needed for both index and tag
  • Making the common case fast suggests avoiding
    translation for hits (misses must be translated)

22
Why are L1 caches physical?(almost all)
  • Security (Protection) page-level protection must
    be checked on access(protection data can be
    copied into cache)
  • Process switch can change virtual mapping
    requiring cache flush(or Process ID) see next
    slide
  • Synonyms two virtual addresses for same (shared)
    physical address

23
Virtually-addressed cache context-switch cost
24
Hybrid virtually indexed, physically tagged
  • Index with the part of the page offset that is
    identical in virtual and physical addresses i.e.
    the index bits are a subset of the
    page-offset bits
  • In parallel with indexing, translate the virtual
    address to check the physical tag
  • Limitation direct-mapped cache page size
    (determined by address bits) set-associative
    caches can be bigger since fewer bits are
    needed for index

25
Example
  • Pentium III
  • 8 KB pages with 16KB 2-way set-associative cache
  • IBM 3033
  • 4KB pageswith 64KB 16-way set-associative
    cache(note that 8-way is sufficient, but 16-way
    is needed to keep index bits sufficiently small)

26
Trace Cache
  • Pentium 4 NetBurst architecture
  • I-cache blocks are organized to contain
    instruction traces including predicted taken
    branchesinstead of organized around memory
    addresses
  • Advantage over regular large cache blocks which
    contain branches and, hence, many unused
    instructionse.g. AMD Athlon 64-byte blocks
    contain 16-24 x86 instructions with 1-in-5 being
    branches
  • Disadvantage complex addressing

27
Trace Cache
  • P4 trace cache (I-cache) is placed after decode
    and branch predictso it contains
  • µops
  • only desired instructions
  • Trace cache contains 12K µops
  • Branch predict BTB is 4K(33 improvement over
    PIII)

28
(No Transcript)
29
Summary (so far)
  • Figure 5.26 summarizes all

30
Main-memory
  • Main-memory modifications can help cache miss
    penalty by bringing words faster from memory
  • Wider path to memory brings in more words at a
    time, e.g. one address request brings in 4 words
    (reduces overhead)
  • Interleaved memory can allow memory to respond
    faster
Write a Comment
User Comments (0)
About PowerShow.com