Advanced Computer Architecture 5MD00 / 5Z033 Memory Hierarchy - PowerPoint PPT Presentation

About This Presentation
Title:

Advanced Computer Architecture 5MD00 / 5Z033 Memory Hierarchy

Description:

Advanced Computer Architecture 5MD00 / 5Z033 Memory Hierarchy & Caches Henk Corporaal www.ics.ele.tue.nl/~heco/courses/aca h.corporaal_at_tue.nl TUEindhoven – PowerPoint PPT presentation

Number of Views:252
Avg rating:3.0/5.0
Slides: 68
Provided by: henkcor1
Category:

less

Transcript and Presenter's Notes

Title: Advanced Computer Architecture 5MD00 / 5Z033 Memory Hierarchy


1
Advanced Computer Architecture5MD00 /
5Z033Memory Hierarchy Caches
  • Henk Corporaal
  • www.ics.ele.tue.nl/heco/courses/aca
  • h.corporaal_at_tue.nl
  • TUEindhoven
  • 2007

2
Topics
  • Processor Memory gap
  • Recap of cache basics
  • Basic cache optimizations
  • Advanced cache optimizations
  • reduce miss penalty
  • reduce miss rate
  • reduce hit time

3
Review Who Cares About the Memory Hierarchy?
µProc 60/yr.
CPU
DRAM 7/yr.
DRAM
4
Cache operation
Cache / Higher level
Memory / Lower level
block / line
tags
data
5
Direct Mapped Cache
  • Taking advantage of spatial locality

Address (bit positions)
6
A 4-Way Set-Associative Cache
7
6 basic cache optimizations (App. C)
  • Reduces miss rate
  • Larger block size
  • Bigger cache
  • Higher associativity
  • reduces conflict rate
  • Reduce miss penalty
  • Multi-level caches
  • Give priority to read messes over write misses
  • Reduce hit time
  • Avoid address translation during indexing of the
    cache

8
11 Advanced Cache Optimizations (5.2)
  • Reducing hit time
  • Small and simple caches
  • Way prediction
  • Trace caches
  • Increasing cache bandwidth
  • Pipelined caches
  • Multibanked caches
  • Nonblocking caches
  • Reducing Miss Penalty
  • Critical word first
  • Merging write buffers
  • Reducing Miss Rate
  • Compiler optimizations
  • Reducing miss penalty or miss rate via
    parallelism
  • Hardware prefetching
  • Compiler prefetching

9
1. Fast Hit via Small and Simple Caches
  • Index tag memory and then compare takes time
  • ? Small cache is faster
  • Also L2 cache small enough to fit on chip with
    the processor avoids time penalty of going off
    chip
  • Simple ? direct mapping
  • Can overlap tag check with data transmission
    since no choice
  • Access time estimate for 90 nm using CACTI model
    4.0

10
2. Fast Hit via Way Prediction
  • Make set-associative caches faster
  • Keep extra bits in cache to predict the way, or
    block within the set, of next cache access.
  • Multiplexor is set early to select desired block,
    only 1 tag comparison performed
  • Miss ? 1st check other blocks for matches in next
    clock cycle
  • Accuracy ? 85
  • Drawback CPU pipeline is hard if hit takes 1 or
    2 cycles
  • Used for instruction caches vs. L1 data caches
  • Also used on MIPS R10K for off-chip L2 unified
    cache, way-prediction table on-chip

11
A 4-Way Set-Associative Cache
12
Way Predicting Caches
  • Use processor address to index into way
    prediction table
  • Look in predicted way at given index, then

HIT
MISS
Return copy of data from cache
Look in other way Read block of data from
next level of cache
MISS
SLOW HIT (change entry in prediction table)
13
Way Predicting Instruction Cache (Alpha
21264-like)

Jump target
0x4
Jump control
Add
PC
addr
inst
Primary Instruction Cache
way
Sequential Way
Branch Target Way
14
3. Fast (Inst. Cache) Hit via Trace Cache
  • Key Idea Pack multiple non-contiguous basic
    blocks into one contiguous trace cache line

BR
BR
BR
  • Single fetch brings in multiple basic blocks
  • Trace cache indexed by start address and next n
    branch predictions

15
3. Fast Hit times via Trace Cache
  • Trace cache in Pentium 4
  • Dynamic instr. traces cached (in level 1 cache)
  • Cache the micro-ops vs. x86 instructions
  • Decode/translate from x86 to micro-ops on trace
    cache miss
  • ? better utilize long blocks (dont exit in
    middle of block, dont enter at label in middle
    of block)
  • ? complicated address mapping since addresses no
    longer aligned to power-of-2 multiples of word
    size
  • - ? instructions may appear multiple times in
    multiple dynamic traces due to different branch
    outcomes

16
4 Increasing Cache Bandwidth by Pipelining
  • Pipeline cache access to maintain bandwidth, but
    higher latency
  • Nr. of Instruction cache access pipeline stages
  • 1 Pentium
  • 2 Pentium Pro through Pentium III
  • 4 Pentium 4
  • ? greater penalty on mispredicted branches
  • ? more clock cycles between the issue of the load
    and the use of the data

17
5. Increasing Cache Bandwidth Non-Blocking
Caches
  • Non-blocking cache or lockup-free cache
  • allow data cache to continue to supply cache hits
    during a miss
  • requires out-of-order execution CPU
  • hit under miss reduces the effective miss
    penalty by continuing during miss
  • hit under multiple miss or miss under miss
    may further lower the effective miss penalty by
    overlapping multiple misses
  • Requires that memory system can service multiple
    misses
  • Significantly increases the complexity of the
    cache controller as there can be multiple
    outstanding memory accesses
  • Requires multiple memory banks (otherwise cannot
    support)
  • Pentium Pro allows 4 outstanding memory misses

18
5. Increasing Cache Bandwidth Non-Blocking
Caches
19
Value of Hit Under Miss for SPEC
0-gt1 1-gt2 2-gt64 Base
Average Memory Access Time
Hit under n Misses
Integer
Floating Point
  • FP programs on average AMAT 0.68 -gt 0.52 -gt
    0.34 -gt 0.26
  • Int programs on average AMAT 0.24 -gt 0.20 -gt
    0.19 -gt 0.19
  • 8 KB Data Cache, Direct Mapped, 32B block, 16
    cycle miss

20
6 Increase Cache Bandwidth via Multiple Banks
  • Divide cache into independent banks that can
    support simultaneous accesses
  • E.g., T1 (Niagara) L2 has 4 banks
  • Banking works best when accesses naturally spread
    themselves across banks ? mapping of addresses to
    banks affects behavior of memory system
  • Simple mapping that works well is sequential
    interleaving
  • Spread block addresses sequentially across banks
  • E.g., with 4 banks, Bank 0 has all blocks with
    address4 0 bank 1 has all blocks whose
    address4 1

21
7. Early Restart and Critical Word First to
reduce miss penalty
  • Dont wait for full block to be loaded before
    restarting CPU
  • Early restartAs soon as the requested word of
    the block arrives, send it to the CPU and
    continue
  • Critical Word FirstRequest the missed word first
    from memory and send it to the CPU as soon as it
    arrives let the CPU continue while filling the
    rest of the words in the block
  • Generally useful only when blocks are large

22
8. Merging Write Buffer to Reduce Miss Penalty
  • Write buffer to allow processor to continue while
    waiting to write to memory
  • E.g., four writes are merged into one buffer
    entry rather than putting them in separate
    buffers
  • Less frequent write backs

23
9. Reducing Misses by Compiler Optimizations
  • McFarling 1989 reduced caches misses by 75
    for 8KB direct-mapped cache, 4-byte blocks in
    software
  • Instructions
  • Reorder procedures in memory so as to reduce
    conflict misses
  • Profiling to look at conflicts (using developed
    tools)
  • Data
  • Merging Arrays improve spatial locality by
    single array of compound elements vs. 2 arrays
  • Loop Interchange change nesting of loops to
    access data in order stored in memory
  • Loop Fusion combine 2 independent loops that
    have same looping and some variables overlap
  • Blocking Improve temporal locality by accessing
    blocks of data repeatedly vs. going down whole
    columns or rows

24
Merging Arrays
  • int valSIZE struct record
  • int keySIZE int val
  • int key
  • for (i0 iltSIZE i)
  • keyi newkey struct record recordsSIZE
  • vali
  • for (i0 iltSIZE i)
  • recordsi.key newkey
  • recordsi.val
  • Reduces conflicts between val key and improves
    spatial locality

25
Loop Interchange
  • for (col0 collt100 col)
  • for (row0 rowlt5000 row)
  • Xrowcol Xrowcol1
  • for (row0 rowlt5000 row)
  • for (col0 collt100 col)
  • Xrowcol Xrowcol1
  • Sequential accesses instead of striding through
    memory every 100 words
  • Improves spatial locality

columns
rows
array X
26
Loop Fusion
  • for (i 0 i lt N i)
  • for (j 0 j lt N j)
  • aij 1/bij cij
  • for (i 0 i lt N i)
  • for (j 0 j lt N j)
  • dij aij cij
  • for (i 0 i lt N i)
  • for (j 0 j lt N j)
  • aij 1/bij cij
  • dij aij cij
  • Splitted loops every access to a and c misses.
    Fused loops only 1st access misses. Improves
    temporal locality

Reference can be directly to register
27
Blocking applied to array multiplication
  • for (i0 iltN i)
  • for (j0 jltN j)
  • cij 0.0
  • for (k0 kltN k)
  • cij aikbkj
  • The two inner loops
  • Read all NxN elements of b
  • Read all N elements of one row of a repeatedly
  • Write all N elements of one row of c
  • If a whole matrix does not fit in the cache, many
    cache misses.
  • Idea compute on BxB submatrix that fits in the
    cache

c

a
x
b
28
Blocking Example
  • for (ii0 iiltN iiB)
  • for (jj0 jjltN jjB)
  • for (iii iltmin(iiB-1,N) i)
  • for (jjj jltmin(jjB-1,N) j)
  • cij 0.0
  • for (k0 kltN k)
  • cij aikbkj
  • B is called Blocking Factor
  • Can reduce capacity misses from 2N3 N2 to
    2N3/B N2

c

a
x
b
29
Reducing Conflict Misses by Blocking
  • Conflict misses in caches vs. Blocking size
  • Lam et al 1991 a blocking factor of 24 had a
    fifth the misses vs. 48 despite both fit in cache

30
Summary of Compiler Optimizations to Reduce Cache
Misses (by hand)
31
10. Reducing Misses by HW Prefetching
  • Use extra memory bandwidth (if available)
  • Instruction Prefetching
  • Typically, CPU fetches 2 blocks on a miss the
    requested block and the next consecutive block.
  • Requested block is placed in instruction cache
    when it returns, and prefetched block is placed
    into instruction stream buffer
  • Data Prefetching
  • Pentium 4 can prefetch data into L2 cache from up
    to 8 streams from 8 different 4 KB pages
  • Prefetching invoked if 2 successive L2 cache
    misses to a page, if distance between those
    cache blocks is lt 256 bytes

32
Performance impact of prefetching
33
Issues in Prefetching
  • Usefulness should produce hits
  • Timeliness not late and not too early
  • Cache and bandwidth pollution

L1 Instruction
Unified L2 Cache
CPU
L1 Data
RF
Prefetched data
34
Hardware Instruction Prefetching
  • Instruction prefetch in Alpha AXP 21064
  • Fetch two blocks on a miss the requested block
    (i) and the next consecutive block (i1)
  • Requested block placed in cache, and next block
    in instruction stream buffer
  • If miss in cache but hit in stream buffer, move
    stream buffer block into cache and prefetch next
    block (i2)

35
Hardware Data Prefetching
  • Prefetch-on-miss
  • Prefetch b 1 upon miss on b
  • One Block Lookahead (OBL) scheme
  • Initiate prefetch for block b 1 when block b is
    accessed
  • Why is this different from doubling block size?
  • Can extend to N block lookahead
  • Strided prefetch
  • If observe sequence of accesses to block b, bN,
    b2N, then prefetch b3N etc.
  • Example IBM Power 5 2003 supports eight
    independent streams of strided prefetch per
    processor, prefetching 12 lines ahead of current
    access

36
11. Reducing Misses by Software (Compiler
controlled) Prefetching Data
  • Data Prefetch
  • Load data into register (HP PA-RISC loads)
  • Cache Prefetch load into cache (MIPS IV,
    PowerPC, SPARC v. 9)
  • Special prefetching instructions cannot cause
    faultsa form of speculative execution
  • Issuing Prefetch Instructions takes time
  • Is cost of prefetch issues lt savings in reduced
    misses?
  • Wider superscalar reduces difficulty of issue
    bandwidth

37
Technique Hit Time Band-width Miss penalty Miss rate Miss rate HW cost/ complexity Comment
Small and simple caches 0 Trivial widely used
Way-predicting caches 1 Used in Pentium 4
Trace caches 3 Used in Pentium 4
Pipelined cache access 1 Widely used
Nonblocking caches 3 Widely used
Banked caches 1 Used in L2 of Opteron and Niagara
Critical word first and early restart 2 Widely used
Merging write buffer 1 Widely used with write through
Compiler techniques to reduce cache misses 0 Software is a challenge some computers have compiler option
Hardware prefetching of instructions and data 2 instr. 3 data Many prefetch instructions AMD Opteron prefetches data
Compiler-controlled prefetching 3 Needs nonblocking cache in many CPUs
38
Recap of Cache basics
39
Cache operation
Cache / Higher level
Memory / Lower level
block / line
tags
data
40
Direct Mapped Cache
  • Mapping address is modulo the number of blocks
    in the cache

41
Review Four Questions for Memory Hierarchy
Designers
  • Q1 Where can a block be placed in the upper
    level? (Block placement)
  • Fully Associative, Set Associative, Direct Mapped
  • Q2 How is a block found if it is in the upper
    level? (Block identification)
  • Tag/Block
  • Q3 Which block should be replaced on a miss?
    (Block replacement)
  • Random, FIFO, LRU
  • Q4 What happens on a write? (Write strategy)
  • Write Back or Write Through (with Write Buffer)

42
Direct Mapped Cache
Address (bit positions)
3
1

3
0







1
3

1
2

1
1







2

1

0
B
y
t
e
o
f
f
s
e
t
  • QWhat kind of locality are we taking advantage
    of?

2
0
1
0
H
i
t
D
a
t
a
T
a
g
I
n
d
e
x
V
a
l
i
d
T
a
g
D
a
t
a
I
n
d
e
x
0
1
2
1
0
2
1
1
0
2
2
1
0
2
3
2
0
3
2
43
Direct Mapped Cache
  • Taking advantage of spatial locality

Address (bit positions)
44
A 4-Way Set-Associative Cache
45
Cache Basics
  • cache_size Nsets x Assoc x Block_size
  • block_address Byte_address DIV Block_size in
    bytes
  • index Block_address MOD Nsets
  • Because the block size and the number of sets are
    (usually) powers of two, DIV and MOD can be
    performed efficiently

block address
block offset
tag
index
2 1 0
31
46
Example 1
  • Assume
  • Cache of 4K blocks
  • 4 word block size
  • 32 bit address
  • Direct mapped (associativity1)
  • 16 bytes per block 24
  • 32 bit address 32-428 bits for index and tag
  • setsblocks/ associativity log2 of 4K12 12
    for index
  • Total number of tag bits (28-12)4K64 Kbits
  • 2-way associative
  • setsblocks/associativity 2K sets
  • 1 bit less for indexing, 1 bit more for tag
  • Tag bits (28-11) 2 2K68 Kbits
  • 4-way associative
  • setsblocks/associativity 1K sets
  • 1 bit less for indexing, 1 bit more for tag
  • Tag bits (28-10) 4 1K72 Kbits

47
Example 2
  • 3 caches consisting of 4 one-word blocks
  • Cache 1 fully associative
  • Cache 2 two-way set associative
  • Cache 3 direct mapped
  • Suppose following sequence of block addresses
    0, 8, 0, 6, 8

48
Example 2 Direct Mapped
Block address Cache Block
0 0 mod 40
6 6 mod 42
8 8 mod 40
Address of memory block Hit or miss Location 0 Location 1 Location 2 Location 3
0 miss Mem0
8 miss Mem8
0 miss Mem0
6 miss Mem0 Mem6
8 miss Mem8 Mem6
Coloured new entry miss
49
Example 2 2-way Set Associative 2 sets
Block address Cache Block
0 0 mod 20
6 6 mod 20
8 8 mod 20
(so all in set/location 0)
Address of memory block Hit or miss SET 0 entry 0 SET 0 entry 1 SET 1 entry 0 SET 1 entry 1
0 Miss Mem0
8 Miss Mem0 Mem8
0 Hit Mem0 Mem8
6 Miss Mem0 Mem6
8 Miss Mem8 Mem6
LEAST RECENTLY USED BLOCK
50
Example 2 Fully associative (4 way assoc., 1
set)
Address of memory block Hit or miss Block 0 Block 1 Block 2 Block 3
0 Miss Mem0
8 Miss Mem0 Mem8
0 Hit Mem0 Mem8
6 Miss Mem0 Mem8 Mem6
8 Hit Mem0 Mem6 Mem6
51
6 basic cache optimizations (App. C)
  • Reduces miss rate
  • Larger block size
  • Bigger cache
  • Higher associativity
  • reduces conflict rate
  • Reduce miss penalty
  • Multi-level caches
  • Give priority to read messes over write misses
  • Reduce hit time
  • Avoid address translation during indexing of the
    cache

52
Improving Cache Performance
  • T Ninstr CPI Tcycle
  • CPI (with cache) CPI_base CPI_cachepenalty
  • CPI_cachepenalty ...............................
    ..............
  • Reduce the miss penalty
  • Reduce the miss rate
  • Reduce the time to hit in the cache

53
1. Increase Block Size
54
2. Larger Caches
  • Increase capacity of cache
  • Disadvantages
  • longer hit time (may determine processor cycle
    time!!)
  • higher cost

55
3. Increase Associativity
  • 21 Cache Rule
  • Miss Rate direct-mapped cache of size N ? Miss
    Rate 2-way set-associative cache of size N/2
  • Beware Execution time is only true measure of
    performance!
  • Access time of set-associative caches larger than
    access time direct-mapped caches
  • L1 cache often direct-mapped (access must fit in
    one clock cycle)
  • L2 cache often set-associative (cannot afford to
    go to main memory)

56
Classifying Misses the 3 Cs
  • The 3 Cs
  • CompulsoryFirst access to a block is always a
    miss. Also called cold start misses
  • misses in infinite cache
  • CapacityMisses resulting from the finite
    capacity of the cache
  • misses in fully associative cache with optimal
    replacement strategy
  • ConflictMisses occurring because several blocks
    map to the same set. Also called collision misses
  • remaining misses

57
3 Cs Compulsory, Capacity, Conflict
  • In all cases, assume total cache size not changed
  • What happens if we
  • 1) Change Block Size Which of 3Cs is obviously
    affected? compulsory
  • 2) Change Cache Size Which of 3Cs is obviously
    affected? capacity misses
  • 3) Introduce higher associativity Which of 3Cs
    is obviously affected? conflict misses

58
3Cs Absolute Miss Rate (SPEC92)
Conflict
Miss rate per type
59
3Cs Relative Miss Rate
Conflict
Miss rate per type
60
Improving Cache Performance
  1. Reduce the miss penalty
  2. Reduce the miss rate / number of misses
  3. Reduce the time to hit in the cache

61
4. Second Level Cache (L2)
  • Most CPUs
  • have an L1 cache small enough to match the cycle
    time (reduce the time to hit the cache)
  • have an L2 cache large enough and with sufficient
    associativity to capture most memory accesses
    (reduce miss rate)
  • L2 Equations
  • AMAT Hit TimeL1 Miss RateL1 x Miss PenaltyL1
  • Miss PenaltyL1 Hit TimeL2 Miss RateL2 x Miss
    PenaltyL2
  • AMAT Hit TimeL1 Miss RateL1 x (Hit TimeL2
    Miss RateL2 x Miss PenaltyL2)
  • Definitions
  • Local miss rate misses in this cache divided by
    the total number of memory accesses to this cache
    (Miss rateL2)
  • Global miss ratemisses in this cache divided by
    the total number of memory accesses generated by
    the CPU (Miss RateL1 x Miss RateL2)

62
4. Second Level Cache (L2)
  • Suppose processor with base CPI of 1.0
  • Clock rate of 500 Mhz
  • Main memory access time 200 ns
  • Miss rate per instruction primary cache 5
  • What improvement with second cache having 20ns
    access time, reducing miss rate to memory to 2 ?
  • Miss penalty 200 ns/ 2ns per cycle100 clock
    cycles
  • Effective CPIbase CPI memory stall per
    instruction ?
  • 1 level cache total CPI151006
  • 2 level cache a miss in first level cache is
    satisfied by second cache or memory
  • Access second level cache 20 ns / 2ns per
    cycle10 clock cycles
  • If miss in second cache, then access memory in
    2 of the cases
  • Total CPI1primary stalls per instruction
    secondary stalls per instruction
  • Total CPI151021003.5
  • Machine with L2 cache 6/3.51.7 times faster

63
4. Second Level Cache
  • Global cache miss is similar to single cache
    miss rate of second level cache provided L2
    cache is much bigger than L1.
  • Local cache rate is NOT good measure of
    secondary caches as it is function of L1 cache.
  • Global cache miss rate should be used.

64
4. Second Level Cache
65
5. Read Priority over Write on Miss
  • Write-through with write buffers can cause RAW
    data hazards
  • SW 512(R0),R3 Mem512 R3
  • LW R1,1024(R0) R1 Mem1024
  • LW R2,512(R0) R2 Mem512
  • Problem if write buffer used, final LW may read
    wrong value from memory !!
  • Solution 1 Simply wait for write buffer to
    empty
  • increases read miss penalty (old MIPS 1000 by 50
    )
  • Solution 2 Check write buffer contents before
    read if no conflicts, let read continue

Map to same cache block
66
5. Read Priority over Write on Miss
  • What about write-back?
  • Dirty bit whenever a write is cached, this bit
    is set (made a 1) to tell the cache controller
    "when you decide to re-use this cache line for a
    different address, you need to write the current
    contents back to memory
  • What if read-miss
  • Normal Write dirty block to memory, then do the
    read
  • Instead Copy dirty block to a write buffer, then
    do the read, then the write
  • Less CPU stalls since restarts as soon as read
    done

67
6. Avoiding address translation during cache
access
Write a Comment
User Comments (0)
About PowerShow.com