Lecture 13: Memory Hierarchy - PowerPoint PPT Presentation

About This Presentation
Title:

Lecture 13: Memory Hierarchy

Description:

Memory Hierarchy Ways to Reduce Misses Review: Who Cares About the Memory Hierarchy? Processor Only Thus Far in Course: CPU cost/performance, ISA, Pipelined ... – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 34
Provided by: csUcrEdu3
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: Lecture 13: Memory Hierarchy


1
Lecture 13 Memory HierarchyWays to Reduce
Misses
2
Review Who Cares About the Memory Hierarchy?
  • Processor Only Thus Far in Course
  • CPU cost/performance, ISA, Pipelined Execution
  • CPU-DRAM Gap
  • 1980 no cache in µproc 1995 2-level cache on
    chip(1989 first Intel µproc with a cache on chip)

µProc 60/yr.
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 7/yr.
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
3
The Goal Illusion of large, fast, cheap memory
  • Fact Large memories are slow, fast memories are
    small
  • How do we create a memory that is large, cheap
    and fast (most of the time)?
  • Hierarchy of Levels
  • Uses smaller and faster memory technologies close
    to the processor
  • Fast access time in highest level of hierarchy
  • Cheap, slow memory furthest from processor
  • The aim of memory hierarchy design is to have
    access time close to the highest level and size
    equal to the lowest level

4
Recap Memory Hierarchy Pyramid
Processor (CPU)
transfer datapath bus
Decreasing distance from CPU, Decreasing Access
Time (Memory Latency)
Increasing Distance from CPU,Decreasing cost /
MB
Level n
Size of memory at each level
5
Memory Hierarchy Terminology
  • Hit data appears in level X Hit Rate the
    fraction of memory accesses found in the upper
    level
  • Miss data needs to be retrieved from a block in
    the lower level (Block Y) Miss Rate 1 - (Hit
    Rate)
  • Hit Time Time to access the upper level which
    consists of Time to determine hit/miss memory
    access time
  • Miss Penalty Time to replace a block in the
    upper level Time to deliver the block to the
    processor
  • Note Hit Time ltlt Miss Penalty

6
Current Memory Hierarchy
Processor
Control
Secon- dary Mem- ory
Main Mem- ory
L2 Cache
Data-path
L1 cache
regs
Speed(ns) 0.5ns 2ns 6ns 100ns 10,000,000ns Size
(MB) 0.0005 0.05 1-4 100-1000 100,000 Cost
(/MB) -- 100 30 1 0.05 Technology Regs SR
AM SRAM DRAM Disk
7
Memory Hierarchy Why Does it Work? Locality!
  • Temporal Locality (Locality in Time)
  • gt Keep most recently accessed data items closer
    to the processor
  • Spatial Locality (Locality in Space)
  • gt Move blocks consists of contiguous words to
    the upper levels

8
Memory Hierarchy Technology
  • Random Access
  • Random is good access time is the same for all
    locations
  • DRAM Dynamic Random Access Memory
  • High density, low power, cheap, slow
  • Dynamic need to be refreshed regularly
  • SRAM Static Random Access Memory
  • Low density, high power, expensive, fast
  • Static content will last forever(until lose
    power)
  • Not-so-random Access Technology
  • Access time varies from location to location and
    from time to time
  • Examples Disk, CDROM
  • Sequential Access Technology access time linear
    in location (e.g.,Tape)
  • We will concentrate on random access technology
  • The Main Memory DRAMs Caches SRAMs

9
Introduction to Caches
  • Cache
  • is a small very fast memory (SRAM, expensive)
  • contains copies of the most recently accessed
    memory locations (data and instructions)
    temporal locality
  • is fully managed by hardware (unlike virtual
    memory)
  • storage is organized in blocks of contiguous
    memory locations spatial locality
  • unit of transfer to/from main memory (or L2) is
    the cache block
  • General structure
  • n blocks per cache organized in s sets
  • b bytes per block
  • total cache size nb bytes

10
Cache Organization
  • (1) How do you know if something is in the cache?
  • (2) If it is in the cache, how to find it?
  • Answer to (1) and (2) depends on type or
    organization of the cache
  • In a direct mapped cache, each memory address is
    associated with one possible block within the
    cache
  • Therefore, we only need to look in a single
    location in the cache for the data if it exists
    in the cache

11
Simplest Cache Direct Mapped
4-Block Direct Mapped Cache
MainMemory
Cache Index
Block Address
0
0
1
1
2
2
0010
3
3
4
Memory block address
5
6
0110
index
tag
7
8
9
  • index determines block in cache
  • index (address) mod ( blocks)
  • If number of cache blocks is power of 2, then
    cache index is just the lower n bits of memory
    address n log2( blocks)

10
1010
11
12
13
14
1110
15
12
Issues with Direct-Mapped
  • If block size gt 1, rightmost bits of index are
    really the offset within the indexed block

13
64KB Cache with 4-word (16-byte) blocks
31 . . . 16 15 . . 4 3 2 1 0
Address (showing bit positions)
1
6
1
2
B
y
t
e
2
H
i
t
D
a
t
a
T
a
g
o
f
f
s
e
t
B
l
o
c
k

o
f
f
s
e
t
I
n
d
e
x
1
6

b
i
t
s
1
2
8

b
i
t
s
Tag
Data
V
4
K
e
n
t
r
i
e
s
1
6
3
2
3
2
3
2
3
2
M
u
x
3
2
14
Direct-mapped Cache Contd.
  • The direct mapped cache is simple to design and
    its access time is fast (Why?)
  • Good for L1 (on-chip cache)
  • Problem Conflict Miss, so low hit ratio
  • Conflict Misses are misses caused by accessing
    different memory locations that are mapped to the
    same cache index
  • In direct mapped cache, no flexibility in where
    memory block can be placed in cache, contributing
    to conflict misses

15
Another Extreme Fully Associative
  • Fully Associative Cache (8 word block)
  • Omit cache index place item in any block!
  • Compare all Cache Tags in parallel

4
0
31
Byte Offset
Cache Tag (27 bits long)
Cache Data
Valid
Cache Tag


B 0
B 1
B 31







  • By definition Conflict Misses 0 for a fully
    associative cache

16
Fully Associative Cache
  • Must search all tags in cache, as item can be in
    any cache block
  • Search for tag must be done by hardware in
    parallel (other searches too slow)
  • But, the necessary parallel comparator hardware
    is very expensive
  • Therefore, fully associative placement practical
    only for a very small cache

17
Compromise N-way Set Associative Cache
  • N-way set associative N cache blocks for each
    Cache Index
  • Like having N direct mapped caches operating in
    parallel
  • Select the one that gets a hit
  • Example 2-way set associative cache
  • Cache Index selects a set of 2 blocks from the
    cache
  • The 2 tags in set are compared in parallel
  • Data is selected based on the tag result (which
    matched the address)

18
Example 2-way Set Associative Cache
tag
offset
address
index
Cache Data
Valid
Cache Data
Valid
Cache Tag
Cache Tag
Block 0
Block 0








mux
Cache Block
Hit
19
Set Associative Cache Contd.
  • Direct Mapped, Fully Associative can be seen as
    just variations of Set Associative block
    placement strategy
  • Direct Mapped 1-way Set Associative Cache
  • Fully Associative n-way Set associativity for
    a cache with exactly n blocks

20
Alpha 21264 Cache Organization
21
Block Replacement Policy
  • N-way Set Associative or Fully Associative have
    choice where to place a block, (which block to
    replace)
  • Of course, if there is an invalid block, use it
  • Whenever get a cache hit, record the cache block
    that was touched
  • When need to evict a cache block, choose one
    which hasn't been touched recently Least
    Recently Used (LRU)
  • Past is prologue history suggests it is least
    likely of the choices to be used soon
  • Flip side of temporal locality

22
Review Four Questions for Memory Hierarchy
Designers
  • Q1 Where can a block be placed in the upper
    level? (Block placement)
  • Fully Associative, Set Associative, Direct Mapped
  • Q2 How is a block found if it is in the upper
    level? (Block identification)
  • Tag/Block
  • Q3 Which block should be replaced on a miss?
    (Block replacement)
  • Random, LRU
  • Q4 What happens on a write? (Write strategy)
  • Write Back or Write Through (with Write Buffer)

23
Write PolicyWrite-Through vs Write-Back
  • Write-through all writes update cache and
    underlying memory/cache
  • Can always discard cached data - most up-to-date
    data is in memory
  • Cache control bit only a valid bit
  • Write-back all writes simply update cache
  • Cant just discard cached data - may have to
    write it back to memory
  • Flagged write-back
  • Cache control bits both valid and dirty bits
  • Other Advantages
  • Write-through
  • memory (or other processors) always have latest
    data
  • Simpler management of cache
  • Write-back
  • Needs much lower bus bandwidth due to infrequent
    access
  • Better tolerance to long-latency memory?

24
Write Through Write Allocate vs Non-Allocate
  • Write allocate allocate new cache line in cache
  • Usually means that you have to do a read miss
    to fill in rest of the cache-line!
  • Alternative per/word valid bits
  • Write non-allocate (or write-around)
  • Simply send write data through to underlying
    memory/cache - dont allocate new cache line!

25
Write Buffers
  • Write Buffers (for wrt-through)
  • buffers words to be written in L2 cache/memory
    along with their addresses.
  • 2 to 4 entries deep
  • all read misses are checked against pending
    writes for dependencies (associatively)
  • allows reads to proceed ahead of writes
  • can coalesce writes to same address
  • Write-back Buffers
  • between a write-back cache and L2 or MM
  • algorithm
  • move dirty block to write-back buffer
  • read new block
  • write dirty block in L2 or MM
  • can be associated with victim cache (later)

to CPU
L1
Write buffer
L2
26
Write Merge
27
Review Cache performance
28
Impact on Performance
  • Suppose a processor executes at
  • Clock Rate 200 MHz (5 ns per cycle), Ideal (no
    misses) CPI 1.1
  • 50 arith/logic, 30 ld/st, 20 control
  • Suppose that 10 of memory operations (Data) get
    50 cycle miss penalty
  • Suppose that 1 of instructions get same miss
    penalty
  • CPI ideal CPI average stalls per
    instruction
  • 1.1(cycles/ins)
  • 0.30 (DataMops/ins) x 0.10
    (miss/DataMop)
  • x 50 (cycle/miss)
  • 1 (InstMop/ins) x 0.01
    (miss/InstMop) x 50
  • (cycle/miss) (1.1 1.5 .5)
    cycle/ins 3.1
  • 58 (1.5/2.6) of the time the proc is stalled
    waiting for data memory!
  • Total no. of memory accesses one per instrn
    0.3 for data 1.3 Thus, AMAT(1/1.3)x10.01x50
    (0.3/1.3)x10.1x502.54 cycles gt instead of
    one cycle.

29
Impact of Change in cc
  • Suppose a processor has the following parameters
  • CPI 2 (w/o memory stalls)
  • mem access per instruction 1.5
  • Compare AMAT and CPU time for a direct mapped
    cache and a 2-way set associative cache assuming
  • AMATd hit time miss rate miss penalty 11
    0.01475 2.05 ns
  • AMAT2 11.25 0.0175 2 ns lt 2.05 ns
  • CPId (CPIcc mem. stall time)IC (21
    1.50.01475)IC 3.575IC
  • CPI2 (21.25 1.50.0175)IC 3.625IC gt CPId
    !
  • Change in cc affects all instructions while
    reduction in miss rate benefit only memory
    instructions.

cc Hit cycle Miss penalty Miss rate
Direct map 1ns 1 75 ns 1.4
2-way associative 1.25ns(why?) 1 75 ns 1.0
30
Miss Penalty for Out-of-Order (OOO) Exe.
Processor.
  • In OOO processors, memory stall cycles are
    overlapped with execution of other instructions.
    Miss penalty should not include this overlapped
    part.
  • mem stall cycle per instruction mem miss per
    instruction x (total miss penalty overlapped
    miss penalty)
  • For the previous example. Suppose 30 of the 75ns
    miss penalty can be overlapped, what is the AMAT
    and CPU time?
  • Assume using direct map cache, cc1.25ns to
    handle out of order execution.
  • AMATd 11.25 0.014(750.7) 1.985 ns
  • With 1.5 memory accesses per instruction,
  • CPU time ( 21.25 1.5 0.014 (750.7))IC
    3.6025 IC lt CPU2

31
Lock-Up Free Cache Using MSHR (Miss Status
Holding Register)
32
Avg. Memory Access Time vs. Miss Rate
  • Associativity reduces miss rate, but increases
    hit time due to increase in hardware complexity!
  • Example For on-chip cache, assume CCT 1.10 for
    2-way, 1.12 for 4-way, 1.14 for 8-way vs. CCT
    direct mapped
  • Cache Size Associativity
  • (KB) 1-way 2-way 4-way 8-way
  • 1 2.33 2.15 2.07 2.01
  • 2 1.98 1.86 1.76 1.68
  • 4 1.72 1.67 1.61 1.53
  • 8 1.46 1.48 1.47 1.43
  • 16 1.29 1.32 1.32 1.32
  • 32 1.20 1.24 1.25 1.27
  • 64 1.14 1.20 1.21 1.23
  • 128 1.10 1.17 1.18 1.20
  • (Red means A.M.A.T. not improved by more
    associativity)

33
Unified vs Split Caches
  • Unified vs Separate ID
  • Example
  • 16KB ID Inst miss rate0.64, Data miss
    rate6.47
  • 32KB unified Aggregate miss rate1.99
  • Which is better (ignore L2 cache)?
  • Assume 33 data ops ? 75 accesses from
    instructions (1.0/1.33)
  • hit time1, miss time50
  • Note that data hit has 1 stall for unified cache
    (only one port)
  • AMATHarvard75x(10.64x50)25x(16.47x50)
    2.05
  • AMATUnified75x(11.99x50)25x(111.99x50)
    2.24
Write a Comment
User Comments (0)
About PowerShow.com