Title: Lecture 13: Memory Hierarchy
1Lecture 13 Memory HierarchyWays to Reduce
Misses
2Review Who Cares About the Memory Hierarchy?
- Processor Only Thus Far in Course
- CPU cost/performance, ISA, Pipelined Execution
- CPU-DRAM Gap
- 1980 no cache in µproc 1995 2-level cache on
chip(1989 first Intel µproc with a cache on chip)
µProc 60/yr.
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 7/yr.
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
3The Goal Illusion of large, fast, cheap memory
- Fact Large memories are slow, fast memories are
small - How do we create a memory that is large, cheap
and fast (most of the time)? - Hierarchy of Levels
- Uses smaller and faster memory technologies close
to the processor - Fast access time in highest level of hierarchy
- Cheap, slow memory furthest from processor
- The aim of memory hierarchy design is to have
access time close to the highest level and size
equal to the lowest level
4Recap Memory Hierarchy Pyramid
Processor (CPU)
transfer datapath bus
Decreasing distance from CPU, Decreasing Access
Time (Memory Latency)
Increasing Distance from CPU,Decreasing cost /
MB
Level n
Size of memory at each level
5Memory Hierarchy Terminology
- Hit data appears in level X Hit Rate the
fraction of memory accesses found in the upper
level - Miss data needs to be retrieved from a block in
the lower level (Block Y) Miss Rate 1 - (Hit
Rate) - Hit Time Time to access the upper level which
consists of Time to determine hit/miss memory
access time - Miss Penalty Time to replace a block in the
upper level Time to deliver the block to the
processor - Note Hit Time ltlt Miss Penalty
6Current Memory Hierarchy
Processor
Control
Secon- dary Mem- ory
Main Mem- ory
L2 Cache
Data-path
L1 cache
regs
Speed(ns) 0.5ns 2ns 6ns 100ns 10,000,000ns Size
(MB) 0.0005 0.05 1-4 100-1000 100,000 Cost
(/MB) -- 100 30 1 0.05 Technology Regs SR
AM SRAM DRAM Disk
7Memory Hierarchy Why Does it Work? Locality!
- Temporal Locality (Locality in Time)
- gt Keep most recently accessed data items closer
to the processor - Spatial Locality (Locality in Space)
- gt Move blocks consists of contiguous words to
the upper levels
8Memory Hierarchy Technology
- Random Access
- Random is good access time is the same for all
locations - DRAM Dynamic Random Access Memory
- High density, low power, cheap, slow
- Dynamic need to be refreshed regularly
- SRAM Static Random Access Memory
- Low density, high power, expensive, fast
- Static content will last forever(until lose
power) - Not-so-random Access Technology
- Access time varies from location to location and
from time to time - Examples Disk, CDROM
- Sequential Access Technology access time linear
in location (e.g.,Tape) - We will concentrate on random access technology
- The Main Memory DRAMs Caches SRAMs
9Introduction to Caches
- Cache
- is a small very fast memory (SRAM, expensive)
- contains copies of the most recently accessed
memory locations (data and instructions)
temporal locality - is fully managed by hardware (unlike virtual
memory) - storage is organized in blocks of contiguous
memory locations spatial locality - unit of transfer to/from main memory (or L2) is
the cache block - General structure
- n blocks per cache organized in s sets
- b bytes per block
- total cache size nb bytes
10Cache Organization
- (1) How do you know if something is in the cache?
- (2) If it is in the cache, how to find it?
- Answer to (1) and (2) depends on type or
organization of the cache - In a direct mapped cache, each memory address is
associated with one possible block within the
cache - Therefore, we only need to look in a single
location in the cache for the data if it exists
in the cache
11Simplest Cache Direct Mapped
4-Block Direct Mapped Cache
MainMemory
Cache Index
Block Address
0
0
1
1
2
2
0010
3
3
4
Memory block address
5
6
0110
index
tag
7
8
9
- index determines block in cache
- index (address) mod ( blocks)
- If number of cache blocks is power of 2, then
cache index is just the lower n bits of memory
address n log2( blocks)
10
1010
11
12
13
14
1110
15
12Issues with Direct-Mapped
- If block size gt 1, rightmost bits of index are
really the offset within the indexed block
1364KB Cache with 4-word (16-byte) blocks
31 . . . 16 15 . . 4 3 2 1 0
Address (showing bit positions)
1
6
1
2
B
y
t
e
2
H
i
t
D
a
t
a
T
a
g
o
f
f
s
e
t
B
l
o
c
k
o
f
f
s
e
t
I
n
d
e
x
1
6
b
i
t
s
1
2
8
b
i
t
s
Tag
Data
V
4
K
e
n
t
r
i
e
s
1
6
3
2
3
2
3
2
3
2
M
u
x
3
2
14Direct-mapped Cache Contd.
- The direct mapped cache is simple to design and
its access time is fast (Why?) - Good for L1 (on-chip cache)
- Problem Conflict Miss, so low hit ratio
- Conflict Misses are misses caused by accessing
different memory locations that are mapped to the
same cache index - In direct mapped cache, no flexibility in where
memory block can be placed in cache, contributing
to conflict misses
15Another Extreme Fully Associative
- Fully Associative Cache (8 word block)
- Omit cache index place item in any block!
- Compare all Cache Tags in parallel
4
0
31
Byte Offset
Cache Tag (27 bits long)
Cache Data
Valid
Cache Tag
B 0
B 1
B 31
- By definition Conflict Misses 0 for a fully
associative cache
16Fully Associative Cache
- Must search all tags in cache, as item can be in
any cache block - Search for tag must be done by hardware in
parallel (other searches too slow) - But, the necessary parallel comparator hardware
is very expensive - Therefore, fully associative placement practical
only for a very small cache
17Compromise N-way Set Associative Cache
- N-way set associative N cache blocks for each
Cache Index - Like having N direct mapped caches operating in
parallel - Select the one that gets a hit
- Example 2-way set associative cache
- Cache Index selects a set of 2 blocks from the
cache - The 2 tags in set are compared in parallel
- Data is selected based on the tag result (which
matched the address)
18Example 2-way Set Associative Cache
tag
offset
address
index
Cache Data
Valid
Cache Data
Valid
Cache Tag
Cache Tag
Block 0
Block 0
mux
Cache Block
Hit
19Set Associative Cache Contd.
- Direct Mapped, Fully Associative can be seen as
just variations of Set Associative block
placement strategy - Direct Mapped 1-way Set Associative Cache
- Fully Associative n-way Set associativity for
a cache with exactly n blocks
20Alpha 21264 Cache Organization
21Block Replacement Policy
- N-way Set Associative or Fully Associative have
choice where to place a block, (which block to
replace) - Of course, if there is an invalid block, use it
- Whenever get a cache hit, record the cache block
that was touched - When need to evict a cache block, choose one
which hasn't been touched recently Least
Recently Used (LRU) - Past is prologue history suggests it is least
likely of the choices to be used soon - Flip side of temporal locality
22Review Four Questions for Memory Hierarchy
Designers
- Q1 Where can a block be placed in the upper
level? (Block placement) - Fully Associative, Set Associative, Direct Mapped
- Q2 How is a block found if it is in the upper
level? (Block identification) - Tag/Block
- Q3 Which block should be replaced on a miss?
(Block replacement) - Random, LRU
- Q4 What happens on a write? (Write strategy)
- Write Back or Write Through (with Write Buffer)
23Write PolicyWrite-Through vs Write-Back
- Write-through all writes update cache and
underlying memory/cache - Can always discard cached data - most up-to-date
data is in memory - Cache control bit only a valid bit
- Write-back all writes simply update cache
- Cant just discard cached data - may have to
write it back to memory - Flagged write-back
- Cache control bits both valid and dirty bits
- Other Advantages
- Write-through
- memory (or other processors) always have latest
data - Simpler management of cache
- Write-back
- Needs much lower bus bandwidth due to infrequent
access - Better tolerance to long-latency memory?
24Write Through Write Allocate vs Non-Allocate
- Write allocate allocate new cache line in cache
- Usually means that you have to do a read miss
to fill in rest of the cache-line! - Alternative per/word valid bits
- Write non-allocate (or write-around)
- Simply send write data through to underlying
memory/cache - dont allocate new cache line!
25Write Buffers
- Write Buffers (for wrt-through)
- buffers words to be written in L2 cache/memory
along with their addresses. - 2 to 4 entries deep
- all read misses are checked against pending
writes for dependencies (associatively) - allows reads to proceed ahead of writes
- can coalesce writes to same address
- Write-back Buffers
- between a write-back cache and L2 or MM
- algorithm
- move dirty block to write-back buffer
- read new block
- write dirty block in L2 or MM
- can be associated with victim cache (later)
to CPU
L1
Write buffer
L2
26Write Merge
27Review Cache performance
28Impact on Performance
- Suppose a processor executes at
- Clock Rate 200 MHz (5 ns per cycle), Ideal (no
misses) CPI 1.1 - 50 arith/logic, 30 ld/st, 20 control
- Suppose that 10 of memory operations (Data) get
50 cycle miss penalty - Suppose that 1 of instructions get same miss
penalty - CPI ideal CPI average stalls per
instruction - 1.1(cycles/ins)
- 0.30 (DataMops/ins) x 0.10
(miss/DataMop) - x 50 (cycle/miss)
- 1 (InstMop/ins) x 0.01
(miss/InstMop) x 50 - (cycle/miss) (1.1 1.5 .5)
cycle/ins 3.1 - 58 (1.5/2.6) of the time the proc is stalled
waiting for data memory! - Total no. of memory accesses one per instrn
0.3 for data 1.3 Thus, AMAT(1/1.3)x10.01x50
(0.3/1.3)x10.1x502.54 cycles gt instead of
one cycle.
29Impact of Change in cc
- Suppose a processor has the following parameters
- CPI 2 (w/o memory stalls)
- mem access per instruction 1.5
- Compare AMAT and CPU time for a direct mapped
cache and a 2-way set associative cache assuming - AMATd hit time miss rate miss penalty 11
0.01475 2.05 ns - AMAT2 11.25 0.0175 2 ns lt 2.05 ns
- CPId (CPIcc mem. stall time)IC (21
1.50.01475)IC 3.575IC - CPI2 (21.25 1.50.0175)IC 3.625IC gt CPId
! - Change in cc affects all instructions while
reduction in miss rate benefit only memory
instructions.
cc Hit cycle Miss penalty Miss rate
Direct map 1ns 1 75 ns 1.4
2-way associative 1.25ns(why?) 1 75 ns 1.0
30Miss Penalty for Out-of-Order (OOO) Exe.
Processor.
- In OOO processors, memory stall cycles are
overlapped with execution of other instructions.
Miss penalty should not include this overlapped
part. - mem stall cycle per instruction mem miss per
instruction x (total miss penalty overlapped
miss penalty) - For the previous example. Suppose 30 of the 75ns
miss penalty can be overlapped, what is the AMAT
and CPU time? - Assume using direct map cache, cc1.25ns to
handle out of order execution. - AMATd 11.25 0.014(750.7) 1.985 ns
- With 1.5 memory accesses per instruction,
- CPU time ( 21.25 1.5 0.014 (750.7))IC
3.6025 IC lt CPU2
31Lock-Up Free Cache Using MSHR (Miss Status
Holding Register)
32 Avg. Memory Access Time vs. Miss Rate
- Associativity reduces miss rate, but increases
hit time due to increase in hardware complexity! - Example For on-chip cache, assume CCT 1.10 for
2-way, 1.12 for 4-way, 1.14 for 8-way vs. CCT
direct mapped - Cache Size Associativity
- (KB) 1-way 2-way 4-way 8-way
- 1 2.33 2.15 2.07 2.01
- 2 1.98 1.86 1.76 1.68
- 4 1.72 1.67 1.61 1.53
- 8 1.46 1.48 1.47 1.43
- 16 1.29 1.32 1.32 1.32
- 32 1.20 1.24 1.25 1.27
- 64 1.14 1.20 1.21 1.23
- 128 1.10 1.17 1.18 1.20
-
- (Red means A.M.A.T. not improved by more
associativity)
33Unified vs Split Caches
- Unified vs Separate ID
- Example
- 16KB ID Inst miss rate0.64, Data miss
rate6.47 - 32KB unified Aggregate miss rate1.99
- Which is better (ignore L2 cache)?
- Assume 33 data ops ? 75 accesses from
instructions (1.0/1.33) - hit time1, miss time50
- Note that data hit has 1 stall for unified cache
(only one port) - AMATHarvard75x(10.64x50)25x(16.47x50)
2.05 - AMATUnified75x(11.99x50)25x(111.99x50)
2.24