Lecture 13: Memory Hierarchy - PowerPoint PPT Presentation

About This Presentation

Title:

Lecture 13: Memory Hierarchy

Description:

Memory Hierarchy Ways to Reduce Misses Review: Who Cares About the Memory Hierarchy? Processor Only Thus Far in Course: CPU cost/performance, ISA, Pipelined ... – PowerPoint PPT presentation

Number of Views:150

Avg rating:3.0/5.0

Slides: 34

Provided by: csUcrEdu3

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 13: Memory Hierarchy

1
Lecture 13 Memory HierarchyWays to Reduce
Misses
2
Review Who Cares About the Memory Hierarchy?

Processor Only Thus Far in Course
CPU cost/performance, ISA, Pipelined Execution
CPU-DRAM Gap
1980 no cache in µproc 1995 2-level cache on
chip(1989 first Intel µproc with a cache on chip)

µProc 60/yr.
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 7/yr.
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
3
The Goal Illusion of large, fast, cheap memory

Fact Large memories are slow, fast memories are
small
How do we create a memory that is large, cheap
and fast (most of the time)?
Hierarchy of Levels
Uses smaller and faster memory technologies close
to the processor
Fast access time in highest level of hierarchy
Cheap, slow memory furthest from processor
The aim of memory hierarchy design is to have
access time close to the highest level and size
equal to the lowest level

4
Recap Memory Hierarchy Pyramid
Processor (CPU)
transfer datapath bus
Decreasing distance from CPU, Decreasing Access
Time (Memory Latency)
Increasing Distance from CPU,Decreasing cost /
MB
Level n
Size of memory at each level
5
Memory Hierarchy Terminology

Hit data appears in level X Hit Rate the
fraction of memory accesses found in the upper
level
Miss data needs to be retrieved from a block in
the lower level (Block Y) Miss Rate 1 - (Hit
Rate)
Hit Time Time to access the upper level which
consists of Time to determine hit/miss memory
access time
Miss Penalty Time to replace a block in the
upper level Time to deliver the block to the
processor
Note Hit Time ltlt Miss Penalty

6
Current Memory Hierarchy
Processor
Control
Secon- dary Mem- ory
Main Mem- ory
L2 Cache
Data-path
L1 cache
regs
Speed(ns) 0.5ns 2ns 6ns 100ns 10,000,000ns Size
(MB) 0.0005 0.05 1-4 100-1000 100,000 Cost
(/MB) -- 100 30 1 0.05 Technology Regs SR
AM SRAM DRAM Disk
7
Memory Hierarchy Why Does it Work? Locality!

Temporal Locality (Locality in Time)
gt Keep most recently accessed data items closer
to the processor
Spatial Locality (Locality in Space)
gt Move blocks consists of contiguous words to
the upper levels

8
Memory Hierarchy Technology

Random Access
Random is good access time is the same for all
locations
DRAM Dynamic Random Access Memory
High density, low power, cheap, slow
Dynamic need to be refreshed regularly
SRAM Static Random Access Memory
Low density, high power, expensive, fast
Static content will last forever(until lose
power)
Not-so-random Access Technology
Access time varies from location to location and
from time to time
Examples Disk, CDROM
Sequential Access Technology access time linear
in location (e.g.,Tape)
We will concentrate on random access technology
The Main Memory DRAMs Caches SRAMs

9
Introduction to Caches

Cache
is a small very fast memory (SRAM, expensive)
contains copies of the most recently accessed
memory locations (data and instructions)
temporal locality
is fully managed by hardware (unlike virtual
memory)
storage is organized in blocks of contiguous
memory locations spatial locality
unit of transfer to/from main memory (or L2) is
the cache block
General structure
n blocks per cache organized in s sets
b bytes per block
total cache size nb bytes

10
Cache Organization

(1) How do you know if something is in the cache?
(2) If it is in the cache, how to find it?

Answer to (1) and (2) depends on type or
organization of the cache
In a direct mapped cache, each memory address is
associated with one possible block within the
cache
Therefore, we only need to look in a single
location in the cache for the data if it exists
in the cache

11
Simplest Cache Direct Mapped
4-Block Direct Mapped Cache
MainMemory
Cache Index
Block Address
0
0
1
1
2
2
0010
3
3
4
Memory block address
5
6
0110
index
tag
7
8
9

index determines block in cache
index (address) mod ( blocks)
If number of cache blocks is power of 2, then
cache index is just the lower n bits of memory
address n log2( blocks)

10
1010
11
12
13
14
1110
15
12
Issues with Direct-Mapped

If block size gt 1, rightmost bits of index are
really the offset within the indexed block

13
64KB Cache with 4-word (16-byte) blocks
31 . . . 16 15 . . 4 3 2 1 0
Address (showing bit positions)
1
6
1
2
B
y
t
e
2
H
i
t
D
a
t
a
T
a
g
o
f
f
s
e
t
B
l
o
c
k

o
f
f
s
e
t
I
n
d
e
x
1
6

b
i
t
s
1
2
8

b
i
t
s
Tag
Data
V
4
K
e
n
t
r
i
e
s
1
6
3
2
3
2
3
2
3
2
M
u
x
3
2
14
Direct-mapped Cache Contd.

The direct mapped cache is simple to design and
its access time is fast (Why?)
Good for L1 (on-chip cache)
Problem Conflict Miss, so low hit ratio
Conflict Misses are misses caused by accessing
different memory locations that are mapped to the
same cache index
In direct mapped cache, no flexibility in where
memory block can be placed in cache, contributing
to conflict misses

15
Another Extreme Fully Associative

Fully Associative Cache (8 word block)
Omit cache index place item in any block!
Compare all Cache Tags in parallel

4
0
31
Byte Offset
Cache Tag (27 bits long)
Cache Data
Valid
Cache Tag

B 0
B 1
B 31

By definition Conflict Misses 0 for a fully
associative cache

16
Fully Associative Cache

Must search all tags in cache, as item can be in
any cache block
Search for tag must be done by hardware in
parallel (other searches too slow)
But, the necessary parallel comparator hardware
is very expensive
Therefore, fully associative placement practical
only for a very small cache

17
Compromise N-way Set Associative Cache

N-way set associative N cache blocks for each
Cache Index
Like having N direct mapped caches operating in
parallel
Select the one that gets a hit
Example 2-way set associative cache
Cache Index selects a set of 2 blocks from the
cache
The 2 tags in set are compared in parallel
Data is selected based on the tag result (which
matched the address)

18
Example 2-way Set Associative Cache
tag
offset
address
index
Cache Data
Valid
Cache Data
Valid
Cache Tag
Cache Tag
Block 0
Block 0

mux
Cache Block
Hit
19
Set Associative Cache Contd.

Direct Mapped, Fully Associative can be seen as
just variations of Set Associative block
placement strategy
Direct Mapped 1-way Set Associative Cache
Fully Associative n-way Set associativity for
a cache with exactly n blocks

20
Alpha 21264 Cache Organization
21
Block Replacement Policy

N-way Set Associative or Fully Associative have
choice where to place a block, (which block to
replace)
Of course, if there is an invalid block, use it
Whenever get a cache hit, record the cache block
that was touched
When need to evict a cache block, choose one
which hasn't been touched recently Least
Recently Used (LRU)
Past is prologue history suggests it is least
likely of the choices to be used soon
Flip side of temporal locality

22
Review Four Questions for Memory Hierarchy
Designers

Q1 Where can a block be placed in the upper
level? (Block placement)
Fully Associative, Set Associative, Direct Mapped
Q2 How is a block found if it is in the upper
level? (Block identification)
Tag/Block
Q3 Which block should be replaced on a miss?
(Block replacement)
Random, LRU
Q4 What happens on a write? (Write strategy)
Write Back or Write Through (with Write Buffer)

23
Write PolicyWrite-Through vs Write-Back

Write-through all writes update cache and
underlying memory/cache
Can always discard cached data - most up-to-date
data is in memory
Cache control bit only a valid bit
Write-back all writes simply update cache
Cant just discard cached data - may have to
write it back to memory
Flagged write-back
Cache control bits both valid and dirty bits
Other Advantages
Write-through
memory (or other processors) always have latest
data
Simpler management of cache
Write-back
Needs much lower bus bandwidth due to infrequent
access
Better tolerance to long-latency memory?

24
Write Through Write Allocate vs Non-Allocate

Write allocate allocate new cache line in cache
Usually means that you have to do a read miss
to fill in rest of the cache-line!
Alternative per/word valid bits
Write non-allocate (or write-around)
Simply send write data through to underlying
memory/cache - dont allocate new cache line!

25
Write Buffers

Write Buffers (for wrt-through)
buffers words to be written in L2 cache/memory
along with their addresses.
2 to 4 entries deep
all read misses are checked against pending
writes for dependencies (associatively)
allows reads to proceed ahead of writes
can coalesce writes to same address

Write-back Buffers
between a write-back cache and L2 or MM
algorithm
move dirty block to write-back buffer
read new block
write dirty block in L2 or MM
can be associated with victim cache (later)

to CPU
L1
Write buffer
L2
26
Write Merge
27
Review Cache performance
28
Impact on Performance

Suppose a processor executes at
Clock Rate 200 MHz (5 ns per cycle), Ideal (no
misses) CPI 1.1
50 arith/logic, 30 ld/st, 20 control
Suppose that 10 of memory operations (Data) get
50 cycle miss penalty
Suppose that 1 of instructions get same miss
penalty
CPI ideal CPI average stalls per
instruction
1.1(cycles/ins)
0.30 (DataMops/ins) x 0.10
(miss/DataMop)
x 50 (cycle/miss)
1 (InstMop/ins) x 0.01
(miss/InstMop) x 50
(cycle/miss) (1.1 1.5 .5)
cycle/ins 3.1
58 (1.5/2.6) of the time the proc is stalled
waiting for data memory!
Total no. of memory accesses one per instrn
0.3 for data 1.3 Thus, AMAT(1/1.3)x10.01x50
(0.3/1.3)x10.1x502.54 cycles gt instead of
one cycle.

29
Impact of Change in cc

Suppose a processor has the following parameters
CPI 2 (w/o memory stalls)
mem access per instruction 1.5
Compare AMAT and CPU time for a direct mapped
cache and a 2-way set associative cache assuming
AMATd hit time miss rate miss penalty 11
0.01475 2.05 ns
AMAT2 11.25 0.0175 2 ns lt 2.05 ns
CPId (CPIcc mem. stall time)IC (21
1.50.01475)IC 3.575IC
CPI2 (21.25 1.50.0175)IC 3.625IC gt CPId
!
Change in cc affects all instructions while
reduction in miss rate benefit only memory
instructions.

cc Hit cycle Miss penalty Miss rate
Direct map 1ns 1 75 ns 1.4
2-way associative 1.25ns(why?) 1 75 ns 1.0
30
Miss Penalty for Out-of-Order (OOO) Exe.
Processor.

In OOO processors, memory stall cycles are
overlapped with execution of other instructions.
Miss penalty should not include this overlapped
part.
mem stall cycle per instruction mem miss per
instruction x (total miss penalty overlapped
miss penalty)
For the previous example. Suppose 30 of the 75ns
miss penalty can be overlapped, what is the AMAT
and CPU time?
Assume using direct map cache, cc1.25ns to
handle out of order execution.
AMATd 11.25 0.014(750.7) 1.985 ns
With 1.5 memory accesses per instruction,
CPU time ( 21.25 1.5 0.014 (750.7))IC
3.6025 IC lt CPU2

31
Lock-Up Free Cache Using MSHR (Miss Status
Holding Register)
32
Avg. Memory Access Time vs. Miss Rate

Associativity reduces miss rate, but increases
hit time due to increase in hardware complexity!
Example For on-chip cache, assume CCT 1.10 for
2-way, 1.12 for 4-way, 1.14 for 8-way vs. CCT
direct mapped
Cache Size Associativity
(KB) 1-way 2-way 4-way 8-way
1 2.33 2.15 2.07 2.01
2 1.98 1.86 1.76 1.68
4 1.72 1.67 1.61 1.53
8 1.46 1.48 1.47 1.43
16 1.29 1.32 1.32 1.32
32 1.20 1.24 1.25 1.27
64 1.14 1.20 1.21 1.23
128 1.10 1.17 1.18 1.20
(Red means A.M.A.T. not improved by more
associativity)