CS 161 Ch 7: Memory Hierarchy LECTURE 14 - PowerPoint PPT Presentation

About This Presentation

Title:

CS 161 Ch 7: Memory Hierarchy LECTURE 14

Description:

Ch 7: Memory Hierarchy LECTURE 14 Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan Recap: Machine Organization: 5 classic components of any computer Memory Trends Users ... – PowerPoint PPT presentation

Number of Views:152

Avg rating:3.0/5.0

Slides: 31

Provided by: davep173

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 161 Ch 7: Memory Hierarchy LECTURE 14

1
CS 161Ch 7 Memory Hierarchy LECTURE 14

Instructor L.N. Bhuyan
www.cs.ucr.edu/bhuyan

2
Recap Machine Organization 5 classic
components of any computer
Personal Computer
Computer
Processor (CPU) (active)
Memory (passive) (where programs, data live
when running)
Devices
Input
Control (brain)
Datapath (brawn)
Output
Components of every computer belong to one of
these five categories
3
Memory Trends

Users want large and fast memories! SRAM access
times are .5 5ns at cost of 4000 to 10,000
per GB.
DRAM access times are 50-70ns at cost of 100 to
200 per GB.
Disk access times are 5 to 20 million ns at cost
of .50 to 2 per GB.

2004
4
Memory Latency Problem
Processor-DRAM Memory Performance Gap Motivation
for Memory Hierarchy
µProc 60/yr. (2X/1.5yr)
1000
CPU
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 5/yr. (2X/15 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
5
The Goal Illusion of large, fast, cheap memory

Fact Large memories are slow, fast memories are
small
How do we create a memory that is large, cheap
and fast (most of the time)?
Hierarchy of Levels
Uses smaller and faster memory technologies close
to the processor
Fast access time in highest level of hierarchy
Cheap, slow memory furthest from processor
The aim of memory hierarchy design is to have
access time close to the highest level and size
equal to the lowest level

6
Recap Memory Hierarchy Pyramid
Processor (CPU)
transfer datapath bus
Decreasing distance from CPU, Decreasing Access
Time (Memory Latency)
Increasing Distance from CPU,Decreasing cost /
MB
Level n
Size of memory at each level
7
Why Hierarchy works Natural Locality

The Principle of Locality
Programs access a relatively small portion of the
address space at any second

1
Probability of reference
0
0
2n - 1
Memory Address

Temporal Locality (Locality in Time)? Recently
accessed data tend to be referenced again soon
Spatial Locality (Locality in Space)? nearby
items will tend to be referenced soon

8
Memory Hierarchy Terminology

Hit data appears in level X Hit Rate the
fraction of memory accesses found in the upper
level
Miss data needs to be retrieved from a block in
the lower level (Block Y) Miss Rate 1 - (Hit
Rate)
Hit Time Time to access the upper level which
consists of Time to determine hit/miss memory
access time
Miss Penalty Time to replace a block in the
upper level Time to deliver the block to the
processor
Note Hit Time ltlt Miss Penalty

9
Current Memory Hierarchy
Processor
Control
Secon- dary Mem- ory
Main Mem- ory
L2 Cache
Data-path
L1 cache
regs
Speed(ns) 1ns 2ns 6ns 100ns 10,000,000ns Size
(MB) 0.0005 0.1 1-4 100-1000 100,000 Cost
(/MB) -- 100 30 1 0.05 Technology Regs SR
AM SRAM DRAM Disk
10
Memory Hierarchy Technology

Random Access access time is the same for all
locations (Hardware decoder used)
Sequential Access Very slow, Data accessed
sequentially, access time is location dependent,
considered as I/O, (Example Disks and Tapes)
DRAM Dynamic Random Access Memory
High density, low power, cheap, slow
Dynamic needs to be refreshed regularly
SRAM Static Random Access Memory
Low density, high power, expensive, fast
Static content last forever(until lose power)

11
Memories Review

SRAM
value is stored on a pair of inverting gates
very fast but takes up more space than DRAM (4 to
6 transistors)
DRAM
value is stored as a charge on capacitor (must be
refreshed)
very small but slower than SRAM (factor of 5 to
10)

12
How is the hierarchy managed?

Registers Memory
By the compiler (or assembly language Programmer)
Cache Main Memory
By hardware
Main Memory Disks
By combination of hardware and the operating
system (virtual memory will cover next)
By the programmer (Files)

13
Measuring Cache Performance
Measuring Cache Performance
CPU time Execution cycles X clock cycle time If
cache miss (Execution cycles Memory stall
cycles) X clock cycle time Read-stall cycles
reads X Read miss rate X Read miss
penalty Write-stall cycles writes X write
miss rate X write miss penalty Memory-stall
cycles Read-stall write stall Memory
accesses X miss rate X miss penalty instns X
misses/instn X miss penalty
14
Example

Q Cache miss penalty 50 cycles and all instns
take 2.0 cycles without memory stalls. Assume
cache miss rate of 2 and 1.33 (why?) memory
references per instn. What is the impact of
cache?
Ans CPU time IC x (CPI Memory stall
cycles/instn) x cycle time t
Performance including cache misses is
CPU time IC x (2.0 (1.33 x .02 x 50)) x cycle
time IC x 3.33 x t
For a perfect cache that never misses CPU time
IC x 2.0 x t
Hence, including the memory hierarchy stretches
CPU time by 1.67
But, without memory hierarchy, the CPI would
increase to
2.0 50 x 1.33 or 68.5 a factor of over 30
times longer.

15
Cache Organization

(1) How do you know if something is in the cache?
(2) If it is in the cache, how to find it?

Answer to (1) and (2) depends on type or
organization of the cache
In a direct mapped cache, each memory address is
associated with one possible block within the
cache
Therefore, we only need to look in a single
location in the cache for the data if it exists
in the cache

16
Simplest Cache Direct Mapped
4-Block Direct Mapped Cache
Memory
Cache Index
Block Address
0
0
0000two
1
1
2
2
3
3
4
0100two

Block Size 32/64 Bytes

5
6

Cache Block 0 can be occupied by data from
Memory block 0, 4, 8, 12
Cache Block 1 can be occupied by data from
Memory block 1, 5, 9, 13

7
8
1000two
9
10
11
12
1100two
13
14
15
17
Simplest Cache Direct Mapped
4-Block Direct Mapped Cache
MainMemory
Cache Index
Block Address
0
0
1
1
2
2
0010
3
3
4
Memory block address
5
6
0110
index
tag
7
8
9

index determines block in cache
index (address) mod ( blocks)
If number of cache blocks is power of 2, then
cache index is just the lower n bits of memory
address n log2( blocks)

10
1010
11
12
13
14
1110
15
18
Simplest Cache Direct Mapped w/Tag
Direct Mapped Cache
MainMemory
cache index
Block Address
tag
data
0
0
1
1
11
2
2
0010
3
3
4
5
6
0110
7

tag determines which memory block occupies cache
block
tag left hand bits of address
hit cache tag field tag bits of address
miss tag field ? tag bits of addr.

8
9
10
1010
11
12
13
14
1110
15
19
Finding Item within Block

In reality, a cache block consists of a number of
bytes/words (32 or 64 bytes) to (1) increase
cache hit due to locality property and (2) reduce
the cache miss time.
Mapping memory block I can be mapped to cache
block frame I mod x, where x is the number of
blocks in the cache
Called congruent mapping
Given an address of item, index tells which block
of cache to look in
Then, how to find requested item within the cache
block?
Or, equivalently, What is the byte offset of the
item within the cache block?

20
Issues with Direct-Mapped

If block size gt 1, rightmost bits of index are
really the offset within the indexed block

21
Accessing data in a direct mapped cache

Three types of events
cache miss nothing in cache in appropriate
block, so fetch from memory
cache hit cache block is valid and contains
proper address, so read desired word
cache miss, block replacement wrong data is in
cache at appropriate block, so discard it and
fetch desired data from memory
Cache Access Procedure (1) Use Index bits to
select cache block (2) If valid bit is 1, compare
the tag bits of the address with the cache block
tag bits (3) If they match, use the offset to
read out the word/byte.

22
Data valid, tag OK, so read offset return word d

000000000000000000 0000000001 1100

3
1
Index
2
0
1
0
a
b
c
d
0
0
0
0
0
0
0
0
23
An Example Cache DecStation 3100

Commercial Workstation 1985
MIPS R2000 Processor (similar to pipelined
machine of chapter 6)
Separate instruction and data caches
direct mapped
64K Bytes (16K words) each
Block Size 1 Word (Low Spatial Locality)
Solution
Increase block size 2nd example

24
DecStation 3100 Cache
3
1

3
0

1
7

1
6

1
5

5

4

3

2

1

0
Address (showing bit positions)
ByteOffset
1
6
1
4
Data
H
i
t
1
6

b
i
t
s
3
2

b
i
t
s
V
a
l
i
d
T
a
g
D
a
t
a
1
6
K
e
n
t
r
i
e
s
If miss, cache controller stalls the processor,
loads data from main memory
1
6
3
2
25
64KB Cache with 4-word (16-byte) blocks
31 . . . 16 15 . . 4 3 2 1 0
Address (showing bit positions)
1
6
1
2
B
y
t
e
2
H
i
t
D
a
t
a
T
a
g
o
f
f
s
e
t
B
l
o
c
k

o
f
f
s
e
t
I
n
d
e
x
1
6

b
i
t
s
1
2
8

b
i
t
s
Tag
Data
V
4
K
e
n
t
r
i
e
s
1
6
3
2
3
2
3
2
3
2
M
u
x
3
2
26
Miss rates 1-word vs. 4-word block (cache
similar to DecStation 3100)
I-cache D-cache CombinedProgram miss
rate miss rate miss rategcc 6.1 2.1 5.4sp
ice 1.2 1.3 1.2gcc 2.0 1.7 1.9spice
0.3 0.6 0.4
1-wordblock
4-wordblock
27
Miss Rate Versus Block Size
4
0

3
5

3
0

2
5

e
t
a
r

s
2
0

s
i
M
1
5

1
0

5

0

256
64
16
4
B
l
o
c
k

s
i
z
e

(bytes)
1

K
B
total cache size
8

K
B
1
6

K
B

Figure 7.12 - for direct mapped cache

6
4

K
B
2
5
6

K
B
28
Extreme Example 1-block cache

Suppose choose block size cache size? Then
only one block in the cache
Temporal Locality says if an item is accessed, it
is likely to be accessed again soon
But it is unlikely that it will be accessed again
immediately!!!
The next access is likely to be a miss
Continually loading data into the cache
butforced to discard them before they are used
again
Worst nightmare of a cache designer Ping Pong
Effect

29
Block Size and Miss Penality

With increase in block size, the cost of a miss
also increases
Miss penalty time to fetch the block from the
next lower level of the hierarchy and load it
into the cache
With very large blocks, increase in miss penalty
overwhelms decrease in miss rate
Can minimize average access time if design memory
system right

30
Block Size Tradeoff
Miss Rate
Exploits Spatial Locality
Fewer blocks compromises temporal locality
Block Size
Average Access Time
Increased Miss Penalty Miss Rate
Block Size

Write a Comment

User Comments (0)