Memory Hierarchy

About This Presentation

Title:

Memory Hierarchy

Description:

For now, we assume the block size is of a single word ... cache with 64 blocks and a block size of 16 bytes (4 words) ... Higher miss rate due to their size ... – PowerPoint PPT presentation

Number of Views:123

Avg rating:3.0/5.0

Slides: 76

Provided by: xiaoyu1

Category:

more less

Transcript and Presenter's Notes

Title: Memory Hierarchy

1
Memory Hierarchy

Chapter 7

2
The Big Picture Where are We Now?

The Five Classic Components of a Computer

3
Memory

Memory is the required for storing
Program
Data
Characteristics
Access mode
Sequential vs. random access (RAM)
Alterability
read only memory vs. read write memory
Access time
Price
dollar / byte

4
Memories Review

SRAM
value is stored on a pair of inverting gates
The value can be kept indefinitely as long as
power is applied
very fast but takes up more space than DRAM (4 to
6 transistors)
DRAM
value is stored as a charge on capacitor (must be
refreshed)
very small but slower than SRAM (factor of 5 to
10)

5
SRAM vs. DRAM

Fast switching due to low impedance of
transistors
Fast access 0.5 - 5ns
High power
Large area
6 transistors / cell
2 lines bit and negated bit line
High costs 4000 to 10000 per GB (2004)

Slow reading because of high resistance and high
capacity
Slow access 50-70ns
Low power
Small area
1 transistor 1 capacitor
Vertically built cells
Low costs 100 to 200 per GB

6
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
Less Law?
DRAM 9/yr. (2X/10 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
7
Memory Hierarchy

User wants large and fast memories!
Conflicting goals
By taking advantage of the principle of
locality
Present the user with as much memory as is
available in the cheapest technology.
Provide access at the speed offered by the
fastest technology.

8
Principal of Locality

Memory locations are not accessed at the uniform
frequency. Certain locations are accessed much
more than others.
If an item is referenced,temporal locality it
will tend to be referenced again soon
spatial locality nearby items will tend to be
referenced soon.
Why does code have locality?

9
Hierarchical Access to the data

Our initial focus two levels (upper, lower)
Cache the fast storage to take advantage of
locality of access
block minimum unit of data
hit data requested is in the cache
miss data requested is not in the cache
When an item is referenced
The main memory block number is extracted from
the address
The block number is looked up in the cache
directory
If we have a hit, data is extracted from the
cache
If we have a miss, a copy of the block is
transferred from the main memory to the cache

10
Hit miss

Hit data is directly available in cache - no
penalty
Hit time time to retrieve a data item on a hit
Miss data is on a lower level - penalty from 1 -
30 cycles!
Miss time time to retrieve a data item on a miss
Miss time hit time miss penalty
Hit ratio the ratio of hits to the total number
of memory accesses of a particular memory level.
Miss ratio 1 hit ratio

11
Access

Questions
How do we know if a data item is in the cache?
If it is, how do we find it?
Simple approach Direct mapping
For now, we assume the block size is of a single
word
For each item of data at the lower level, there
is exactly one location in the cache where it
might be.
e.g., lots of items at the lower level share
locations in the upper level
Mapping(Block address) mod (number of cache
blocks in the cache)

12
Direct Mapped Cache Example

Mapping address is modulo the number of blocks
in the cache

13
Access Example

Memory with 32 locations
Cache with 8 locations
Cache address are identical with the lower 3 bits
of the memory address
A tag at the cache memory location indicates the
high order bits of memory address
Complete address tag cache address
Valid information is the content still valid?

14
Example

Initial state
Miss of10110

15
Example

After miss at11010
After miss at10000

16
Example

After miss at00011
After miss at10010

17
A more realistic cache example

32 bit data width
32 bit byte-address (30 bit word address)
Cache size 1k blocks and Block size 1 Word
10 bit cache index
20 bit tag size
2 bit byte offset (need not be addressed if we
use word alignment to 32 bit boundary)
Valid bit

18
Cache access

Compare cache tag with address bits 31..12
Check valid bit
Signal a hit to the CPU
Transfer data
What kind of locality are we taking advantage of?

19
Cache size

Cache memory size
1024 32 bit 32 k bit
Tag memory size
1024 20 bit 20 k bit
Valid information
1024 1 bit 1 k bit
Efficiency
32 / 53 60.4 only!
General size of a one-word direct-mapped cache
2n (32 (32 - n - 2) 1) width 32 bits,
size 2n

20
Spatial locality

So far we didnt take advantage of spatial
locality
Basic idea
Whenever we have a miss, load a group of adjacent
memory cells into the cache.
Having a larger block
Directed-mapped block mappingcache index
(block address) mod (Number of cache blocks)
Address components 64kB cache, 4-word block
size
Tag 31 - 16
Index 15 - 4
Block offset 3 - 2
Byte offset 1 - 0

21
64 KB cache with 4 word blocks
22
Example

Consider a direct-mapped cache with 64 blocks and
a block size of 16 bytes (4 words). What is the
cache index of byte address 1200?
Block address of byte address 1200
Word address 1200 / 4 300
Block address 300 / 4 75
Cache index
75 64 11

23
Example

Consider a series of address references given as
word addresses 22, 24, 25, 20
Assuming a 16-word direct-mapped cache one-word
blocks, compute cache index of each reference and
label it as hit or miss
What are the results if assuming a 16-word
direct-mapped cache with 4 four-word blocks?

24
Optimal block size

Small block size
High miss rate
Short block loading time
Ignoring spatial locality
Large block size
Low miss rate
Long time for reloading block
1 miss requires n words to be loaded, n block
size
Optimization strategies
Early restart
Requested word first

25
Miss rate / block size
26
Harward - von Neumann

Split caches
Higher miss rate due to their size
No conflicts when accessing data and instruction
at the same time
Higher bandwidth due to separate data paths
Combined caches
Lower miss rate due to their size
Possibly stalls due to the simultaneous access to
data and instructions
Lower bandwidth due to sharing of resources

27
Cache Reads

Cache hit - just continue
Access data from data memory data cache
Access instructions from instruction memory
instruction cache
Miss
Stall the complete processor
Activate memory controller
Get information from next lower level of cache or
the main memory
Load information into cache
Resume as before

28
Cache Write

Write Hit must maintain consistencies between
cache and memory
can replace data in cache and memory
(write-through)
write the data only into the cache (write-back
the cache later)
Only in data cache
Write Miss
read the entire block into the cache, then write
the word
No need to read the block if the block size is
one word. Why?

29
Write through and write back

Write through
Update cache and memory at the same time
Requires a buffer because the memory cannot
accept data as fast as the processor can generate
writes
Write back
Keep data in cache and write back when the cache
contents is being replaced
Requires more effort for the cache contents
replacement unit

30
Example cache

DecStation 3100based on MIPS R2000
Cache
Separate instruction and data caches
Each 64 KB
16 k words
Block size 1 word

31
Example

Write through
Use bits 15 - 2 as the cache index
Write bits 31 - 16 into the tag
Write word into cache memory
Set valid bit
Write to memory
Performance problem of writing to memory
Cannot wait for the word to be written in memory
Solution
Introduce a write buffer (4 words) between
processor and memory

32
Memory system design

Hypothetical access time for a DRAM
1 clock cycle for sending address
15 clock cycles for each DRAM access initiated
1 clock cycle for sending the word
Memory organization
Block with four words
Memory access 1 word
Miss penalty
1 4 15 4 1 65 cycles
Bytes / cycle 4 4 / 65 0.25

33
Memory organisation
34
Memory organization

Option 1
Access path 1 word wide
high penalty
Option 2
Bus and wide, e.g. equal block size
Penalty drops to
1 1 15 1 1 17 cycles
Option 3
Bus width 1 wordbut memory organized in banks
Penalty drops to1 1 15 4 1 20 cycles

35
Summary

Memory hierarchy
Cache
Directed-mapped
Hit miss
Miss rate block size
Memory organization

36
Cache performance

The performance of a cache depends on many
parameters
Memory stall clock cycles
all stall cycles causes by memory access
Read stall clock cycles and Write stall clock
cycles
Instruction cache stalls and data cache stalls
CPU time (CPU execution cycles memory stall
cycles) cycle time
Memory stall cycles memory accesses miss
ratio miss penalty
(assuming same miss penalty for read and write
stalls)
Two ways of improving performance
decreasing the miss ratio
decreasing the miss penalty
What happens if we increase block size?

37
Example

A machine has a CPI of 2 without memory stalls
Instruction cache miss rate 2
Data cache miss rate 4. 36 of all instructions
are memory accesses
Miss penalty 100 cycles
How much faster a machine would run with a
perfect cache that never missed?

38
Example

Stall cycles
Instruction missing cycles I x 2 x 100 2.00I
Data missing cycles I x 36 x 4 x 100 1.44I
CPI with memory stalls 2 2.0 1.44 5.44
Ratio of CPU execution time 5.44/2 2.72

39
Acceleration of the CPU

Assumption
Currently CPI 2, stall cycles / instruction
3.44
Improvement
Clock rate constant
CPI improved from 2 to 1
System behaviour
System with perfect cache would be 4.44 / 1
4.44 times faster
Time spent on memory stalls
Originally 3.44 / 5.44 63
Now 3.44 / 4.44 77

40
Acceleration of CPU

If we double the clock rate without changing the
memory system, how much faster will the new
machine be?
Measured in the faster clock cycles, the miss
penalty will be twice as long, 200 cycles
Total miss cycle per instruction
2 x 200 36 x 4 x 200 6.88
CPI per instruction 2 6.88 8.88
speedup 5.44 x 2 / 8.88 1.23

41
Three Cs of cache misses

Compulsory misses caused by first access to a
block that is never in the cache
How to reduce compulsory misses?
Capacity misses caused when the cache cannot
contain all blocks needed during execution of a
program
How to reduce capacity misses?
Conflict misses caused when multiple blocks
compete for the same location in the cache, which
can be very bad in a direct-mapped cache.
How to reduce conflict misses?

42
Decreasing miss ratio with associativity

Direct mapped cache
Every memory block goes exactly to one block in
the cache
Easy to find
(Block No.) mod (No. of cache blocks)
Use it as the index to the referenced word
Fully associative cache
A memory block can go in any block of the cache
Lower miss rate
Difficult to find
Longer hit time
Search all tags if the word is the requested one

43
Cache Organizations

Set associative cache
A memory block goes to a set of blocks
The minimum set size is 2.
Finding the set
cache index (Block No.) mod (No. of sets in the
cache)
It is required to check which of the elements of
the set contains the data

44
Mapping of an eight block cache
45
Cache types

Direct mapped Set associative Full
associative

What is the cache index for block with address 12?
46
Cache Miss Example

Three small caches 4 one-word blocks
Fully associative, two-way set associative, and
direct mapped
Find the number of misses for each cache
organization given the following sequence of
block addresses 0, 8, 0, 6, 8

47
Example

Direct mapped
0 -gt 0 mod 4 0
6 -gt 6 mod 4 2
8 -gt 8 mod 4 0
5 misses
Set Associative
0 -gt 0 mod 2 0
6 -gt 6 mod 2 0
8 -gt 8 mod 2 0
4 or 3 misses (depending on replacement)

48
Example (cont.)

Fully associative
3 misses

49
Performance improvement

Associativity influences the performance

50
Miss Rate
51
Replacement strategy

Direct mapping - no choice
Full associative any position is allowed - Which
to choose?
evict a block that wont be used again
If all blocks will be used again, then evict the
block that will not be used for the longest
period of time
Guarantees the lowest possible miss rate
Cant be done unless we can tell the future
Most often used scheme LRU (Least Recently Used)
Setting a mark which element has not been used
for the longest time.
Random
Easy to implement
Is only by 1.1 worse than LRU

52
Locating a block in Set Associative Cache

Address portions
Tag - must be compared to all elements in the set
Index - selects the set
Block offset
Hardware effort for comparisonincreases linearly
with the number of elements in the set
More time for comparison longer hit time
Example
Cache size 4KB, 4-way set associative, 1-word
block
Which bits in a 32-bit address are used for cache
index?
Which bits are the tag?
How about 4-word blocks?

53
Example four-way associative cache
4 KB cache with 1-word blocks 4-way set
associative
54
Decreasing miss penalty with multilevel caches

Different technology costs for the cache levels
First level cache on the same die as the
processor
use SRAMs to add another cache above primary
memory (DRAM)
miss penalty goes down if data is in 2nd level
cache
Different optimisation strategies
Primary level minimal hit time
Frequency as close to CPU clock as possible
Secondary level minimal miss rate
Larger size
Larger block size
Less accesses to main memory

55
Performance of Multi-level Cache

5 GHZ processor
CPI 1.0 without miss
Main memory access time 100ns
Miss rate per instruction at the primary cache is
2
How much faster if we add a secondary cache with
5-ns access time and reduce the miss rate to main
memory to 0.5?

56
Example

Without secondary cache
Miss penalty to main memory is
100ns / 0.2ns 500 cycles
Effective CPI
1 500 x 2 11
With secondary cache
Miss penalty to secondary cache is
5ns / 0.2ns 25 cycles
Effective CPI
1 25 x 2 500 x 0.5 4
The machine with the secondary cache is
11 / 4 2.8 times faster

57
Cache Complexities
Theoretical behavior of Radix sort vs. Quicksort
Observed behavior of Radix sort vs. Quicksort

Not always easy to understand implications of
caches

58
Cache Complexities

Here is why
Memory system performance is often critical
factor
multilevel caches, pipelined processors, make it
harder to predict outcomes
Compiler optimizations to increase locality
sometimes hurt ILP
Difficult to predict best algorithm need
experimental data

59
Memory Hierarchies Summary

Where can a block be placed in the cache? How is
a bock found?
Direct mapped
Set associative
Fully associative
What is the block size?
One-word block
Multiple word block
Which block should be replaced on a cache miss?
Least recently used (LRU)
Random
What happens on a write
Write through
Write back

60
Virtual Memory

Motivation
Allow multiple programs to share the same memory
Allow a single program to exceed the size of
primary memory
Virtual memory
A hardware-software interface that gives the user
the illusion of a memory system that is much
larger than the physical memory
The illusion of a larger memory is accomplished
by making use of secondary storage to back up for
the primary memory.
We will focus on page based virtual memory
Page virtual memory block

61
Virtual Memory

Main memory can act as a cache for the secondary
storage (disk)
Advantages
illusion of having more physical memory
program relocation
protection

62
How Does VM Work

Two memory spaces
Virtual memory space-what the program sees
Physical memory space-what the program runs in
(size of RAM)
On program startup
OS copies program into RAM
If there is not enough RAM, OS stops copying
program starts running the program with some
portion of the program loaded in RAM
When the program touches a part of the program
not in physical memory (RAM), OS copies that part
of the program from disk into RAM
In order to copy some of the program from disk
to RAM, OS must evict parts of the program
already in RAM
OS copies the evicted parts of the program back
to disk if the pages are dirty (ie, if they have
been written into, and changed)

63
Pages virtual memory blocks

Page faults the data is not in memory, retrieve
it from disk
huge miss penalty, thus pages should be fairly
large (e.g., 4KB)
reducing page faults is important (LRU is worth
the price)
can handle the faults in software instead of
hardware
using write-through is too expensive so we use
write back

64
Page Tables

Use fully associative method because of the high
overhead of page fault
Use a page table to map from virtual memory
address to physical memory address

65
Page Tables

Page size 212 4KB Page table size 220 4
222 bytes 4MB
66
What Happens if Page is not in RAM?

How do we know its not in RAM?
Page Table entrys valid bit is set to
INVALID(DISK)
What do we do?
ask OS to fetch the page from disk -we call this
a page fault
Before page is read from disk, OS must evict a
page from RAM (if RAM is full)
The page to be evicted is called the victim page
If the page to be evicted is dirty, write the
page back to disk
Only data pages can be dirty
OS then reads the requested page from disk
OS changes the page table to reflect the new
mapping
Hardware restarts at the faulting virtual address

67
Which Page Should We Evict?

Optimal solution
evict a page that wont be referenced (used)
again
If all pages will be used again, then evict the
page that will not be used for the longest period
of time
Guarantees the lowest possible page fault rate (
of faults per second)
Cant be done unless we can tell the future
Other page replacement algorithms
First-in, First-out (FIFO)
Least Recently Used (LRU)

68
Mapping to Physical Memory
69
Protection

Each process has its own virtual memory but the
physical memory is shared.
A multi-program machine must provide protection
to the users.
The operation systems maps individual virtual
memories to disjoint physical pages.
Requires two modes of execution user mode and
supervisor mode.
Only supervisor mode can modify the page table
Share information among processes by using
protection bits in the page table
Each page table entry contains protection bits
(read, write, executive)
Each memory access is checked against the
protection bits
An violation generates an interrupt (segmentation
fault)

70
Performance of Virtual Memory

If every program in a multiprogramming
environment fits into RAM, then virtual memory
never pages (goes to disk)
If any program doesnt fit into RAM, then the VM
system must page between RAM and disk
Paging is very costly
A disk access (4KBytes) can take 10 ms in 10
ms, a processor can execute 20 Million
instructions
Basically, you really dont want to page very
often, if you dont have to
thrashing

71
Making Address Translation Fast