Title: Computer Architecture and Organization
1Computer Architecture and Organization
- Ben Juurlink Module 8
- Delft University of Technology The Memory
Hierarchy - April-May 2001
- Additional information
- http//ce.et.tudelft.nl/benj/Courses/CAO
2Objectives
- After this lecture, you should be able to
- describe basics of caches
- describe temporal and spatial locality
- describe direct-mapped caches and how data items
are found in such a cache - describe set-associative caches and how data
items are found in it - given a cache description, compute the number of
sets and how large the tag is - given a sequence of addresses, compute the miss
rate - use several cache replacement strategies
- describe virtual memory
- translate virtual addresses to physical addresses
- give a sequence of page references, compute the
number of page faults
3Memory Hierarchy, why?
- Users want large and fast memories!
- Type Access times Cost/MB (in 1997)SRAM 2 -
25ns 100 - 250DRAM 60-120ns 5 - 10Disk
10 to 20 million ns .10 to .20 - Try and give it to them anyway
- build a memory hierarchy
4Locality
- A principle that makes having a memory hierarchy
a good idea - If an item is referenced,temporal locality it
will tend to be referenced again soon - spatial locality nearby items will tend to
be referenced soon. - Why does code have locality?
- Our initial focus two levels (upper, lower)
- block minimum unit of data
- hit data requested is in the upper level
- miss data requested is not in the upper level
5Cache
- Two issues
- How do we know if a data item is in the cache?
- If it is, how do we find it?
- Our first example
- block size is one word of data
- "direct mapped"
For each item of data at the lower level, there
is exactly one location in the cache where it
might be. e.g., lots of items at the lower level
share locations in the upper level
6Direct Mapped Cache
- Mapping
- block address (byte address) div (block size in
bytes) - cache address (block address) mod (cache size
in blocks)
7Direct Mapped Cache Organization
Address (bit positions)
3
1
3
0
1
3
1
2
1
1
2
1
0
- For MIPS
- What kind of locality are we taking advantage of?
B
y
t
e
o
f
f
s
e
t
2
0
1
0
H
i
t
D
a
t
a
T
a
g
I
n
d
e
x
V
a
l
i
d
T
a
g
D
a
t
a
I
n
d
e
x
0
1
2
1
0
2
1
1
0
2
2
1
0
2
3
2
0
3
2
8Hits vs. Misses
- Read hits
- this is what we want!
- Read misses
- stall the CPU, fetch block from memory, deliver
to cache, restart the load instruction - Write hits
- can replace data in cache and memory
(write-through) - write the data only into the cache (write-back
the cache later) - Write misses
- read the entire block into the cache, then write
the word (allocate on write miss) - do not read the cache line just write to memory
(no allocate on write miss)
9Direct Mapped Cache
- Taking advantage of spatial locality
Address (bit positions)
10Hardware Issues
- Make reading multiple words faster by using
multiple banks of memory
11Performance
- Increasing the block size tends to decrease miss
rate but increases miss penalty
12Split caches
- Split cache separate caches for instructions
(code) and data - Useful because there is more spatial locality in
code
13Impact of Cache Performance on Execution Time
- Texec Ninst CPI Tcycle
- where
- CPI CPIideal CPIstall
- CPIstall reads missrateread
misspenaltyread - writes missratewrite misspenaltywrite
- or
- Texec (Nnormal-cycles Nstall-cycles )
Tcycle - where
- Nstall-cycles Nreads missrateread
misspenaltyread - Nwrites missratewrite
misspenaltywrite - ( Write-buffer stalls )
14Impact of Cache Performance on Execution Time
Texec (Nnormal-cycles Nstall-cycles )
Tcycle where Nstall-cycles Naccess
miss-rate miss-penalty
15Performance example
- Assume GCC application (page 311)
- I-cache miss rate 2
- D-cache miss rate 4
- CPIideal 2.0
- Miss penalty 40 cycles
- Calculate CPI
- CPI 2.0 CPIstall
- Nstall-cycles (instruction miss cycles) (data
miss cycles) - Instruction miss cycles Ninstr x 0.02 x 40
0.80 x Ninstr - loads and stores 36
- Data miss cycles Ninstr x ld-st x 0.04 x 40
0.576 x Ninstr - CPI 3.376 x Ninstr
- Slowdown 1.688 !!
16Performance example (continued)
- What if ideal processor had CPI 1.0 (instead of
2.0) - Slowdown would be 2.38 !
- What if processor is clocked twice as fast
- gt penalty becomes 80 cycles
- CPI 4.752
- Speedup N.CPIa.Tclock / (N.CPIb.Tclock/2)
3.376 / (4.752/2) - Speedup not 2, but only 1.42 !!
17Improving performance
- Two ways of improving performance
- decrease miss ratio associativity
- decrease miss penalty multilevel caches
- Active Learning What happens if we increase
block size?
18Decrease miss ratio using associative caches
2 blocks / set
block
4 blocks / set
8 blocks / set
19Implementation 4-way associative
20Active Learning
- Useful formula
- Active learning given the following
- cache size 4 KB
- associativity 4
- block size 4 words
- word 4 bytes
- how many sets are there in the cache?
(cache size) (number of sets) x associativity x
(block size)
21Replacement
- Which block is replaced on a cache miss?
- Cache Replacement Strategies
- Random pick one block at random
- First-In-First-Out (FIFO) replace block which is
longest in the cache - Least Recently Used (LRU) replace block which
has not been used for the longest time - Optimal algorithm (MIN) replace the block which
will not be used for the longest time
22Performance
1 KB
2 KB
8 KB
23Multilevel Caches
- Add a second level cache
- primary cache is often on the same chip as the
processor - use SRAMs to add another cache above primary
memory (DRAM) - miss penalty goes down if data is in 2nd level
cache - Example
- CPI of 1.0 on a 500MHz machine with 5 miss rate,
200ns DRAM access - Adding 2nd level cache with 20ns access time
decreases miss rate to 2 - Using multilevel caches
- try and optimize the hit time on the 1st level
cache - try and optimize the miss rate on the 2nd level
cache
24Virtual Memory
- Main memory can act as a cache for secondary
storage (disk) - Advantages
- illusion of having more physical memory
- program relocation
- protection
physical memory
virtual memory
25Pages and Page Table
- Pages virtual memory blocks
- Page table mapping of virtual page numbers to
physical page numbers
valid
page 0
0
0
page 3
0
page 1
1
2
1
page 2
1
0
2
page 1
page 3
2
1
0
3
3
physical memory
page 63
0
63
virtual memory
page table
26Page Faults
- Page fault data is not in memory, retrieve it
from disk - huge miss penalty, thus pages should be fairly
large (e.g., 4KB) - reducing page faults is important (LRU is worth
the price) - can handle faults in software (OS) instead of
hardware - write-through is too expensive, use writeback
27Page Tables
28Making Address Translation Fast
- A cache for address translations translation
lookaside buffer (TLB)
29Active Learning
- Suppose there is room for 3 pages in memory and
the processor references the following pages - 7 0 1 2 0 3 0 4 2 3 0 3 2 1
- How many page faults occur (assuming LRU
replacement)?
30Modern Systems
- First level cache organization
Pentium Pro dual chip module
31Modern Systems
- Very complicated memory systems
- Virtual memory
32Research Issues
- Processor speeds continue to increase very
fast much faster than either DRAM or disk
access times - Design challenge dealing with this growing
disparity - Trends
- synchronous SRAMs (provide a burst of data)
- redesign DRAM chips to provide higher bandwidth
or processing - restructure code to increase locality
- use prefetching (make cache visible to ISA)
33Active Learning
- Suggested exercises from Chapter seven
- 7.2, 7.3
- 7.7 - 7.10
- 7.20, 7.21
- 7.27, 7.32