Title: Memory Hierarchy
1Memory Hierarchy
2The Big Picture Where are We Now?
- The Five Classic Components of a Computer
3Memory
- Memory is the required for storing
- Program
- Data
- Characteristics
- Access mode
- Sequential vs. random access (RAM)
- Alterability
- read only memory vs. read write memory
- Access time
- Price
- dollar / byte
4Memories Review
- SRAM
- value is stored on a pair of inverting gates
- The value can be kept indefinitely as long as
power is applied - very fast but takes up more space than DRAM (4 to
6 transistors) - DRAM
- value is stored as a charge on capacitor (must be
refreshed) - very small but slower than SRAM (factor of 5 to
10)
5SRAM vs. DRAM
- Fast switching due to low impedance of
transistors - Fast access 0.5 - 5ns
- High power
- Large area
- 6 transistors / cell
- 2 lines bit and negated bit line
- High costs 4000 to 10000 per GB (2004)
- Slow reading because of high resistance and high
capacity - Slow access 50-70ns
- Low power
- Small area
- 1 transistor 1 capacitor
- Vertically built cells
- Low costs 100 to 200 per GB
6Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
Less Law?
DRAM 9/yr. (2X/10 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
7Memory Hierarchy
- User wants large and fast memories!
- Conflicting goals
- By taking advantage of the principle of
locality - Present the user with as much memory as is
available in the cheapest technology. - Provide access at the speed offered by the
fastest technology.
8Principal of Locality
- Memory locations are not accessed at the uniform
frequency. Certain locations are accessed much
more than others. - If an item is referenced,temporal locality it
will tend to be referenced again soon - spatial locality nearby items will tend to be
referenced soon. - Why does code have locality?
9Hierarchical Access to the data
- Our initial focus two levels (upper, lower)
- Cache the fast storage to take advantage of
locality of access - block minimum unit of data
- hit data requested is in the cache
- miss data requested is not in the cache
- When an item is referenced
- The main memory block number is extracted from
the address - The block number is looked up in the cache
directory - If we have a hit, data is extracted from the
cache - If we have a miss, a copy of the block is
transferred from the main memory to the cache
10Hit miss
- Hit data is directly available in cache - no
penalty - Hit time time to retrieve a data item on a hit
- Miss data is on a lower level - penalty from 1 -
30 cycles! - Miss time time to retrieve a data item on a miss
- Miss time hit time miss penalty
- Hit ratio the ratio of hits to the total number
of memory accesses of a particular memory level. - Miss ratio 1 hit ratio
11Access
- Questions
- How do we know if a data item is in the cache?
- If it is, how do we find it?
- Simple approach Direct mapping
- For now, we assume the block size is of a single
word - For each item of data at the lower level, there
is exactly one location in the cache where it
might be. - e.g., lots of items at the lower level share
locations in the upper level - Mapping(Block address) mod (number of cache
blocks in the cache)
12Direct Mapped Cache Example
- Mapping address is modulo the number of blocks
in the cache
13Access Example
- Memory with 32 locations
- Cache with 8 locations
- Cache address are identical with the lower 3 bits
of the memory address - A tag at the cache memory location indicates the
high order bits of memory address - Complete address tag cache address
- Valid information is the content still valid?
14Example
- Initial state
- Miss of10110
15Example
- After miss at11010
- After miss at10000
16Example
- After miss at00011
- After miss at10010
17A more realistic cache example
- 32 bit data width
- 32 bit byte-address (30 bit word address)
- Cache size 1k blocks and Block size 1 Word
- 10 bit cache index
- 20 bit tag size
- 2 bit byte offset (need not be addressed if we
use word alignment to 32 bit boundary) - Valid bit
18Cache access
- Compare cache tag with address bits 31..12
- Check valid bit
- Signal a hit to the CPU
- Transfer data
- What kind of locality are we taking advantage of?
19Cache size
- Cache memory size
- 1024 32 bit 32 k bit
- Tag memory size
- 1024 20 bit 20 k bit
- Valid information
- 1024 1 bit 1 k bit
- Efficiency
- 32 / 53 60.4 only!
- General size of a one-word direct-mapped cache
- 2n (32 (32 - n - 2) 1) width 32 bits,
size 2n
20Spatial locality
- So far we didnt take advantage of spatial
locality - Basic idea
- Whenever we have a miss, load a group of adjacent
memory cells into the cache. - Having a larger block
- Directed-mapped block mappingcache index
(block address) mod (Number of cache blocks) - Address components 64kB cache, 4-word block
size - Tag 31 - 16
- Index 15 - 4
- Block offset 3 - 2
- Byte offset 1 - 0
2164 KB cache with 4 word blocks
22Example
- Consider a direct-mapped cache with 64 blocks and
a block size of 16 bytes (4 words). What is the
cache index of byte address 1200? - Block address of byte address 1200
- Word address 1200 / 4 300
- Block address 300 / 4 75
- Cache index
- 75 64 11
23Example
- Consider a series of address references given as
word addresses 22, 24, 25, 20 - Assuming a 16-word direct-mapped cache one-word
blocks, compute cache index of each reference and
label it as hit or miss - What are the results if assuming a 16-word
direct-mapped cache with 4 four-word blocks?
24Optimal block size
- Small block size
- High miss rate
- Short block loading time
- Ignoring spatial locality
- Large block size
- Low miss rate
- Long time for reloading block
- 1 miss requires n words to be loaded, n block
size - Optimization strategies
- Early restart
- Requested word first
25Miss rate / block size
26Harward - von Neumann
- Split caches
- Higher miss rate due to their size
- No conflicts when accessing data and instruction
at the same time - Higher bandwidth due to separate data paths
- Combined caches
- Lower miss rate due to their size
- Possibly stalls due to the simultaneous access to
data and instructions - Lower bandwidth due to sharing of resources
27Cache Reads
- Cache hit - just continue
- Access data from data memory data cache
- Access instructions from instruction memory
instruction cache - Miss
- Stall the complete processor
- Activate memory controller
- Get information from next lower level of cache or
the main memory - Load information into cache
- Resume as before
28Cache Write
- Write Hit must maintain consistencies between
cache and memory - can replace data in cache and memory
(write-through) - write the data only into the cache (write-back
the cache later) - Only in data cache
- Write Miss
- read the entire block into the cache, then write
the word - No need to read the block if the block size is
one word. Why?
29Write through and write back
- Write through
- Update cache and memory at the same time
- Requires a buffer because the memory cannot
accept data as fast as the processor can generate
writes - Write back
- Keep data in cache and write back when the cache
contents is being replaced - Requires more effort for the cache contents
replacement unit
30Example cache
- DecStation 3100based on MIPS R2000
- Cache
- Separate instruction and data caches
- Each 64 KB
- 16 k words
- Block size 1 word
31Example
- Write through
- Use bits 15 - 2 as the cache index
- Write bits 31 - 16 into the tag
- Write word into cache memory
- Set valid bit
- Write to memory
- Performance problem of writing to memory
- Cannot wait for the word to be written in memory
- Solution
- Introduce a write buffer (4 words) between
processor and memory
32Memory system design
- Hypothetical access time for a DRAM
- 1 clock cycle for sending address
- 15 clock cycles for each DRAM access initiated
- 1 clock cycle for sending the word
- Memory organization
- Block with four words
- Memory access 1 word
- Miss penalty
- 1 4 15 4 1 65 cycles
- Bytes / cycle 4 4 / 65 0.25
33Memory organisation
34Memory organization
- Option 1
- Access path 1 word wide
- high penalty
- Option 2
- Bus and wide, e.g. equal block size
- Penalty drops to
- 1 1 15 1 1 17 cycles
- Option 3
- Bus width 1 wordbut memory organized in banks
- Penalty drops to1 1 15 4 1 20 cycles
35Summary
- Memory hierarchy
- Cache
- Directed-mapped
- Hit miss
- Miss rate block size
- Memory organization
36Cache performance
- The performance of a cache depends on many
parameters - Memory stall clock cycles
- all stall cycles causes by memory access
- Read stall clock cycles and Write stall clock
cycles - Instruction cache stalls and data cache stalls
- CPU time (CPU execution cycles memory stall
cycles) cycle time - Memory stall cycles memory accesses miss
ratio miss penalty - (assuming same miss penalty for read and write
stalls) - Two ways of improving performance
- decreasing the miss ratio
- decreasing the miss penalty
- What happens if we increase block size?
37Example
- A machine has a CPI of 2 without memory stalls
- Instruction cache miss rate 2
- Data cache miss rate 4. 36 of all instructions
are memory accesses - Miss penalty 100 cycles
- How much faster a machine would run with a
perfect cache that never missed?
38Example
- Stall cycles
- Instruction missing cycles I x 2 x 100 2.00I
- Data missing cycles I x 36 x 4 x 100 1.44I
- CPI with memory stalls 2 2.0 1.44 5.44
- Ratio of CPU execution time 5.44/2 2.72
39Acceleration of the CPU
- Assumption
- Currently CPI 2, stall cycles / instruction
3.44 - Improvement
- Clock rate constant
- CPI improved from 2 to 1
- System behaviour
- System with perfect cache would be 4.44 / 1
4.44 times faster - Time spent on memory stalls
- Originally 3.44 / 5.44 63
- Now 3.44 / 4.44 77
40Acceleration of CPU
- If we double the clock rate without changing the
memory system, how much faster will the new
machine be? - Measured in the faster clock cycles, the miss
penalty will be twice as long, 200 cycles - Total miss cycle per instruction
- 2 x 200 36 x 4 x 200 6.88
- CPI per instruction 2 6.88 8.88
- speedup 5.44 x 2 / 8.88 1.23
41Three Cs of cache misses
- Compulsory misses caused by first access to a
block that is never in the cache - How to reduce compulsory misses?
- Capacity misses caused when the cache cannot
contain all blocks needed during execution of a
program - How to reduce capacity misses?
- Conflict misses caused when multiple blocks
compete for the same location in the cache, which
can be very bad in a direct-mapped cache. - How to reduce conflict misses?
42Decreasing miss ratio with associativity
- Direct mapped cache
- Every memory block goes exactly to one block in
the cache - Easy to find
- (Block No.) mod (No. of cache blocks)
- Use it as the index to the referenced word
- Fully associative cache
- A memory block can go in any block of the cache
- Lower miss rate
- Difficult to find
- Longer hit time
- Search all tags if the word is the requested one
43Cache Organizations
- Set associative cache
- A memory block goes to a set of blocks
- The minimum set size is 2.
- Finding the set
- cache index (Block No.) mod (No. of sets in the
cache) - It is required to check which of the elements of
the set contains the data
44Mapping of an eight block cache
45Cache types
- Direct mapped Set associative Full
associative
What is the cache index for block with address 12?
46Cache Miss Example
- Three small caches 4 one-word blocks
- Fully associative, two-way set associative, and
direct mapped - Find the number of misses for each cache
organization given the following sequence of
block addresses 0, 8, 0, 6, 8
47Example
- Direct mapped
- 0 -gt 0 mod 4 0
- 6 -gt 6 mod 4 2
- 8 -gt 8 mod 4 0
- 5 misses
- Set Associative
- 0 -gt 0 mod 2 0
- 6 -gt 6 mod 2 0
- 8 -gt 8 mod 2 0
- 4 or 3 misses (depending on replacement)
48Example (cont.)
- Fully associative
- 3 misses
49Performance improvement
- Associativity influences the performance
50Miss Rate
51Replacement strategy
- Direct mapping - no choice
- Full associative any position is allowed - Which
to choose? - evict a block that wont be used again
- If all blocks will be used again, then evict the
block that will not be used for the longest
period of time - Guarantees the lowest possible miss rate
- Cant be done unless we can tell the future
- Most often used scheme LRU (Least Recently Used)
- Setting a mark which element has not been used
for the longest time. - Random
- Easy to implement
- Is only by 1.1 worse than LRU
52Locating a block in Set Associative Cache
- Address portions
- Tag - must be compared to all elements in the set
- Index - selects the set
- Block offset
- Hardware effort for comparisonincreases linearly
with the number of elements in the set - More time for comparison longer hit time
- Example
- Cache size 4KB, 4-way set associative, 1-word
block - Which bits in a 32-bit address are used for cache
index? - Which bits are the tag?
- How about 4-word blocks?
53Example four-way associative cache
4 KB cache with 1-word blocks 4-way set
associative
54Decreasing miss penalty with multilevel caches
- Different technology costs for the cache levels
- First level cache on the same die as the
processor - use SRAMs to add another cache above primary
memory (DRAM) - miss penalty goes down if data is in 2nd level
cache - Different optimisation strategies
- Primary level minimal hit time
- Frequency as close to CPU clock as possible
- Secondary level minimal miss rate
- Larger size
- Larger block size
- Less accesses to main memory
55Performance of Multi-level Cache
- 5 GHZ processor
- CPI 1.0 without miss
- Main memory access time 100ns
- Miss rate per instruction at the primary cache is
2 - How much faster if we add a secondary cache with
5-ns access time and reduce the miss rate to main
memory to 0.5?
56Example
- Without secondary cache
- Miss penalty to main memory is
- 100ns / 0.2ns 500 cycles
- Effective CPI
- 1 500 x 2 11
- With secondary cache
- Miss penalty to secondary cache is
- 5ns / 0.2ns 25 cycles
- Effective CPI
- 1 25 x 2 500 x 0.5 4
- The machine with the secondary cache is
- 11 / 4 2.8 times faster
57Cache Complexities
Theoretical behavior of Radix sort vs. Quicksort
Observed behavior of Radix sort vs. Quicksort
- Not always easy to understand implications of
caches
58Cache Complexities
- Here is why
- Memory system performance is often critical
factor - multilevel caches, pipelined processors, make it
harder to predict outcomes - Compiler optimizations to increase locality
sometimes hurt ILP - Difficult to predict best algorithm need
experimental data
59Memory Hierarchies Summary
- Where can a block be placed in the cache? How is
a bock found? - Direct mapped
- Set associative
- Fully associative
- What is the block size?
- One-word block
- Multiple word block
- Which block should be replaced on a cache miss?
- Least recently used (LRU)
- Random
- What happens on a write
- Write through
- Write back
60Virtual Memory
- Motivation
- Allow multiple programs to share the same memory
- Allow a single program to exceed the size of
primary memory - Virtual memory
- A hardware-software interface that gives the user
the illusion of a memory system that is much
larger than the physical memory - The illusion of a larger memory is accomplished
by making use of secondary storage to back up for
the primary memory. - We will focus on page based virtual memory
- Page virtual memory block
61Virtual Memory
- Main memory can act as a cache for the secondary
storage (disk) - Advantages
- illusion of having more physical memory
- program relocation
- protection
62How Does VM Work
- Two memory spaces
- Virtual memory space-what the program sees
- Physical memory space-what the program runs in
(size of RAM) - On program startup
- OS copies program into RAM
- If there is not enough RAM, OS stops copying
program starts running the program with some
portion of the program loaded in RAM - When the program touches a part of the program
not in physical memory (RAM), OS copies that part
of the program from disk into RAM - In order to copy some of the program from disk
to RAM, OS must evict parts of the program
already in RAM - OS copies the evicted parts of the program back
to disk if the pages are dirty (ie, if they have
been written into, and changed)
63Pages virtual memory blocks
- Page faults the data is not in memory, retrieve
it from disk - huge miss penalty, thus pages should be fairly
large (e.g., 4KB) - reducing page faults is important (LRU is worth
the price) - can handle the faults in software instead of
hardware - using write-through is too expensive so we use
write back
64Page Tables
- Use fully associative method because of the high
overhead of page fault - Use a page table to map from virtual memory
address to physical memory address
65Page Tables
Page size 212 4KB Page table size 220 4
222 bytes 4MB
66What Happens if Page is not in RAM?
- How do we know its not in RAM?
- Page Table entrys valid bit is set to
INVALID(DISK) - What do we do?
- ask OS to fetch the page from disk -we call this
a page fault - Before page is read from disk, OS must evict a
page from RAM (if RAM is full) - The page to be evicted is called the victim page
- If the page to be evicted is dirty, write the
page back to disk - Only data pages can be dirty
- OS then reads the requested page from disk
- OS changes the page table to reflect the new
mapping - Hardware restarts at the faulting virtual address
67Which Page Should We Evict?
- Optimal solution
- evict a page that wont be referenced (used)
again - If all pages will be used again, then evict the
page that will not be used for the longest period
of time - Guarantees the lowest possible page fault rate (
of faults per second) - Cant be done unless we can tell the future
- Other page replacement algorithms
- First-in, First-out (FIFO)
- Least Recently Used (LRU)
68Mapping to Physical Memory
69Protection
- Each process has its own virtual memory but the
physical memory is shared. - A multi-program machine must provide protection
to the users. - The operation systems maps individual virtual
memories to disjoint physical pages. - Requires two modes of execution user mode and
supervisor mode. - Only supervisor mode can modify the page table
- Share information among processes by using
protection bits in the page table - Each page table entry contains protection bits
(read, write, executive) - Each memory access is checked against the
protection bits - An violation generates an interrupt (segmentation
fault)
70Performance of Virtual Memory
- If every program in a multiprogramming
environment fits into RAM, then virtual memory
never pages (goes to disk) - If any program doesnt fit into RAM, then the VM
system must page between RAM and disk - Paging is very costly
- A disk access (4KBytes) can take 10 ms in 10
ms, a processor can execute 20 Million
instructions - Basically, you really dont want to page very
often, if you dont have to - thrashing
71Making Address Translation Fast
- A cache for address translations translation
look-aside buffer (TLB)
72TLB
- Translation look-aside buffer
- 32 4096 entries
- Fully associative or set associative
- Hit time 0.5 1 cycle
- Miss penalty 10 30 cycles
- Hit rate gt 99
- For a TLB hit, the physical memory address is
obtained in one cycle - For a TLB miss, the regular translation mechanism
is used. - The TLB is updated with the new page-number /
page-table-entry pair.
73TLBs and caches
74TLBs and Caches
75Modern Memory Systems
76Modern Systems
77Modern Systems
- Things are getting complicated!