Title: Chap 7: Memory Hierarchies
1Chap 7 Memory Hierarchies
- Shivkumar Kalyanaraman
- Rensselaer Polytechnic Institute
- shivkuma_at_ecse.rpi.edu
- http//www.ecse.rpi.edu/Homepages/shivkuma
2Overview
- Memory and memory hierarchies overview
- Caching direct-mapped, set- and
fully-associative - Virtual memory paging, translation lookaside
buffers (TLBs), protection issues - A common framework for caching virtual memory
- Real Stuff Pentium pro and Power PC caches VM
support.
3 Why memory hierarchies?
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 9/yr. (2X/10 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
4Impact on Performance
- Suppose a processor executes at
- Clock Rate 200 MHz (5 ns per cycle)
- Ideal CPI 1.1
- 50 arith/logic, 30 ld/st, 20 control
- Suppose10 of memory operations get 50 cycle
miss penalty - CPI ideal CPI average stalls per
instruction 1.1(cycles) ( 0.30
(memops/ins) x - 0.10 (miss/memop) x 50 (cycle/miss)
) - 1.1 cycle 1.5 cycle 2.6
- 58 of the time the processor is stalled
waiting for memory! - Every 1 instruction miss rate would add an
additional 0.5 cycles to the CPI! - Need caches to bridge this growing performance
gap !
5Memories Review
- SRAM
- value is stored on a pair of inverting gates
- very fast but takes up more space than DRAM (4 to
6 transistors). Access times 5 - 25ns - DRAM
- value is stored as a charge on capacitor (must be
refreshed) - very small but slower than SRAM (factor of 5 to
10)
DRAM
SRAM
6Classical DRAM Organization (square)
bit (data) lines
r o w d e c o d e r
Each intersection represents a 1-T DRAM Cell
RAM Cell Array
word (row) select
Column Selector I/O Circuits
row address
Column Address
- Row and Column Address together
- Select 1 bit a time
data
7Improvements to DRAM access time
- Page mode and EDO (Extended Data Out) DRAM
- Access a single row and store it in a buffer
which acts as an SRAM. - Now with different column addresses, different
bits can be randomly accessed and sent out. - Improvement 120 ns -gt 60 ns (page mode) -gt 25 ns
(EDO) - SDRAMs (S synchronous) clock out bits from the
buffer based upon a clock (100 MHz). - Avoids need to pass multiple column addresses.
Useful for burst access. Time between successive
bits after row access complete 8-10 ns ! - All these improvements use DRAM circuitry, adding
little cost to the system, but improve
performance dramatically.
8Exploiting Memory Hierarchy
- Users want large and fast memories! SRAM access
times 2 - 25ns. Cost 100 to
250/Mbyte.DRAM access times 60-120ns Cost
5 to 10/Mbyte.Disk access times10 to 20
million ns.Cost.10 to .20/Mbyte. - But we know that smaller, not larger is faster !
- Need to create illusion of large, fast memories
- Can we use a memory hierarchy ?
9Locality enables hierarchies ...
- Although memory allows random access, programs
access relatively small portions of it at any
time. - Specifically if an item is referenced,
- it will tend to be referenced again soon
(Temporal locality) - nearby items may be referenced soon (Spatial
locality) - Trick Let us capture these small portions and
place them in a small, fast memory (which we can
build) ! - Programmers compilers must increasingly
cooperate to enhance locality in programs, else
cache wont deliver and performance will degrade !
10Modern memory hierarchy
- By taking advantage of the principle of
locality - Present the user with as much memory as is
available in the cheapest technology. - Provide access at the speed offered by the
fastest technology.
1
10,000,000
Speed (ns)
10
100
10,000,000,000
100s
GB
Size (bytes)
KB
MB
TB
11Memory Hierarchy Terminology
- Hit data appears in some block in the upper
level (example Block X) - Hit Rate the fraction of memory access in the
upper level - Hit Time Time to access the upper level which
consists of - SRAM access time Time to determine hit/miss
- Miss data retrieved from a block in the lower
level (Block Y) - Miss Rate 1 - (Hit Rate)
- Miss Penalty Time to replace a block in the
upper level - Time to deliver the block the processor
12Cache
- Two issues
- How do we know if a data item is in the cache?
(hit check) - If it is, how do we find it?
- Simple design
- Block size is one word of data
- Direct mapped
- For each item of data at the lower level, there
is exactly one location in the cache where it
might be. - e.g., lots of items at the lower level share
locations in the upper level
13Direct Mapped Cache
- Mapping address is modulo the number of blocks
in the cache - Cache index (Block address) mod (cache size in
blocks) - Cache index can be derived as a bit field in the
address - Eg Cache 256B Block size32B Byte address
300 - Cache index (300/32) mod (256/32) 1
- Note 300 1 001 00010 (ignore 5 lsbs since blk
size 32)
14Direct Mapped Cache
- For MIPS
- Qn What kind of locality are we exploiting
here ?
15Direct Mapped Cache larger blocks
- Taking advantage of spatial locality
16Hits vs. Misses
- Read hits this is what we want!
- Read misses Stall the CPU, fetch block from
memory, deliver to cache, restart - Write hits
- Can replace data in cache and memory
(write-through) - Use a bufferwrite controller to cut stalls
(write-buffer) - Write the data only into the cache (write-back).
- Write data to memory when the block is being
tossed out of the cache. - Write misses
- Read the entire block into the cache, then write
the word - Note miss penalty has increased (larger blocks
transferred per write) when we try to attack miss
rate (through use of larger blocks for spatial
locality)
17Reducing the miss penalty hardware
- Memory read time Address send time DRAM
access time data transfer time. Access
time dominates - Speed up
- 1. Through parallel access of multiple words
- 2. Through wider bus for transfers
- 1 2 gt wide memory 1 only interleaved
18Performance
- Increasing the block size tends to decrease miss
rate unless the total of blocks is very
small - Split caches (one for instruction and one for
data) preferred because you double bandwidth by
accessing simultaneously.
19Performance
- Simplified modelexecution time (execution
cycles stall cycles) ? cycle timestall cycles
of instructions ? miss ratio ? miss penalty - Where do we head from here ?
- Two ways of improving performance
- decreasing the miss ratio increase
associativity - decreasing the miss penalty multi-level caches
20Decreasing miss ratio with associativity
- Smart placement of blocks
- Fully associative can place block anywhere in
the cache - Set-associative can place block anywhere in a
set of blocks within the cache
21Implementation 4-way set associative cache
22Performance
- Huge improvements upto 4-way. Then we see
diminishing - returns, combined with higher cost of
implementation
23Decreasing miss penalty with multilevel caches
- Add a second level cache
- often primary cache is on the same chip as the
processor - use SRAMs to add another cache above primary
memory (DRAM) - miss penalty goes down if data is in 2nd level
cache - Example
- CPI of 1.0 on a 500Mhz machine with a 5 miss
rate, 200ns DRAM access gt New CPI 6.0 - Adding 2nd level cache with 20ns access time and
miss rate of 2 gt New CPI 3.5 - Using multilevel caches
- try and optimize the hit time on the 1st level
cache - try and optimize the miss rate on the 2nd level
cache
24Virtual Memory
- Main memory can act as a cache for the secondary
storage (disk) used for multiprogramming, not
for direct disk access - Goals
- Illusion of having more physical memory
- Program relocation support (relieves programmer
burden) - Protection one program does not read/write data
of another
25Virtual memory terminology
- Page equivalent of block. Fixed size.
- Page faults equivalent of misses
- Virtual address equivalent of tag.
- No cache index equivalent fully associative. VM
table index appears because VM uses a different
(page-table) implementation of fully-associative. - Physical address translated value of virtual
address. Can be smaller than virtual address. No
equivalent in caches. - Memory mapping (address translation) converting
virtual to physical addresses. No equivalent in
caches. - Valid bit same as in caches
- Referenced bit Used to approximate LRU algorithm
- Dirty bit Used to optimize write-back.
26Virtual memory design issues
- Miss penalty huge Access time of disk millions
of cycles ! - Highest priority minimize misses (page faults)
- Use fully-associative cache, large block sizes,
complex replacement strategies (implemented in
software) they are worth the price !! - Use write-back instead of write-through. This is
called copy-back in VM. Optimize disk access
only if the page has actually been written (dirty
bit). - Page fault gt Operating system schedules another
process! - Protection support
- Break up programs code and data into pages. Add
process id to the cache index use separate
tables for different pgms. - OS is called via an exception handles page
faults, protection
27Page Tables implements fully-associative
- Place virtual page anywhere in physical
memory. - Index page table by the virtual page
number to find physical page address (or disk
address)
28Using page tables to access a word
- Separate page tables per program - Page table
register values state of program called
process
29Making Address Translation Fast
- Why ? Table (memory) access required for every
memory access (100 overhead) ! - A hardware cache for address translations
translation lookaside buffer (TLB). Caches
entries from the page table.
30TLBs and caches hierarchy in action
31A common frameworkfor virtual memory and caches
- Where can a block be placed ? (Placement)
- Associativity direct-mapped, set- or
fully-associative - More associativity great for small cache sizes
- For larger cache sizes, relative improvement at
best improves only slightly (diminishing returns) - How is a block found ? (Search)
- Direct-mapped, set, fully associative, table/TLB
- Cost of miss vs cost of implementation
- Cost of comparators high, miss rate improvements
small - For VM fully associative is a must since misses
are very expensive extra table also worth it.
Software replacement schemes, with large page
sizes make it more worthwhile
32Common Framework (contd)
- Set associative placement used for caches and
TLBs in general - Which block should be replaced on a miss ?
(Replacement) - LRU costly - only approximations implemented for
VM - Random for TLB, caches. Some hardware assistance
provided. Penalty only 1.1 times that of LRU - What happens on a write ? (Write policies)
- Write-through caches. Usually with write buffers
- misses are simpler to handle and cheaper, since
no write to lower level required and easy to
implement - Write-back (copy back) used in VM
- write to the cache only gt at the speed of the
cache
33Common framework (contd)
- multiple writes within block require only one
write to lower level - can use high b/w transfer effectively to write
back entire block - Three C's model
- Compulsory misses cold start misses
- Capacity misses when cache can't contain all
blocks needed - Conflict misses conflicts for positions within
set. Also called collision misses. - Challenge Techniques that improve miss rate also
affect some other aspect of performance or cost
negatively (hit-time, miss penalty)
34Modern Systems
- Very complicated memory systems
35Current issues in memory hierarchies
- Processor speeds continue to increase very fast
- much faster than either DRAM or disk access times
- Design challenge dealing with this growing
disparity - Trends
- Synchronous SRAMs (provide a burst of data).
- Redesign DRAM chips to provide higher bandwidth
or processing SDRAMs, Page-mode, EDO - Restructure code to increase locality (compiler).
- Use pre-fetching in multi-level caches.
- Out-of-order execution finding other
instructions to execute in superscalar
architectures during cache miss - Intelligent RAMs (IRAM)32-bit memory on
processor!
36Summary
- Memory hierarchy concepts
- Caches, Virtual memories, techniques, performance
- Similarities, differences.
- Real stuff and current issues
37Extras More VM issues
- OS does not save page table while swapping
programs - it just loads the page table register
to point to new table - OS responsible for allocating physical memory and
updating page tables. It also creates space on
disk to store all pages of a process when it
creates the process and the data structure to
point to this disk location - LRU approximation Hardware sets the referenced
bit when page is touched. OS periodically reads
and clears these bits - this information is used
to track usage. - Approaches to minimize size of page table Eg
page page tables themselves ! - Process switching gt TLB needs to be flushed and
reloaded gt loses locality. Hence the process id
is concatenated to tag portion of TLB to avoid
need to clear TLB per context switch.
38VM issues (contd)
- User process cannot change page tables gt under
control of OS - Hardware requirements for protection
- Provide at least two modes user and supervisory
mode - Mode switching mechanisms in MIPS syscall
instruction in MIPS (system call exception),
return from exception (RFE) instruction - Space which user process can read but not write
- Some support for shared memory
- TLB miss need not imply page faults. Both are
handled by OS. - Several subtle issues here including handling of
exceptions during the exception handling phase
itself. Need capability to disable and enable
exceptions. Protection violations must also be
reported through exceptions to OS.