Memory Hierarchy - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

Memory Hierarchy

Description:

For now, we assume the block size is of a single word ... cache with 64 blocks and a block size of 16 bytes (4 words) ... Higher miss rate due to their size ... – PowerPoint PPT presentation

Number of Views:123
Avg rating:3.0/5.0
Slides: 76
Provided by: xiaoyu1
Category:
Tags: hierarchy | memory | size

less

Transcript and Presenter's Notes

Title: Memory Hierarchy


1
Memory Hierarchy
  • Chapter 7

2
The Big Picture Where are We Now?
  • The Five Classic Components of a Computer

3
Memory
  • Memory is the required for storing
  • Program
  • Data
  • Characteristics
  • Access mode
  • Sequential vs. random access (RAM)
  • Alterability
  • read only memory vs. read write memory
  • Access time
  • Price
  • dollar / byte

4
Memories Review
  • SRAM
  • value is stored on a pair of inverting gates
  • The value can be kept indefinitely as long as
    power is applied
  • very fast but takes up more space than DRAM (4 to
    6 transistors)
  • DRAM
  • value is stored as a charge on capacitor (must be
    refreshed)
  • very small but slower than SRAM (factor of 5 to
    10)

5
SRAM vs. DRAM
  • Fast switching due to low impedance of
    transistors
  • Fast access 0.5 - 5ns
  • High power
  • Large area
  • 6 transistors / cell
  • 2 lines bit and negated bit line
  • High costs 4000 to 10000 per GB (2004)
  • Slow reading because of high resistance and high
    capacity
  • Slow access 50-70ns
  • Low power
  • Small area
  • 1 transistor 1 capacitor
  • Vertically built cells
  • Low costs 100 to 200 per GB

6
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
Less Law?
DRAM 9/yr. (2X/10 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
7
Memory Hierarchy
  • User wants large and fast memories!
  • Conflicting goals
  • By taking advantage of the principle of
    locality
  • Present the user with as much memory as is
    available in the cheapest technology.
  • Provide access at the speed offered by the
    fastest technology.

8
Principal of Locality
  • Memory locations are not accessed at the uniform
    frequency. Certain locations are accessed much
    more than others.
  • If an item is referenced,temporal locality it
    will tend to be referenced again soon
  • spatial locality nearby items will tend to be
    referenced soon.
  • Why does code have locality?

9
Hierarchical Access to the data
  • Our initial focus two levels (upper, lower)
  • Cache the fast storage to take advantage of
    locality of access
  • block minimum unit of data
  • hit data requested is in the cache
  • miss data requested is not in the cache
  • When an item is referenced
  • The main memory block number is extracted from
    the address
  • The block number is looked up in the cache
    directory
  • If we have a hit, data is extracted from the
    cache
  • If we have a miss, a copy of the block is
    transferred from the main memory to the cache

10
Hit miss
  • Hit data is directly available in cache - no
    penalty
  • Hit time time to retrieve a data item on a hit
  • Miss data is on a lower level - penalty from 1 -
    30 cycles!
  • Miss time time to retrieve a data item on a miss
  • Miss time hit time miss penalty
  • Hit ratio the ratio of hits to the total number
    of memory accesses of a particular memory level.
  • Miss ratio 1 hit ratio

11
Access
  • Questions
  • How do we know if a data item is in the cache?
  • If it is, how do we find it?
  • Simple approach Direct mapping
  • For now, we assume the block size is of a single
    word
  • For each item of data at the lower level, there
    is exactly one location in the cache where it
    might be.
  • e.g., lots of items at the lower level share
    locations in the upper level
  • Mapping(Block address) mod (number of cache
    blocks in the cache)

12
Direct Mapped Cache Example
  • Mapping address is modulo the number of blocks
    in the cache

13
Access Example
  • Memory with 32 locations
  • Cache with 8 locations
  • Cache address are identical with the lower 3 bits
    of the memory address
  • A tag at the cache memory location indicates the
    high order bits of memory address
  • Complete address tag cache address
  • Valid information is the content still valid?

14
Example
  • Initial state
  • Miss of10110

15
Example
  • After miss at11010
  • After miss at10000

16
Example
  • After miss at00011
  • After miss at10010

17
A more realistic cache example
  • 32 bit data width
  • 32 bit byte-address (30 bit word address)
  • Cache size 1k blocks and Block size 1 Word
  • 10 bit cache index
  • 20 bit tag size
  • 2 bit byte offset (need not be addressed if we
    use word alignment to 32 bit boundary)
  • Valid bit

18
Cache access
  • Compare cache tag with address bits 31..12
  • Check valid bit
  • Signal a hit to the CPU
  • Transfer data
  • What kind of locality are we taking advantage of?

19
Cache size
  • Cache memory size
  • 1024 32 bit 32 k bit
  • Tag memory size
  • 1024 20 bit 20 k bit
  • Valid information
  • 1024 1 bit 1 k bit
  • Efficiency
  • 32 / 53 60.4 only!
  • General size of a one-word direct-mapped cache
  • 2n (32 (32 - n - 2) 1) width 32 bits,
    size 2n

20
Spatial locality
  • So far we didnt take advantage of spatial
    locality
  • Basic idea
  • Whenever we have a miss, load a group of adjacent
    memory cells into the cache.
  • Having a larger block
  • Directed-mapped block mappingcache index
    (block address) mod (Number of cache blocks)
  • Address components 64kB cache, 4-word block
    size
  • Tag 31 - 16
  • Index 15 - 4
  • Block offset 3 - 2
  • Byte offset 1 - 0

21
64 KB cache with 4 word blocks
22
Example
  • Consider a direct-mapped cache with 64 blocks and
    a block size of 16 bytes (4 words). What is the
    cache index of byte address 1200?
  • Block address of byte address 1200
  • Word address 1200 / 4 300
  • Block address 300 / 4 75
  • Cache index
  • 75 64 11

23
Example
  • Consider a series of address references given as
    word addresses 22, 24, 25, 20
  • Assuming a 16-word direct-mapped cache one-word
    blocks, compute cache index of each reference and
    label it as hit or miss
  • What are the results if assuming a 16-word
    direct-mapped cache with 4 four-word blocks?

24
Optimal block size
  • Small block size
  • High miss rate
  • Short block loading time
  • Ignoring spatial locality
  • Large block size
  • Low miss rate
  • Long time for reloading block
  • 1 miss requires n words to be loaded, n block
    size
  • Optimization strategies
  • Early restart
  • Requested word first

25
Miss rate / block size
26
Harward - von Neumann
  • Split caches
  • Higher miss rate due to their size
  • No conflicts when accessing data and instruction
    at the same time
  • Higher bandwidth due to separate data paths
  • Combined caches
  • Lower miss rate due to their size
  • Possibly stalls due to the simultaneous access to
    data and instructions
  • Lower bandwidth due to sharing of resources

27
Cache Reads
  • Cache hit - just continue
  • Access data from data memory data cache
  • Access instructions from instruction memory
    instruction cache
  • Miss
  • Stall the complete processor
  • Activate memory controller
  • Get information from next lower level of cache or
    the main memory
  • Load information into cache
  • Resume as before

28
Cache Write
  • Write Hit must maintain consistencies between
    cache and memory
  • can replace data in cache and memory
    (write-through)
  • write the data only into the cache (write-back
    the cache later)
  • Only in data cache
  • Write Miss
  • read the entire block into the cache, then write
    the word
  • No need to read the block if the block size is
    one word. Why?

29
Write through and write back
  • Write through
  • Update cache and memory at the same time
  • Requires a buffer because the memory cannot
    accept data as fast as the processor can generate
    writes
  • Write back
  • Keep data in cache and write back when the cache
    contents is being replaced
  • Requires more effort for the cache contents
    replacement unit

30
Example cache
  • DecStation 3100based on MIPS R2000
  • Cache
  • Separate instruction and data caches
  • Each 64 KB
  • 16 k words
  • Block size 1 word

31
Example
  • Write through
  • Use bits 15 - 2 as the cache index
  • Write bits 31 - 16 into the tag
  • Write word into cache memory
  • Set valid bit
  • Write to memory
  • Performance problem of writing to memory
  • Cannot wait for the word to be written in memory
  • Solution
  • Introduce a write buffer (4 words) between
    processor and memory

32
Memory system design
  • Hypothetical access time for a DRAM
  • 1 clock cycle for sending address
  • 15 clock cycles for each DRAM access initiated
  • 1 clock cycle for sending the word
  • Memory organization
  • Block with four words
  • Memory access 1 word
  • Miss penalty
  • 1 4 15 4 1 65 cycles
  • Bytes / cycle 4 4 / 65 0.25

33
Memory organisation
34
Memory organization
  • Option 1
  • Access path 1 word wide
  • high penalty
  • Option 2
  • Bus and wide, e.g. equal block size
  • Penalty drops to
  • 1 1 15 1 1 17 cycles
  • Option 3
  • Bus width 1 wordbut memory organized in banks
  • Penalty drops to1 1 15 4 1 20 cycles

35
Summary
  • Memory hierarchy
  • Cache
  • Directed-mapped
  • Hit miss
  • Miss rate block size
  • Memory organization

36
Cache performance
  • The performance of a cache depends on many
    parameters
  • Memory stall clock cycles
  • all stall cycles causes by memory access
  • Read stall clock cycles and Write stall clock
    cycles
  • Instruction cache stalls and data cache stalls
  • CPU time (CPU execution cycles memory stall
    cycles) cycle time
  • Memory stall cycles memory accesses miss
    ratio miss penalty
  • (assuming same miss penalty for read and write
    stalls)
  • Two ways of improving performance
  • decreasing the miss ratio
  • decreasing the miss penalty
  • What happens if we increase block size?

37
Example
  • A machine has a CPI of 2 without memory stalls
  • Instruction cache miss rate 2
  • Data cache miss rate 4. 36 of all instructions
    are memory accesses
  • Miss penalty 100 cycles
  • How much faster a machine would run with a
    perfect cache that never missed?

38
Example
  • Stall cycles
  • Instruction missing cycles I x 2 x 100 2.00I
  • Data missing cycles I x 36 x 4 x 100 1.44I
  • CPI with memory stalls 2 2.0 1.44 5.44
  • Ratio of CPU execution time 5.44/2 2.72

39
Acceleration of the CPU
  • Assumption
  • Currently CPI 2, stall cycles / instruction
    3.44
  • Improvement
  • Clock rate constant
  • CPI improved from 2 to 1
  • System behaviour
  • System with perfect cache would be 4.44 / 1
    4.44 times faster
  • Time spent on memory stalls
  • Originally 3.44 / 5.44 63
  • Now 3.44 / 4.44 77

40
Acceleration of CPU
  • If we double the clock rate without changing the
    memory system, how much faster will the new
    machine be?
  • Measured in the faster clock cycles, the miss
    penalty will be twice as long, 200 cycles
  • Total miss cycle per instruction
  • 2 x 200 36 x 4 x 200 6.88
  • CPI per instruction 2 6.88 8.88
  • speedup 5.44 x 2 / 8.88 1.23

41
Three Cs of cache misses
  • Compulsory misses caused by first access to a
    block that is never in the cache
  • How to reduce compulsory misses?
  • Capacity misses caused when the cache cannot
    contain all blocks needed during execution of a
    program
  • How to reduce capacity misses?
  • Conflict misses caused when multiple blocks
    compete for the same location in the cache, which
    can be very bad in a direct-mapped cache.
  • How to reduce conflict misses?

42
Decreasing miss ratio with associativity
  • Direct mapped cache
  • Every memory block goes exactly to one block in
    the cache
  • Easy to find
  • (Block No.) mod (No. of cache blocks)
  • Use it as the index to the referenced word
  • Fully associative cache
  • A memory block can go in any block of the cache
  • Lower miss rate
  • Difficult to find
  • Longer hit time
  • Search all tags if the word is the requested one

43
Cache Organizations
  • Set associative cache
  • A memory block goes to a set of blocks
  • The minimum set size is 2.
  • Finding the set
  • cache index (Block No.) mod (No. of sets in the
    cache)
  • It is required to check which of the elements of
    the set contains the data

44
Mapping of an eight block cache
45
Cache types
  • Direct mapped Set associative Full
    associative

What is the cache index for block with address 12?
46
Cache Miss Example
  • Three small caches 4 one-word blocks
  • Fully associative, two-way set associative, and
    direct mapped
  • Find the number of misses for each cache
    organization given the following sequence of
    block addresses 0, 8, 0, 6, 8

47
Example
  • Direct mapped
  • 0 -gt 0 mod 4 0
  • 6 -gt 6 mod 4 2
  • 8 -gt 8 mod 4 0
  • 5 misses
  • Set Associative
  • 0 -gt 0 mod 2 0
  • 6 -gt 6 mod 2 0
  • 8 -gt 8 mod 2 0
  • 4 or 3 misses (depending on replacement)

48
Example (cont.)
  • Fully associative
  • 3 misses

49
Performance improvement
  • Associativity influences the performance

50
Miss Rate
51
Replacement strategy
  • Direct mapping - no choice
  • Full associative any position is allowed - Which
    to choose?
  • evict a block that wont be used again
  • If all blocks will be used again, then evict the
    block that will not be used for the longest
    period of time
  • Guarantees the lowest possible miss rate
  • Cant be done unless we can tell the future
  • Most often used scheme LRU (Least Recently Used)
  • Setting a mark which element has not been used
    for the longest time.
  • Random
  • Easy to implement
  • Is only by 1.1 worse than LRU

52
Locating a block in Set Associative Cache
  • Address portions
  • Tag - must be compared to all elements in the set
  • Index - selects the set
  • Block offset
  • Hardware effort for comparisonincreases linearly
    with the number of elements in the set
  • More time for comparison longer hit time
  • Example
  • Cache size 4KB, 4-way set associative, 1-word
    block
  • Which bits in a 32-bit address are used for cache
    index?
  • Which bits are the tag?
  • How about 4-word blocks?

53
Example four-way associative cache
4 KB cache with 1-word blocks 4-way set
associative
54
Decreasing miss penalty with multilevel caches
  • Different technology costs for the cache levels
  • First level cache on the same die as the
    processor
  • use SRAMs to add another cache above primary
    memory (DRAM)
  • miss penalty goes down if data is in 2nd level
    cache
  • Different optimisation strategies
  • Primary level minimal hit time
  • Frequency as close to CPU clock as possible
  • Secondary level minimal miss rate
  • Larger size
  • Larger block size
  • Less accesses to main memory

55
Performance of Multi-level Cache
  • 5 GHZ processor
  • CPI 1.0 without miss
  • Main memory access time 100ns
  • Miss rate per instruction at the primary cache is
    2
  • How much faster if we add a secondary cache with
    5-ns access time and reduce the miss rate to main
    memory to 0.5?

56
Example
  • Without secondary cache
  • Miss penalty to main memory is
  • 100ns / 0.2ns 500 cycles
  • Effective CPI
  • 1 500 x 2 11
  • With secondary cache
  • Miss penalty to secondary cache is
  • 5ns / 0.2ns 25 cycles
  • Effective CPI
  • 1 25 x 2 500 x 0.5 4
  • The machine with the secondary cache is
  • 11 / 4 2.8 times faster

57
Cache Complexities
Theoretical behavior of Radix sort vs. Quicksort
Observed behavior of Radix sort vs. Quicksort
  • Not always easy to understand implications of
    caches

58
Cache Complexities
  • Here is why
  • Memory system performance is often critical
    factor
  • multilevel caches, pipelined processors, make it
    harder to predict outcomes
  • Compiler optimizations to increase locality
    sometimes hurt ILP
  • Difficult to predict best algorithm need
    experimental data

59
Memory Hierarchies Summary
  • Where can a block be placed in the cache? How is
    a bock found?
  • Direct mapped
  • Set associative
  • Fully associative
  • What is the block size?
  • One-word block
  • Multiple word block
  • Which block should be replaced on a cache miss?
  • Least recently used (LRU)
  • Random
  • What happens on a write
  • Write through
  • Write back

60
Virtual Memory
  • Motivation
  • Allow multiple programs to share the same memory
  • Allow a single program to exceed the size of
    primary memory
  • Virtual memory
  • A hardware-software interface that gives the user
    the illusion of a memory system that is much
    larger than the physical memory
  • The illusion of a larger memory is accomplished
    by making use of secondary storage to back up for
    the primary memory.
  • We will focus on page based virtual memory
  • Page virtual memory block

61
Virtual Memory
  • Main memory can act as a cache for the secondary
    storage (disk)
  • Advantages
  • illusion of having more physical memory
  • program relocation
  • protection

62
How Does VM Work
  • Two memory spaces
  • Virtual memory space-what the program sees
  • Physical memory space-what the program runs in
    (size of RAM)
  • On program startup
  • OS copies program into RAM
  • If there is not enough RAM, OS stops copying
    program starts running the program with some
    portion of the program loaded in RAM
  • When the program touches a part of the program
    not in physical memory (RAM), OS copies that part
    of the program from disk into RAM
  • In order to copy some of the program from disk
    to RAM, OS must evict parts of the program
    already in RAM
  • OS copies the evicted parts of the program back
    to disk if the pages are dirty (ie, if they have
    been written into, and changed)

63
Pages virtual memory blocks
  • Page faults the data is not in memory, retrieve
    it from disk
  • huge miss penalty, thus pages should be fairly
    large (e.g., 4KB)
  • reducing page faults is important (LRU is worth
    the price)
  • can handle the faults in software instead of
    hardware
  • using write-through is too expensive so we use
    write back

64
Page Tables
  • Use fully associative method because of the high
    overhead of page fault
  • Use a page table to map from virtual memory
    address to physical memory address

65
Page Tables

Page size 212 4KB Page table size 220 4
222 bytes 4MB
66
What Happens if Page is not in RAM?
  • How do we know its not in RAM?
  • Page Table entrys valid bit is set to
    INVALID(DISK)
  • What do we do?
  • ask OS to fetch the page from disk -we call this
    a page fault
  • Before page is read from disk, OS must evict a
    page from RAM (if RAM is full)
  • The page to be evicted is called the victim page
  • If the page to be evicted is dirty, write the
    page back to disk
  • Only data pages can be dirty
  • OS then reads the requested page from disk
  • OS changes the page table to reflect the new
    mapping
  • Hardware restarts at the faulting virtual address

67
Which Page Should We Evict?
  • Optimal solution
  • evict a page that wont be referenced (used)
    again
  • If all pages will be used again, then evict the
    page that will not be used for the longest period
    of time
  • Guarantees the lowest possible page fault rate (
    of faults per second)
  • Cant be done unless we can tell the future
  • Other page replacement algorithms
  • First-in, First-out (FIFO)
  • Least Recently Used (LRU)

68
Mapping to Physical Memory
69
Protection
  • Each process has its own virtual memory but the
    physical memory is shared.
  • A multi-program machine must provide protection
    to the users.
  • The operation systems maps individual virtual
    memories to disjoint physical pages.
  • Requires two modes of execution user mode and
    supervisor mode.
  • Only supervisor mode can modify the page table
  • Share information among processes by using
    protection bits in the page table
  • Each page table entry contains protection bits
    (read, write, executive)
  • Each memory access is checked against the
    protection bits
  • An violation generates an interrupt (segmentation
    fault)

70
Performance of Virtual Memory
  • If every program in a multiprogramming
    environment fits into RAM, then virtual memory
    never pages (goes to disk)
  • If any program doesnt fit into RAM, then the VM
    system must page between RAM and disk
  • Paging is very costly
  • A disk access (4KBytes) can take 10 ms in 10
    ms, a processor can execute 20 Million
    instructions
  • Basically, you really dont want to page very
    often, if you dont have to
  • thrashing

71
Making Address Translation Fast
  • A cache for address translations translation
    look-aside buffer (TLB)

72
TLB
  • Translation look-aside buffer
  • 32 4096 entries
  • Fully associative or set associative
  • Hit time 0.5 1 cycle
  • Miss penalty 10 30 cycles
  • Hit rate gt 99
  • For a TLB hit, the physical memory address is
    obtained in one cycle
  • For a TLB miss, the regular translation mechanism
    is used.
  • The TLB is updated with the new page-number /
    page-table-entry pair.

73
TLBs and caches
74
TLBs and Caches
75
Modern Memory Systems
76
Modern Systems

77
Modern Systems
  • Things are getting complicated!
Write a Comment
User Comments (0)
About PowerShow.com