Memory Hierarchy - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Memory Hierarchy

Description:

... so that don t wait for lower level memory Write Back Write data to memory only when cache line is replaced We need a Dirty ... quick and/or too many ... growth ... – PowerPoint PPT presentation

Number of Views:138
Avg rating:3.0/5.0
Slides: 51
Provided by: Systemtek9
Category:

less

Transcript and Presenter's Notes

Title: Memory Hierarchy


1
Memory Hierarchy
  • Memory Hierarchy
  • Reasons
  • Virtual Memory
  • Cache Memory
  • Translation Lookaside Buffer
  • Address translation
  • Demand paging

2
Why Care About the Memory Hierarchy?
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 9/yr. (2X/10 yrs)
DRAM
1
Time
3
DRAMs over Time
DRAM Generation
84 87 90 93 96 99 1 Mb 4 Mb 16 Mb
64 Mb 256 Mb 1 Gb 55 85 130 200 300 450 30 47 7
2 110 165 250 28.84 11.1 4.26 1.64 0.61 0.23
1st Gen. Sample Memory Size Die Size (mm2) Memory
Area (mm2) Memory Cell Area (µm2)
(from Kazuhiro Sakashita, Mitsubishi)
4
Recap
  • Two Different Types of Locality
  • Temporal Locality (Locality in Time) If an item
    is referenced, it will tend to be referenced
    again soon.
  • Spatial Locality (Locality in Space) If an item
    is referenced, items whose addresses are close by
    tend to be referenced soon.
  • By taking advantage of the principle of locality
  • Present the user with as much memory as is
    available in the cheapest technology.
  • Provide access at the speed offered by the
    fastest technology.
  • DRAM is slow but cheap and dense
  • Good choice for presenting the user with a BIG
    memory system
  • SRAM is fast but expensive and not very dense
  • Good choice for providing the user FAST access
    time.

5
Memory Hierarchy of a Modern Computer
  • By taking advantage of the principle of locality
  • Present the user with as much memory as is
    available in the cheapest technology.
  • Provide access at the speed offered by the
    fastest technology.

Processor
Control
Tertiary Storage (Disk /Tape)
Secondary Storage (Disk)
Main Memory (DRAM)
Second Level Cache (SRAM)
On-Chip Cache
Datapath
Registers
1ns
10 ms
Speed
10ns
100ns
10 sec
Xns
100
G
Size (bytes)
K..M
M
T
64K
6
Levels of the Memory Hierarchy
Staging Xfer Unit
faster
prog 1-8 bytes
Instr. Operands
cache cntl 8-128 bytes
Blocks
OS 512-4K bytes
Pages
user/operator Mbytes
Files
Larger
7
The Art of Memory System Design
Optimize the memory system organization to
minimize the average memory access time for
typical workloads
reference stream ltop,addrgt, ltop,addrgt,ltop,addrgt,lt
op,addrgt, . . . op i-fetch, read, write
8
Virtual Memory System Design
size of information blocks that are transferred
from secondary to main storage (M) block of
information brought into M, and M is full, then
some region of M must be released to make room
for the new block --gt replacement policy which
region of M is to hold the new block --gt
placement policy missing item fetched from
secondary memory only on the occurrence of a
fault --gt demand load policy
Paging Organization virtual and physical address
space partitioned into blocks of equal size
(pages)
9
Address Map
V 0, 1, . . . , n - 1 virtual address
space M 0, 1, . . . , m - 1 physical address
space MAP V --gt M U 0 address mapping
function
n gt m
MAP(a) a' if data at virtual address a is
present in physical
address a' and a' in M 0 if
data at virtual address a is not present in M
a
missing item fault
Name Space V
fault handler
Processor
0
Secondary Memory
Addr Trans Mechanism
Main Memory
a
a'
physical address
OS performs this transfer
10
Paging Organization
actually, concatenation is more likely
11
Address Mapping
CP0
User Memory
MIPS PIPELINE
Instr
Data
32
32
24-bit Physical Address
32-bit Virtual Address
User process 2 running
Kernel Memory
Page Table 1
Here we need page table 2 for address mapping
Page Table 2
Page Table n
12
Translation Lookaside Buffer (TLB)
CP0
On TLB hit, the 32-bit virtual address is
translated into a 24-bit physical address by
hardware We never call the Kernel!
User Memory
MIPS PIPELINE
32
32
24
D
R
Physical Addr 2310
Virtual Address
Kernel Memory
Page Table 1
Page Table 2
Page Table n
13
So Far, NO GOOD
60 ns, RAM
CP0
STALL
IM
DE
EX
DM
32
32
Critical path 20 ns
24-bit Physical Address
TLB
MIPS pipe is clocked at 50 MHz
Kernel Memory
5ns
Page Table 1
But RAM needs 3 cycles to read/write STALLS the
pipe
Page Table 2
Page Table n
14
Lets put in a Cache
60 ns, RAM
CP0
IM
DE
EX
DM
32
32
Critical path 20 ns
TLB
Cache
MIPS pipe is clocked at 50 MHz
Kernel Memory
5ns
15ns
Page Table 1
A cache Hit never STALLS the pipe
Page Table 2
Page Table n
15
Fully Associative Cache
1
0
2
23
24-bit PA
Check all Cache lines Cache Hit if PA232TAG
Tag PA232
Data Word PA10
16
all 2 lines
16
2 4256kb
16
Fully Associative Cache
  • Very good hit ratio (nr hits/nr accesses)
  • But!
  • Too expensive checking all 2 Cache lines
    concurrently
  • A comparator for each line! A lot of hardware

16
17
Direct Mapped Cache
1
0
2
23
17
18
24-bit PA
Selects ONE cache line Cache Hit if PA2318TAG
Tag PA2318
Data Word PA10
1 line
16
2 4256kb
18
Direct Mapped Cache
  • Not so good hit ratio
  • Each line can hold only certain addresses, less
    freedom
  • But!
  • Much cheaper to implement, only one line checked
  • Only one comparator

19
Set Associative Cache
1
0
2
23
17-z
18-z
24-bit PA
z
Selects ONE set of lines, size 2 Cache Hit if
PA2318-zTAG in the set
Tag PA2318-z
Data Word PA10
z
2 lines
16
2z-way set associative
2 4256kb
20
Set Associative Cache
  • Quite good hit ratio
  • The number (set) of different addresses for each
    line is greater than that of a directly mapped
    cache
  • The larger Z the better hit ratio, but more
    expensive
  • 2z comparators
  • Cost-performance tradeoff

21
Cache Miss
  • A Cache Miss should be handled by the hardware
  • If handled by the OS it would be very slow (gtgt60
    ns)
  • On a Cache Miss
  • Stall the pipe
  • Read in new data to cache
  • Release the pipe, now we get a Cache Hit

22
A Summary on Sources of Cache Misses
  • Compulsory (cold start or process migration,
    first reference) first access to a block
  • Cold fact of life not a whole lot you can do
    about it
  • Note If you are going to run billions of
    instructions, Compulsory Misses are insignificant
  • Conflict (collision)
  • Multiple memory locations mappedto the same
    cache location
  • Solution 1 increase cache size
  • Solution 2 increase associativity
  • Capacity
  • Cache cannot contain all blocks access by the
    program
  • Solution increase cache size
  • Invalidation other process (e.g., I/O) updates
    memory

23
Example 1 KB Direct Mapped Cache with 32 Byte
Blocks
  • For a 2N byte cache
  • The uppermost (32 - N) bits are always the Cache
    Tag
  • The lowest M bits are the Byte Select (Block Size
    2M)

24
Block Size Tradeoff
  • In general, larger block size take advantage of
    spatial locality BUT
  • Larger block size means larger miss penalty
  • Takes longer time to fill up the block
  • If block size is too big relative to cache size,
    miss rate will go up
  • Too few cache blocks
  • In gerneral, Average Access Time
  • TimeAv Hit Time x (1 - Miss Rate) Miss
    Penalty x Miss Rate

Average Access Time
Miss Rate
Miss Penalty
Exploits Spatial Locality
Increased Miss Penalty Miss Rate
Fewer blocks compromises temporal locality
Block Size
Block Size
Block Size
25
Extreme Example single big line
  • Cache Size 4 bytes Block Size 4 bytes
  • Only ONE entry in the cache
  • If an item is accessed, likely that it will be
    accessed again soon
  • But it is unlikely that it will be accessed again
    immediately!!!
  • The next access will likely to be a miss again
  • Continually loading data into the cache
    butdiscard (force out) them before they are used
    again
  • Worst nightmare of a cache designer Ping Pong
    Effect
  • Conflict Misses are misses caused by
  • Different memory locations mapped to the same
    cache index
  • Solution 1 make the cache size bigger
  • Solution 2 Multiple entries for the same Cache
    Index

26
Hierarchy
  • Small, fast and expensive VS Slow big and
    inexpensive

Cache Contains copies What if copies are
changed? INCONSISTENCY!
HD 2 Gb
RAM 16 Mb
Cache 256kb
I
D
27
Cache Miss, Write Through/Back
  • To avoid INCONSISTENCY we can
  • Write Through
  • Always write data to RAM
  • Not so good performance (write 60ns)
  • Therefore, WT always combined with write buffers
    so that dont wait for lower level memory
  • Write Back
  • Write data to memory only when cache line is
    replaced
  • We need a Dirty bit (D) for each cache line
  • D-bit set by hardware on write operation
  • Much better performance, but more complex
    hardware

28
Write Buffer for Write Through
  • A Write Buffer is needed between the Cache and
    Memory
  • Processor writes data into the cache and the
    write buffer
  • Memory controller write contents of the buffer
    to memory
  • Write buffer is just a FIFO
  • Typical number of entries 4
  • Works fine if Store frequency (w.r.t. time) ltlt
    1 / DRAM write cycle
  • Memory system designers nightmare
  • Store frequency (w.r.t. time) -gt 1 / DRAM
    write cycle
  • Write buffer saturation

29
Write Buffer Saturation
  • Store frequency (w.r.t. time) -gt 1 / DRAM
    write cycle
  • If this condition exist for a long period of time
    (CPU cycle time too quick and/or too many store
    instructions in a row)
  • Store buffer will overflow no matter how big you
    make it
  • The CPU Cycle Time lt DRAM Write Cycle Time
  • Solution for write buffer saturation
  • Use a write back cache
  • Install a second level (L2) cache

30
Replacement Strategy in Hardware
  • A Direct mapped cache selects ONE cache line
  • No replacement strategy
  • Set/Fully Associative Cache selects a set of
    lines. Strategy to select one Cache line
  • Random, Round Robin
  • Not so good, spoils the idea with Associative
    Cache
  • Least Recently Used, (move to top strategy)
  • Good, but complex and costly for large Z
  • We could use an approximation (heuristic)
  • Not Recently Used, (replace if not used for a
    certain time)

31
Sequential RAM Access
  • Accessing sequential words from RAM is faster
    than accessing RAM randomly
  • Only lower address bits will change
  • How could we exploit this?
  • Let each Cache Line hold an Array of Data words
  • Give the Base address and array size
  • Burst Read the array from RAM to Cache
  • Burst Write the array from Cache to RAM

32
System Startup, RESET
  • Random Cache Contents
  • We might read incorrect values from the Cache
  • We need to know if the contents is Valid, a V-bit
    for each cache line
  • Let the hardware clear all V-bits on RESET
  • Set the V-bit and clear the D-bit for the line
    copied from RAM to Cache

33
Final Cache Model
1j
0
2j
23
17-z
18-z
24-bit PA
z
Selects ONE set of lines, size 2 Cache Hit if
(PA2318-zTAG) and V in set Set D bit if Write
D
V
Tag PA2318-z
Data Word PA1j0
z
2 lines
...
34
Translation Lookaside Buffer (TLB)
CP0
On TLB hit, the 32-bit virtual address is
translated into a 24-bit physical address by
hardware We never call the Kernel!
User Memory
MIPS PIPELINE
32
32
24
D
R
Physical Addr 2310
Virtual Address
Kernel Memory
Page Table 1
Page Table 2
Page Table n
35
Virtual Address and a Cache
CPU
It takes an extra memory access to translate VA
to PA This makes cache access very expensive,
and this is the "innermost loop" that you want to
go as fast as possible ASIDE Why access cache
with PA at all? VA caches have a problem!
synonym / alias problem two different virtual
addresses map to same physical address gt two
different cache entries holding data for the same
physical address! for update must update all
cache entries with same physical address or
memory becomes inconsistent determining this
requires significant hardware, essentially an
associative lookup on the physical address tags
to see if you have multiple hits or software
enforced alias boundary same lsb of VA PA gt
cache size
VA
Trans- lation
data
hit
PA
Cache
miss
Main Memory
36
Translation Look-Aside Buffers
Just like any other cache, the TLB can be
organized as fully associative, set
associative, or direct mapped TLBs are usually
small, typically not more than 128 - 256 entries
even on high end machines. This permits
fully associative lookup on these
machines. Most mid-range machines use small
n-way set associative organizations.
hit
miss
VA
PA
TLB Lookup
Cache
Main Memory
CPU
hit
miss
Translation with a TLB
OS Page table
data
37
Reducing Translation Time
  • Machines with TLBs go one step further to reduce
    cycles/cache access
  • They overlap the cache access with the TLB access
  • Works because high order bits of the VA are used
    to look in the TLB while low order bits are used
    as index into cache

38
Overlapped Cache TLB Access
IF cache hit AND (cache tag PA) then deliver
data to CPU ELSE IF cache miss OR (cache tag
PA) and TLB hit THEN access
memory with the PA from the TLB ELSE do standard
VA translation
39
Problems With Overlapped TLB Access
Overlapped access only works as long as the
address bits used to index into the cache
do not change as the result of VA
translation This usually limits things to small
caches, large page sizes, or high n-way set
associative caches if you want a large
cache Example suppose everything the same
except that the cache is increased to 8 K
bytes instead of 4 K
This bit is changed by VA translation, but is
needed for cache lookup
Solutions go to 8K byte page sizes
go to 2 way set associative cache or
SW guarantee VA13PA13
2 way set assoc cache
40
Startup a User process
  • Allocate Stack pages, Make a Page Table
  • Set Instruction (I), Global Data (D) and Stack
    pages (S)
  • Clear Resident (R) and Dirty (D) bits
  • Clear V-bits in TLB

Kernel Memory
V
D
R
Page Table
Page Table
0
I
0
0
Place on Hard Disk
0
I
0
0
...
I
0
0
0
D
0
0
0
S
0
TLB
41
Demand Paging
  • IM Stage We get a TLB Miss and Page Fault (page
    0 not resident)
  • Page Table (Kernel memory) holds HD address for
    page 0 (P0)
  • Read page to RAM page X, Update PA2310 in Page
    Table
  • Update TLB, set V, clear D, Page , PA2310
  • Restart failing instruction TLB hit!

RAM
XXX00..0
Page 0
I
TLB
D
22-bit Page
V
Physical Addr PA2310
I
1
0
00....0
XX..X
0
P0
...
0
42
Demand Paging
  • DM Stage We get a TLB Miss and Page Fault (page
    3 not resident)
  • Page Table (Kernel memory) holds HD address for
    page 3 (P3)
  • Read page to RAM page Y, Update PA2310 in Page
    Table
  • Update TLB, set V, clear D, Page , PA2310
  • Restart failing instruction TLB hit!

RAM
Page 0
I
TLB
YYY00..0
Page 3
D
D
22-bit Page
V
Physical Addr PA2310
I
1
0
00....0
XX..X
1
00...11
YY..Y
D
0
P0
P3
...
P1
P2
0
43
Spatial and Temporal Locality
  • Spatial Locality
  • Now TLB holds page translation 1024 bytes, 256
    instructions
  • The next instruction (PC4) will cause a TLB Hit
  • Access a data array, e.g., 0(t0),4(t0) etc
  • Temporal Locality
  • TLB holds translation
  • Branch within the same page,
    access the same instruction
    address
  • Access the array again e.g., 0(t0),4(t0) etc

THIS IS THE ONLY REASON A SMALL TLB WORKS
44
Replacement Strategy
  • If TLB is full the OS selects the TLB line to
    replace
  • Any line will do, they are the same and
    concurrently checked
  • Strategy to select one
  • Random
  • Not so good
  • Round Robin
  • Not so good, about the same as random
  • Least Recently Used, (move to top strategy)
  • Much better, (the best we can do without knowing
    or predicting page access). Based on temporal
    locality

45
Hierarchy
  • Small, fast and expensive VS Slow big and
    inexpensive

RAM 256Mb
TLB 64 Lines
TLB/RAM Contains copies What if copies are
changed? INCONSISTENCY!
gtgt 64
Kernel Memory
HD 32 Gb
Page Table
46
Inconsistency
  • Replace a TLB entry, caused by TLB Miss
  • If old TLB entry dirty (D-bit) we update Page
    Table (Kernel memory)
  • Replace a page in RAM (swapping) caused by Page
    Fault
  • If old Page is in TLB
  • Check old page TLB D-bit, if Dirty write page to
    HD
  • Clear TLB V-bit and Page Table R-bit (now not
    resident)
  • If old Page is in not in TLB
  • Check old page Page Table D-bit, if Dirty write
    page to HD
  • Clear Page Table R-bit (page not resident any
    more)

47
Current Working Set
  • If RAM is full the OS selects a page to replace,
    Page Fault
  • OBS! The RAM is shared by many User processes
  • Least Recently Used, (move to top strategy)
  • Much better, (the best we can do without knowing
    or predicting page access)
  • Swapping is VERY expensive (, maybe gt 100 ms)
  • Why not try harder to keep the pages needed (the
    working set) in RAM using Advanced memory paging
    algorithms

Current working set of process P
p0,p3,... set of pages used under t
t
t
now
48
Trashing
Probability of Page Fault
1
Trashing No useful work done!
This we want to avoid
Fragment of working set not resident
0
0
1
49
Summary Cache, TLB, Virtual Memory
  • Caches, TLBs, Virtual Memory all understood by
    examining how they deal with 4 questions
  • Where can a page be placed?
  • How is a page found?
  • What page is replaced on miss?
  • How are writes handled?
  • Page tables map virtual address to physical
    address
  • TLBs are important for fast translation
  • TLB misses are significant in processor
    performance
  • (some systems cant access all of 2nd level cache
    without TLB misses!)

50
Summary Memory Hierachy
  • Virtual memory was controversial at the time
    can SW automatically manage 64KB across many
    programs?
  • 1000X DRAM growth removed the controversy
  • Today VM allows many processes to share single
    memory without having to swap all processes to
    disk
  • VM protection is more important than memory space
    increase
  • Today CPU time is a function of (ops, cache
    misses) vs. just of(ops)
  • What does this mean to Compilers, Data
    structures, Algorithms?
  • Vtune performance analyzer, cache misses.
Write a Comment
User Comments (0)
About PowerShow.com