Title: THE%20MEMORY%20HIERARCHY
1THE MEMORY HIERARCHY
- Jehan-François Pâris
- jfparis_at_uh.edu
2Chapter Organization
- Technology overview
- Caches
- Cache associativity, write through andwrite
back, - Virtual memory
- Page table organization, the translation
lookaside buffer (TLB), page fault handling,
memory protection - Virtual machines
- Cache consistency
3TECHNOLOGY OVERVIEW
4Dynamic RAM
- Standard solution for main memory since 70's
- Replaced magnetic core memory
- Bits represented stored on capacitors
- Charged state represents a one
- Capacitors discharge
- Must be dynamically refreshed
- Achieved by accessing each cell several thousand
times each second
5Dynamic RAM
Row select
nMOS transistor
ColumnSelect
Capacitor
Ground
6The role of the nMOS transistor
Not on the exam
- Normally, no current can go from the source to
the drain
- When the gate is positive with respect to the
ground, electrons are attracted to the gate (the
"field effect")and current can go through
7Magnetic disks
Platter
Servo
Arm
R/W head
8Magnetic disk (I)
- Data are stored into circular tracks
- Tracks are partitioned into a variable number of
fixed-size sectors - If disk drive has more than one platter, all
tracks corresponding to the same position of the
R/ W head form a cylinder
9Magnetic disk (II)
- Disk spins at a speed varying between
- 5,400 rpm (laptops) and
- 15,000 rpm (Seagate Cheetah X15, )
- Accessing data requires
- Positioning the head on the right track
- Seek time
- Waiting for the data to reach the R/W head
- On the average half a rotation
10Disk access times
- Dominated by seek time and rotational delay
- We try to reduce seek times by placing all data
that are likely to be accessed together on
nearby tracks or same cylinder - Cannot do as much for rotational delay
- On the average half a rotation
11Average rotational delay
RPM Delay (ms)
5400 5.6
7200 4.2
10,000 3.0
15,000 2.0
12Overall performance
- Disk access times are still dominated by
rotational latency - Were 8-10 ms in the late 70's when rotational
speeds were 3,000 to 3,600 RPM - Disk capacities and maximum transfer rates have
done much better - Pack many more tracks per platter
- Pack many more bits per track
13The internal disk controller
- Printed circuit board attached to disk drive
- As powerful as the CPU of a personal computer of
the early 80's - Functions include
- Speed buffering
- Disk scheduling
-
14Reliability issues
- Disk drives have more reliability issues than
most other computer components - Moving parts eventually wear
- Infant mortality
- Would be too costly to produceperfect magnetic
surfaces - Disks have bad blocks
15Disk failure rates
- Failure rates follow a bathtub curve
- High infantile mortality
- Low failure rate during useful life
- Higher failure rates as disks wear out
16Disk failure rates (II)
Failurerate
Wearout
Infantilemortality
Useful life
Time
17Disk failure rates (III)
- Infant mortality effect can last for months for
disk drives - Cheap ATA disk drives seem to age less gracefully
than SCSI drives
18MTTF
- Disk manufacturers advertise very highMean Times
To Fail (MTTF) for their products - 500,000 to 1,000,000 hours, that is,57 to 114
years - Does not mean that disk will last that long!
- Means that disks will fail at an average rate of
one failure per 500,000 to 100,000 hours
duringtheir useful life
19More MTTF Issues (I)
- Manufacturers' claims are not supported by solid
experimental evidence - Obtained by submitting disks to a stress test at
high temperature and extrapolating results to
ideal conditions - Procedure raises many issues
20More MTTF Issues (II)
- Failure rates observed in the field aremuch
higher - Can go up to 8 to 9 percent per year
- Corresponding MTTFs are 11 to 12.5 years
- If we have 100 disks and a MTTF of 12.5 years,
we can expect an average of 8 disk failures per
year
21Bad blocks (I)
- Also known as
- Irrecoverable read errors
- Latent sector errors
- Can be caused by
- Defects in magnetic substrate
- Problems during last write
22Bad blocks (II)
- Disk controller uses redundant encoding that can
detect and correct many errors - When internal disk controller detects a bad block
- Marks it as unusable
- Remaps logical block address of bad block to
spare sectors - Each disk is extensively tested duringburn in
period before being released
23The memory hierarchy (I)
Level Device Access Time
1 Fastest registers(2 GHz CPU) 0.5 ns
2 Main memory 10-60 ns
3 Secondary storage (disk) 7 ms
4 Mass storage(CD-ROM library) a few s
24The memory hierarchy (II)
- To make sense of these numbers, let us consider
an analogy
25Writing a paper (I)
Level Resource Access Time
1 Open book on desk 1 s
2 Book on desk
3 Book in library
4 Book far away
26Writing a paper (II)
Level Resource Access Time
1 Open book on desk 1 s
2 Book on desk 20-120 s
3 Book in library
4 Book far away
27Writing a paper (III)
Level Resource Access Time
1 Open book on desk 1 s
2 Book on desk 20-140 s
3 Book in library 162 days
4 Book far away
28Writing a paper (IV)
Level Resource Access Time
1 Open book on desk 1 s
2 Book on desk 20-140 s
3 Book in library 162 days
4 Book far away 63 years
29Major issues
- Huge gaps between
- CPU speeds and SDRAM access times
- SDRAM access times and disk access times
- Both problems have very different solutions
- Gap between CPU speeds and SDRAM access times
handled by hardware - Gap between SDRAM access times and disk access
times handled by combination of software and
hardware
30Why?
- Having hardware handle an issue
- Complicates hardware design
- Offers a very fast solution
- Standard approach for very frequent actions
- Letting software handle an issue
- Cheaper
- Has a much higher overhead
- Standard approach for less frequent actions
31Will the problem go away?
- It will become worse
- RAM access times are not improving as fast as CPU
power - Disk access times are limited by rotational speed
of disk drive
32What are the solutions?
- To bridge the CPU/DRAM gap
- Interposing between the CPU and the DRAM smaller,
faster memories that cache the data that the CPU
currently needs - Cache memories
- Managed by the hardware and invisible to the
software (OS included)
33What are the solutions?
- To bridge the DRAM/disk drive gap
- Storing in main memory the data blocks that are
currently accessed (I/O buffer) - Managing memory space and disk space as a single
resource (Virtual memory) - I/O buffer and virtual memory are managed by the
OS and invisible to the user processes
34Why do these solutions work?
- Locality principle
- Spatial localityat any time a process only
accesses asmall portion of its address space - Temporal localitythis subset does not change
too frequently
35Can we think of examples?
- The way we write programs
- The way we act in everyday life
-
36CACHING
37The technology
- Caches use faster static RAM (SRAM)
- Similar organization as that of D flipflops
- Can have
- Separate caches for instructions and data
- Great for pipelining
- A unified cache
38A little story (I)
- Consider a closed-stack library
- Customers bring book requests to circulation desk
- Librarians go to stack to fetch requested book
- Solution is used in national libraries
- Costlier than open-stack approach
- Much better control of assets
39A little story (II)
- Librarians have noted that some books get asked
again and again - Want to put them closer to the circulation desk
- Would result in much faster service
- The problem is how to locate these books
- They will not be at the right location!
40A little story (III)
- Librarians come with a great solution
- They put behind the circulation desk shelves with
100 book slots numbered from 00 to 99 - Each slot is a home for the most recently
requested book that has a call number whose
last two digits match the slot number - 3141593 can only go in slot 93
- 1234567 can only go in slot 67
41A little story (IV)
Let me see if it's in bin 93
The call number of the book I need is 3141593
42A little story (V)
- To let the librarian do her job each slot much
contain either - Nothing or
- A book and its reference number
- There are many books whose reference number ends
in 93 or any two given digits
43A little story (VI)
Sure
Could I get this time the book whose call number
4444493?
44A little story (VII)
- This time the librarian will
- Go bin 93
- Find it contains a book with a different call
number - She will
- Bring back that book to the stacks
- Fetch the new book
45Basic principles
- Assume we want to store in a faster memory 2n
words that are currently accessed by the CPU - Can be instructions or data or even both
- When the CPU will need to fetch an instruction or
load a word into a register - It will look first into the cache
- Can have a hit or a miss
46Cache hits
- Occur when the requested word is found in the
cache - Cache avoided a memory access
- CPU can proceed
47Cache misses
- Occur when the requested word is not found in the
cache - Will need to access the main memory
- Will bring the new word into the cache
- Must make space for it by expelling one of the
cache entries - Need to decide which one
48Handling writes (I)
- When CPU has to store the contents of a register
into main memory - Write will update the cache
- If the modified word is already in the cache
- Everything is fine
- Otherwise
- Must make space for it by expelling one of the
cache entries
49Handling writes (II)
- Two ways to handle writes
- Write through
- Each write updates both the cache and the main
memory - Write back
- Writes are not propagated to the main memory
until the updated word is expelled from the cache
50Handling writes (II)
CPU
Cache
later
RAM
51Pros and cons
- Write through
- Ensures that memory is always up to date
- Expelled cache entries can be overwritten
- Write back
- Faster writes
- Complicates cache expulsion procedure
- Must write back cache entries that have been
modified in the cache
52Picking the right solution
- Caches use write through
- Provides simpler cache expulsions
- Can minimize write-through overhead with
additional circuitry - I/O Buffers and virtual memory usewrite back
- Write-through overhead would be too high
53A better write through (I)
- Add a small buffer to speed up write performance
of write-through caches - At least four words
- Holds modified data until they are written into
main memory - Cache can proceed as soon as data are written
into the write buffer
54A better write through (II)
CPU
Cache
Write buffer
RAM
55A very basic cache
- Has 2n entries
- Each entry contains
- A word (4 bytes)
- Its RAM address
- Sole way to identify the word
- A bit indicating whether the cache entry contains
something useful
56A very basic cache (I)
Actual caches are much bigger
57A very basic cache (II)
58Comments (I)
- The cache organization we have presented is
nothing but the hardware implementation of a hash
table - Each entry has
- a key the word address
- a value word contents plus valid bit
-
59Comments (II)
- The hash function is
- h(k) (k/4) mod N
- where k is the key and N is the cache size
- Can be computed very fast
- Unlike conventional hash tables, this
organization has no provision for handling
collisions - Use expulsion to resolve collisions
60Managing the cache
- Each word fetched into cache can occupy a single
cache location - Specified by n1 to 2 bits of its address
- Two words with the same n1 to 2 bitscannot be
at the same time in the cache - Happens whenever the addresses of the two words
differ by K 2n2
61Example
- Assume cache can contain 8 words
- If word 48 is in the cache it will be stored at
cache index (48/4) mod 8 12 mod 8 4 - In our case 2n2 232 32
- The only possible cache index for word 80 would
be (80/4) mod 8 20 mod 8 4 - Same for words 112, 144, 176,
62Managing the cache
- Each word fetched into cache can occupy a single
cache location - Specified by n1 to 2 bits of its address
- Two words with the same n1 to 2 bitscannot be
at the same time in the cache - Happens whenever the addresses of the two words
differ by K 2n2
63Saving cache space
- We do not need to store whole address of each
word in cache - Bits 1 and 0 will always be zero
- Bits n 1 to 2 can be inferred from thecache
index - If cache has 8 entries, bits 4 to 2
- Will only store in tag the remaining bits of
address
64A very basic cache (III)
Cache uses bits 4 to 2 of word address
65Storing a new word in the cache
- Location of new word entry will be obtained from
LSB of word address - Discard 2 LSB
- Always zero for a well-aligned word
- Remove n next LSB for a cache of size 2n
- Given by cache index
MSB of word address
00
n next LSB
66Accessing a word in the cache (I)
- Start with word address
- Remove two least significant bit
- Always zero
Word address
67Accessing a word in the cache (II)
- Split remainder of address into
- n least significant bits
- Word address in the cache
- Cache tag
Word address minus two LSB
n LSB
Cache Tag
68Towards a better cache
- Our cache takes into account temporal locality of
accesses - Repeated accesses to the same location
- But not their spatial locality
- Accesses to neighboring locations
- Cache space is poorly used
- Need 26 1 bits of overhead to store32 bits of
data
69Multiword cache (I)
- Each cache entry will contain a block of 2, 4 ,
8, words with consecutive addresses - Will require words to be well aligned
- Pair of words should start at an address that is
multiple of 24 8 - Group of four words should start at an address
that is multiple of 44 16
70Multiword cache (II)
Tag
Contents
71Multiword cache (III)
- Has 2n entries each containing 2m words
- Each entry contains
- 2m words
- A tag
- A bit indicating whether the cache entry contains
useful data
72Storing a new word in the cache
- Location of new word entry will be obtained from
LSB of word address - Discard 2 m LSB
- Always zero for a well-aligned group of words
- Take n next LSB for a cache of size 2n
MSB of address
2 m LSB
n next LSB
73Example
- Assume
- Cache can contain 8 entries
- Each block contains 2 words
- Words 48 and 52 belong to the same block
- If word 48 is in the cache it will be stored at
cache index (48 /8) mod 8 6 mod 8 6 - If word 48 is in the cache it will be stored at
cache index (49 /8) mod 8 6 mod 8 6
74Selecting the right block size
- Larger block sizes improve the performance of the
cache - Allows us to exploit spatial locality
- Three limitations
- Spatial locality effect less pronounced if block
size exceeds 128 bytes - Too many collisions in very small caches
- Large blocks take more time to be fetched into
the cache
75(No Transcript)
76Collision effect in small cache
- Consider a 4KB cache
- If block size is 16 B, that is, 4 words,cache
will have 256 blocks -
- If block size is 128 B, that is 32 words,cache
will have 32 blocks - Too many collisions
77Problem
- Consider a very small cache with 8 entries and a
block size of 8 bytes (2 words) - Which words will be fetched in the cache when the
CPU accesses words at address 32, 48, 60 and 80? - How will these words will be stored in the cache?
78Solution (I)
- Since block size is 8 bytes
- 3 LSB of address used to address one of the 8
bytes in a block - Since cache holds 8 blocks,
- Next 3 LSB of address used by the cache index
- As a result, tag has 32 3 3 26 bits
79Solution (I)
- Consider words at address 32
- Cache index is (32/23) mod 23 (32/8) mod 8 4
- Block tag is 32/26 32/64 0
Row 4 Tag0 32 33 34 35 36 37 38 39
80Solution (II)
- Consider words at address 48
- Cache index is (48/8) mod 8 6
- Block tag is 48/64 0
Row 6 Tag0 48 49 50 51 52 53 54 55
81Solution (III)
- Consider words at address 60
- Cache index is (60/8) mod 8 7
- Block tag is 60/64 0
Row 6 Tag0 56 57 58 59 60 61 62 63
82Solution (IV)
- Consider words at address 80
- Cache index is (80/8) mod 8 10 mod 8 2
- Block tag is 80/64 1
Row 2 Tag1 80 81 82 83 84 85 86 67
83Set-associative caches (I)
- Can be seen as 2, 4, 8 caches attached together
- Reduces collisions
84Set-associative caches (II)
85Set-associative caches (III)
- Advantage
- We take care of more collisions
- Like a hash table with a fixed bucket size
- Results in lower miss rates than direct-mapped
caches - Disadvantage
- Slower access
- Best solution if miss penalty is very big
86Fully associative caches
- The dream!
- A block can occupy any index position in the
cache - Requires an associative memory
- Content-addressable
- Like our brain!
- Remain a dream
87Designing RAM to support caches
- RAM connected to CPU through a "bus"
- Clock rate much slower than CPU clock rate
- Assume that a RAM access takes
- 1 bus clock cycle to send the address
- 15 bus clock cycle to initiate a read
- 1 bus clock cycle to send a word of data
88Designing RAM to support caches
- Assume
- Cache block size is 4 words
- One-word bank of DRAM
- Fetching a cache block would take
- 1 415 41 65 bus clock cycles
- Transfer rate is 0.25 byte/bus cycle
- Awful!
89Designing RAM to support caches
- Could
- Double bus width (from 32 to 64 bits)
- Have a two-word bank of DRAM
- Fetching a cache block would take
- 1 215 21 33 bus clock cycles
- Transfer rate is 0.48 byte/bus cycle
- Much better
- Costly solution
90Designing RAM to support caches
- Could
- Have an interleaved memory organization
- Four one-word banks of DRAM
- A 32-bit bus
32 bits
RAMbank 1
RAMbank 0
RAMbank 2
RAMbank 3
91Designing RAM to support caches
- Can do the 4 accesses in parallel
- Must still transmit the block 32 bits by 32 bits
- Fetching a cache block would take
- 1 15 41 20 bus clock cycles
- Transfer rate is 0.80 word/bus cycle
- Even better
- Much cheaper than having a 64-bit bus
92ANALYZING CACHE PERFORMANCE
93Memory stalls
- Can divide CPU time into
- NEXEC clock cycles spent executing instructions
- NMEM_STALLS cycles spent waiting for memory
accesses - We have
- CPU time (NEXEC NMEM_STALLS)TCYCLE
94Memory stalls
- We assume that
- cache access times can be neglected
- most CPU cycles spent waiting for memory accesses
are caused by cache misses - Distinguishing between read stalls and write
stalls - NMEM_STALLS NRD_STALLS NWR_STALLS
95Read stalls
- Fairly simple
- NRD_STALLS NMEM_RDRead miss rate Read
miss penalty
96Write stalls (I)
- Two causes of delays
- Must fetch missing blocks before updating them
- We update at most 8 bytes of the block!
- Must take into account cost of write through
- Buffering delay depends of proximity of writes
not number of cache misses - Writes too close to each other
97Write stalls (II)
- We have
- NWR_STALLS NWRITESWrite miss rate
- Write miss penalty
NWR_BUFFER_STALLS - In practice, very few buffer stalls if the buffer
contains at least four words
98Global impact
- We have
- NMEM_STALLS NMEM_ACCESSESCache miss rate
- Cache miss penalty
- and also
- NMEM_STALLS NINSTRUCTIONS(NMISSES/Instruction)
- Cache miss penalty
99Example
- Miss rate of instruction cache is 2 percentMiss
rate of data cache is 4 percentIn the absence of
memory stalls, each instruction would take 2
cyclesMiss penalty is 100 cycles36 percent of
instructions access the main memory - How many cycles are lost due to cache misses?
100Solution (I)
- Impact of instruction cache misses
- 0.02100 2 cycles/instruction
- Impact of data cache misses
- 0.360.04100 1.44 cycles/instruction
- Total impact of cache misses
- 2 1.44 3.44 cycles/instruction
101Solution (II)
- Average number of cycles per instruction
- 2 3.44 5.44 cycles/instruction
- Fraction of time wasted
- 3.44 /5.44 63 percent
102Problem
- Redo the example with the following data
- Miss rate of instruction cache is 3 percentMiss
rate of data cache is 5 percentIn the absence of
memory stalls, each instruction would take 2
cyclesMiss penalty is 100 cycles40 percent of
instructions access the main memory
103Solution
- The fraction of time wasted to memory stalls is
71 percent
104Average memory access time
- Some authors call it AMAT
- TAVERAGE TCACHE fTMISS
- where f is the cache miss rate
- Times can be expressed
- In nanoseconds
- In number of cycles
105Example
- A cache has a hit rate of 96 percent
- Accessing data
- In the cache requires one cycle
- In the memory requires 100 cycles
- What is the average memory access time?
106Solution
Corrected
- Miss rate 1 Hit rate 0.04
- Applying the formula
- TAVERAGE 1 0.04100 5 cycles
107Impact of a better hit rate
- What would be the impact of improving the hit
rate of the cache from 96 to 98 percent?
108Solution
- New miss rate 1 New hit rate 0.02
- Applying the formula
- TAVERAGE 1 0.02100 3 cycles
When the hit rate is above 80 percent small
improvements in the hit rate willresult in much
better miss rate
109Examples
- Old hit rate 80 percentNew hit rate 90
percent - Miss rates goes from 20 to 10 percent!
- Old hit rate 94 percentNew hit rate 98
percent - Miss rates goes from 6 to 2 percent!
110In other words
It's the miss rate, stupid!
111Improving cache hit rate
- Two complementary techniques
- Using set-associative caches
- Must check tags of all blocks with the same index
values - Slower
- Have fewer collisions
- Fewer misses
- Use a cache hierarchy
112A cache hierarchy (I)
CPU
L1
L1 misses
L2
L2 misses
L3
L3 misses
RAM
113A cache hierarchy
- Topmost cache
- Optimized for speed, not miss rate
- Rather small
- Uses a small block size
- As we go down the hierarchy
- Cache sizes increase
- Block sizes increase
- Cache associativity level increases
114Example
- Cache miss rate per instruction is 2 percentIn
the absence of memory stalls, each instruction
would take one cycleCache miss penalty is 100
nsClock rate is 4GHz - How many cycles are lost due to cache misses?
115Solution (I)
- Duration of clock cycle
- 1/(4 Ghz) 0.2510-9 s 0.25 ns
- Cache miss penalty
- 100ns 400 cycles
- Total impact of cache misses
- 0.02400 8 cycles/instruction
116Solution (II)
- Average number of cycles per instruction
- 1 8 9 cycles/instruction
- Fraction of time wasted
- 8/9 89 percent
117Example (cont'd)
- How much faster would the processor if we added a
L2 cache that - Has a 5 ns access time
- Would reduce miss rate to main memory to 0.5
percent? - Will see later how to get that
118Solution (I)
- L2 cache access time
- 5ns 20 cycles
- Impact of cache misses per instruction
- L1 cache misses L2 cache misses
0.02200.005400 0.4 2.0 2.4
cycles/instruction - Average number of cycles per instruction
- 1 2.4 3.4 cycles/instruction
119Solution (II)
- Fraction of time wasted
- 2.4/3.4 63 percent
- CPU speedup
- 9/3.4 2.6
120How to get the 0.005 miss rate
- Wanted miss rate corresponds to a combined cache
hit rate of 99.5 percent - Let H1 be hit rate of L1 cache and H2 be the hit
rate of the second cache - The combined hit rate of the cache hierarchy isH
H1 (1-H1)H2
121How to get the 0.005 miss rate
- We have0.995 0.98 0.02?H2
- H2 (0.995 0.98)/0.02 0.75
- Quite feasible!
122Can we do better? (I)
- Keep 98 percent hit rate for L1 cache
- Raise hit rate of L2 cache to 85 percent
- L2 cache is now slower 6ns
- Impact of cache misses per instruction
- L1 cache misses L2 cache misses
0.02240.020.15400 0.48 1.2 1.68
cycles/instruction
123The verdict
- Fraction of time wasted per cycle
- 1.68/2.68 63 percent
- CPU speedup
- 9/2.68 3.36
124Would a faster L2 cache help?
- Redo the example assuming
- Miss rate of L1 cache is till 98 percent
- New faster L2 cache
- Access time reduced to 3 ns
- Hit rate only 50 percent
125The verdict
- Fraction of time wasted
- 87 percent
- CPU speedup
- 1.72
New L2 cache with a lower access timebut a
higher miss rate performs much worsethan
original L2 cache
126Cache replacement policy
- Not an issue in direct mapped caches
- We have no choice!
- An issue in set-associative caches
- Best policy is least recently used (LRU)
- Expels from the cache a block in thesame set as
the incoming block - Pick block that has not been accessed for the
longest period of time
127Implementing LRU policy
- Easy when each set contains two blocks
- We attach to each block a use bit that is
- Set to 1 when the block is accessed
- Reset to 0 when the other block is accessed
- We expel block whose use bit is 0
- Much more complicated for higher associativity
levels
128REALIZATIONS
129Caching in a multicore organization
- Multicore organizations often involve multiple
chips - Say four chips with four cores per chip
- Have a cache hierarchy on each chip
- L1, L2, L3
- Some caches are private, other are shared
- Accessing a cache on a chip is much faster than
accessing a cache on another chip
130AMD 16-core system (I)
- AMD 16-core system
- Sixteen cores on four chips
- Each core has a 64-KB L1 and a 512-KB L2 cache
- Each chip has a 2-MB shared L3 cache
131X/Y where X is latency in cycles Y is bandwidth
in bytes/cycle
132AMD 16-core system (II)
- Observe that access times are non-uniform
- Takes more time to access L1 or L2 cache of
another core than accessing shared L3 cache - Takes more time to access caches in another chip
than local caches - Access times and bandwidths depend onchip
interconnect topology
133VIRTUAL MEMORY
134Main objective (I)
- To allow programmers to write programs that
reside - partially in main memory
- partially on disk
135Main objective (II)
Main memory
Address space (I)
Address space (II)
136Motivation
- Most programs do not access their whole address
space at the same time - Compilers go through several phases
- Lexical analysis
- Preprocessing (C, C)
- Syntactic analysis
- Semantic analysis
137Advantages (I)
- VM allows programmers to write programs that
would not otherwise fit in main memory - They will run although much more slowly
- Very important in 70's and 80's
- VM allows OS to allocate the main memory much
more efficiently - Do not waste precious memory space
- Still important today
138Advantages
- VM let programmers use
- Sparsely populated
- Very large address spaces
VM
D
C
S
L
139Sparsely populated address spaces
- Let programmers put different items apart from
each other - Code segment
- Data segment
- Stack
- Shared library
- Mapped files
Wait until you take 4330 tostudy this
140Big difference with caching
- Miss penalty is much bigger
- Around 5 ms
- Assuming a memory access time of 50 ns,5 ms
equals 100,000 memory accesses - For caches, miss penalty was around100 cycles
141Consequences
- Will use much larger block sizes
- Blocks, here called pages, measure 4 KB, 8KB,
with 4 KB an unofficial standard - Will use fully associative mapping to reduce
misses, here called page faults - Will use write back to reduce disk accesses
- Must keep track of modified (dirty) pages in
memory
142Virtual memory
- Combines two big ideas
- Non-contiguous memory allocationprocesses are
allocated page frames scattered all over the main
memory - On-demand fetchProcess pages are brought in
main memory when they are accessed for the first
time - MMU takes care of almost everything
143Main memory
- Divided into fixed-size page frames
- Allocation units
- Sizes are powers of 2 (512B, , 4KB, )
- Properly aligned
- Numbered 0 , 1, 2, . . .
0
1
2
3
4
5
6
7
8
144Program address space
- Divided into fixed-size pages
- Same sizes as page frames
- Properly aligned
- Also numbered 0 , 1, 2, . . .
0
1
2
3
4
5
6
7
145The mapping
- Will allocate non contiguous page frames to the
pages of a process
146The mapping
Page Number Frame number
0 0
1 4
2 2
147The mapping
- Assuming 1KB pages and page frames
Virtual Addresses Physical Addresses
0 to 1,023 0 to 1,023
1,024 to 2,047 4,096 to 5,119
2,048 to 3,071 2,048 to 3,071
148The mapping
- Observing that 210 1000000000 in binary
- We will write 0-0 for ten zeroes and 1-1 for ten
ones
Virtual Addresses Physical Addresses
0000-0 to 0001-1 0000-0 to 0001-1
0010-0 to 0011-1 1000-0 to 1001-1
0100-0 to 0101-1 0100-0 to 0101-1
149The mapping
- The ten least significant bits of the address do
not change
Virtual Addresses Physical Addresses
000 0-0 to 000 1-1 000 0-0 to 000 1-1
001 0-0 to 001 1-1 100 0-0 to 100 1-1
010 0-0 to 010 1-1 010 0-0 to 010 1-1
150The mapping
- Must only map page numbers into page frame numbers
Page number Page frame number
000 000
001 100
010 010
151The mapping
Page number Page frame number
0 0
1 4
2 2
152The mapping
- Since page numbers are always in sequence, they
are redundant
X
Page number Page frame number
0 0
1 4
2 2
153The algorithm
- Assume page size 2p
- Remove p least significant bits from virtual
address to obtain the page number - Use page number to find corresponding page frame
number in page table - Append p least significant bits from virtual
address to page frame number to get physical
address
154Realization
155The offset
- Offset contains all bits that remain unchanged
through the address translation process - Function of page size
Page size Offset
1 KB 10 bits
2 KB 11 bits
4KB 12 bits
156The page number
- Contains other bits of virtual address
- Assuming 32-bit addresses
Page size Offset Page number
1 KB 10 bits 22 bits
2 KB 11 bits 21 bits
4KB 12 bits 20 bits
157Internal fragmentation
- Each process now occupies an integer number of
pages - Actual process space is not a round number
- Last page of a process is rarely full
- On the average, half a page is wasted
- Not a big issue
- Internal fragmentation
158On-demand fetch (I)
- Most processes terminate without having accessed
their whole address space - Code handling rare error conditions, . . .
- Other processes go to multiple phases during
which they access different parts of their
address space - Compilers
159On-demand fetch (II)
- VM systems do not fetch whole address space of a
process when it is brought into memory - They fetch individual pages on demand when they
get accessed the first time - Page miss or page fault
- When memory is full, they expel from memory pages
that are not currently in use
160On-demand fetch (III)
- The pages of a process that are not in main
memory reside on disk - In the executable file for the program being
run for the pages in the code segment - In a special swap area for the data pages that
were expelled from main memory
161On-demand fetch (IV)
Main memory
Code
Data
Disk
Executable
Swap area
162On-demand fetch (V)
- When a process tries to access data that are nor
present in main memory - MMU hardware detects that the page is missing and
causes an interrupt - Interrupt wakes up page fault handler
- Page fault handler puts process in waiting state
and brings missing page in main memory
163Advantages
- VM systems use main memory more efficiently than
other memory management schemes - Give to each process more or less what it needs
- Process sizes are not limited by the size of main
memory - Greatly simplifies program organization
164Sole disadvantage
- Bringing pages from disk is a relatively slow
operation - Takes milliseconds while memory access take
nanoseconds - Ten thousand times to hundred thousand times
slower
165The cost of a page fault
- Let
- Tm be the main memory access time
- Td the disk access time
- f the page fault rate
- Ta the average access time of the VM
- Ta (1 f ) Tm f (Tm Td ) Tm f Td
166Example
- Assume Tm 50 ns and Td 5 ms
f Mean memory access time
10-3 50 ns 5 ms/103 5,050 ns
10-4 50 ns 5 ms/104 550 ns
10-5 50 ns 5 ms/105 100 ns
10-6 50 ns 5 ms/ 106 55 ns
167Conclusion
- Virtual memory works best when page fault rate
is less than a page fault per 100,000
instructions
168Locality principle (I)
- A process that would access its pages in a
totally unpredictable fashion would perform very
poorly in a VM system unless all its pages are in
main memory
169Locality principle (II)
- Process P accesses randomly a very large array
consisting of n pages - If m of these n pages are in main memory, the
page fault frequency of the process will be( n
m )/ n - Must switch to another algorithm
170Tuning considerations
- In order to achieve an acceptable performance,a
VM system must ensure that each process has in
main memory all the pages it is currently
referencing - When this is not the case, the system performance
will quickly collapse
171 First problem
- A virtual memory system has
- 32 bit addresses
- 8 KB pages
- What are the sizes of the
- Page number field?
- Offset field?
172Solution (I)
- Step 1Convert page size to power of 28 KB
2----- B - Step 2Exponent is length of offset field
173Solution (II)
- Step 3Size of page number field Address size
Offset sizeHere 32 ____ _____ bits - Highlight the text in the box to see the answers
13 bits for the offset and 19 bits for the page
number
174PAGE TABLE REPRESENTATION
175Page table entries
- A page table entry (PTE) contains
- A page frame number
- Several special bits
- Assuming 32-bit addresses, all fit into four bytes
176The special bits (I)
- Valid bit1 if page is in main memory, 0
otherwise - Missing bit1 if page is in not main memory, 0
otherwise - Serve the same functionUse different conventions
177The special bits (II)
- Dirty bit1 if page has been modified since it
was brought into main memory,0 otherwise - A dirty page must be saved in the process swap
area on disk before being expelled from main
memory - A clean page can be immediately expelled
178The special bits (III)
- Page-referenced bit1 if page has been recently
modified,0 otherwise - Often simulated in software
179Where to store page tables
- Use a three-level approach
- Store parts of page table
- In high speed registers located in the MMUthe
translation lookaside buffer (TLB)(good
solution) - In main memory (bad solution)
- On disk (ugly solution)
180The translation look aside buffer
- Small high-speed memory
- Contains fixed number of PTEs
- Content-addressable memory
- Entries include page frame number and page number
181Realizations (I)
- TLB of Intrisity FastMATH
- 32-bit addresses
- 4 KB pages
- Fully associative TLB with 16 entries
- Each entry occupies 64 bits
- 20 bits for page number
- 20 bits for page frame number
- Valid bit, dirty bit,
182Realizations (II)
- TLB of ULTRA SPARC III
- 64-bit addresses
- Maximum program size is 244 bytes, that is,16 TB
- Supported page sizes are 4 KB, 16KB, 64 KB, 4MB
("superpages")
183Realizations (III)
- TLB of ULTRA SPARC III
- Dual direct-mapping (?) TLB
- 64 entries for code pages
- 64 entries for data pages
- Each entry occupies 64 bits
- Page number and page frame number
- Context
- Valid bit, dirty bit,
184The context (I)
- Conventional TLBs contain the PTE's for a
specific address space - Must be flushed each time the OS switches from
the current process to a new process - Frequent action in any modern OS
- Introduces a significant time penalty
185The context (II)
- UltraSPARC III architecture adds to TLB entries a
context identifying a specific address space - Page mappings from different address spaces can
coexist in the TLB - A TLB hit now requires a match for both page
number and context - Eliminates the need to flush the TLB
186TLB misses
- When a PTE cannot be found in the TLB, a TLB
miss is said to occur - TLB misses can be handled
- By the computer firmware
- Cost of miss is one extra memory access
- By the OS kernel
- Cost of miss is two context switches
187Letting SW handle TLB misses
- As in other exceptions, must save current value
of PC in EPC register - Must also assert the exception by the end of the
clock cycle during which the memory access occurs - In MIPS, must prevent WB cycle to occur after MEM
cycle that generated the exception
188Example
- Consider the instruction
- lw 1, 0(2)
- If word at address 2 is not in the TLB,we must
prevent any update of 1
189Performance implications
- When TLB misses are handled by the firmware,
they are very cheap - A TLB hit rate of 99 is very goodAverage
access cost will be - Ta 0.99Tm 0.012Tm 1.01Tm
- Less true if TLB misses are handled by the kernel
190Storing the rest of the page table
- PTs are too large to be stored in main memory
- Will store active part of the PT in main memory
- Other entries on disk
- Three solutions
- Linear page tables
- Multilevel page tables
- Hashed page tables
191Storing the rest of the page table
- We will review these solutions even though page
table organizations are an operating system topic
192Linear page tables (I)
- Store PT in virtual memory (VMS solution)
- Very large page tables need more than 2 levels (3
levels on MIPS R3000)
193Linear page tables (II)
PhysicalMemory
Virtual Memory
Other PTs
PT
194Linear page tables (III)
- Assuming a page size of 4KB,
- Each page of virtual memory requires 4 bytes of
physical memory - Each PT maps 4GB of virtual addresses
- A PT will occupy 4MB
- Storing these 4MB in virtual memory will require
4KB of physical memory
195Multi-level page tables (I)
- PT is divided into
- A master index that always remains in main memory
- Sub indexes that can be expelled
196Multi-level page tables (II)
lt Page Number gt
VIRTUAL ADDRESS
Offset
1ary
2ary
MASTER INDEX
SUBINDEX
(unchanged)
Frame
Addr
Offset
Frame No
PHYSICAL ADDRESS
197Multi-level page tables (III)
- Especially suited for a page size of 4 KB and 32
bits virtual addresses - Will allocate
- 10 bits of the address for the first level,
- 10 bits for the second level, and
- 12 bits for the offset.
- Master index and sub indexes will all have 210
entries and occupy 4KB
198Hashed page tables (I)
- Only contain paged that are in main memory
- PTs are much smaller
- Also known as inverted page tables
199Hashed page table (II)
PN page number PFN page frame number
200Selecting the right page size
- Increasing the page size
- Increases the length of the offset
- Decreases the length of the page number
- Reduces the size of page tables
- Less entries
- Increases internal fragmentation
- 4KB seems to be a good choice
201MEMORY PROTECTION
202Objective
- Unless we have an isolated single-user system, we
must prevent users from - Accessing
- Deleting
- Modifying
- the address spaces of other processes, including
the kernel
203Historical considerations
- Earlier operating systems for personal computers
did not have any protection - They were single-user machines
- They typically ran one program at a time
- Windows 2000, Windows XP, Vista and MacOS X are
protected
204Memory protection (I)
- VM ensures that processes cannot access page
frames that are not referenced in their page
table. - Can refine control by distinguishing among
- Read access
- Write access
- Execute access
- Must also prevent processes from modifying their
own page tables
205Dual-mode CPU
- Require a dual-mode CPU
- Two CPU modes
- Privileged mode or executive mode that allows
CPU to execute all instructions - User mode that allows CPU to execute only safe
unprivileged instructions - State of CPU is determined by a special bits
206Switching between states
- User mode will be the default mode for all
programs - Only the kernel can run in supervisor mode
- Switching from user mode to supervisor mode is
done through an interrupt - Safe because the jump address is at a
well-defined location in main memory
207Memory protection (II)
- Has additional advantages
- Prevents programs from corrupting address spaces
of other programs - Prevents programs from crashing the kernel
- Not true for device drivers which are inside the
kernel - Required part of any multiprogramming system
208INTEGRATING CACHES AND VM
209The problem
- In a VM system, each byte of memory has two
addresses - A virtual address
- A physical address
- Should cache tags contain virtual addresses or
physical addresses?
210Discussion
- Using virtual addresses
- Directly available
- Bypass TLB
- Cache entries specific to a given address space
- Must flush caches when the OS selects another
process
- Using physical addresses
- Must access first TLB
- Cache entries not specific to a given address
space - Do not have to flush caches when the OS selects
another process
211The best solution
- Let the cache use physical addresses
- No need to flush the cache at each context switch
- TLB access delay is tolerable
212Processing a memory access (I)
- if virtual address in TLB get physical
addresselse - create TLB miss exception break
I use Python because it is very
compacthetland.org/writing/instant-python.html
213Processing a memory access (II)
- if read_access while data not in cache
stall deliver data to CPUelse
write_access
Continues on next page
214Processing a memory access (III)
- if write_access_OK while data not in cache
stall write data into cache update dirty
bit put data and address in write buffer - else
- illegal access create TLB miss exception
215 More Problems (I)
- A virtual memory system has a virtual address
space of 4 Gigabytes and a page size of 4
Kilobytes. Each page table entry occupies 4
bytes.
216More Problems (II)
- How many bits are used for the byte offset?
- Since 4K 2___, the byte offset will use __ bits.
- Highlight text in box to see the answer
Since 4KB 212bytes, the byte offset uses 12 bits
217More Problems (III)
- How many bits are used for the page number?
- Since 4G 2__ we will have __-bit virtual
addresses. Since the byte offset occupies ___ of
these __ bits, __ bits are left for the page
number.
The page number uses 20 bits of the address
218More Problems (IV)
- What is the maximum number of page table entries
in a page table? - Address space/ Page size 2__ / 2__ 2 ___
PTEs.
220 page table entries
219More problems (VI)
- A computer has 32 bit addresses and a page size
of one kilobyte. - How many bits are used to represent the page
number? - ___ bits
- What is the maximum number of entries in a
process page table? - 2___ entries
220Answer
- As 1KB 210 bytes, the byte offset occupies10
bits - The page number uses the remaining 22 bits ofthe
address
221Some review questions
- Why are TLB entries 64-bit wide while page table
entries only require 32 bits? - What would be the main disadvantage of a virtual
memory system lacking a dirty bit? - What is the big limitation of VM systems that
cannot prevent processes from executing the
contents of any arbitrary page in their address
space?
222Answers
- We need extra space for storing the page number
- It would have to write back to disk all pages
thatit expels even when they were not modified - It would make the system less secure
223VIRTUAL MACHINES
224Key idea
- Let different operating systems run at the same
time on a single computer - Windows, Linux and Mac OS
- A real-time OS and a conventional OS
- A production OS and a new OS being tested
225How it is done
- A hypervisor /VM monitor defines two or more
virtual machines - Each virtual machine has
- Its own virtual CPU
- Its own virtual physical memory
- Its own virtual disk(s)
226The virtualization process
Hypervisor
227Reminder
- In a conventional OS,
- Kernel executes in privileged/supervisor mode
- Can do virtually everything
- User processes execute in user mode
- Cannot modify their page tables
- Cannot execute privileged instructions
228User process
User process
User mode
System call
Privileged mode
Kernel
229Two virtual machines
User mode
Privileged mode
Hypervisor
230Explanations (II)
- Whenever the kernel of a VM issues a privileged
instruction, an interrupt occurs - The hypervisor takes control and do the physical
equivalent of what the VM attempted to do - Must convert virtual RAM addresses into physical
RAM addresses - Must convert virtual disk block addresses into
physical block addresses
231Translating a block address
That's block v, w of the actual disk
Access block x, y of my virtual disk
VM kernel
Hypervisor
Virtual disk
Access block v, w of actual disk
Actual disk
232Handling I/Os
- Difficult task because
- Wide variety of devices
- Some devices may be shared among several VMs
- Printers
- Shared disk partition
- Want to let Linux and Windowsaccess the same
files
233Virtual Memory Issues
- Each VM kernel manages its own memory