Title: Virtual%20Memory%20and%20Paging
1Virtual Memory and Paging
2Large Data Sets
- Size of address space
- 32-bit machines 232 4 GB
- 64-bit machines 264 a huge number
- Size of main memory
- approaching 4 GB
- How to handle
- Applications whose data set is larger than the
main memory size? - Sets of applications that together need more
space than the memory size?
Baer, p. 60
3Multiprogramming
- More than one program reside in memory at the
same time - I/O is slow
- If the running program needs I/O, it relinquishes
the CPU
Baer, p. 60
4Multiprogramming Challenges
- How and where to load a program to memory?
- How a program asks for more memory?
- How to protect one program from another?
Baer, p. 60
5Virtual Memory
- Solution
- Give each program the illusion that it could
address the whole addressing space - CPU works with virtual addresses
- Memory works with real or physical addresses
Baer, p. 60
6Virtual -gt PhysicalAddress Translation
- Paging System
- Divide both the virtual and the physical address
spaces into pages of the same size. - Virtual space page
- Physical space frame
- Fully associative mapping between pages and
frames. - any page can be stored in any frame
Baer, p. 60
7Paging System
Virtual space is much larger than physical memory
Memory can be shared with little fragmentation
Pages can be shared among programs
Memory does not need to store the whole
program and its data at the same time
Baer, p. 61
8Address Translation
Baer, p. 62
9Page Fault
- Exception generated in program P1 because valid
bit 0 in Page Table Entry (PTE) - Page fault handler initiates I/O read for P1
- I/O read takes several miliseconds to complete
- context switch occurs
- O.S. saves processor state and starts I/O
operation - Handles CPU control to another program P2
- Restores P2s state into CPU
Baer, p. 62
10Address Translation
Virtual and physical addresses can be of
different sizes. Example
64 bits
40 or 48 bits
Baer, p. 62
11Translation Look-Aside Buffer (TLB)
- Problem
- Storing page table entries (PTEs) in memory would
require a load for each address translation. - Caching PTEs interferes with the flow of
instructions or data into the cache - Solution TLB, a small, high-associativity, cache
dedicated to cache PTEs
Baer, p. 62
12TLB organization
- Each TLB entry consists of
- tag
- data (a PTE entry)
- valid bit
- dirty bit
- bits to encode memory protection
- bits to encode recency of access
- A set of TLB entries may be reserved to the
Operating System
Baer, p. 62
13TLB Characteristics
Architecture Page Size (KB) Number of Entries Number of Entries
Architecture Page Size (KB) I-TLB D-TLB
Alpha 21064 8 8 (FA) 32 (FA)
Alpha 21164 8 48 (FA) 64 (FA)
Alpha 21264 8 64 (FA) 128 (FA)
Pentium 4 32 (4-way) 64 (4-way)
Pentium II 4 32 (4-way) 64 (4-way)
Pentium III 4 32 (4-way) 64 (4-way)
Pentium 4 4 64 (4-way) 128 (4-way)
Core Duo 4 64 (FA) 64 (FA)
Baer, p. 63
14Large Pages
- Recent processors implement large page size
(typically 4 MB pages) - reduces page faults in applications with lots of
data (scientific and graph) - requires that TLB entries be reserved for large
pages.
Baer, p. 63
15Referencing Memory
Baer, p. 63
16Memory Reference Process
TLB hit?
1
Baer, p. 63
17Handling TLB Misses
- Must access page table in memory
- entirely in hardware
- entirely in software
- combination of both
- Replacement Algorithms
- LRU for 4-way associativity (Intel)
- Not Most Recently Used for full associativity
(Alpha)
Baer, p. 64
18Handling TLB Miss (cont.)
- Serving a TLB miss takes 100-1000 cycles.
- Too short to justify a context switch
- Long enough to have significant impact on
performance - even a small TLB miss rate affects CPI
Baer, p. 64
19OS handling of page fault
Reserve frame from a free list
Find page to replace if there is no free frame
Find if faulting page is in disk
Invalidate cache lines mapping to replaced page
Invalidate portions of the TLB (maybe Cache)
Write dirty replaced pages to the disk
Initiate read for faulting page
Baer, p. 64
20When page arrives in memory
I/O interruption is raised
OS updates the PTE of the page
OS schedule requesting process for execution
Baer, p. 64
21Invalidating TLB Entries on Context Switch
- Page Fault ? Exception ? Context Switch
- Let
- PR Relinquishing process
- PI Incoming Process
- Problem TLB entries are for PR, not PI
- Invalidating entire TLB on context switch leads
to many TLB misses when PI is restored - Solution Use a processor ID number (PID)
Baer, p. 64
22Process ID (PID) Number
- O.S. sets a PID for each program
- The PID is added to the tag in the TLB entries
- A PID Register stores the PID of the active
process - Match PID Register with PID in TLB entry
- No need to invalidate TLB entries on context
switch - PIDs are recycled by the OS
Baer, p. 64
23Page Size X Read/Write Time
- Amortizing I/O Time
- Large page size
- Read/write consecutive pages
Baer, p. 65
24Large Pages
- Amortize I/O time to transfer pages
- Smaller Page Tables
- More PTEs are in main memory
- lower probability of double page fault for a
single memory reference - Fewer TLB misses
- Single TLB entry translates more locations
- Pages cannot be too large
- Transfer time and fragmentation
Baer, p. 65
25Performance of Memory Hierarchy
Baer, p. 66
26When to bring a missing item (to cache, TLB, or
memory)?
Level Miss Frequency Miss Resolution
Cache few times per 100 references 5-100 cycles entirely in hardware
TLB few times per 10,000 references 100-1000 cycles in hardware or software
Page Fault few times per 10,000,000 references millions of cycles require context switch
Baer, p. 66
27Where to put the missing item?
- Cache restrictive mapping (direct or low
associativity) - TLB fully associative or high set associativity
- Paging System general mapping
Baer, p. 66
28How do we know it is there?
- Cache Compare tags and check valid bits
- TLB Compare tags, PID, check valid bits
- Memory Check Page Tables
Baer, p. 67
29What happens on a replacement?
- Caches and TLBs (approximation to) LRU
- Paging Systems
- Sophisticated algorithms to keep page fault rate
very low - O.S. policies allocate a number of page to each
program according to working set
Baer, p. 67
30Simulating Memory Hierarchy
- Memory Hierarchy simulation is faster than
simulation to assess IPC or execution time - Stack property of some replacement algorithms
- for a sequence of memory references for a given
memory location at a given level of the
hierarchy, the number of misses is monotonically
non increasing with the size of the memory - can simulate a range of sizes in a single
simulation pass.
Baer, p. 67
31Beladys Algorithm
- Beladys algorithm replace the entry that will
be accessed the furthest in the future. - It is the optimal algorithm
- It needs to know the future
- not realizable in practice
- useful in simulation to compare with practical
algorithms
Baer, p. 67