Title: Chapter 3 Memory Management
1Chapter 3 Memory Management
Page Management
- Li Wensheng
- wenshli_at_bupt.edu.cn
2Outline
- Data Structure
- Page Scanner Operation
- Page-out Algorithm
- Hardware Address Translation Layer
3PagesThe Basic Unit of Solaris Memory
- Physical memory is divided into pages.
- A pages identity is its vnode/offset pair.
- The hardware address translation (HAT) and
address space layers manage the mapping between
a physical page and its virtual address space.
4The Page Structure
5The Page Hash List
- global hash list -- an array of pointers to
linked lists of pages - VM system hashes pages with identity onto a
global hash list so that they can be located by
vnode/offset. - Three page functions search the global page hash
list - page_find()
- page_lookup()
- page_lookup_nowait()
6Locating Pages by Vnode/Offset Identity
7MMU-Specific Page Structures
- need to keep machine-specific data about every
page, e.g. the HAT information that describes
how the page is mapped by the MMU. - struct machpage
- The contents of the machine-specific page
structure are hidden from the generic kernel. - only the HAT machine-specific layer can see or
manipulate its contents
8Machine-Specific Page Structures sun4u Example
9Physical Page Lists
- a segmented global physical page list, consisting
of segments of contiguous physical memory. - Contiguous physical memory segments are added
during system boot. - Can also added and deleted dynamically when
physical memory is added and removed while the
system is running.
10arrangement of the physical page lists
11Free List and Cache List
- hold pages that are not mapped into any address
space and that have been freed by page_free(). - free list
- Does not have a vnode/offset associated
- Pages are put on the free list at process exits
- is generally very small
- cache list
- still have a vnode/offset
- Seg_map free-behind and seg_vn executables and
libraries (for reuse)
12The Page-Level Interfaces
Method Description
page_create() Creates pages. Page coloring is based on a hash of the vnode offset. page_create() is provided for backward compatibility only. Dont use it if you dont have to. Instead, use the page_create_va() function so that pages are correctly colored.
page_create_va() Creates pages, taking into account the virtual address they will be mapped to. The address is used to calculate page coloring.
page_exists() Tests that a page for vnode/offset exists.
page_find() Searches the hash list for a page with the specified vnode and offset that is known to exist and is already locked
page_first() Finds the first page on the global page hash list
page_free() Frees a page. Pages with vnode/offset go onto the cache list other pages go onto the free list
page_isfree() Checks whether a page is on the free list
page_ismod() Checks whether a page is modified. This function checks only the software bit in the page structure. To sync the MMU bits with the page structure, you may need to call hat_pagesync() before calling page_ismod().
13The Page-Level Interfaces (Cont.)
Method Description
page_isref() Checks whether a page has been referenced checks only the software bit in the page structure. To sync the MMU bits with the page structure, you may need to call hat_pagesync() before calling page_isref().
page_isshared() Checks whether a page is shared across more than one address space.
page_lookup() Finds a page representing the specified vnode/offset. If the page is found on a free list, then it will be removed from the free list
page_lookup_nowait() Finds a page representing the specified vnode/offset that is not locked or on the free list
page_needfree() Informs the VM system we need some pages freed up. Calls to page_needfree() must be symmetric, that is they must be followed by another page_needfree() with the same amount of memory multiplied by -1, after the task is complete.
page_next() Finds the next page on the global page hash list.
14The Page Throttle
- implemented in the page_create() and
page_create_va() functions - causes page creates to block when the PG_WAIT
flag is specified, that is, when available is
less than the system global, throttlefree. - throttlefree is set to the same value as minfree.
- memory allocated through the kernel memory
allocator specifies PG_WAIT and is subject to the
page-created throttle.
15Page Sizes
System Type System Type MMU Page Size Capability Solaris 2.x Page Size
Early SPARC systems sun4c 4K 4K
microSPARC-I, -II sun4m 4K 4K
SuperSPARC-I, -II sun4m 4K, 4M 4K, 4M
UltraSPARC-I, -II sun4u 4K, 64K, 512K, 4M 8K, 4M
Intel x86 architecture i86pc 4K, 4M 4K, 4M
16Page Coloring
- page placement policy affects processor
performance - The optimal placement of pages often depends on
the memory access patterns of the application. - in a random order
- in some sort of stridden ordered
- How page placement can affect performance?
- The UltraSPARC-I -II implementations
- The L1 cache is 16 Kbytes
- The L2 (external) cache can vary between 512
Kbytes and 8 Mbytes - The L2 cache is arranged in lines of 64 bytes,
and transfers are done to and from physical
memory in 64-byte units.
17Page Coloring (Cont.)
- Assume
- we have a 32-Kbyte L2 cache
- page size of 8 Kbytes
- four page-sized slots on the L2 cache
- The cache does not necessarily read and write
8-Kbyte units from memory it does that in
64-byte chunks, so 32-Kbyte cache has 1024
addressable slots.
18Page Coloring (Cont.)
offsets 0 and 32678 map to the same cache
line. If we were now to access these two
addresses, cache ping-pong effect occurs.
we program to virtual memory rather than physical
memory.The OS must provide a sensible mapping
between virtual memory and physical memory
19Page Coloring (Cont.)
- physical pages are assigned to an address space
from the order they appear in the free list. - page coloring algorithm
- the free list of physical pages is organized into
specifically colored bins, one color bin for each
slot in the physical cache. - When a page is put on the free list, the
page_free() algorithms assign it to a color bin. - When a page is consumed from the free list
(page_create_va() function ), the
virtual-to-physical algorithm takes the page from
a physical color bin.
20Page Coloring (Cont.)
- The kernel supports a default algorithm and two
optional algorithms. - The default algorithm was chosen according to the
following criteria - Fairly consistent, repeatable results
- Good overall performance for the majority of
applications - Acceptable performance across a wide range of
applications
21Solaris Page Coloring Algorithms
algorithm algorithm description Solaris Availability Solaris Availability Solaris Availability
No. Name 2.5.1 2.6 7
0 Hashed VA The physical page color bin is chosen on a hashed algorithm to ensure even distribution of virtual addresses across the cache. Default Default Default
1 P.Addr V.Addr The physical page color is chosen so that physical addresses map directly to the virtual addresses (as in the example). Yes Yes Yes
2 Bin Hopping Physical pages are allocated with a round-robin method. Yes Yes Yes
6 Kesslers Best Bin Kessler best bin algorithm. Keep history per process of used colors and chooses least used color if multiple, use largest bin. E10000 only (default) E10000 only (default) Not Available
22Outline
- Data Structure
- Page Scanner Operation
- Page-out Algorithm
- Hardware Address Translation Layer
23Page Scanner
- Is the memory management daemon that manages
system wide physical memory - When there is a memory shortage, the page scanner
runs to steal memory from address spaces, by - taking pages that havent been used recently
- syncing them up with their backing store
- freeing them
- If paged-out virtual memory is required again, a
memory page fault occurs.
24Page Scanner (Cont.)
- The balancing of page stealing and page faults
determines which parts of virtual memory will be
backed and which will be moved out to swap. - global page replacement / local page replacement
- The subtleties of which pages are stolen govern
the memory allocation policies and can affect
different workloads in different ways. - Enhancements to minimize page stealing from
extensively shared libraries and executables - Priority paging to prevent application, shared
library, and executable paging on systems with
ample memory.
25Page Scanner Operation
- tracks page usage by reading a per-page hardware
bit from the MMU for each page - Two bits for each page Reference bit modify
bit - awakened when the amount of memory on the
free-page list falls below a system threshold - typically 1/64th of total physical memory.
- scans through pages in physical page order
- looking for pages that havent been used recently
to page out to the swap device and free
26Two-handed Clock Algorithm
- front hand clears the referenced and modified
bits for each page - back hand inspects the referenced and modified
bits some time later - Pages havent been referenced or modified are
swapped out and freed - scan rate is controlled by the amount of free
memory on the system - The gap between the front and back hand is fixed
by a boot-time parameter, handspreadpages.
27Outline
- Data Structure
- Page Scanner Operation
- Page-out Algorithm
- Hardware Address Translation Layer
28Introduction to page-out algorithm
- Steals pages when memory is lower than lotsfree
- Scanner runs
- Starts scanning at slowscan (pages/sec)
- Four times/second when memory is short
- Awoken by page allocator if very low
- Puts memory out to backing store
- Uses a Least Recently Used process
- Kernel threads does the scanning
29Page Scanner Parameters
Parameter Description Min Default
Lotsfree starts stealing anonymous memory pages 512K 1/64 th of memory
Desfree scanner is started at 100 times/second Minfree ½ of lotsfee
Minfree start scanning every time a new page is created ½ of desfree
Throttlefree page_create routine makes the caller wait until free pages are Available Minfree
Fastscan scan rate (pages per second) when free memory minfree slowscan minimum of 64MB/s or ½ memory size
Slowscan scan rate (pages per second) when free memory lotsfree 100
Maxpgio max number of pages per second that the swap device can handle 60 60 or 90 pages per spindle
hand-spreadpages number of pages between the front hand (clearing) and back hand (checking) 1 Fastscan
min_percent_cpu CPU usage when free memory is at lotsfree 4 (1 clock tick) of a single CPU
30Scan Rate Parameters (Assuming No Priority
Paging)
Stsrts scanning at slowscan
Scans faster as the amount of free memory
approaches 0
31Scan Rate Parameters calculation
- lotsfree is calculated at startup as 1/64th of
memory - slowscan parameter is 100 by default on Solaris
systems - fastscan is set to total physicalmemory/2
- If total physical memory is 1G, then
- Lotsfree2048 pages/sec fastscan8192 pages/sec
- If free memory falls to 12 Mbytes (1536 pages)
32Not Recently Used Time
- The time between the front hand and back hand
- short time ? the most active pages remain intact
- long time ? only the largely unused pages are
stolen - varies from just a few seconds to several
hours,according to - the number of pages between front and back hand
- the scan rate
- Example
- Scan rate 2000pages/sec
- hand spread 8192 pages/sec
- Clear/check time 4 seconds
33Shared Library Optimizations
- prevents scanner from stealing pages from
extensively shared libraries - looks at the share reference count for each page
- if the page is shared more than a certain amount,
then it is skipped during the page scan
operation. - threshold parameter po_share
- 8 134217728, By default, starts at 8
- A page shared by more than po_share processes
will be skipped - Each time around, it is decremented ?
34The Priority Paging Algorithm
- Purpose overcome adverse behavior that results
from the memory pressure caused by the file
system. - puts a higher priority on a processs pages
- its heap, stack, shared libraries, and
executables. - permits scanner to
- pick file system cache pages only when ample
memory is available - only steal application pages when there is a true
memory shortage.
35The Priority Paging Algorithm
- a new paging parameter, cachefree
- When the amount of free memory lies between
cachefree and lotsfree, the page scanner steals
only file system cache pages - scanner wakes up when memory falls below
cachefree rather than below lotsfree
36Scan Rate Interpolation with the Priority Paging
Algorithm
37Page Scanner CPU Utilization Clamp
- Purpose to prevent the page-out daemon from
using too much processor time - Two parameters
- min_percent_cpu, default 4 of a single CPU
- max_percent_cpu, default 80 of a single CPU
- CPU time can be used
- From min_percent_cpu to max_percent_cpu
- min_percent_cpu when free memory is at lotsfree
(cachefree with priority paging enabled) - max_percent_cpu if free memory were to fall to
zero
38Parameters That Limit Pages Paged Out
- Maxpgio
- limits the rate at which I/O is queued to the
swap devices - defaults to 40 or 60 I/Os per second
- Often set to 100 times the number of swap
spindles - Maxpgio can also indirectly affect file system
throughput
39Page Scanner Implementation
- implemented as two kernel threads
- Page scanner thread scans pages
- Page-out thread pushes the dirty pages queued
for I/O
40Page Scanner Architecture
41Scanner Schedpaging()
- waken up
- called four times per second by a callout,
- triggered by the clock() thread if memory falls
below minfree - triggered by the page allocator if memory falls
below throttlefree - calculates two setup parameters for the page
scanner thread - the number of pages to scan
- the number of CPU ticks that the scanner thread
can consume - triggers the scanner through a condition variable
42Page scanner thread
- cycles through the physical page list
- The front and back hand each have a page pointer
- front hand is incremented first to clear the
referenced and modified bits for pointed page - back hand is then incremented to check the status
of the pointed page (using check_page() function) - If modified, placed in the dirty page queue
- If not referenced, freed
43Page-out thread
- uses a preinitialized list of async buffer
headers as the queue for I/O requests - The number of entries is controlled by parameter
async_request_size, initialized with 256 - Requests to queue more I/Os will be blocked
- if the entire queue is full
- if the rate of pages queued has exceeded the
maxpgio - removes I/O entries from the queue
- initiates I/O by calling the vnode putpage()
44The Memory Scheduler
- swap out entire processes to conserve memory
- removing all of a processs thread structures and
private pages - setting flags in the process table to indicate
that this process has been swapped out - Not expensive but affects processs performance
- launched at boot time
- does nothing unless memory is less than desfree
- looking for processes that can completely swap
out - soft-swap out / hard-swap out
45Soft Swapping
- takes place when the 30-second average for free
memory is below desfree - memory scheduler looks for processes that have
been inactive for at least maxslp seconds - If found
- swaps out the thread structures for each thread
- pages out all of the private pages of memory for
that process
46Hard Swapping
- takes place when all of the following are true
- At least two processes are on the run queue,
waiting for CPU. - The average free memory over 30 seconds is
consistently less than desfree. - Excessive paging is going on
- determined to be true if page-out page-in gt
maxpgio - Use a much more aggressive approach to find
memory - First, the kernel is requested to unload all
modules and cache memory that are not currently
active - Then, processes are sequentially swapped out
until the desired amount of free memory is
returned
47Memory Scheduler Parameters
Parameter Affect on Memory Scheduler
desfree If the average amount of free memory falls below desfree for 30 seconds, then the memory scheduler is invoked.
maxslp When soft-swapping, the memory scheduler starts swapping processes that have slept for at least maxslp seconds. The default for maxslp is 20 seconds and is tunable
maxpgio When the run queue is greater than 2, free memory is below desfree, and the paging rate is greater than maxpgio, then hard swapping occurs, unloading kernel modules and process memory.
48Outline
- Data Structure
- Page Scanner Operation
- Page-out Algorithm
- Hardware Address Translation Layer
49Introduction to HAT
- Hardware Address Translation (HAT)
- controls the hardware that manages mapping of
virtual to physical memory - provides interfaces that implement the creation
and destruction of mappings between virtual and
physical memory - provides a set of interfaces to probe and control
the MMU - implements all of the low-level trap handlers to
manage page faults and memory exceptions
50Solaris Virtual Memory Layers
51Solaris Memory Model
52Address Apace
- Process Address Space
- Process Text and Data
- Stack (anon memory) and Libraries
- Heap (anon memory)
- Kernel Address Space
- Kernel Text and Data
- Kernel map Space (data structures, caches)
- 32-bit kernel map (64-bit kernels only)
- Trap table
- Critical virtual memory data structures
- Mapping File System Cache (segmap)
53The Address Space
54Role of the HAT layer in virtual-to-physical
translation
- hides the platform-specific implementation
- used by the segment drivers to implement the
segment drivers view of virtual-to-physical
translation - use hat to hold top-level translation information
- hat structure is platform specific
- hat is referenced by the address space structure
- HAT-specific data structures existing in every
page represent the translation information at a
page level - HAT layer is called when the segment drivers want
to manipulate the hardware MMU
55Summarizes HAT functions
Function Description
hat_chgattr() Changes the protections for the supplied virtual address range.
hat_clrattr() Clears the protections for the supplied virtual address range.
hat_free_end() Informs the HAT layer that a process has exited.
hat_free_start() Informs the HAT layer that a process is exiting.
hat_get_mapped_size() Returns the number of bytes that have valid mappings.
hat_getattr() Gets the protections for the supplied virtual address range.
hat_memload() Creates a mapping for the supplied page at the supplied virtual address. Used to create mappings.
hat_setattr() Sets the protections for the supplied virtual address range.
hat_stats_disable() Finishes collecting stats on an address space.
hat_stats_enable() Starts collecting page reference and modification stats on an address space.
hat_swapin() Allocates resources for a process that is about to be swapped in.
hat_swapout() Allocates resources for a process that is about to be swapped out.
hat_sync() Synchronizes the struct_page software referenced and modified bits with the hardware MMU.
hat_unload() Unloads a mapping for the given page at the given address.
56Virtual Memory Contexts Address Spaces
- A virtual memory context is a set of
virtual-to-physical translations that maps an
address space - contexts change when
- scheduler wants to switch execution from one
process to another - a trap or interrupt from user mode to kernel
occurs - virtual memory context zero refers to kernel
context - HAT layer implements functions to create, delete,
and switch virtual memory contexts - Different hardware MMUs support different numbers
of concurrent virtual memory contexts
57Hardware Translation Acceleration
- translation lookaside buffer (TLB)
- a hardware cache of recent translations
- The number of entries in the TLB is typically 64
on SPARC systems - TLB fill
- hardware
- such as Intel and older SPARC implementations
- software algorithms
- like the UltraSPARC architecture
58The UltraSPARC-I -II HAT
- The UltraSPARC-I -II MMUs do the following
- Implement mapping between a 44-bit virtual
address and a 41-bit physical address - Support page sizes of 8 Kbytes, 64 Kbytes, 512
bytes, and 4 Mbytes
59Virtual-to-Physical Translation
60Translation Table Entry (TTE)
- TTE is a translation map entry, one for each page
- TTE contains a virtual address tag and the high
bits of the physical address - TTEs must be loaded into the TLB
- When MMU finds the TTE entry that matches the
virtual page number and current context, it
retrieves the physical page information
61Relationship of TLBs, TSBs, and TTEs
Translation Software Buffer software cache of
TTEs a direct-mapped cache of the TLB an array
of TTEs in regular physical memory
62TSB Size
Memory Size Kernel TSB Entries Kernel TSB Size User TSB Entries User TSB Size
lt 32 Mbytes 2048 128 Kbytes
32 Mbytes 64 Mbytes 4096 256 Kbytes 8192 16383 512 Kbytes 1 Mbyte
32 Mbytes 2 Gbytes 4096 262,144 512 Kbytes 16 Mbytes 16384 524,287 1 Mbyte 32 Mbytes
2 Gbytes 8 Gbytes 262,144 16 Mbytes 524,288 2,097,511 32 Mbytes 128 Mbytes
8 Gbytes -gt 262,144 16 Mbytes 2,097,512 128 Mbytes
63Address Space Identifiers
- describe the MMU mode and hardware used to access
pages - derived from the instruction being executed and
the current trap level - grouped into three different modes of physical
memory access - The MMU translation context used to index TLB
entries is derived from the ASI
ASI Description Derived Context
Primary The default address translation used for regular SPARC Instructions The address space translation is done through TLB entries that match the context number in the MMU primary context register
Secondary A secondary address space context used for accessing another address space context without requiring a context switch The address space translation is done through TLB entries that match the context number in the MMU secondary context register
Nucleus The address translation used for TLB miss handlers, system calls, and interrupts The nucleus context is always zero (the kernels context).
64UltraSPARC-I II Watchpoint Implementation
- watchpoint registers describe the address of
watchpoints for the address space - Virtual address / physical address
- Watchpoint traps are generated when
- watchpoints are enabled, and
- the data MMU detects a load or store to the
virtual or physical address specified by the
virtual address data watchpoint register or the
physical data watchpoint register
65UltraSPARC-I -II Protection Modes
Condition Condition Condition Resultant Protection Mode
TTE in D-MMU TTE in I-MMU Writable Attribute Bit Resultant Protection Mode
Yes No 0 Read-only
No Yes Dont Care Execute-only
Yes No 1 Read/Write
Yes Yes 0 Read-only/Execute
Yes Yes 1 Read/Write/Execute
66UltraSPARC-I -II MMU-Generated Traps
Trap Description
Instruction_access_miss A TTE for the virtual address of an instruction was not found in the instruction TLB
Instruction_access_exception An instruction privilege violation or invalid instruction address occurred
Data_access_MMU_miss A TTE for the virtual address of a load was not found in the data TLB
Data_access_exception A data access privilege violation or invalid data address occurred
Data_access_protection A data write was attempted to a read-only page
Privileged_action An attempt was made to access a privileged address space
Watchpoint Watchpoints were enabled and the CPU attempted to load or store at the address equivalent to that stored in the watchpoint register
Mem_address_not_aligned An attempt was made to load or store from an address that is not correctly word aligned
67TLB Performance and Large Pages
- large pages
- typically 4 Mbytes in size
- optimize the effectiveness of the hardware TLB
- memory performance is largely influenced by the
effectiveness of the TLB - because of the time spent servicing TLB misses
- TLBs are limited in size
- only 64 entries in UltraSPARC-I and -II
68TLB reach
- TLB reach -- the amount of memory that TLB can
address concurrently - TLB reach TLB entries Page size
- 648 Kbytes, or 512 Kbytes
- increase TLB reach
- Increase the number of entries in the TLB
- Increase the page size that each entry reflects
- A trade-off method -- use two or more different
page sizes at the same time - 8-Kbyte, 64-Kbyte, 512-Kbyte. Or 4-Mbyte pages
69Solaris Support for Large Pages
- 8 Kbytes
- a good mix of performance across the range of
smaller machines to larger machines - hurts large-memory scientific applications and
large-memory databases - hurts kernel performance
- 4 Mbytes
- speeds up the kernel code path
- frees up valuable TLB slots for hungry
applications - accelerates graphics performance
- Large-Page Database Performance Improvements
Database Performance Improvement
Oracle TPC-C 12
Informix TPC-C 1
Informix TPC-D 6
70End