Title: Superpages
1Superpages
- Kenneth Chiu
- (Some slides adapted from Juan Navarro)
2Introduction
3Overview
- Increasing cost in TLB miss overhead
- main memory sizes growing exponentially
- growing working sets
- TLB size does not grow at same pace. (Why?)
- Some caches now larger than TLB coverage
- Fully associative, usually 128 or fewer entries
- Processors now provide superpages
- one TLB entry can map a large region
- OSs have been slow to harness them
- no transparent superpage support for apps
- Increasing base page size across the board does
not work well. - Wastes memory.
- Increased I/O.
4Translation look-aside buffer
- TLB caches virtual-to-physical address
translations - TLB coverage
- amount of memory mapped by TLB
- amount of memory that can be accessed without TLB
misses
5TLB coverage trend
TLB coverage as percentage of main memory
Factor of 1000 decrease in 15 years
6(No Transcript)
7How to increase TLB coverage
- Typical TLB coverage ? 1 MB
- Use superpages!
- large and small pages
- Increase TLB coverage
- no increase in TLB size
- no internal fragmentation
If only large pages larger working sets, more
I/O.
8What are these superpages anyway?
- Memory pages of larger sizes
- supported by most modern CPUs
- Alpha 8KB, 64KB, 512KB, 4MB
- IA-32 4KB, 4MB
- IA-64 4KB-256MB
- Opteron? POWER? SPARC?
- Otherwise, same as normal pages
- use only one TLB entry
- Power of 2 size, contiguous, aligned (physically
and virtually) - Digression question How do you round-up a number
to a given power of 2? Round down? - one reference bit, one dirty bit, one set of
protection attributes. Implications?
9A superpage TLB
Alpha 8,64,512KB 4MB Itanium 4,8,16,64,256KB
1,4,16,64,256MB
virtual memory
base page entry (size1)
physical address
virtual address
superpage entry (size4)
TLB
physical memory
10The Superpage Problem
11Issues and Tradeoffs
- Assume that virtual address space of each process
is a set of virtual memory objects. - mmap()ed files, stack, heap, text, etc.
12Issues and Tradeoffs
- Allocation
- Allocate anywhere, might require relocation.
- Instead, reservation-based. Allocate aligned.
- Requires a priori choice of size. Why?
- Must trade-off large superpages against future,
possibly more critical uses. - Fragmentation control
- Memory becomes fragmented due to
- use of multiple page sizes
- persistence of file cache pages
- scattered wired (non-pageable) pages
- Contiguity contended resource
- OS must
- use contiguity restoration techniques
- trade off impact of contiguity restoration
against superpage benefits
13- Promotion
- OS must decide when to promote.
- Can also be done incrementally.
- Must trade-off benefits of superpages against
wasted memory if not all parts of the superpage
are used. - Demotion
- Process of breaking a superpage back down into
smaller subpages. - Hardware maintains only a single reference bit.
- Multiple hardware bits should be not too
expensive. - Eviction
- Superpage eviction is similar to normal eviction.
- Superpage must be flushed out in entirety, since
only a single dirty bit.
14Issue 1 superpage allocation
virtual memory
B
superpage boundaries
physical memory
B
- How / when / what size to allocate?
15Issue 2 promotion
- Promotion create a superpage out of a set of
smaller pages - mark page table entry of each base page
- When to promote?
Forcibly populate pages? May cause internal
fragmentation.
16Related Approaches
17Reservations
- Talluri and Hill propose reservation-based
scheme, with preemption under pressure. - Main goal is use of partial-subblock TLBs.
- Superpage entries with holes.
- HP-UX, IRIX create superpages eagerly.
- Superpage size specified by user.
- Main drawback is not transparent and not dynamic.
18Relocation
- Copy page frames to make things contiguous and
aligned. - Romer proposes algorithm using cost-benefit
analysis. - More TLB misses, since relocation performed in
response to TLB miss. - TLB misses more expensive, since handling more
complex. - But more robust to fragmentation.
- Can use both in a hybrid approach.
19Superpage allocation
virtual memory
B
superpage boundaries
physical memory
B
Copying costs
20Hardware Support
- Talluri advocates partial subblock TLBs
- Superpage TLBs with holes
- Fang proposes another level of indirection
- Eliminates contiguity requirements.
- None of these are in commercial hardware.
21Design
22Key observation
Once an application touches the first page of a
memory object then it is likely that it will
quickly touch every page of that object
- Example array initialization
- Opportunistic policies
- superpages as large and as soon as possible
- as long as no penalty if wrong decision
23Reservation-Based Allocation
- Buddy allocator used for reservations.
- Does coalescing with buddy.
- Power of 2.
- On fault
- Pick a superpage size.
- Get a set of contiguous pages.
- Corresponding frame is used.
- Others are reserved, and added to a list.
- If fault on page for which a frame has been
reserved, just use that.
24Superpage allocation
Preemptible reservations
virtual memory
B
superpage boundaries
physical memory
B
reserved frames
How much do we reserve? Goal good TLB
coverage,without internal fragmentation.
25Superpage Size
- Hard to predict best size
- But if too large, can override
- If too small, cannot enlarge
- For fixed size objects, largest that
- Contains the faulting page
- Does not overlap with existing
- Does not extend beyond end.
- For dynamic objects, can extend beyond end.
26Preempting Reservations
- When no free extents of desired size is
available, two choices - Refusing the allocation, reserve smaller.
- Preempt existing reservation
- Policy chosen is to preempt existing.
- If more than one, use LRA.
27Fragmentation Control
- Buddy allocator coalesces.
- Page daemon modified to perform contiguity-aware
placement.
28Incremental Promotions
- Promote as soon as a superpage is fully
populated. - Does not need to be the preferred size.
- Only promote if fully populated.
- Based on observation that most applications
populate densely and early.
29Incremental promotions
- Promotion policy opportunistic
2
4
42
8
30Speculative Demotions
- Demotion occurs during page replacement.
- Apparently policy is page aware.
- Demotion is also incremental.
- Speculative demotions are to determine if the
superpage is being completely used. - One reference bit per superpage
- How do we detect portions of a superpage not
referenced anymore? - How expensive would additional hardware bits for
this? - On memory pressure, demote superpages when
clearing reference bit. - Re-promote (incrementally) as pages are
referenced
31Paging Out Dirty Superpages
- Whole superpage may not be dirty.
- Heuristic is to demote when clean superpage is
touched, and repromote later if dirty. - Hash digests can be used to infer.
- But expensive
- Do it when idle
32Demotions dirty superpages
- One dirty bit per superpage
- whats dirty and whats not?
- page out entire superpage
- Demote on first write to clean superpage
write
- Re-promote (incrementally) as other pages are
dirtied
33Multi-List Reservation Scheme
- A set of lists, corresponding to the superpage
sizes. - Each reservation is put on a list corresponding
to the largest extent that can be obtained if
preempted. - Kept sorted by time of most recent allocation.
- When system needs an extent, it can get from
buddy allocator, or it can preempt. - Example page fault, preferred size is 64 KB.
- First ask buddy for 64 KB.
- Then try preemption, 64KB, 512KB.
- Else, go to base pages (or smaller superpages).
34Allocation managing reservations
largest unused (and aligned) chunk
4
2
1
- best candidate for preemption at front
- reservation whose most recently populated frame
was populated the least recently
35Population Map
- Keeps track of allocated pages within each memory
object. - On each page fault, enable lookup of reserved
page frame. - When allocating contiguous regions, enable OS to
detect and avoid overlap. (Isnt buddy allocator
for that?) - Assist in promotion decisions.
- When preempting, help to identify unallocated
regions. - Use radix tree
36- One population map per largest superpage size.
- somepop holds number of children that have some
at least one. - fullpop holds number of fully-populated children.
- Backpointers to and from reservations.
37- Reserved frame lookup
- Virtual address rounded down to multiple of
largest page size, yielding (memory_object,
page_index) tuple. - Use hash table to find population map. (Why not
include in radix tree?) - Traverse down to find reserved page frame, if any.
38- Overlap avoidance
- Traverse down, first node with somepop equal to
zero. - Note that buddy map manages physical space, while
population map manages virtual space.
39- Promotion decisions
- After fault serviced, promotion attempted.
- First node going down that is fully populated.
40- Preemption assistance
- When reservation preempted, allocation status is
needed to decide if free or reinserted into
reservation list. - Looked up from pointer in reservation to node.
41Implementation
42Contiguity-Aware Page Daemon
- FreeBSD normally keeps three page lists
- Active accessed recently, but not necessarily
have reference bit set. - Inactive mapped, but not referenced for a long
time - Cache not mapped, clean
- Under pressure
- Clean inactive go to cache
- Dirty inactive get paged out (and become clean)
- Some active become inactive
43Changes to Page Daemon
- Cache pages considered available (managed by
buddy allocator). - What happens if they are referenced?
- Page daemon activated when contiguity is low.
- Criterion is failure to allocate region of
preferred size. - Daemon traverses inactive list and moves to cache
pages necessary to satisfy recent requests. - Does this make sense?
- Clean pages moved to inactive as soon as file is
closed. - What about mmap()ed files?
44Wired Page Clustering
- Wired pages cannot be evicted.
- Normally, they will get scattered throughout
memory. - Special allocator to group wired pages, so they
dont fragment memory too much.
45Multiple Mappings
- For multiple mappings to use superpages, virtual
addresses must be equally aligned. - Use same virtual address alignment across
mappings. - Align to largest superpage that is smaller than
the size of the mapping.
46Evaluation
47Platform
- Alpha 21264 at 500 MHz
- Four page sizes 8 KB, 64 KB, 512 KB, and 4 MB.
- Software page tables, firmware TLB loader.
- 512 MB RAM
- 64 KB D and I L1 caches, virtually indexed and
2-way associative. - 4 MB unified, direct-mapped external L2 cache.
48Workloads
- CINT2000 Integer
- CFP2000 Floating-point
- Web Working set was 238 MB, data set is 3.6 GB.
- Image 90-degree rotation using ImageMagick.
- POV-Ray
- Linker Link of FreeBSD kernel.
- C4 Alpha-beta search solver for Connect-4 game.
- Tree Many tree ops, designed for poor-locality.
- SP Solver
- FFTW FFT
- Matrix Non-blocked transpose.
49- Nothing really surprising.
50- Mesa slows down, due to not favoring zeroed out
pages.
51- Web does poorly, since many small files.
- Matrix incurs one TLB miss for every two memory
accesses.
52Page Coloring
- Side effect is that page coloring is less useful,
since physical pages are contiguous.
53Multiple Page Sizes
- Best depends on app 64KB for SP, 512KB for vpr,
4MB for FFTW. - Some are too small to fully populate large
superpage. - OS could be allowed to promote not fully
populated. - Some apps really need variety of superpage sizes
(mcf). - Large reductions in TLB miss, little gain.
54Sustained Performance
- Fragmented memory with web server.
- Then ran two schemes
- cache just treat all cache pages as available.
- daemon contiguity aware and wired-page
clustering. - Remember how there daemon worked?
55Concurrent Execution
- Run web server concurrently with
contiguity-seeking application. - Exercises the daemon.
- Only minor degradation of the server 3.
- Increase from 3 to 30 of requests satisfied.
56Adverserial Applications
- Incremental promotion touch one byte in each
page. - 8.9 slowdown, 7.2 due to hardware-specific
reason. - Rest due to maintenance of population maps.
- Sequential access overhead cmp program.
- No slowdown.
57Superpage Demotion
- Map 100 MB file, read every page to trigger
promotion, then write into every 512th page,
flush, then exit. - As expected a huge difference, 20X.
- Would have been more interesting to write an
adverserial program. - Make the demotion useless.
58Summary
- Superpages 30 improvement
- transparently realized low overhead
- Contiguity restoration is necessary
- sustains benefits low impact
- Multiple page sizes are important
- scales to very large superpages