Superpages - PowerPoint PPT Presentation

1 / 58

About This Presentation

Title:

Superpages

Description:

When no free extents of desired size is available, two choices: ... Page Coloring ... is that page coloring is less useful, since physical pages are contiguous. ... – PowerPoint PPT presentation

Number of Views:295

Avg rating:3.0/5.0

Slides: 59

Provided by: Ken667

Category:

more less

Transcript and Presenter's Notes

Title: Superpages

1
Superpages

Kenneth Chiu
(Some slides adapted from Juan Navarro)

2
Introduction
3
Overview

Increasing cost in TLB miss overhead
main memory sizes growing exponentially
growing working sets
TLB size does not grow at same pace. (Why?)
Some caches now larger than TLB coverage
Fully associative, usually 128 or fewer entries
Processors now provide superpages
one TLB entry can map a large region
OSs have been slow to harness them
no transparent superpage support for apps
Increasing base page size across the board does
not work well.
Wastes memory.
Increased I/O.

4
Translation look-aside buffer

TLB caches virtual-to-physical address
translations
TLB coverage
amount of memory mapped by TLB
amount of memory that can be accessed without TLB
misses

5
TLB coverage trend
TLB coverage as percentage of main memory
Factor of 1000 decrease in 15 years
6
(No Transcript)
7
How to increase TLB coverage

Typical TLB coverage ? 1 MB
Use superpages!
large and small pages
Increase TLB coverage
no increase in TLB size
no internal fragmentation

If only large pages larger working sets, more
I/O.
8
What are these superpages anyway?

Memory pages of larger sizes
supported by most modern CPUs
Alpha 8KB, 64KB, 512KB, 4MB
IA-32 4KB, 4MB
IA-64 4KB-256MB
Opteron? POWER? SPARC?
Otherwise, same as normal pages
use only one TLB entry
Power of 2 size, contiguous, aligned (physically
and virtually)
Digression question How do you round-up a number
to a given power of 2? Round down?
one reference bit, one dirty bit, one set of
protection attributes. Implications?

9
A superpage TLB
Alpha 8,64,512KB 4MB Itanium 4,8,16,64,256KB
1,4,16,64,256MB
virtual memory
base page entry (size1)
physical address
virtual address
superpage entry (size4)
TLB
physical memory
10
The Superpage Problem
11
Issues and Tradeoffs

Assume that virtual address space of each process
is a set of virtual memory objects.
mmap()ed files, stack, heap, text, etc.

12
Issues and Tradeoffs

Allocation
Allocate anywhere, might require relocation.
Instead, reservation-based. Allocate aligned.
Requires a priori choice of size. Why?
Must trade-off large superpages against future,
possibly more critical uses.
Fragmentation control
Memory becomes fragmented due to
use of multiple page sizes
persistence of file cache pages
scattered wired (non-pageable) pages
Contiguity contended resource
OS must
use contiguity restoration techniques
trade off impact of contiguity restoration
against superpage benefits

Promotion
OS must decide when to promote.
Can also be done incrementally.
Must trade-off benefits of superpages against
wasted memory if not all parts of the superpage
are used.
Demotion
Process of breaking a superpage back down into
smaller subpages.
Hardware maintains only a single reference bit.
Multiple hardware bits should be not too
expensive.
Eviction
Superpage eviction is similar to normal eviction.
Superpage must be flushed out in entirety, since
only a single dirty bit.

14
Issue 1 superpage allocation
virtual memory
B
superpage boundaries
physical memory
B

How / when / what size to allocate?

15
Issue 2 promotion

Promotion create a superpage out of a set of
smaller pages
mark page table entry of each base page
When to promote?

Forcibly populate pages? May cause internal
fragmentation.
16
Related Approaches
17
Reservations

Talluri and Hill propose reservation-based
scheme, with preemption under pressure.
Main goal is use of partial-subblock TLBs.
Superpage entries with holes.
HP-UX, IRIX create superpages eagerly.
Superpage size specified by user.
Main drawback is not transparent and not dynamic.

18
Relocation

Copy page frames to make things contiguous and
aligned.
Romer proposes algorithm using cost-benefit
analysis.
More TLB misses, since relocation performed in
response to TLB miss.
TLB misses more expensive, since handling more
complex.
But more robust to fragmentation.
Can use both in a hybrid approach.

19
Superpage allocation

Relocation approach

virtual memory
B
superpage boundaries
physical memory
B
Copying costs
20
Hardware Support

Talluri advocates partial subblock TLBs
Superpage TLBs with holes
Fang proposes another level of indirection
Eliminates contiguity requirements.
None of these are in commercial hardware.

21
Design
22
Key observation
Once an application touches the first page of a
memory object then it is likely that it will
quickly touch every page of that object

Example array initialization
Opportunistic policies
superpages as large and as soon as possible
as long as no penalty if wrong decision

23
Reservation-Based Allocation

Buddy allocator used for reservations.
Does coalescing with buddy.
Power of 2.
On fault
Pick a superpage size.
Get a set of contiguous pages.
Corresponding frame is used.
Others are reserved, and added to a list.
If fault on page for which a frame has been
reserved, just use that.

24
Superpage allocation
Preemptible reservations
virtual memory
B
superpage boundaries
physical memory
B
reserved frames
How much do we reserve? Goal good TLB
coverage,without internal fragmentation.
25
Superpage Size

Hard to predict best size
But if too large, can override
If too small, cannot enlarge
For fixed size objects, largest that
Contains the faulting page
Does not overlap with existing
Does not extend beyond end.
For dynamic objects, can extend beyond end.

26
Preempting Reservations

When no free extents of desired size is
available, two choices
Refusing the allocation, reserve smaller.
Preempt existing reservation
Policy chosen is to preempt existing.
If more than one, use LRA.

27
Fragmentation Control

Buddy allocator coalesces.
Page daemon modified to perform contiguity-aware
placement.

28
Incremental Promotions

Promote as soon as a superpage is fully
populated.
Does not need to be the preferred size.
Only promote if fully populated.
Based on observation that most applications
populate densely and early.

29
Incremental promotions

Promotion policy opportunistic

2
4
42
8
30
Speculative Demotions

Demotion occurs during page replacement.
Apparently policy is page aware.
Demotion is also incremental.
Speculative demotions are to determine if the
superpage is being completely used.
One reference bit per superpage
How do we detect portions of a superpage not
referenced anymore?
How expensive would additional hardware bits for
this?
On memory pressure, demote superpages when
clearing reference bit.
Re-promote (incrementally) as pages are
referenced

31
Paging Out Dirty Superpages

Whole superpage may not be dirty.
Heuristic is to demote when clean superpage is
touched, and repromote later if dirty.
Hash digests can be used to infer.
But expensive
Do it when idle

32
Demotions dirty superpages

One dirty bit per superpage
whats dirty and whats not?
page out entire superpage
Demote on first write to clean superpage

write

Re-promote (incrementally) as other pages are
dirtied

33
Multi-List Reservation Scheme

A set of lists, corresponding to the superpage
sizes.
Each reservation is put on a list corresponding
to the largest extent that can be obtained if
preempted.
Kept sorted by time of most recent allocation.
When system needs an extent, it can get from
buddy allocator, or it can preempt.
Example page fault, preferred size is 64 KB.
First ask buddy for 64 KB.
Then try preemption, 64KB, 512KB.
Else, go to base pages (or smaller superpages).

34
Allocation managing reservations
largest unused (and aligned) chunk
4
2
1

best candidate for preemption at front
reservation whose most recently populated frame
was populated the least recently

35
Population Map

Keeps track of allocated pages within each memory
object.
On each page fault, enable lookup of reserved
page frame.
When allocating contiguous regions, enable OS to
detect and avoid overlap. (Isnt buddy allocator
for that?)
Assist in promotion decisions.
When preempting, help to identify unallocated
regions.
Use radix tree

One population map per largest superpage size.
somepop holds number of children that have some
at least one.
fullpop holds number of fully-populated children.
Backpointers to and from reservations.

Reserved frame lookup
Virtual address rounded down to multiple of
largest page size, yielding (memory_object,
page_index) tuple.
Use hash table to find population map. (Why not
include in radix tree?)
Traverse down to find reserved page frame, if any.

Overlap avoidance
Traverse down, first node with somepop equal to
zero.
Note that buddy map manages physical space, while
population map manages virtual space.

Promotion decisions
After fault serviced, promotion attempted.
First node going down that is fully populated.

Preemption assistance
When reservation preempted, allocation status is
needed to decide if free or reinserted into
reservation list.
Looked up from pointer in reservation to node.

41
Implementation
42
Contiguity-Aware Page Daemon

FreeBSD normally keeps three page lists
Active accessed recently, but not necessarily
have reference bit set.
Inactive mapped, but not referenced for a long
time
Cache not mapped, clean
Under pressure
Clean inactive go to cache
Dirty inactive get paged out (and become clean)
Some active become inactive

43
Changes to Page Daemon

Cache pages considered available (managed by
buddy allocator).
What happens if they are referenced?
Page daemon activated when contiguity is low.
Criterion is failure to allocate region of
preferred size.
Daemon traverses inactive list and moves to cache
pages necessary to satisfy recent requests.
Does this make sense?
Clean pages moved to inactive as soon as file is
closed.
What about mmap()ed files?

44
Wired Page Clustering

Wired pages cannot be evicted.
Normally, they will get scattered throughout
memory.
Special allocator to group wired pages, so they
dont fragment memory too much.

45
Multiple Mappings

For multiple mappings to use superpages, virtual
addresses must be equally aligned.
Use same virtual address alignment across
mappings.
Align to largest superpage that is smaller than
the size of the mapping.

46
Evaluation
47
Platform

Alpha 21264 at 500 MHz
Four page sizes 8 KB, 64 KB, 512 KB, and 4 MB.
Software page tables, firmware TLB loader.
512 MB RAM
64 KB D and I L1 caches, virtually indexed and
2-way associative.
4 MB unified, direct-mapped external L2 cache.

48
Workloads

CINT2000 Integer
CFP2000 Floating-point
Web Working set was 238 MB, data set is 3.6 GB.
Image 90-degree rotation using ImageMagick.
POV-Ray
Linker Link of FreeBSD kernel.
C4 Alpha-beta search solver for Connect-4 game.
Tree Many tree ops, designed for poor-locality.
SP Solver
FFTW FFT
Matrix Non-blocked transpose.

Nothing really surprising.

Mesa slows down, due to not favoring zeroed out
pages.

Web does poorly, since many small files.
Matrix incurs one TLB miss for every two memory
accesses.

52
Page Coloring

Side effect is that page coloring is less useful,
since physical pages are contiguous.

53
Multiple Page Sizes

Best depends on app 64KB for SP, 512KB for vpr,
4MB for FFTW.
Some are too small to fully populate large
superpage.
OS could be allowed to promote not fully
populated.
Some apps really need variety of superpage sizes
(mcf).
Large reductions in TLB miss, little gain.

54
Sustained Performance

Fragmented memory with web server.
Then ran two schemes
cache just treat all cache pages as available.
daemon contiguity aware and wired-page
clustering.
Remember how there daemon worked?

55
Concurrent Execution

Run web server concurrently with
contiguity-seeking application.
Exercises the daemon.
Only minor degradation of the server 3.
Increase from 3 to 30 of requests satisfied.

56
Adverserial Applications

Incremental promotion touch one byte in each
page.
8.9 slowdown, 7.2 due to hardware-specific
reason.
Rest due to maintenance of population maps.
Sequential access overhead cmp program.
No slowdown.

57
Superpage Demotion

Map 100 MB file, read every page to trigger
promotion, then write into every 512th page,
flush, then exit.
As expected a huge difference, 20X.
Would have been more interesting to write an
adverserial program.
Make the demotion useless.

58
Summary