Superpages - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Superpages

Description:

When no free extents of desired size is available, two choices: ... Page Coloring ... is that page coloring is less useful, since physical pages are contiguous. ... – PowerPoint PPT presentation

Number of Views:295
Avg rating:3.0/5.0
Slides: 59
Provided by: Ken667
Category:

less

Transcript and Presenter's Notes

Title: Superpages


1
Superpages
  • Kenneth Chiu
  • (Some slides adapted from Juan Navarro)

2
Introduction
3
Overview
  • Increasing cost in TLB miss overhead
  • main memory sizes growing exponentially
  • growing working sets
  • TLB size does not grow at same pace. (Why?)
  • Some caches now larger than TLB coverage
  • Fully associative, usually 128 or fewer entries
  • Processors now provide superpages
  • one TLB entry can map a large region
  • OSs have been slow to harness them
  • no transparent superpage support for apps
  • Increasing base page size across the board does
    not work well.
  • Wastes memory.
  • Increased I/O.

4
Translation look-aside buffer
  • TLB caches virtual-to-physical address
    translations
  • TLB coverage
  • amount of memory mapped by TLB
  • amount of memory that can be accessed without TLB
    misses

5
TLB coverage trend
TLB coverage as percentage of main memory
Factor of 1000 decrease in 15 years
6
(No Transcript)
7
How to increase TLB coverage
  • Typical TLB coverage ? 1 MB
  • Use superpages!
  • large and small pages
  • Increase TLB coverage
  • no increase in TLB size
  • no internal fragmentation

If only large pages larger working sets, more
I/O.
8
What are these superpages anyway?
  • Memory pages of larger sizes
  • supported by most modern CPUs
  • Alpha 8KB, 64KB, 512KB, 4MB
  • IA-32 4KB, 4MB
  • IA-64 4KB-256MB
  • Opteron? POWER? SPARC?
  • Otherwise, same as normal pages
  • use only one TLB entry
  • Power of 2 size, contiguous, aligned (physically
    and virtually)
  • Digression question How do you round-up a number
    to a given power of 2? Round down?
  • one reference bit, one dirty bit, one set of
    protection attributes. Implications?

9
A superpage TLB
Alpha 8,64,512KB 4MB Itanium 4,8,16,64,256KB
1,4,16,64,256MB
virtual memory
base page entry (size1)
physical address
virtual address
superpage entry (size4)
TLB
physical memory
10
The Superpage Problem
11
Issues and Tradeoffs
  • Assume that virtual address space of each process
    is a set of virtual memory objects.
  • mmap()ed files, stack, heap, text, etc.

12
Issues and Tradeoffs
  • Allocation
  • Allocate anywhere, might require relocation.
  • Instead, reservation-based. Allocate aligned.
  • Requires a priori choice of size. Why?
  • Must trade-off large superpages against future,
    possibly more critical uses.
  • Fragmentation control
  • Memory becomes fragmented due to
  • use of multiple page sizes
  • persistence of file cache pages
  • scattered wired (non-pageable) pages
  • Contiguity contended resource
  • OS must
  • use contiguity restoration techniques
  • trade off impact of contiguity restoration
    against superpage benefits

13
  • Promotion
  • OS must decide when to promote.
  • Can also be done incrementally.
  • Must trade-off benefits of superpages against
    wasted memory if not all parts of the superpage
    are used.
  • Demotion
  • Process of breaking a superpage back down into
    smaller subpages.
  • Hardware maintains only a single reference bit.
  • Multiple hardware bits should be not too
    expensive.
  • Eviction
  • Superpage eviction is similar to normal eviction.
  • Superpage must be flushed out in entirety, since
    only a single dirty bit.

14
Issue 1 superpage allocation
virtual memory
B
superpage boundaries
physical memory
B
  • How / when / what size to allocate?

15
Issue 2 promotion
  • Promotion create a superpage out of a set of
    smaller pages
  • mark page table entry of each base page
  • When to promote?

Forcibly populate pages? May cause internal
fragmentation.
16
Related Approaches
17
Reservations
  • Talluri and Hill propose reservation-based
    scheme, with preemption under pressure.
  • Main goal is use of partial-subblock TLBs.
  • Superpage entries with holes.
  • HP-UX, IRIX create superpages eagerly.
  • Superpage size specified by user.
  • Main drawback is not transparent and not dynamic.

18
Relocation
  • Copy page frames to make things contiguous and
    aligned.
  • Romer proposes algorithm using cost-benefit
    analysis.
  • More TLB misses, since relocation performed in
    response to TLB miss.
  • TLB misses more expensive, since handling more
    complex.
  • But more robust to fragmentation.
  • Can use both in a hybrid approach.

19
Superpage allocation
  • Relocation approach

virtual memory
B
superpage boundaries
physical memory
B
Copying costs
20
Hardware Support
  • Talluri advocates partial subblock TLBs
  • Superpage TLBs with holes
  • Fang proposes another level of indirection
  • Eliminates contiguity requirements.
  • None of these are in commercial hardware.

21
Design
22
Key observation
Once an application touches the first page of a
memory object then it is likely that it will
quickly touch every page of that object
  • Example array initialization
  • Opportunistic policies
  • superpages as large and as soon as possible
  • as long as no penalty if wrong decision

23
Reservation-Based Allocation
  • Buddy allocator used for reservations.
  • Does coalescing with buddy.
  • Power of 2.
  • On fault
  • Pick a superpage size.
  • Get a set of contiguous pages.
  • Corresponding frame is used.
  • Others are reserved, and added to a list.
  • If fault on page for which a frame has been
    reserved, just use that.

24
Superpage allocation
Preemptible reservations
virtual memory
B
superpage boundaries
physical memory
B
reserved frames
How much do we reserve? Goal good TLB
coverage,without internal fragmentation.
25
Superpage Size
  • Hard to predict best size
  • But if too large, can override
  • If too small, cannot enlarge
  • For fixed size objects, largest that
  • Contains the faulting page
  • Does not overlap with existing
  • Does not extend beyond end.
  • For dynamic objects, can extend beyond end.

26
Preempting Reservations
  • When no free extents of desired size is
    available, two choices
  • Refusing the allocation, reserve smaller.
  • Preempt existing reservation
  • Policy chosen is to preempt existing.
  • If more than one, use LRA.

27
Fragmentation Control
  • Buddy allocator coalesces.
  • Page daemon modified to perform contiguity-aware
    placement.

28
Incremental Promotions
  • Promote as soon as a superpage is fully
    populated.
  • Does not need to be the preferred size.
  • Only promote if fully populated.
  • Based on observation that most applications
    populate densely and early.

29
Incremental promotions
  • Promotion policy opportunistic

2
4
42
8
30
Speculative Demotions
  • Demotion occurs during page replacement.
  • Apparently policy is page aware.
  • Demotion is also incremental.
  • Speculative demotions are to determine if the
    superpage is being completely used.
  • One reference bit per superpage
  • How do we detect portions of a superpage not
    referenced anymore?
  • How expensive would additional hardware bits for
    this?
  • On memory pressure, demote superpages when
    clearing reference bit.
  • Re-promote (incrementally) as pages are
    referenced

31
Paging Out Dirty Superpages
  • Whole superpage may not be dirty.
  • Heuristic is to demote when clean superpage is
    touched, and repromote later if dirty.
  • Hash digests can be used to infer.
  • But expensive
  • Do it when idle

32
Demotions dirty superpages
  • One dirty bit per superpage
  • whats dirty and whats not?
  • page out entire superpage
  • Demote on first write to clean superpage

write
  • Re-promote (incrementally) as other pages are
    dirtied

33
Multi-List Reservation Scheme
  • A set of lists, corresponding to the superpage
    sizes.
  • Each reservation is put on a list corresponding
    to the largest extent that can be obtained if
    preempted.
  • Kept sorted by time of most recent allocation.
  • When system needs an extent, it can get from
    buddy allocator, or it can preempt.
  • Example page fault, preferred size is 64 KB.
  • First ask buddy for 64 KB.
  • Then try preemption, 64KB, 512KB.
  • Else, go to base pages (or smaller superpages).

34
Allocation managing reservations
largest unused (and aligned) chunk
4
2
1
  • best candidate for preemption at front
  • reservation whose most recently populated frame
    was populated the least recently

35
Population Map
  • Keeps track of allocated pages within each memory
    object.
  • On each page fault, enable lookup of reserved
    page frame.
  • When allocating contiguous regions, enable OS to
    detect and avoid overlap. (Isnt buddy allocator
    for that?)
  • Assist in promotion decisions.
  • When preempting, help to identify unallocated
    regions.
  • Use radix tree

36
  • One population map per largest superpage size.
  • somepop holds number of children that have some
    at least one.
  • fullpop holds number of fully-populated children.
  • Backpointers to and from reservations.

37
  • Reserved frame lookup
  • Virtual address rounded down to multiple of
    largest page size, yielding (memory_object,
    page_index) tuple.
  • Use hash table to find population map. (Why not
    include in radix tree?)
  • Traverse down to find reserved page frame, if any.

38
  • Overlap avoidance
  • Traverse down, first node with somepop equal to
    zero.
  • Note that buddy map manages physical space, while
    population map manages virtual space.

39
  • Promotion decisions
  • After fault serviced, promotion attempted.
  • First node going down that is fully populated.

40
  • Preemption assistance
  • When reservation preempted, allocation status is
    needed to decide if free or reinserted into
    reservation list.
  • Looked up from pointer in reservation to node.

41
Implementation
42
Contiguity-Aware Page Daemon
  • FreeBSD normally keeps three page lists
  • Active accessed recently, but not necessarily
    have reference bit set.
  • Inactive mapped, but not referenced for a long
    time
  • Cache not mapped, clean
  • Under pressure
  • Clean inactive go to cache
  • Dirty inactive get paged out (and become clean)
  • Some active become inactive

43
Changes to Page Daemon
  • Cache pages considered available (managed by
    buddy allocator).
  • What happens if they are referenced?
  • Page daemon activated when contiguity is low.
  • Criterion is failure to allocate region of
    preferred size.
  • Daemon traverses inactive list and moves to cache
    pages necessary to satisfy recent requests.
  • Does this make sense?
  • Clean pages moved to inactive as soon as file is
    closed.
  • What about mmap()ed files?

44
Wired Page Clustering
  • Wired pages cannot be evicted.
  • Normally, they will get scattered throughout
    memory.
  • Special allocator to group wired pages, so they
    dont fragment memory too much.

45
Multiple Mappings
  • For multiple mappings to use superpages, virtual
    addresses must be equally aligned.
  • Use same virtual address alignment across
    mappings.
  • Align to largest superpage that is smaller than
    the size of the mapping.

46
Evaluation
47
Platform
  • Alpha 21264 at 500 MHz
  • Four page sizes 8 KB, 64 KB, 512 KB, and 4 MB.
  • Software page tables, firmware TLB loader.
  • 512 MB RAM
  • 64 KB D and I L1 caches, virtually indexed and
    2-way associative.
  • 4 MB unified, direct-mapped external L2 cache.

48
Workloads
  • CINT2000 Integer
  • CFP2000 Floating-point
  • Web Working set was 238 MB, data set is 3.6 GB.
  • Image 90-degree rotation using ImageMagick.
  • POV-Ray
  • Linker Link of FreeBSD kernel.
  • C4 Alpha-beta search solver for Connect-4 game.
  • Tree Many tree ops, designed for poor-locality.
  • SP Solver
  • FFTW FFT
  • Matrix Non-blocked transpose.

49
  • Nothing really surprising.

50
  • Mesa slows down, due to not favoring zeroed out
    pages.

51
  • Web does poorly, since many small files.
  • Matrix incurs one TLB miss for every two memory
    accesses.

52
Page Coloring
  • Side effect is that page coloring is less useful,
    since physical pages are contiguous.

53
Multiple Page Sizes
  • Best depends on app 64KB for SP, 512KB for vpr,
    4MB for FFTW.
  • Some are too small to fully populate large
    superpage.
  • OS could be allowed to promote not fully
    populated.
  • Some apps really need variety of superpage sizes
    (mcf).
  • Large reductions in TLB miss, little gain.

54
Sustained Performance
  • Fragmented memory with web server.
  • Then ran two schemes
  • cache just treat all cache pages as available.
  • daemon contiguity aware and wired-page
    clustering.
  • Remember how there daemon worked?

55
Concurrent Execution
  • Run web server concurrently with
    contiguity-seeking application.
  • Exercises the daemon.
  • Only minor degradation of the server 3.
  • Increase from 3 to 30 of requests satisfied.

56
Adverserial Applications
  • Incremental promotion touch one byte in each
    page.
  • 8.9 slowdown, 7.2 due to hardware-specific
    reason.
  • Rest due to maintenance of population maps.
  • Sequential access overhead cmp program.
  • No slowdown.

57
Superpage Demotion
  • Map 100 MB file, read every page to trigger
    promotion, then write into every 512th page,
    flush, then exit.
  • As expected a huge difference, 20X.
  • Would have been more interesting to write an
    adverserial program.
  • Make the demotion useless.

58
Summary
  • Superpages 30 improvement
  • transparently realized low overhead
  • Contiguity restoration is necessary
  • sustains benefits low impact
  • Multiple page sizes are important
  • scales to very large superpages
Write a Comment
User Comments (0)
About PowerShow.com