Reducing Garbage Collector Cache Misses - PowerPoint PPT Presentation

About This Presentation
Title:

Reducing Garbage Collector Cache Misses

Description:

Inflected Form(s): plural -ties ... from affinis bordering on, related by marriage, from ad- finis end, border ... 1 : relationship by marriage ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 124
Provided by: shac9
Category:

less

Transcript and Presenter's Notes

Title: Reducing Garbage Collector Cache Misses


1
Reducing Garbage Collector Cache Misses
  • Shachar Rubinstein
  • Garbage Collection Seminar

2
The End!
3
The general problem
  • CPUs are getting fast faster and faster
  • Main memory speed lags behind
  • Result The cost to access main memory is
    increasing

4
Solutions
  • Hardware and software techniques
  • Memory hierarchy
  • Prefetcing
  • Multithreading
  • Non-blocking caches
  • Dynamic instruction scheduling
  • Speculative execution

5
Great Solutions?
Not exactly
  • Complex hardware and compilers
  • Ineffective for many programs
  • Attack the manifestation ( memory latency) and
    not the source (poor reference locality)

6
Previous work
  • Improving cache locality in dense matrices using
    loop transformation
  • Other profile-driven, compiler directed approach

7
The GC problem
  • Little temporal locality.
  • Each live object is usually read only once during
    mark phase.
  • Most reads are likely to miss.
  • The new contents are unlikely to be used more
    than once.

8
The GC problem cont.
  • The sweep phase, like the mark phase, also
    touches each object once
  • Thats since the free list pointers are
    maintained in the objects themselves
  • Unlike the mark phase, the sweep phase is more
    sequential

9
The GC problem cont.
  • The sweep is less likely to use cache contents
    left by the marker
  • The allocator is likely to miss again, when the
    object is allocated

10
The GC problem - previous work
  • Older work concentrated on paging performance.
  • Memory size increase lead to abandoning this
    goal.
  • But memory size also lead to huge cache miss
    penalties.
  • The largest cache size lt heap size
  • This problem is unavoidable.

11
Previous work
  • Reducing sweep time for a nearly empty heap
  • Compiler-based prefetching for recursive data
    structures

12
How am I going to improve the situation?
  • Do some magic!
  • Well no
  • Use real-time information to improve program
    cache locality.
  • The mark and sweep phases offers invaluable
    opportunities for improvements
  • Bring objects earlier to the cache
  • Reuse freed objects for reallocation

13
Some numbers
  • Relative to copy GC
  • Cache miss rates reduced by 21-42
  • Program performance improved by 14-37
  • Relative to a page level GC
  • Cache miss rates reduced by 20-41
  • Program performance improved by 18-31

14
Road map
  • Cache conscious data placement using generational
    GC
  • Overview
  • Short generational GC reminder
  • Real-time data profiling
  • Object affinity graph
  • Combining the affinity graph with GC
  • Experimental evaluation
  • Other methods and their experimental results

15
Overview
  • A program is instrumented to profile its access
    patterns
  • The data is used in the same execution and not
    the next one.
  • The data -gt affinity graph
  • A new copy algorithm uses the graph to layout the
    data while copying.

16
Generational GC A reminder
  • The heap is divided to generations
  • GC activity concentrates on young objects, which
    typically die faster.
  • Objects that survive one or more scavenges are
    moved to the next generation

17
Implementation notes
  • The authors used the UM GC toolkit
  • The toolkit has several steps per generation
  • The authors used a single step for each
    generation for simplicity.
  • Each step consists of fixed size blocks
  • The blocks are not necessarily contiguous in
    memory

18
Implementation notes - steps
19
Implementation notes - steps
  • The steps are used to encode the objects age
  • An object which survives a scavenge is moved to
    the next step

20
Implementation notes moving between generations
  • The scavenger collects a generation g and all its
    younger generations
  • It starts from objects that are
  • In g
  • Reachable from the roots.
  • Moving an object is copying it into a TO space.
  • The FROM space can be reused

21
Copying algorithm a reminder
  • Cheneys algorithm
  • TO and FROM spaces are switched
  • Starts from the root set
  • Objects are traversed breadth-first using a queue
  • Objects are copied to TO space
  • Terminates when the queue is empty

22
Copying algorithm the queue trick
23
The algorithm
24
Did you get it?
25
Real time data profiling
  • Earlier program run profile is not good enough
  • Real time data eliminates
  • Profile execution run
  • Finding inputs

Great!
But the overhead must be low!
26
Profiling data access patterns
  • Trace every load and store to heap
  • Huge overhead (factor of 10!)

27
Reducing overhead
  • Use object oriented programs properties
  • Most objects are small, often less than 32 bytes
  • No need to distinguish between fields, since
    cache blocks are bigger

28
Reducing overhead cont.
  • Most object accesses are not lightweight
  • Profiling instrumentation will not incur large
    overhead
  • Dont believe? Stay awake

29
Collecting profiling data
  • Loads of base object addresses
  • Uses a modified compiler
  • The compiler retains object type information for
    selective loads

30
Code instrumentation
31
Collecting profiling data - cont
  • The base object address is entered to an object
    access buffer

32
Implementation note
  • Uses a page trap for buffer overflow
  • A trap causes a graph to be built
  • Recommended buffer size 15000 (60KB)

33
Affinity?
  • Main Entry affinity Pronunciation
    -'fi-n-tEFunction nounInflected Form(s)
    plural -tiesEtymology Middle English affinite,
    from Middle French or Latin Middle French
    afinité, from Latin affinitas, from affinis
    bordering on, related by marriage, from ad-
    finis end, borderDate 14th century1
    relationship by marriage2 a sympathy marked by
    community of interest KINSHIP b (1) an
    attraction to or liking for something ltpeople
    with an affinity to darkness -- Mark Twaingt ltpork
    and fennel have a natural affinity for each other
    -- Abby Mandelgt (2) an attractive force between
    substances or particles that causes them to enter
    into and remain in chemical combination c a
    person especially of the opposite sex having a
    particular attraction for one3 a likeness
    based on relationship or causal connection ltfound
    an affinity between the teller of a tale and the
    craftsman -- Mary McCarthygt ltthis investigation,
    with affinities to a case history, a
    psychoanalysis, a detective story -- Oliver
    Sacksgt b a relation between biological groups
    involving resemblance in structural plan and
    indicating a common origin

34
The object affinity graph
35
The object affinity graph
  • Nodes objects
  • Edges Temporal affinity between objects
  • An undirected graph

36
Building the graph
37
Inserting an object to the queue
38
Incrementing edges weight
39
All clear?
40
Demonstration
Queue tail
A
D
D
C
Queue tail
A
B
A
Locality queue
Object access buffer
Graph
41
Demonstration
Queue tail
A
D
A
D
C
Queue tail
A
B
A
Locality queue
Object access buffer
Graph
42
Demonstration
Queue tail
A
D
A
D
1
C
Queue tail
B
A
B
A
Locality queue
Object access buffer
Graph
43
Demonstration
Queue tail
A
D
A
D
1
2
C
Queue tail
B
A
B
Locality queue
Object access buffer
Graph
44
Demonstration
Queue tail
A
1
D
A
C
D
2
Queue tail
B
1
C
A
B
Locality queue
Object access buffer
Graph
45
Demonstration
Queue tail
A
1
D
A
C
1
2
1
Queue tail
1
B
D
D
C
A
Locality queue
Object access buffer
Graph
46
Demonstration
Queue tail
A
1
A
C
1
2
1
Queue tail
1
B
D
D
C
A
Locality queue
Object access buffer
Graph
47
Demonstration
Queue tail
A
1
A
C
2
2
2
Queue tail
1
B
D
D
C
A
Locality queue
Object access buffer
Graph
48
Demonstration
Queue tail
1
A
C
2
2
2
Queue tail
1
B
D
A
D
C
Locality queue
Object access buffer
Graph
49
Demonstration
Queue tail
2
A
C
3
2
2
Queue tail
1
B
D
A
D
C
Locality queue
Object access buffer
Graph
50
Implementation notes
  • A separate affinity graph is built for each
    generation, except the first.
  • It uses the fact that the object generation is
    encoded in its address.
  • This method prevents placing objects from
    different generations in the same cache block.
    (Explanations later on)

51
Implementation notes queue size
  • The locality queue size is important
  • Too small -gt Miss temporal relationships
  • Too big -gt huge graph, long processing time
  • Recommended 3.

52
Implementation notes
  • Re-create or update the graph?
  • Depends on the application
  • Access phases should re-create
  • Uniform behavior should update
  • In this article re-create before each scavange

53
Stop!
  • Our goal Produce a cache conscious data layout,
    so that objects are likely to reside in the same
    cache block
  • In English place objects with high temporal
    affinity next to each other.
  • The method Use the profiling information weve
    collected in the copying process.

54
GC Real-time profiling
  • Use the object affinity graph in the Copying
    algorithm.

55
Example object affinity graph
56
Example before step 1
57
Step 1 using the graph
  • Flip roles (TO and FROM)
  • Initialize free and unprocessed to the beginning
    of the TO space.
  • Pick a node that is in
  • The root set
  • and
  • the affinity graph and has the highest edge
    weight
  • Perform a greedy DFS on the graph

58
Step 1 cont.
  • Copy each visited object to the TO space
  • Increment the free pointer
  • Store a forwarding address in the FROM space

59
Example After step 1
60
Step 2 continues Cheneys way
  • Process all objects between the unprocessed and
    the free pointers, as in Cheneys algorithm

61
Example After step 2
62
Step 3 - cleanup
  • Ensure all roots are in the TO space
  • If not, process them using Cheneys algorithm

63
Example After step 3
64
Implementation notes
  • The object access buffer can be used as a stack
    for the DFS

65
Inaccurate results(?)
  • The object affinity graph may retain objects not
    reachable garbage
  • They will be incorrectly promoted at most once
  • Efforts are focused on longer lived objects and
    not on the youngest generation

66
Experimental evaluation
  • Methodology If we have the time
  • Object oriented programs manipulate small objects
  • Real-time data profiling overhead
  • The algorithm impact on performance

67
Size of heap objects
68
But thats not the point!
  • Small objects often die fast

69
Surviving heap objects
70
Real-time data profiling overhead
71
Overall execution time
72
Overall execution time - notes
  • No impact on L1 cache because its blocks are 16B

73
Compared to WLM algorithm
74
Comparison notes
  • WLM (Wilson-Lam-Moher) improves programs virtual
    memory locality.
  • It performed worse or close to Cheneys because
    of the 2GB memory

75
What else?
76
Other methods
  • Two methods that can be used with the previous
    one
  • Prefetch on grey
  • Lazy sweeping

77
Assumptions
  • Non moving mark-sweep collector
  • For simplicity, the collector segregates objects
    by size. Each block contains objects of a single
    size
  • The collectors data structure are outside the
    user-visible heap
  • A mark bit is reserved for each word in the block

78
Advantages of outside the heap data
  • The mark phase does not need to examine (bring
    to the cache) pointer-free objects
  • Sequences of small unreachable objects can be
    reclaimed as a group
  • A single instruction is needed to examine their
    sequence of mark bits
  • It is used when a heap block turns out to be empty

79
The mark phase a reminder
  • Ensure that all objects are white.
  • Grey all objects pointed to by a root.
  • while there is a grey object g
  • blacken g
  • For each pointer p in g
  • if p points to a white object
  • grey that object.

80
The mark phase colors
  • 1 mark bit
  • 0 is white
  • 1 is grey/black
  • Stack
  • In the stack grey
  • Removed from stack - black

81
The mark GC problem
  • A significant fraction of time is spent to
    retrieve the first pointer p from each grey
    object
  • About third of the markers execution time is
    spent
  • This time is expected to increase on future
    machines

82
Prefetching
  • A modern CPU instruction
  • A program can prefetch data into the cache for
    future use

83
Prefetching cont.
  • But object reference must be predicted soon
    enough
  • For example, if the object is in main memory, it
    must be prefetched hundred of cycles before its
    use
  • Prefetching instructions are mostly inserted by
    compiler optimizations

84
Prefetch on grey
  • When? Prefetch as soon as p is found likely to be
    a pointer
  • What? Prefetch the first cache line of the object

85
To improve performance
  • The last pointer to be pushed on the mark stack
    is prefetched first
  • It minimizes the cases in which a just grayed
    object is immediately examined

86
And to improve more
  • Prefetch a few cache lines ahead when scanning an
    object
  • It helps with large objects
  • It prefetches more objects if it isnt that large

87
The sweep GC problem
  • If (reclaimed memory gt cache size)
  • Objects are likely to be evicted from the cache
    by the allocator or mutator
  • Thus, the allocator will miss again when reusing
    the reclaimed memory

88
Lazy sweeping
  • Originally used to reduce page faults
  • Delay the sweeping for the allocator
  • Pages will be reused instead of evicted from the
    cache

89
A reminder
  • A mark bit is saved for each word in a cache
    block.
  • A mark bit is used only if its word is the
    beginning of an object

90
Cache lazy sweeping the collector
  • Scans for each block its mark bits
  • If all bits are unmarked, the block is added to
    the free blocks pool without touching it
  • If some bits are marked, its added to a queue of
    blocks waiting to be swept
  • There are several queues, one or more for each
    object size

91
Cache lazy sweeping the allocator
  • Maps the request to the appropriate object free
    list
  • Returns the first object from the list
  • If the list is empty
  • It sweeps the queue of the right size for a block
    with some available objects

92
Experimental results
  • Measured on two platforms
  • Second platform is to get some calibration on
    architecture variation

93
Pentium III/500 results
94
HP PA-8000/180 based results
95
Results conclusions
  • Prefetch on grey eliminates a third to almost all
    cache miss overhead in the marker.
  • But it is dependent on data structures used in
    the program

96
Results conclusions cont.
  • Collector performance is determined by the marker
  • The sweep performance is architecture dependent

97
Conclusions
  • Be concerned about cache locality or
  • Have a method that does it for you

98
Conclusions cont.
  • Real-time data profiling is feasible
  • Produce cache conscious data layout using that
    information
  • May help reduce the performance gap between
    high-level to low-level languages

99
Conclusions cont.
  • Prefetch on grey and lazy sweeping are cheap to
    implement and should be in future garbage
    collectors

100
Bibliography
  • Using Generational Garbage Collection To
    Implement Cache-Conscious Data Placement -
    Trishul M. Chilimbi and James R. Larus
  • Reducing Garbage Collector Cache Misses - Hans-J.
    Boehm

101
Further reading
  • Look at the articles
  • Garbage collection algorithms for automatic
    dynamic memory management Richard Jones
    Rafael Lins

102
Further reading cont.
  • Cecil
  • Craig Chambers. Object-oriented multi-methods in
    Cecil. In Proceedings ECOOP92, LNCS 615,
    Springer-Verlag, pages 3356, June 1992.
  • Craig Chambers. The Cecil language
    Specification and rationale. University of
    Washington Seattle, Technical Report TR-93-03-05,
    Mar. 1993.
  • Hyperion by Dan Simmons

103
Time fillers
104
Items
  • Large objects
  • Inter-generational objects placement
  • Why explicitly build free lists?
  • Experimental methodology
  • Second experimental methodology

105
Large objects
  • Ungar and Jackson
  • Theres an advantage from not copying large
    objects (gt 256 bytes) with the same age
  • A large object is never copied
  • Each step has an associated set of large objects

106
Large objects cont.
  • A large object is linked in a doubly linked list.
  • If it survives a collection, its removed from
    its list and inserted to the TO space list.
  • No compaction is done on large objects.

107
Large objects cont.
  • Read more in David Ungar and Frank Jackson. An
    adaptive tenuring policy for generation
    scavengers. ACM Transactions on Programming
    Languages and Systems, 14(1)127, January 1992

108
Two generations, one cache block
  • How important is co-location of inter-generation
    objects?
  • The way to achieve this is to demote or promote.

109
Two generations, one cache block cont.
  • Intra-generation pointers are not tracked.
  • In order to demote safely, its needed to collect
    its original generation
  • Result

Long collection time
110
Two generations, one cache block cont.
  • Promote can be done safely
  • The young generation is being collected and its
    pointers updated
  • Pointers from old to young are tracked
  • The locality benefit will start only when the old
    generation is collected
  • Premature promotion

111
Why explicitly build free lists?
  • Allocation is fast
  • Heap scanning for unmarked objects can be fast
    using mark bits
  • Little additional space overhead is required

112
Experimental methodology
  • Vortex compiler infrastructure
  • Vortex supports GGC only for Cecil
  • Cecil A dynamically typed, purely
    object-oriented language.
  • Used Cecil benchmarks
  • Repeated each experiment 5 times and reported the
    average

113
Cecil benchmarks
114
Cecil benchmarks cont.
  • Compiled at highest (o2) optimization level

115
The platform
  • Sun Ultraserver E5000
  • 12 167Mhz UltraSPARC processors
  • 2GB memory To prevent page faults
  • Solaris 2.5.1

116
The platform - memory
  • L1 16KB direct-mapped, 16B blocks
  • L2 1MB unified direct-mapped, 64B blocks
  • 64 entry iTLB and 64 entry dTLB, fully associative

117
The platform memory costs
  • L1, data cache hit 1 cycle
  • L1 miss, L2 hit 6 cycles
  • L2 miss additional 64 cycles

118
Second experimental methodology
  • Two platforms
  • All benchmarks except one are C programs

119
Pentium measurements
  • Dual processor 500Mhz Pentium III (but only one
    used)
  • 100Mhz bus
  • 512KB L2 cache
  • Physical memory gt 300MB (why keep it a secret?),
    which prevented paging and allowed the whole
    executable in memory
  • RedHat 6.1
  • Benchmarks compiled using gcc with O2

120
RISC measurements
  • A single PA-8000/180 MHz processor
  • Running HP/UX 11
  • Single level I and D caches, 1MB each

121
Benchmarks
  • Execution time measurements are a five runs
    average
  • The division between sweep and mark times is
    arbitrary
  • Pentium III prefetcht0 introduced a new overhead,
    so prefetchnta was used. It was less effective
    eliminating cache miss, though

122
?
123
Thank you for listening! (and staying awake)
The end Lectured by Shachar Rubinstein shachar1
_at_post.tau.ac.il GC seminar Molley
Sagiv Audience You Thanks For your
patience The Powerpoint XP effects My parents No
animals were harmed during this production
(except for annoying mosquitoes)
Write a Comment
User Comments (0)
About PowerShow.com