Title: Reducing Garbage Collector Cache Misses
1Reducing Garbage Collector Cache Misses
- Shachar Rubinstein
- Garbage Collection Seminar
2The End!
3The general problem
- CPUs are getting fast faster and faster
- Main memory speed lags behind
- Result The cost to access main memory is
increasing
4Solutions
- Hardware and software techniques
- Memory hierarchy
- Prefetcing
- Multithreading
- Non-blocking caches
- Dynamic instruction scheduling
- Speculative execution
5Great Solutions?
Not exactly
- Complex hardware and compilers
- Ineffective for many programs
- Attack the manifestation ( memory latency) and
not the source (poor reference locality)
6Previous work
- Improving cache locality in dense matrices using
loop transformation - Other profile-driven, compiler directed approach
7The GC problem
- Little temporal locality.
- Each live object is usually read only once during
mark phase. - Most reads are likely to miss.
- The new contents are unlikely to be used more
than once.
8The GC problem cont.
- The sweep phase, like the mark phase, also
touches each object once - Thats since the free list pointers are
maintained in the objects themselves - Unlike the mark phase, the sweep phase is more
sequential
9The GC problem cont.
- The sweep is less likely to use cache contents
left by the marker - The allocator is likely to miss again, when the
object is allocated
10The GC problem - previous work
- Older work concentrated on paging performance.
- Memory size increase lead to abandoning this
goal. - But memory size also lead to huge cache miss
penalties. - The largest cache size lt heap size
- This problem is unavoidable.
11Previous work
- Reducing sweep time for a nearly empty heap
- Compiler-based prefetching for recursive data
structures
12How am I going to improve the situation?
- Do some magic!
- Well no
- Use real-time information to improve program
cache locality. - The mark and sweep phases offers invaluable
opportunities for improvements - Bring objects earlier to the cache
- Reuse freed objects for reallocation
13Some numbers
- Relative to copy GC
- Cache miss rates reduced by 21-42
- Program performance improved by 14-37
- Relative to a page level GC
- Cache miss rates reduced by 20-41
- Program performance improved by 18-31
14Road map
- Cache conscious data placement using generational
GC - Overview
- Short generational GC reminder
- Real-time data profiling
- Object affinity graph
- Combining the affinity graph with GC
- Experimental evaluation
- Other methods and their experimental results
15Overview
- A program is instrumented to profile its access
patterns - The data is used in the same execution and not
the next one. - The data -gt affinity graph
- A new copy algorithm uses the graph to layout the
data while copying.
16Generational GC A reminder
- The heap is divided to generations
- GC activity concentrates on young objects, which
typically die faster. - Objects that survive one or more scavenges are
moved to the next generation
17Implementation notes
- The authors used the UM GC toolkit
- The toolkit has several steps per generation
- The authors used a single step for each
generation for simplicity. - Each step consists of fixed size blocks
- The blocks are not necessarily contiguous in
memory
18Implementation notes - steps
19Implementation notes - steps
- The steps are used to encode the objects age
- An object which survives a scavenge is moved to
the next step
20Implementation notes moving between generations
- The scavenger collects a generation g and all its
younger generations - It starts from objects that are
- In g
- Reachable from the roots.
- Moving an object is copying it into a TO space.
- The FROM space can be reused
21Copying algorithm a reminder
- Cheneys algorithm
- TO and FROM spaces are switched
- Starts from the root set
- Objects are traversed breadth-first using a queue
- Objects are copied to TO space
- Terminates when the queue is empty
22Copying algorithm the queue trick
23The algorithm
24Did you get it?
25Real time data profiling
- Earlier program run profile is not good enough
- Real time data eliminates
- Profile execution run
- Finding inputs
Great!
But the overhead must be low!
26Profiling data access patterns
- Trace every load and store to heap
- Huge overhead (factor of 10!)
27Reducing overhead
- Use object oriented programs properties
- Most objects are small, often less than 32 bytes
- No need to distinguish between fields, since
cache blocks are bigger
28Reducing overhead cont.
- Most object accesses are not lightweight
- Profiling instrumentation will not incur large
overhead - Dont believe? Stay awake
29Collecting profiling data
- Loads of base object addresses
- Uses a modified compiler
- The compiler retains object type information for
selective loads
30Code instrumentation
31Collecting profiling data - cont
- The base object address is entered to an object
access buffer
32Implementation note
- Uses a page trap for buffer overflow
- A trap causes a graph to be built
- Recommended buffer size 15000 (60KB)
33Affinity?
- Main Entry affinity Pronunciation
-'fi-n-tEFunction nounInflected Form(s)
plural -tiesEtymology Middle English affinite,
from Middle French or Latin Middle French
afinité, from Latin affinitas, from affinis
bordering on, related by marriage, from ad-
finis end, borderDate 14th century1
relationship by marriage2 a sympathy marked by
community of interest KINSHIP b (1) an
attraction to or liking for something ltpeople
with an affinity to darkness -- Mark Twaingt ltpork
and fennel have a natural affinity for each other
-- Abby Mandelgt (2) an attractive force between
substances or particles that causes them to enter
into and remain in chemical combination c a
person especially of the opposite sex having a
particular attraction for one3 a likeness
based on relationship or causal connection ltfound
an affinity between the teller of a tale and the
craftsman -- Mary McCarthygt ltthis investigation,
with affinities to a case history, a
psychoanalysis, a detective story -- Oliver
Sacksgt b a relation between biological groups
involving resemblance in structural plan and
indicating a common origin
34The object affinity graph
35The object affinity graph
- Nodes objects
- Edges Temporal affinity between objects
- An undirected graph
36Building the graph
37Inserting an object to the queue
38Incrementing edges weight
39All clear?
40Demonstration
Queue tail
A
D
D
C
Queue tail
A
B
A
Locality queue
Object access buffer
Graph
41Demonstration
Queue tail
A
D
A
D
C
Queue tail
A
B
A
Locality queue
Object access buffer
Graph
42Demonstration
Queue tail
A
D
A
D
1
C
Queue tail
B
A
B
A
Locality queue
Object access buffer
Graph
43Demonstration
Queue tail
A
D
A
D
1
2
C
Queue tail
B
A
B
Locality queue
Object access buffer
Graph
44Demonstration
Queue tail
A
1
D
A
C
D
2
Queue tail
B
1
C
A
B
Locality queue
Object access buffer
Graph
45Demonstration
Queue tail
A
1
D
A
C
1
2
1
Queue tail
1
B
D
D
C
A
Locality queue
Object access buffer
Graph
46Demonstration
Queue tail
A
1
A
C
1
2
1
Queue tail
1
B
D
D
C
A
Locality queue
Object access buffer
Graph
47Demonstration
Queue tail
A
1
A
C
2
2
2
Queue tail
1
B
D
D
C
A
Locality queue
Object access buffer
Graph
48Demonstration
Queue tail
1
A
C
2
2
2
Queue tail
1
B
D
A
D
C
Locality queue
Object access buffer
Graph
49Demonstration
Queue tail
2
A
C
3
2
2
Queue tail
1
B
D
A
D
C
Locality queue
Object access buffer
Graph
50Implementation notes
- A separate affinity graph is built for each
generation, except the first. - It uses the fact that the object generation is
encoded in its address. - This method prevents placing objects from
different generations in the same cache block.
(Explanations later on)
51Implementation notes queue size
- The locality queue size is important
- Too small -gt Miss temporal relationships
- Too big -gt huge graph, long processing time
- Recommended 3.
52Implementation notes
- Re-create or update the graph?
- Depends on the application
- Access phases should re-create
- Uniform behavior should update
- In this article re-create before each scavange
53Stop!
- Our goal Produce a cache conscious data layout,
so that objects are likely to reside in the same
cache block - In English place objects with high temporal
affinity next to each other. - The method Use the profiling information weve
collected in the copying process.
54GC Real-time profiling
- Use the object affinity graph in the Copying
algorithm.
55Example object affinity graph
56Example before step 1
57Step 1 using the graph
- Flip roles (TO and FROM)
- Initialize free and unprocessed to the beginning
of the TO space. - Pick a node that is in
- The root set
- and
- the affinity graph and has the highest edge
weight - Perform a greedy DFS on the graph
58Step 1 cont.
- Copy each visited object to the TO space
- Increment the free pointer
- Store a forwarding address in the FROM space
59Example After step 1
60Step 2 continues Cheneys way
- Process all objects between the unprocessed and
the free pointers, as in Cheneys algorithm
61Example After step 2
62Step 3 - cleanup
- Ensure all roots are in the TO space
- If not, process them using Cheneys algorithm
63Example After step 3
64Implementation notes
- The object access buffer can be used as a stack
for the DFS
65Inaccurate results(?)
- The object affinity graph may retain objects not
reachable garbage - They will be incorrectly promoted at most once
- Efforts are focused on longer lived objects and
not on the youngest generation
66Experimental evaluation
- Methodology If we have the time
- Object oriented programs manipulate small objects
- Real-time data profiling overhead
- The algorithm impact on performance
67Size of heap objects
68But thats not the point!
- Small objects often die fast
69Surviving heap objects
70Real-time data profiling overhead
71Overall execution time
72Overall execution time - notes
- No impact on L1 cache because its blocks are 16B
73Compared to WLM algorithm
74Comparison notes
- WLM (Wilson-Lam-Moher) improves programs virtual
memory locality. - It performed worse or close to Cheneys because
of the 2GB memory
75What else?
76Other methods
- Two methods that can be used with the previous
one - Prefetch on grey
- Lazy sweeping
77Assumptions
- Non moving mark-sweep collector
- For simplicity, the collector segregates objects
by size. Each block contains objects of a single
size - The collectors data structure are outside the
user-visible heap - A mark bit is reserved for each word in the block
78Advantages of outside the heap data
- The mark phase does not need to examine (bring
to the cache) pointer-free objects - Sequences of small unreachable objects can be
reclaimed as a group - A single instruction is needed to examine their
sequence of mark bits - It is used when a heap block turns out to be empty
79The mark phase a reminder
- Ensure that all objects are white.
- Grey all objects pointed to by a root.
- while there is a grey object g
- blacken g
- For each pointer p in g
- if p points to a white object
- grey that object.
80The mark phase colors
- 1 mark bit
- 0 is white
- 1 is grey/black
- Stack
- In the stack grey
- Removed from stack - black
81The mark GC problem
- A significant fraction of time is spent to
retrieve the first pointer p from each grey
object - About third of the markers execution time is
spent - This time is expected to increase on future
machines
82Prefetching
- A modern CPU instruction
- A program can prefetch data into the cache for
future use
83Prefetching cont.
- But object reference must be predicted soon
enough - For example, if the object is in main memory, it
must be prefetched hundred of cycles before its
use - Prefetching instructions are mostly inserted by
compiler optimizations
84Prefetch on grey
- When? Prefetch as soon as p is found likely to be
a pointer - What? Prefetch the first cache line of the object
85To improve performance
- The last pointer to be pushed on the mark stack
is prefetched first - It minimizes the cases in which a just grayed
object is immediately examined
86And to improve more
- Prefetch a few cache lines ahead when scanning an
object - It helps with large objects
- It prefetches more objects if it isnt that large
87The sweep GC problem
- If (reclaimed memory gt cache size)
- Objects are likely to be evicted from the cache
by the allocator or mutator - Thus, the allocator will miss again when reusing
the reclaimed memory
88Lazy sweeping
- Originally used to reduce page faults
- Delay the sweeping for the allocator
- Pages will be reused instead of evicted from the
cache
89A reminder
- A mark bit is saved for each word in a cache
block. - A mark bit is used only if its word is the
beginning of an object
90Cache lazy sweeping the collector
- Scans for each block its mark bits
- If all bits are unmarked, the block is added to
the free blocks pool without touching it - If some bits are marked, its added to a queue of
blocks waiting to be swept - There are several queues, one or more for each
object size
91Cache lazy sweeping the allocator
- Maps the request to the appropriate object free
list - Returns the first object from the list
- If the list is empty
- It sweeps the queue of the right size for a block
with some available objects
92Experimental results
- Measured on two platforms
- Second platform is to get some calibration on
architecture variation
93Pentium III/500 results
94HP PA-8000/180 based results
95Results conclusions
- Prefetch on grey eliminates a third to almost all
cache miss overhead in the marker. - But it is dependent on data structures used in
the program
96Results conclusions cont.
- Collector performance is determined by the marker
- The sweep performance is architecture dependent
97Conclusions
- Be concerned about cache locality or
- Have a method that does it for you
98Conclusions cont.
- Real-time data profiling is feasible
- Produce cache conscious data layout using that
information - May help reduce the performance gap between
high-level to low-level languages
99Conclusions cont.
- Prefetch on grey and lazy sweeping are cheap to
implement and should be in future garbage
collectors
100Bibliography
- Using Generational Garbage Collection To
Implement Cache-Conscious Data Placement -
Trishul M. Chilimbi and James R. Larus - Reducing Garbage Collector Cache Misses - Hans-J.
Boehm
101Further reading
- Look at the articles
- Garbage collection algorithms for automatic
dynamic memory management Richard Jones
Rafael Lins
102Further reading cont.
- Cecil
- Craig Chambers. Object-oriented multi-methods in
Cecil. In Proceedings ECOOP92, LNCS 615,
Springer-Verlag, pages 3356, June 1992. - Craig Chambers. The Cecil language
Specification and rationale. University of
Washington Seattle, Technical Report TR-93-03-05,
Mar. 1993. - Hyperion by Dan Simmons
103Time fillers
104Items
- Large objects
- Inter-generational objects placement
- Why explicitly build free lists?
- Experimental methodology
- Second experimental methodology
105Large objects
- Ungar and Jackson
- Theres an advantage from not copying large
objects (gt 256 bytes) with the same age - A large object is never copied
- Each step has an associated set of large objects
106Large objects cont.
- A large object is linked in a doubly linked list.
- If it survives a collection, its removed from
its list and inserted to the TO space list. - No compaction is done on large objects.
107Large objects cont.
- Read more in David Ungar and Frank Jackson. An
adaptive tenuring policy for generation
scavengers. ACM Transactions on Programming
Languages and Systems, 14(1)127, January 1992
108Two generations, one cache block
- How important is co-location of inter-generation
objects? - The way to achieve this is to demote or promote.
109Two generations, one cache block cont.
- Intra-generation pointers are not tracked.
- In order to demote safely, its needed to collect
its original generation - Result
Long collection time
110Two generations, one cache block cont.
- Promote can be done safely
- The young generation is being collected and its
pointers updated - Pointers from old to young are tracked
- The locality benefit will start only when the old
generation is collected - Premature promotion
111Why explicitly build free lists?
- Allocation is fast
- Heap scanning for unmarked objects can be fast
using mark bits - Little additional space overhead is required
112Experimental methodology
- Vortex compiler infrastructure
- Vortex supports GGC only for Cecil
- Cecil A dynamically typed, purely
object-oriented language. - Used Cecil benchmarks
- Repeated each experiment 5 times and reported the
average
113Cecil benchmarks
114Cecil benchmarks cont.
- Compiled at highest (o2) optimization level
115The platform
- Sun Ultraserver E5000
- 12 167Mhz UltraSPARC processors
- 2GB memory To prevent page faults
- Solaris 2.5.1
116The platform - memory
- L1 16KB direct-mapped, 16B blocks
- L2 1MB unified direct-mapped, 64B blocks
- 64 entry iTLB and 64 entry dTLB, fully associative
117The platform memory costs
- L1, data cache hit 1 cycle
- L1 miss, L2 hit 6 cycles
- L2 miss additional 64 cycles
118Second experimental methodology
- Two platforms
- All benchmarks except one are C programs
119Pentium measurements
- Dual processor 500Mhz Pentium III (but only one
used) - 100Mhz bus
- 512KB L2 cache
- Physical memory gt 300MB (why keep it a secret?),
which prevented paging and allowed the whole
executable in memory - RedHat 6.1
- Benchmarks compiled using gcc with O2
120RISC measurements
- A single PA-8000/180 MHz processor
- Running HP/UX 11
- Single level I and D caches, 1MB each
121Benchmarks
- Execution time measurements are a five runs
average - The division between sweep and mark times is
arbitrary - Pentium III prefetcht0 introduced a new overhead,
so prefetchnta was used. It was less effective
eliminating cache miss, though
122?
123Thank you for listening! (and staying awake)
The end Lectured by Shachar Rubinstein shachar1
_at_post.tau.ac.il GC seminar Molley
Sagiv Audience You Thanks For your
patience The Powerpoint XP effects My parents No
animals were harmed during this production
(except for annoying mosquitoes)