Reducing Garbage Collector Cache Misses - PowerPoint PPT Presentation

About This Presentation

Title:

Reducing Garbage Collector Cache Misses

Description:

Inflected Form(s): plural -ties ... from affinis bordering on, related by marriage, from ad- finis end, border ... 1 : relationship by marriage ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 124

Provided by: shac9

Category:

more less

Transcript and Presenter's Notes

Title: Reducing Garbage Collector Cache Misses

1
Reducing Garbage Collector Cache Misses

Shachar Rubinstein
Garbage Collection Seminar

2
The End!
3
The general problem

CPUs are getting fast faster and faster
Main memory speed lags behind
Result The cost to access main memory is
increasing

4
Solutions

Hardware and software techniques
Memory hierarchy
Prefetcing
Multithreading
Non-blocking caches
Dynamic instruction scheduling
Speculative execution

5
Great Solutions?
Not exactly

Complex hardware and compilers
Ineffective for many programs
Attack the manifestation ( memory latency) and
not the source (poor reference locality)

6
Previous work

Improving cache locality in dense matrices using
loop transformation
Other profile-driven, compiler directed approach

7
The GC problem

Little temporal locality.
Each live object is usually read only once during
mark phase.
Most reads are likely to miss.
The new contents are unlikely to be used more
than once.

8
The GC problem cont.

The sweep phase, like the mark phase, also
touches each object once
Thats since the free list pointers are
maintained in the objects themselves
Unlike the mark phase, the sweep phase is more
sequential

9
The GC problem cont.

The sweep is less likely to use cache contents
left by the marker
The allocator is likely to miss again, when the
object is allocated

10
The GC problem - previous work

Older work concentrated on paging performance.
Memory size increase lead to abandoning this
goal.
But memory size also lead to huge cache miss
penalties.
The largest cache size lt heap size
This problem is unavoidable.

11
Previous work

Reducing sweep time for a nearly empty heap
Compiler-based prefetching for recursive data
structures

12
How am I going to improve the situation?

Do some magic!
Well no
Use real-time information to improve program
cache locality.
The mark and sweep phases offers invaluable
opportunities for improvements
Bring objects earlier to the cache
Reuse freed objects for reallocation

13
Some numbers

Relative to copy GC
Cache miss rates reduced by 21-42
Program performance improved by 14-37
Relative to a page level GC
Cache miss rates reduced by 20-41
Program performance improved by 18-31

14
Road map

Cache conscious data placement using generational
GC
Overview
Short generational GC reminder
Real-time data profiling
Object affinity graph
Combining the affinity graph with GC
Experimental evaluation
Other methods and their experimental results

15
Overview

A program is instrumented to profile its access
patterns
The data is used in the same execution and not
the next one.
The data -gt affinity graph
A new copy algorithm uses the graph to layout the
data while copying.

16
Generational GC A reminder

The heap is divided to generations
GC activity concentrates on young objects, which
typically die faster.
Objects that survive one or more scavenges are
moved to the next generation

17
Implementation notes

The authors used the UM GC toolkit
The toolkit has several steps per generation
The authors used a single step for each
generation for simplicity.
Each step consists of fixed size blocks
The blocks are not necessarily contiguous in
memory

18
Implementation notes - steps
19
Implementation notes - steps

The steps are used to encode the objects age
An object which survives a scavenge is moved to
the next step

20
Implementation notes moving between generations

The scavenger collects a generation g and all its
younger generations
It starts from objects that are
In g
Reachable from the roots.
Moving an object is copying it into a TO space.
The FROM space can be reused

21
Copying algorithm a reminder

Cheneys algorithm
TO and FROM spaces are switched
Starts from the root set
Objects are traversed breadth-first using a queue
Objects are copied to TO space
Terminates when the queue is empty

22
Copying algorithm the queue trick
23
The algorithm
24
Did you get it?
25
Real time data profiling

Earlier program run profile is not good enough
Real time data eliminates
Profile execution run
Finding inputs

Great!
But the overhead must be low!
26
Profiling data access patterns

Trace every load and store to heap
Huge overhead (factor of 10!)

27
Reducing overhead

Use object oriented programs properties
Most objects are small, often less than 32 bytes
No need to distinguish between fields, since
cache blocks are bigger

28
Reducing overhead cont.

Most object accesses are not lightweight
Profiling instrumentation will not incur large
overhead
Dont believe? Stay awake

29
Collecting profiling data

Loads of base object addresses
Uses a modified compiler
The compiler retains object type information for
selective loads

30
Code instrumentation
31
Collecting profiling data - cont

The base object address is entered to an object
access buffer

32
Implementation note

Uses a page trap for buffer overflow
A trap causes a graph to be built
Recommended buffer size 15000 (60KB)

33
Affinity?

Main Entry affinity Pronunciation
-'fi-n-tEFunction nounInflected Form(s)
plural -tiesEtymology Middle English affinite,
from Middle French or Latin Middle French
afinité, from Latin affinitas, from affinis
bordering on, related by marriage, from ad-
finis end, borderDate 14th century1
relationship by marriage2 a sympathy marked by
community of interest KINSHIP b (1) an
attraction to or liking for something ltpeople
with an affinity to darkness -- Mark Twaingt ltpork
and fennel have a natural affinity for each other
-- Abby Mandelgt (2) an attractive force between
substances or particles that causes them to enter
into and remain in chemical combination c a
person especially of the opposite sex having a
particular attraction for one3 a likeness
based on relationship or causal connection ltfound
an affinity between the teller of a tale and the
craftsman -- Mary McCarthygt ltthis investigation,
with affinities to a case history, a
psychoanalysis, a detective story -- Oliver
Sacksgt b a relation between biological groups
involving resemblance in structural plan and
indicating a common origin

34
The object affinity graph
35
The object affinity graph

Nodes objects
Edges Temporal affinity between objects
An undirected graph

36
Building the graph
37
Inserting an object to the queue
38
Incrementing edges weight
39
All clear?
40
Demonstration
Queue tail
A
D
D
C
Queue tail
A
B
A
Locality queue
Object access buffer
Graph
41
Demonstration
Queue tail
A
D
A
D
C
Queue tail
A
B
A
Locality queue
Object access buffer
Graph
42
Demonstration
Queue tail
A
D
A
D
1
C
Queue tail
B
A
B
A
Locality queue
Object access buffer
Graph
43
Demonstration
Queue tail
A
D
A
D
1
2
C
Queue tail
B
A
B
Locality queue
Object access buffer
Graph
44
Demonstration
Queue tail
A
1
D
A
C
D
2
Queue tail
B
1
C
A
B
Locality queue
Object access buffer
Graph
45
Demonstration
Queue tail
A
1
D
A
C
1
2
1
Queue tail
1
B
D
D
C
A
Locality queue
Object access buffer
Graph
46
Demonstration
Queue tail
A
1
A
C
1
2
1
Queue tail
1
B
D
D
C
A
Locality queue
Object access buffer
Graph
47
Demonstration
Queue tail
A
1
A
C
2
2
2
Queue tail
1
B
D
D
C
A
Locality queue
Object access buffer
Graph
48
Demonstration
Queue tail
1
A
C
2
2
2
Queue tail
1
B
D
A
D
C
Locality queue
Object access buffer
Graph
49
Demonstration
Queue tail
2
A
C
3
2
2
Queue tail
1
B
D
A
D
C
Locality queue
Object access buffer
Graph
50
Implementation notes

A separate affinity graph is built for each
generation, except the first.
It uses the fact that the object generation is
encoded in its address.
This method prevents placing objects from
different generations in the same cache block.
(Explanations later on)

51
Implementation notes queue size

The locality queue size is important
Too small -gt Miss temporal relationships
Too big -gt huge graph, long processing time
Recommended 3.

52
Implementation notes

Re-create or update the graph?
Depends on the application
Access phases should re-create
Uniform behavior should update
In this article re-create before each scavange

53
Stop!

Our goal Produce a cache conscious data layout,
so that objects are likely to reside in the same
cache block
In English place objects with high temporal
affinity next to each other.
The method Use the profiling information weve
collected in the copying process.

54
GC Real-time profiling

Use the object affinity graph in the Copying
algorithm.

55
Example object affinity graph
56
Example before step 1
57
Step 1 using the graph

Flip roles (TO and FROM)
Initialize free and unprocessed to the beginning
of the TO space.
Pick a node that is in
The root set
and
the affinity graph and has the highest edge
weight
Perform a greedy DFS on the graph

58
Step 1 cont.

Copy each visited object to the TO space
Increment the free pointer
Store a forwarding address in the FROM space

59
Example After step 1
60
Step 2 continues Cheneys way

Process all objects between the unprocessed and
the free pointers, as in Cheneys algorithm

61
Example After step 2
62
Step 3 - cleanup

Ensure all roots are in the TO space
If not, process them using Cheneys algorithm

63
Example After step 3
64
Implementation notes

The object access buffer can be used as a stack
for the DFS

65
Inaccurate results(?)

The object affinity graph may retain objects not
reachable garbage
They will be incorrectly promoted at most once
Efforts are focused on longer lived objects and
not on the youngest generation

66
Experimental evaluation

Methodology If we have the time
Object oriented programs manipulate small objects
Real-time data profiling overhead
The algorithm impact on performance

67
Size of heap objects
68
But thats not the point!

Small objects often die fast

69
Surviving heap objects
70
Real-time data profiling overhead
71
Overall execution time
72
Overall execution time - notes

No impact on L1 cache because its blocks are 16B

73
Compared to WLM algorithm
74
Comparison notes

WLM (Wilson-Lam-Moher) improves programs virtual
memory locality.
It performed worse or close to Cheneys because
of the 2GB memory

75
What else?
76
Other methods

Two methods that can be used with the previous
one
Prefetch on grey
Lazy sweeping

77
Assumptions

Non moving mark-sweep collector
For simplicity, the collector segregates objects
by size. Each block contains objects of a single
size
The collectors data structure are outside the
user-visible heap
A mark bit is reserved for each word in the block

78
Advantages of outside the heap data

The mark phase does not need to examine (bring
to the cache) pointer-free objects
Sequences of small unreachable objects can be
reclaimed as a group
A single instruction is needed to examine their
sequence of mark bits
It is used when a heap block turns out to be empty

79
The mark phase a reminder

Ensure that all objects are white.
Grey all objects pointed to by a root.
while there is a grey object g
blacken g
For each pointer p in g
if p points to a white object
grey that object.

80
The mark phase colors

1 mark bit
0 is white
1 is grey/black
Stack
In the stack grey
Removed from stack - black

81
The mark GC problem

A significant fraction of time is spent to
retrieve the first pointer p from each grey
object
About third of the markers execution time is
spent
This time is expected to increase on future
machines

82
Prefetching

A modern CPU instruction
A program can prefetch data into the cache for
future use

83
Prefetching cont.

But object reference must be predicted soon
enough
For example, if the object is in main memory, it
must be prefetched hundred of cycles before its
use
Prefetching instructions are mostly inserted by
compiler optimizations

84
Prefetch on grey

When? Prefetch as soon as p is found likely to be
a pointer
What? Prefetch the first cache line of the object

85
To improve performance

The last pointer to be pushed on the mark stack
is prefetched first
It minimizes the cases in which a just grayed
object is immediately examined

86
And to improve more

Prefetch a few cache lines ahead when scanning an
object
It helps with large objects
It prefetches more objects if it isnt that large

87
The sweep GC problem

If (reclaimed memory gt cache size)
Objects are likely to be evicted from the cache
by the allocator or mutator
Thus, the allocator will miss again when reusing
the reclaimed memory

88
Lazy sweeping

Originally used to reduce page faults
Delay the sweeping for the allocator
Pages will be reused instead of evicted from the
cache

89
A reminder

A mark bit is saved for each word in a cache
block.
A mark bit is used only if its word is the
beginning of an object

90
Cache lazy sweeping the collector

Scans for each block its mark bits
If all bits are unmarked, the block is added to
the free blocks pool without touching it
If some bits are marked, its added to a queue of
blocks waiting to be swept
There are several queues, one or more for each
object size

91
Cache lazy sweeping the allocator

Maps the request to the appropriate object free
list
Returns the first object from the list
If the list is empty
It sweeps the queue of the right size for a block
with some available objects

92
Experimental results

Measured on two platforms
Second platform is to get some calibration on
architecture variation

93
Pentium III/500 results
94
HP PA-8000/180 based results
95
Results conclusions

Prefetch on grey eliminates a third to almost all
cache miss overhead in the marker.
But it is dependent on data structures used in
the program

96
Results conclusions cont.

Collector performance is determined by the marker
The sweep performance is architecture dependent

97
Conclusions

Be concerned about cache locality or
Have a method that does it for you

98
Conclusions cont.

Real-time data profiling is feasible
Produce cache conscious data layout using that
information
May help reduce the performance gap between
high-level to low-level languages

99
Conclusions cont.

Prefetch on grey and lazy sweeping are cheap to
implement and should be in future garbage
collectors

100
Bibliography

Using Generational Garbage Collection To
Implement Cache-Conscious Data Placement -
Trishul M. Chilimbi and James R. Larus
Reducing Garbage Collector Cache Misses - Hans-J.
Boehm

101
Further reading

Look at the articles
Garbage collection algorithms for automatic
dynamic memory management Richard Jones
Rafael Lins

102
Further reading cont.

Cecil
Craig Chambers. Object-oriented multi-methods in
Cecil. In Proceedings ECOOP92, LNCS 615,
Springer-Verlag, pages 3356, June 1992.
Craig Chambers. The Cecil language
Specification and rationale. University of
Washington Seattle, Technical Report TR-93-03-05,
Mar. 1993.
Hyperion by Dan Simmons

103
Time fillers
104
Items

Large objects
Inter-generational objects placement
Why explicitly build free lists?
Experimental methodology
Second experimental methodology

105
Large objects

Ungar and Jackson
Theres an advantage from not copying large
objects (gt 256 bytes) with the same age
A large object is never copied
Each step has an associated set of large objects

106
Large objects cont.

A large object is linked in a doubly linked list.
If it survives a collection, its removed from
its list and inserted to the TO space list.
No compaction is done on large objects.

107
Large objects cont.

Read more in David Ungar and Frank Jackson. An
adaptive tenuring policy for generation
scavengers. ACM Transactions on Programming
Languages and Systems, 14(1)127, January 1992

108
Two generations, one cache block

How important is co-location of inter-generation
objects?
The way to achieve this is to demote or promote.

109
Two generations, one cache block cont.

Intra-generation pointers are not tracked.
In order to demote safely, its needed to collect
its original generation
Result

Long collection time
110
Two generations, one cache block cont.

Promote can be done safely
The young generation is being collected and its
pointers updated
Pointers from old to young are tracked
The locality benefit will start only when the old
generation is collected
Premature promotion

111
Why explicitly build free lists?

Allocation is fast
Heap scanning for unmarked objects can be fast
using mark bits
Little additional space overhead is required

112
Experimental methodology

Vortex compiler infrastructure
Vortex supports GGC only for Cecil
Cecil A dynamically typed, purely
object-oriented language.
Used Cecil benchmarks
Repeated each experiment 5 times and reported the
average

113
Cecil benchmarks
114
Cecil benchmarks cont.

Compiled at highest (o2) optimization level

115
The platform

Sun Ultraserver E5000
12 167Mhz UltraSPARC processors
2GB memory To prevent page faults
Solaris 2.5.1

116
The platform - memory

L1 16KB direct-mapped, 16B blocks
L2 1MB unified direct-mapped, 64B blocks
64 entry iTLB and 64 entry dTLB, fully associative

117
The platform memory costs

L1, data cache hit 1 cycle
L1 miss, L2 hit 6 cycles
L2 miss additional 64 cycles

118
Second experimental methodology

Two platforms
All benchmarks except one are C programs

119
Pentium measurements

Dual processor 500Mhz Pentium III (but only one
used)
100Mhz bus
512KB L2 cache
Physical memory gt 300MB (why keep it a secret?),
which prevented paging and allowed the whole
executable in memory
RedHat 6.1
Benchmarks compiled using gcc with O2

120
RISC measurements

A single PA-8000/180 MHz processor
Running HP/UX 11
Single level I and D caches, 1MB each

121
Benchmarks

Execution time measurements are a five runs
average
The division between sweep and mark times is
arbitrary
Pentium III prefetcht0 introduced a new overhead,
so prefetchnta was used. It was less effective
eliminating cache miss, though

122
?
123
Thank you for listening! (and staying awake)
The end Lectured by Shachar Rubinstein shachar1
_at_post.tau.ac.il GC seminar Molley
Sagiv Audience You Thanks For your
patience The Powerpoint XP effects My parents No
animals were harmed during this production
(except for annoying mosquitoes)

Write a Comment

User Comments (0)