Managing Bounded Code Caches in Dynamic Binary Optimizers - PowerPoint PPT Presentation

1 / 67

About This Presentation

Title:

Managing Bounded Code Caches in Dynamic Binary Optimizers

Description:

All create a modified code image at run time. Good ... For good performance, vast majority of code should execute in the code cache ... Code Cache Design Space ... – PowerPoint PPT presentation

Number of Views:81

Avg rating:3.0/5.0

Slides: 68

Provided by: kimhazelw

Category:

more less

Transcript and Presenter's Notes

Title: Managing Bounded Code Caches in Dynamic Binary Optimizers

1
Managing Bounded Code Caches in Dynamic
Binary Optimizers

Kim Hazelwood
Intel Corporation

2
Dynamic Binary Optimizers

Dynamic optimizers DynamoRIO, Pin,
x86 ? instrumented x86
x86 ? optimized x86
x86 ? secure x86
Dynamic translators DAISY, CMS,
x86 ? Itanium
Just-in-time compilers Jikes RVM,
Java ? x86
All create a modified code image at run time
Good performance is crucial

3
Dynamic Binary Optimizers
EXE

Run-Time Overheads
Observing execution
Transforming code
Caching code

Transform
Code Cache
Profile
Execute
For good performance, vast majority of code
should execute in the code cache Goal Maintain
the working set in the cache
4
Talk Outline

The code cache management problem
Motivation
Complications
Two solutions
Eviction granularities
Generational caches
Collaborative HWSW systems

5
Code Cache Design Space

Inter-execution What should be stored between
executions of an application
Intra-execution What should reside in the cache
at a given time
Unified Cache Partitioned Cache
Local policies Local global policies

CGO04 Eviction Granularity
MICRO03 Generational Caches
6
Code Cache Management

Capacity
Remove start-up, temporal code
Consistency
Self-modifying code
Unmapped memory
Ephemeral optimizations

7
Bounding Code Caches

For SPEC2000 Not necessary

8
Interactive Windows Applications

Unbounded caches become impractical

9
As a General Rule (in DynamoRIO)
Code Expansion Final Code Cache Size
Application Code Executed
10
Havent We Solved This Problem?

Several unique challenges for code caches
Code caches store superblocks
Variable size fragmentation
Tail duplication cache expansion
Linked superblocks consistency issues
No backing store high miss penalty

11
Systems Typically Cache Superblocks
Superblock with exit stubs
Control-flow graph
Linked superblocks

Superblocks Single-entry multiple-exit code
regions

12
Superblocks Vary in Size
Cannot use pure page-based techniques
13
Fragmentation Problem

Evictions must free contiguous memory
Defragmentation too expensive at run time

LRU 2
Block A
Block I
LRU
Block G
Block B
Block C
Block D
LRU 1
Block E
Block H
Block F
14
Superblocks Link to Other Superblocks

Cache evictions require link removal
Back-pointer tables are necessary

Superblock 1
Superblock 2
Superblock 3
Superblock 4
Interpreter
Superblock 5
Superblock 6
Superblock 7
15
Cache Misses are Expensive

No backing store!
On a code cache miss, the system must
Save processor state
Re-optimize/re-translate code region
Insert (and evict) code blocks
Update hash table
Update links
Restore the processor state

16
FIFO Replacement

Simple implementation using circular buffer
All evictions are contiguous
Solves the fragmentation problem
Only slightly higher miss rate than LRU

Block A
Block E
Block F
Block G
Block B
Block C
Block H
Block D
17
Code Cache Eviction Granularities

Can range from

Coarse-Grained Evictions
Fine-Grained Evictions
Flush
FIFO
18
Medium-Grained Evictions

Cache units are evicted in FIFO order
All superblocks in a unit are flushed
Balances miss rate and eviction overheads
Reduces link maintenance

Code Cache
Unit 1
Superblocks
Unit 2

Unit N
19
Link Maintenance

Only inter-unit links require removal

Inter-Unit Link

Intra-Unit Link
20
Inter-Unit Links
Link Removal Cost 296 (NumLinks) 96
Coarse
Fine
21
Eviction Interruptions
Eviction Cost 2.8 (SBSize) 3055
Coarse
Fine
22
Miss Rate Comparison
Miss Cost 75.4 (SBSize) 1922
Coarse
Fine
23
Relative Overhead of Granularities
Includes Miss Costs Eviction Costs Link Costs
Coarse
Fine
24
Roadmap

Local Cache Management Eviction policy for a
single code cache
LRU, FIFO, Flush,
Eviction granularity
? Global Cache Management Policy of interaction
between multiple code caches
Generational code caches

A
25
Superblock Lifetimes
Lifetime LastExecutionTime FirstExecutionTime
TotalExecutionTime
26
Generational Code Caches
Persistent Cache
Nursery
New Superblock
FiFo Eviction
If (Live) PROMOTE
Circular Buffer
Circular Buffer
If (Dead) DELETE
27
Generational Hypothesis

Generational hypothesis from garbage collection
Objects tend to die young
Unfortunately, garbage collectors know when an
object is dead
A superblock is dead when it will never be
executed again (too difficult to determine before
program ends)
Incorrect guesses dont impact correctness

28
The Probation Cache
Persistent Cache
Nursery
If (threshold_met) PROMOTE
New Trace
FiFo Eviction
Circular Buffer
Circular Buffer
Circular Buffer
Probation Cache
If (threshold_not_met) DELETE
29
Experimental Results

Ensure pressure cacheSize maxCache/3
Local policy fixed at FIFO for all caches
Base case one unified FIFO cache
Generational case nursery probation
persistent unified size

30
Windows Application Miss Rates
Nursery Probation Persistent
Promotion Size Size Size
Threshold
31
SPEC2000 Miss Rates
32
Reduction in Runtime Overhead
33
Performance Impact

Results varied and were highly dependent on
number of misses eliminated
Gzip 2,288 misses eliminated resulting in 0.07
reduction in execution time
Crafty 292,486 misses eliminated resulting in a
8.09 reduction in execution time

34
Implementation Challenges

Modified the DynamoRIO framework
Design Decisions and Challenges
Supporting relocatable code
Efficient link repair
Profiling insertion and removal
Hash table lookup

35
Whats Next for Code Caching?

Study truly persistent caches
Study the interconnectivity of cached superblocks
is there a better way to place superblocks into
cache units?
Heuristics for identifying hot superblocks
Why are we caching superblocks again?
Shared vs. private code caches
The potential of these systems is not limited to
performance!

36
Dynamic Optimization for Low Power

Completed initial work on collaborative HW/SW
approach for managing the di/dt problem

Executable
Dynamic Optimizer
SW
HW
Voltage Control HW
Scaled Program
Microprocessor
Protection without performance penalties
37
Why is a HWSW Solution Best?

Compiler-based techniques difficult to predict
power problems before execution
Hardware-based techniques cure the symptoms, not
the problem
Combined approach potential to quickly detect
and permanently correct power problems

38
A Code Cache Client API for Pin

Clean, robust interface for accessing and
altering code cache behavior
Features
Low overhead
One API ? four ISAs
Seamless integration with instrumentation API
Applications
Cache replacement investigations
Architectural comparisons
Graphical-user interfaces

39
Question Answer
40
Prior Approaches

Dynamo Flush (pre-emptive)
DELI Flush (manual)
Strata Flush (when full)
DynamoRIOUnbounded cache
Pin Unbounded cache
ADORE Unbounded cache
Mojo Coarse-grained FIFO

41
DynamoRIO
Hash Table Lookup
Code Cache
Hit
Interpret
Branch Target Address
Start
Miss
Counter
Insert
Delete
Update Hash Table
Room in Code Cache?
Yes
Code is Hot?
Region Formation Optimization
Yes
No
Code Cache Manager
No
42
Experimental Approach
Block Details Insertions Accesses
Bench- marks
Superblock Trace
DynamoRIO
Results
Code Cache Simulator
43
Basic Block Superblock Caches

DynamoRIO interprets by copying all basic
blocks into a code cache
Once the basic blocks become hot, superblocks
are formed and copied into the superblock cache
One weakness of a single FIFO cache is that all
superblocks are treated equally

Basic Block Cache
Superblock Cache
50 executions
Superblock Formation
44
Generating Overhead Estimates
Each overhead estimate was generated using
least-squares linear regression over 30,000
samples
45
Finding a General Configuration
Previously Different cache size per
benchmark Now Fixed cache size across all
benchmarks Focusing on SPEC 2000 for
repeatability
DynamoRIO
Benchmarks
46
Relocation and Links

Supporting relocation
PC-rel branches to exit stubs
Non-PC-rel branches to interpreter
Link repair
Repair incoming links
Repair outgoing links
Used a data structure

SB 1
Instr A
Instr B
Br (cond) Exit 1
Application Code
Instr C
Instr D
Br (cond) SB2
Br Exit 3
Exit 1
Exit 2
Exit Stubs
Exit 3
47
Inserting Profile Counters

High overhead to insert profiling code into
existing superblocks

Nursery Cache
SB 6
SB 1
SB 2
Probation Cache
i
SB 3
SB 4
SB 5
48
Wall-Clock Performance
49
Validating Lifetimes
50
The Granularity Trade-Off

Fine granularity
lower miss rate
versus
Coarse granularity
less link maintenance
less aggregate eviction overhead

51
Superblock Promotion Overhead
52
Run-Time Factors

Difficult to simulate
Relocating code impacts instruction fetch
performance
Changing cache size affects maximum superblock
size
Also interesting
Relocation must be balanced by a lower miss rate

53
GCCs Hit Breakdown
54
As Cache Pressure Increases
Cache Size UnboundedMax Cache Pressure
55
Combining Miss Rate Eviction Costs
56
Outbound Links per Superblock
57
Are Links Highly Beneficial?
58
SPEC2000 Lifetimes
59
Code Cache Visualization
60
Code Cache Visualization
61
Code Cache GUI
62
Importance of Cache Pressure

Replacement policies only affect performance when
there is cache pressure
Helps evaluate how well policies scale

Cache Pressure UnboundedMax Cache
Size
63
How Does Performance Scale?
64
Impact of Understanding Granularity

In high cache pressure situations, code cache
management overhead can dominate
At pressure10 crafty, twolf result in a 20
execution time reduction by changing from FLUSH
to 8-unit FIFO
Medium-grained evictions are most scalable under
pressure

65
Traditional Replacement Policies
66
Contributions

The code cache management problem
Local cache management
Evaluation of traditional replacement algorithms
Interact02
Superblock eviction granularity CGO04
Global cache management
Generational code caches MICRO03
Persistent caches Traces04
Implementation issues and challenges
Future research ideas

67
Conclusions

Effective code caching is crucial to a robust
dynamic binary optimizer
Medium-grained evictions provide a robust
solution for various application sizes
Replacing a single cache with multiple,
generational code caches provides notable
run-time benefits for large applications

Write a Comment

User Comments (0)