Title: Managing Bounded Code Caches in Dynamic Binary Optimizers
1Managing Bounded Code Caches in Dynamic
Binary Optimizers
- Kim Hazelwood
- Intel Corporation
2Dynamic Binary Optimizers
- Dynamic optimizers DynamoRIO, Pin,
- x86 ? instrumented x86
- x86 ? optimized x86
- x86 ? secure x86
- Dynamic translators DAISY, CMS,
- x86 ? Itanium
- Just-in-time compilers Jikes RVM,
- Java ? x86
- All create a modified code image at run time
- Good performance is crucial
3Dynamic Binary Optimizers
EXE
- Run-Time Overheads
- Observing execution
- Transforming code
- Caching code
Transform
Code Cache
Profile
Execute
For good performance, vast majority of code
should execute in the code cache Goal Maintain
the working set in the cache
4Talk Outline
- The code cache management problem
- Motivation
- Complications
- Two solutions
- Eviction granularities
- Generational caches
- Collaborative HWSW systems
5Code Cache Design Space
- Inter-execution What should be stored between
executions of an application - Intra-execution What should reside in the cache
at a given time - Unified Cache Partitioned Cache
- Local policies Local global policies
CGO04 Eviction Granularity
MICRO03 Generational Caches
6Code Cache Management
- Capacity
- Remove start-up, temporal code
- Consistency
- Self-modifying code
- Unmapped memory
- Ephemeral optimizations
7Bounding Code Caches
- For SPEC2000 Not necessary
8Interactive Windows Applications
- Unbounded caches become impractical
9As a General Rule (in DynamoRIO)
Code Expansion Final Code Cache Size
Application Code Executed
10Havent We Solved This Problem?
- Several unique challenges for code caches
- Code caches store superblocks
- Variable size fragmentation
- Tail duplication cache expansion
- Linked superblocks consistency issues
- No backing store high miss penalty
11Systems Typically Cache Superblocks
Superblock with exit stubs
Control-flow graph
Linked superblocks
- Superblocks Single-entry multiple-exit code
regions
12Superblocks Vary in Size
Cannot use pure page-based techniques
13Fragmentation Problem
- Evictions must free contiguous memory
- Defragmentation too expensive at run time
LRU 2
Block A
Block I
LRU
Block G
Block B
Block C
Block D
LRU 1
Block E
Block H
Block F
14Superblocks Link to Other Superblocks
- Cache evictions require link removal
- Back-pointer tables are necessary
Superblock 1
Superblock 2
Superblock 3
Superblock 4
Interpreter
Superblock 5
Superblock 6
Superblock 7
15Cache Misses are Expensive
- No backing store!
- On a code cache miss, the system must
- Save processor state
- Re-optimize/re-translate code region
- Insert (and evict) code blocks
- Update hash table
- Update links
- Restore the processor state
16FIFO Replacement
- Simple implementation using circular buffer
- All evictions are contiguous
- Solves the fragmentation problem
- Only slightly higher miss rate than LRU
Block A
Block E
Block F
Block G
Block B
Block C
Block H
Block D
17Code Cache Eviction Granularities
Coarse-Grained Evictions
Fine-Grained Evictions
Flush
FIFO
18Medium-Grained Evictions
- Cache units are evicted in FIFO order
- All superblocks in a unit are flushed
- Balances miss rate and eviction overheads
- Reduces link maintenance
Code Cache
Unit 1
Superblocks
Unit 2
Unit N
19Link Maintenance
- Only inter-unit links require removal
Inter-Unit Link
Intra-Unit Link
20Inter-Unit Links
Link Removal Cost 296 (NumLinks) 96
Coarse
Fine
21Eviction Interruptions
Eviction Cost 2.8 (SBSize) 3055
Coarse
Fine
22Miss Rate Comparison
Miss Cost 75.4 (SBSize) 1922
Coarse
Fine
23Relative Overhead of Granularities
Includes Miss Costs Eviction Costs Link Costs
Coarse
Fine
24Roadmap
- Local Cache Management Eviction policy for a
single code cache - LRU, FIFO, Flush,
- Eviction granularity
- ? Global Cache Management Policy of interaction
between multiple code caches - Generational code caches
A
25Superblock Lifetimes
Lifetime LastExecutionTime FirstExecutionTime
TotalExecutionTime
26Generational Code Caches
Persistent Cache
Nursery
New Superblock
FiFo Eviction
If (Live) PROMOTE
Circular Buffer
Circular Buffer
If (Dead) DELETE
27Generational Hypothesis
- Generational hypothesis from garbage collection
Objects tend to die young - Unfortunately, garbage collectors know when an
object is dead - A superblock is dead when it will never be
executed again (too difficult to determine before
program ends) - Incorrect guesses dont impact correctness
28The Probation Cache
Persistent Cache
Nursery
If (threshold_met) PROMOTE
New Trace
FiFo Eviction
Circular Buffer
Circular Buffer
Circular Buffer
Probation Cache
If (threshold_not_met) DELETE
29Experimental Results
- Ensure pressure cacheSize maxCache/3
- Local policy fixed at FIFO for all caches
- Base case one unified FIFO cache
- Generational case nursery probation
persistent unified size
30Windows Application Miss Rates
Nursery Probation Persistent
Promotion Size Size Size
Threshold
31SPEC2000 Miss Rates
32Reduction in Runtime Overhead
33Performance Impact
- Results varied and were highly dependent on
number of misses eliminated - Gzip 2,288 misses eliminated resulting in 0.07
reduction in execution time - Crafty 292,486 misses eliminated resulting in a
8.09 reduction in execution time
34Implementation Challenges
- Modified the DynamoRIO framework
- Design Decisions and Challenges
- Supporting relocatable code
- Efficient link repair
- Profiling insertion and removal
- Hash table lookup
35Whats Next for Code Caching?
- Study truly persistent caches
- Study the interconnectivity of cached superblocks
is there a better way to place superblocks into
cache units? - Heuristics for identifying hot superblocks
- Why are we caching superblocks again?
- Shared vs. private code caches
- The potential of these systems is not limited to
performance!
36Dynamic Optimization for Low Power
- Completed initial work on collaborative HW/SW
approach for managing the di/dt problem
Executable
Dynamic Optimizer
SW
HW
Voltage Control HW
Scaled Program
Microprocessor
Protection without performance penalties
37Why is a HWSW Solution Best?
- Compiler-based techniques difficult to predict
power problems before execution - Hardware-based techniques cure the symptoms, not
the problem - Combined approach potential to quickly detect
and permanently correct power problems
38A Code Cache Client API for Pin
- Clean, robust interface for accessing and
altering code cache behavior - Features
- Low overhead
- One API ? four ISAs
- Seamless integration with instrumentation API
- Applications
- Cache replacement investigations
- Architectural comparisons
- Graphical-user interfaces
39Question Answer
40Prior Approaches
- Dynamo Flush (pre-emptive)
- DELI Flush (manual)
- Strata Flush (when full)
- DynamoRIOUnbounded cache
- Pin Unbounded cache
- ADORE Unbounded cache
- Mojo Coarse-grained FIFO
41DynamoRIO
Hash Table Lookup
Code Cache
Hit
Interpret
Branch Target Address
Start
Miss
Counter
Insert
Delete
Update Hash Table
Room in Code Cache?
Yes
Code is Hot?
Region Formation Optimization
Yes
No
Code Cache Manager
No
42Experimental Approach
Block Details Insertions Accesses
Bench- marks
Superblock Trace
DynamoRIO
Results
Code Cache Simulator
43Basic Block Superblock Caches
- DynamoRIO interprets by copying all basic
blocks into a code cache - Once the basic blocks become hot, superblocks
are formed and copied into the superblock cache - One weakness of a single FIFO cache is that all
superblocks are treated equally
Basic Block Cache
Superblock Cache
50 executions
Superblock Formation
44Generating Overhead Estimates
Each overhead estimate was generated using
least-squares linear regression over 30,000
samples
45Finding a General Configuration
Previously Different cache size per
benchmark Now Fixed cache size across all
benchmarks Focusing on SPEC 2000 for
repeatability
DynamoRIO
Benchmarks
46Relocation and Links
- Supporting relocation
- PC-rel branches to exit stubs
- Non-PC-rel branches to interpreter
- Link repair
- Repair incoming links
- Repair outgoing links
- Used a data structure
SB 1
Instr A
Instr B
Br (cond) Exit 1
Application Code
Instr C
Instr D
Br (cond) SB2
Br Exit 3
Exit 1
Exit 2
Exit Stubs
Exit 3
47Inserting Profile Counters
- High overhead to insert profiling code into
existing superblocks
Nursery Cache
SB 6
SB 1
SB 2
Probation Cache
i
SB 3
SB 4
SB 5
48Wall-Clock Performance
49Validating Lifetimes
50The Granularity Trade-Off
- Fine granularity
- lower miss rate
- versus
- Coarse granularity
- less link maintenance
- less aggregate eviction overhead
51Superblock Promotion Overhead
52Run-Time Factors
- Difficult to simulate
- Relocating code impacts instruction fetch
performance - Changing cache size affects maximum superblock
size - Also interesting
- Relocation must be balanced by a lower miss rate
53GCCs Hit Breakdown
54As Cache Pressure Increases
Cache Size UnboundedMax Cache Pressure
55Combining Miss Rate Eviction Costs
56Outbound Links per Superblock
57Are Links Highly Beneficial?
58SPEC2000 Lifetimes
59Code Cache Visualization
60Code Cache Visualization
61Code Cache GUI
62Importance of Cache Pressure
- Replacement policies only affect performance when
there is cache pressure - Helps evaluate how well policies scale
Cache Pressure UnboundedMax Cache
Size
63How Does Performance Scale?
64Impact of Understanding Granularity
- In high cache pressure situations, code cache
management overhead can dominate - At pressure10 crafty, twolf result in a 20
execution time reduction by changing from FLUSH
to 8-unit FIFO - Medium-grained evictions are most scalable under
pressure
65Traditional Replacement Policies
66Contributions
- The code cache management problem
- Local cache management
- Evaluation of traditional replacement algorithms
Interact02 - Superblock eviction granularity CGO04
- Global cache management
- Generational code caches MICRO03
- Persistent caches Traces04
- Implementation issues and challenges
- Future research ideas
67Conclusions
- Effective code caching is crucial to a robust
dynamic binary optimizer - Medium-grained evictions provide a robust
solution for various application sizes - Replacing a single cache with multiple,
generational code caches provides notable
run-time benefits for large applications