Managing Bounded Code Caches in Dynamic Binary Optimizers - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

Managing Bounded Code Caches in Dynamic Binary Optimizers

Description:

All create a modified code image at run time. Good ... For good performance, vast majority of code should execute in the code cache ... Code Cache Design Space ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 68
Provided by: kimhazelw
Category:

less

Transcript and Presenter's Notes

Title: Managing Bounded Code Caches in Dynamic Binary Optimizers


1
Managing Bounded Code Caches in Dynamic
Binary Optimizers
  • Kim Hazelwood
  • Intel Corporation

2
Dynamic Binary Optimizers
  • Dynamic optimizers DynamoRIO, Pin,
  • x86 ? instrumented x86
  • x86 ? optimized x86
  • x86 ? secure x86
  • Dynamic translators DAISY, CMS,
  • x86 ? Itanium
  • Just-in-time compilers Jikes RVM,
  • Java ? x86
  • All create a modified code image at run time
  • Good performance is crucial

3
Dynamic Binary Optimizers
EXE
  • Run-Time Overheads
  • Observing execution
  • Transforming code
  • Caching code

Transform
Code Cache
Profile
Execute
For good performance, vast majority of code
should execute in the code cache Goal Maintain
the working set in the cache
4
Talk Outline
  • The code cache management problem
  • Motivation
  • Complications
  • Two solutions
  • Eviction granularities
  • Generational caches
  • Collaborative HWSW systems

5
Code Cache Design Space
  • Inter-execution What should be stored between
    executions of an application
  • Intra-execution What should reside in the cache
    at a given time
  • Unified Cache Partitioned Cache
  • Local policies Local global policies

CGO04 Eviction Granularity
MICRO03 Generational Caches
6
Code Cache Management
  • Capacity
  • Remove start-up, temporal code
  • Consistency
  • Self-modifying code
  • Unmapped memory
  • Ephemeral optimizations

7
Bounding Code Caches
  • For SPEC2000 Not necessary

8
Interactive Windows Applications
  • Unbounded caches become impractical

9
As a General Rule (in DynamoRIO)
Code Expansion Final Code Cache Size
Application Code Executed
10
Havent We Solved This Problem?
  • Several unique challenges for code caches
  • Code caches store superblocks
  • Variable size fragmentation
  • Tail duplication cache expansion
  • Linked superblocks consistency issues
  • No backing store high miss penalty

11
Systems Typically Cache Superblocks
Superblock with exit stubs
Control-flow graph
Linked superblocks
  • Superblocks Single-entry multiple-exit code
    regions

12
Superblocks Vary in Size
Cannot use pure page-based techniques
13
Fragmentation Problem
  • Evictions must free contiguous memory
  • Defragmentation too expensive at run time

LRU 2
Block A
Block I
LRU
Block G
Block B
Block C
Block D
LRU 1
Block E
Block H
Block F
14
Superblocks Link to Other Superblocks
  • Cache evictions require link removal
  • Back-pointer tables are necessary

Superblock 1
Superblock 2
Superblock 3
Superblock 4
Interpreter
Superblock 5
Superblock 6
Superblock 7
15
Cache Misses are Expensive
  • No backing store!
  • On a code cache miss, the system must
  • Save processor state
  • Re-optimize/re-translate code region
  • Insert (and evict) code blocks
  • Update hash table
  • Update links
  • Restore the processor state

16
FIFO Replacement
  • Simple implementation using circular buffer
  • All evictions are contiguous
  • Solves the fragmentation problem
  • Only slightly higher miss rate than LRU

Block A
Block E
Block F
Block G
Block B
Block C
Block H
Block D
17
Code Cache Eviction Granularities
  • Can range from

Coarse-Grained Evictions
Fine-Grained Evictions
Flush
FIFO
18
Medium-Grained Evictions
  • Cache units are evicted in FIFO order
  • All superblocks in a unit are flushed
  • Balances miss rate and eviction overheads
  • Reduces link maintenance

Code Cache
Unit 1
Superblocks
Unit 2

Unit N
19
Link Maintenance
  • Only inter-unit links require removal

Inter-Unit Link

Intra-Unit Link
20
Inter-Unit Links
Link Removal Cost 296 (NumLinks) 96
Coarse
Fine
21
Eviction Interruptions
Eviction Cost 2.8 (SBSize) 3055
Coarse
Fine
22
Miss Rate Comparison
Miss Cost 75.4 (SBSize) 1922
Coarse
Fine
23
Relative Overhead of Granularities
Includes Miss Costs Eviction Costs Link Costs
Coarse
Fine
24
Roadmap
  • Local Cache Management Eviction policy for a
    single code cache
  • LRU, FIFO, Flush,
  • Eviction granularity
  • ? Global Cache Management Policy of interaction
    between multiple code caches
  • Generational code caches

A
25
Superblock Lifetimes
Lifetime LastExecutionTime FirstExecutionTime
TotalExecutionTime
26
Generational Code Caches
Persistent Cache
Nursery
New Superblock
FiFo Eviction
If (Live) PROMOTE
Circular Buffer
Circular Buffer
If (Dead) DELETE
27
Generational Hypothesis
  • Generational hypothesis from garbage collection
    Objects tend to die young
  • Unfortunately, garbage collectors know when an
    object is dead
  • A superblock is dead when it will never be
    executed again (too difficult to determine before
    program ends)
  • Incorrect guesses dont impact correctness

28
The Probation Cache
Persistent Cache
Nursery
If (threshold_met) PROMOTE
New Trace
FiFo Eviction
Circular Buffer
Circular Buffer
Circular Buffer
Probation Cache
If (threshold_not_met) DELETE
29
Experimental Results
  • Ensure pressure cacheSize maxCache/3
  • Local policy fixed at FIFO for all caches
  • Base case one unified FIFO cache
  • Generational case nursery probation
    persistent unified size

30
Windows Application Miss Rates
Nursery Probation Persistent
Promotion Size Size Size
Threshold
31
SPEC2000 Miss Rates
32
Reduction in Runtime Overhead
33
Performance Impact
  • Results varied and were highly dependent on
    number of misses eliminated
  • Gzip 2,288 misses eliminated resulting in 0.07
    reduction in execution time
  • Crafty 292,486 misses eliminated resulting in a
    8.09 reduction in execution time

34
Implementation Challenges
  • Modified the DynamoRIO framework
  • Design Decisions and Challenges
  • Supporting relocatable code
  • Efficient link repair
  • Profiling insertion and removal
  • Hash table lookup

35
Whats Next for Code Caching?
  • Study truly persistent caches
  • Study the interconnectivity of cached superblocks
    is there a better way to place superblocks into
    cache units?
  • Heuristics for identifying hot superblocks
  • Why are we caching superblocks again?
  • Shared vs. private code caches
  • The potential of these systems is not limited to
    performance!

36
Dynamic Optimization for Low Power
  • Completed initial work on collaborative HW/SW
    approach for managing the di/dt problem

Executable
Dynamic Optimizer
SW
HW
Voltage Control HW
Scaled Program
Microprocessor
Protection without performance penalties
37
Why is a HWSW Solution Best?
  • Compiler-based techniques difficult to predict
    power problems before execution
  • Hardware-based techniques cure the symptoms, not
    the problem
  • Combined approach potential to quickly detect
    and permanently correct power problems

38
A Code Cache Client API for Pin
  • Clean, robust interface for accessing and
    altering code cache behavior
  • Features
  • Low overhead
  • One API ? four ISAs
  • Seamless integration with instrumentation API
  • Applications
  • Cache replacement investigations
  • Architectural comparisons
  • Graphical-user interfaces

39
Question Answer
40
Prior Approaches
  • Dynamo Flush (pre-emptive)
  • DELI Flush (manual)
  • Strata Flush (when full)
  • DynamoRIOUnbounded cache
  • Pin Unbounded cache
  • ADORE Unbounded cache
  • Mojo Coarse-grained FIFO

41
DynamoRIO
Hash Table Lookup
Code Cache
Hit
Interpret
Branch Target Address
Start
Miss
Counter
Insert
Delete
Update Hash Table
Room in Code Cache?
Yes
Code is Hot?
Region Formation Optimization
Yes
No
Code Cache Manager
No
42
Experimental Approach
Block Details Insertions Accesses
Bench- marks
Superblock Trace
DynamoRIO
Results
Code Cache Simulator
43
Basic Block Superblock Caches
  • DynamoRIO interprets by copying all basic
    blocks into a code cache
  • Once the basic blocks become hot, superblocks
    are formed and copied into the superblock cache
  • One weakness of a single FIFO cache is that all
    superblocks are treated equally

Basic Block Cache
Superblock Cache
50 executions
Superblock Formation
44
Generating Overhead Estimates
Each overhead estimate was generated using
least-squares linear regression over 30,000
samples
45
Finding a General Configuration
Previously Different cache size per
benchmark Now Fixed cache size across all
benchmarks Focusing on SPEC 2000 for
repeatability
DynamoRIO
Benchmarks
46
Relocation and Links
  • Supporting relocation
  • PC-rel branches to exit stubs
  • Non-PC-rel branches to interpreter
  • Link repair
  • Repair incoming links
  • Repair outgoing links
  • Used a data structure

SB 1
Instr A
Instr B
Br (cond) Exit 1
Application Code
Instr C
Instr D
Br (cond) SB2
Br Exit 3
Exit 1
Exit 2
Exit Stubs
Exit 3
47
Inserting Profile Counters
  • High overhead to insert profiling code into
    existing superblocks

Nursery Cache
SB 6
SB 1
SB 2
Probation Cache
i
SB 3
SB 4
SB 5
48
Wall-Clock Performance
49
Validating Lifetimes
50
The Granularity Trade-Off
  • Fine granularity
  • lower miss rate
  • versus
  • Coarse granularity
  • less link maintenance
  • less aggregate eviction overhead

51
Superblock Promotion Overhead
52
Run-Time Factors
  • Difficult to simulate
  • Relocating code impacts instruction fetch
    performance
  • Changing cache size affects maximum superblock
    size
  • Also interesting
  • Relocation must be balanced by a lower miss rate

53
GCCs Hit Breakdown
54
As Cache Pressure Increases
Cache Size UnboundedMax Cache Pressure
55
Combining Miss Rate Eviction Costs
56
Outbound Links per Superblock
57
Are Links Highly Beneficial?
58
SPEC2000 Lifetimes
59
Code Cache Visualization
60
Code Cache Visualization
61
Code Cache GUI
62
Importance of Cache Pressure
  • Replacement policies only affect performance when
    there is cache pressure
  • Helps evaluate how well policies scale

Cache Pressure UnboundedMax Cache
Size
63
How Does Performance Scale?
64
Impact of Understanding Granularity
  • In high cache pressure situations, code cache
    management overhead can dominate
  • At pressure10 crafty, twolf result in a 20
    execution time reduction by changing from FLUSH
    to 8-unit FIFO
  • Medium-grained evictions are most scalable under
    pressure

65
Traditional Replacement Policies
66
Contributions
  • The code cache management problem
  • Local cache management
  • Evaluation of traditional replacement algorithms
    Interact02
  • Superblock eviction granularity CGO04
  • Global cache management
  • Generational code caches MICRO03
  • Persistent caches Traces04
  • Implementation issues and challenges
  • Future research ideas

67
Conclusions
  • Effective code caching is crucial to a robust
    dynamic binary optimizer
  • Medium-grained evictions provide a robust
    solution for various application sizes
  • Replacing a single cache with multiple,
    generational code caches provides notable
    run-time benefits for large applications
Write a Comment
User Comments (0)
About PowerShow.com