Title: Dynamic Binary Optimization
1Dynamic Binary Optimization
- After compatibility, performance is the next
consideration.
- In many VMs, simple optimizations are performed
to smooth out rough edges
- In some VMs, aggressive optimizations can close
the gap between a guests emulated performance
and native platform performance
- Profiles serve as a guide for making optimization
decisions. Runtime collected profiles are more
accurate.
2Optimization Example
3Type of Profiles
- Block or node profiles
- Identify hot code blocks
- Fewer nodes than edges
- Edge profiles
- Give a more precise idea of program flow
- Block profile can be derived from edge profile
(not vice versa)
4Path Profiles
- A Path profile subsumes an edge profile by
counting paths containing multiple edges.
-
- For superblock formation, the path profile is the
most appropriate type of profile.
- Simply follow the most frequent edges from an
edge profile does not always give the accurate
frequent paths.
- Path profiles can be collected efficiently, but
would require up-front program analysis to
determine where to place profile probes.
- Hardware support can give information on paths
efficiently.
- branch trace information
- global branch history taken/not-taken bit mask
5Other Profiles
- Cache miss profiles
- L1/L2/L3 miss profiles
- coherence miss profiles
- miss traffic profiles
- useless prefetch profiles
- Value predication profiles
- Speculative execution profiles
- data speculation check profiles
- Data dependence profiles
- Exception profiles
- Indirect branch profiles
- and more
6Collecting Profiles
- Instrumentation-based
- Software probes
- Slows down program more
- Can be intrusive
- Can be bursty
- Hardware probes
- Less overhead than software
- Less well-supported in processors
- Typically event counters
- Sampling based
- Interrupt at random intervals and take sample
- Slows down program less
- Requires longer time to get same amount of data
- Not useful during interpretation
7Profiling During Interpretation
- Profiling code add to interpreter routines
- can be applied to specific instruction types
- can be applied to certain classes of
instructions (e.g. backward branches)
- Profile table can be merged with the translation
lookup table.
- Profile counter decaying
- saturating counters
- automatic decay ? profile manager periodically
divides all profile counts by 2 ( by 1)
- Profiling jump instructions
- May need to maintain multiple target addresses
8Profiling Translated Code
Increment edge counter (i) If (counter (i) tr
igger) then invoke optimizer Else branch to fal
l-thru basic block
Translated Basic Block
Increment edge counter (j) If (counter (j) tr
igger) then invoke optimizer Else branch to tar
get basic block
Fall-thru stub
Branch target stub
9Optimizing Translated Blocks
- Use dominant control flow for enhancing memory
locality
- Enlarged basic blocks (traces, superblocks, tree
groups) are optimized.
- Performance metric guided optimizations
- cache related optimizations such as prefetching
- value related optimizations (e.g.
specialization)
- failure related optimizations such as
speculative optimization, alignment optimization,
and so on.
10Memory Locality Enhancement
A
A
Br condT
B
30
70
B
D
2
C
29
1
D
68
F
C
E
E
29
1
F
68
G
G
97
1
Initially arranged BBs
11Memory Locality Enhancement
A
A
B
30
70
B
D
2
C
29
1
D
68
F
C
E
E
29
1
F
68
G
G
97
1
Initially arranged BBs
12Memory Locality Enhancement
A
A
Br condF
D
30
70
B
D
2
E
29
1
G
68
F
C
E
B
29
1
C
68
G
F
97
1
Rearrangement to improve spatial locality
13Dynamic Call Inlining
Follow dominant flow of control
14Pros and Cons of Partial Inlining
- Increased spatial locality
- Increased extended basic block (larger scope
for optimization)
- Reducing parameter passing overhead
- May reduce call/return overhead
- May increase dynamic code size (due to excessive
code duplication) and increase Icache and ITLB
miss rates.
- May increase register pressure
- May increase translation time
15Traces, Superblocks, Treegions
- Three common ways to rearrange basic blocks
according to control flow
- Trace formation
- follows control flow naturally
- Superblock formation
- more widely used by VM implementation
- enforce single entry multiple exits trace (often
use tail duplication)
- more amenable to inter basic block
optimizations
- Tree group formation
- provide a wider scope than superblocks
16Trace Scheduling and VLIW Processors
- VLIW Processors
- Each long instruction contains multiple
operations (branches, Loads and Stores, Int/FP
operations, reg-reg transfers) that are executed
in parallel. - No need to track data dependences because
compiler packs independent operations into
instructions after analyzing dependences. (not
entirely true, e.g. memory operations) - Complexity of HW dependence check is replaced by
complexity in the compiler. BTW, what is RISC?
Relegating Important Stuff to Compiler )
17Example Multiflow 500
- Multiflow 500 can issue up to 28 operations in
each instruction (instructions can be up to
1024-bits).
Fetch
Decode
Execut
WB
Fetch
Decode
Execut
WB
18Historical Background
- 1970s
- FPS-164 and MARS-432
- 1980s
- Multiflow (Fisher), Cydrome (Rau)
- Competing with Convex, Ardent, VAX, Cray,
- Lost battle to superscalar processors (RS6000,
PA-RISC, MIPS, and so called killer micros)
- Late 90s to 2000s
- DSP processors (Philips Trimedia)
- EPIC (IA-64 processors)
- Was the trend Suns MAJC, Transmeta, Daisy,
for a while until multi-core becomes the new wave
19Control Dependences -Instruction Window
Superscalar Hardware branch prediction guides fe
tching of instructions to fill up the
instruction window. Instructions are issued from
the window as they become ready, that is,
out-of-order execution is possible.Speculative
execution is also possible with out-of-order
execution.
VLIW/EPIC Programs are first profiled. The com
piler uses the profiles to trace out likely
paths. A trace is a software instruction window.
Instruction reordering is performed by the
compiler within the trace.
20Data Dependences - Exploiting ILP
Superscalar Memory dependences HW load-store di
sambiguation techniques used for enabling
out-of-order execution. False register dependenc
es avoided using register renaming.
True data dependences must be honored. Value
prediction for out-of-order execution of
dependent instructions.
VLIW/EPIC Memory dependences detected by the co
mpiler using dependency analysis. HW support for
advanced loads. False data dependences avoided b
y the compiler through renaming (memory) and
register allocation. True data dependences are
strictly followed.
21General Comparisons
Superscalar Smaller code size (no nops, no compe
nsation code, no code duplication, )
Binary compatible!! Hiding unpredictable late
ncies. More adaptable!
VLIW/EPIC Simpler hardware, may be easier to imp
lement, easier to verify, less expensive, smaller
die, less power consumption, higher clock rate
Early VLIWs dont have interlocks
Early VLIWs have cluster register files
EPIC addresses some binary compatibility issues
(not completely)
22Why VLIW?
- Superscalar out-of-order implementation is not
scalable for exploiting ILP
- Runtime data dependency check complexity not
scalable
- Register renaming complexity
- Large instruction reordering window is expensive
to implement
- HW complexity may limit the clock rate, make
verification more difficult
23Trace Scheduling is vital for VLIW
- Typical basic blocks contain about 5
instructions. These 5 instructions usually have
data dependences. So how can the compiler exploit
ILP and make use of the long instruction word? -
- Trace Scheduling
- It exploits ILP cross basic block boundaries
24Trace Scheduling
- Trace Selection
- Find likely sequence of basic blocks (trace) of
(statically predicted or profile predicted) long
sequence of straight-line code
- Trace Compaction
- Squeeze trace into few VLIW instructions
- Need bookkeeping code in case prediction is
wrong
25Trace Selection
- Trace selection is based on either static branch
prediction or edge profiles.
- Example of common traces
Simple loop unrolling While (I aI b If (In) exit aI b If
(In) exit
aI b If (In) exit
Error checks If (r bI.data r funct(a) If (r )
If (r
26Code Motions in Trace Scheduling
- Case 1 If a trace operation moves below a
conditional branch, a copy of it must be placed
on the off-trace edge.
Inst 1
Inst 1
Branch x
Inst 2
Inst 2
Inst 2
Branch x
Inst 3
Inst 3
Inst x
Inst x
Inst x1
Inst x1
27Code Motions in Trace Scheduling
- Why do we want to move an operation downwards? To
hide latency, to schedule it for free,
Load r1
Load r1
Branch x
r2 r11
r2r11
Inst 3
Branch x
r2r11
Inst 3
Inst x
Inst x
Inst x1
Inst x1
28Code Motions in Trace Scheduling
- Case 2 If a trace operation moves above a
rejoin, a copy of it must be placed on the
off-trace rejoin edge.
Inst 4
Inst 4
Inst 1
Inst 1
branch
Inst 3
Inst 2
Inst 3
branch
Inst 2
Inst 3
29Code Motions in Trace Scheduling
- Case 3 If a trace operation writes to a variable
and the variable is live on the off-trace edge,
this operation cannot be moved above the branch.
However, register renaming can be used.
Inst 1
Inst 2
branch
r31
r3 a
30Code Motions in Trace Scheduling
- Case 4 Speculative code motion how to hide the
load latency?
Inst 1
Inst 2
Branch x
Inst x
Load r1
r3 r11
31Code Motions in Trace Scheduling
- Case 4 Speculative code motion move the load
above the branch. But ..
The load may trap!!!
Inst 1
Some architectures introduce non-trapping loads
for speculative code motion
Load r1
Inst 2
Branch x
Inst x
But what if the moved load really needs to trap?!
r3 r11
32Code Motions in Trace Scheduling
- Case 4 Speculative code motion IA-64 introduces
speculative loads and check for faults.
If the speculative load (load.s) causes an
exception, then the exception should only be
serviced if the condition is true.
The check (chk.s) verifies whether an exception
has occurred and if so it branches to recovery
code.
Inst 1
Load.s r1
Inst 2
Branch x
Inst x
Chk.s r1
r3 r11
33Code Motions in Trace Scheduling
- Case 5 Moving an instruction below a side
entrance
Inst x
Inst x
Inst 1
Inst 1
Inst 3
Inst 2
Inst 4
Inst 4
Inst 3
Inst 2
Inst 4
Inst 5
Inst 5
Inst 6
Inst 6
Need to adjust the entrance address and code
duplication
34Code Motions in Trace Scheduling
1 Speculative execution 2 Need code duplicati
on 3 Need code duplication and branch target a
djustment
4 Need code duplication
1
Side exit
2
4
Side entrance
3
35Superblock An Effective Technique for VLIW and
Superscalar Compilation
- What is a Superblock?
- Why Superblocks?
- The implementation of Superblocks in the Impact
Compiler
-
36What is wrong with Trace Scheduling?
Code motion 3 and 4 have increased much
bookkeeping complexity. Side entrance also cause
optimizations such as copy propagation more
complex Superblock tries to eliminate side entran
ce
1
Side exit
2
4
Side entrance
3
37Example of Copy Propagation in traces
r1 ? r3
r1?r3
r1 ? 1
r1?1
branch
r4?r1r2
r2 ?3
r2?3
branch
r4?4
r4 ?r1r2
38Superblock
- A Superblock is a Trace which has NO side
entrance.
- Control may enter at the top but may leave at one
or more exit points
- A Superblock is formed in two steps
- Traces are identified (using profile or static
branch prediction)
- Tail duplication to remove all side entrances
39Superblock formation example
A
A
B
C
B
C
E
D
E
D
F
F
F
40Tail Duplication in Superblocks
41Superblock ILP Optimization
- Superblock enlarging transformations
- Branch target expansion
- Loop peeling
- Loop unrolling
- Dependence removing transformations
- Register renaming
- Operation migration
- Induction variable expansion
- Accumulator variable expansion
- Operation combining
42Operation Migration
- Move an instruction from a superblock where its
result is not used to a less frequently executed
block.
r1 ? A
r1 ? A
r2?r1r3
Branch x
Branch x
Inst x
r2?r1r3
r4?B
r4?B
Inst x
r2 r4r1
r2 r4r1
43Induction variable expansion
For (I0 I I AIBI I AI BI
I1I1 I2I2 AI BI AI1BI1 A
I2 BI2
I3
In some cases, the increment of Induction variab
les can be folded into the offset of load and
store instructions
44Accumulator variable expansion
For (I0 I A BI I A BI
I1I1 I2I2 S1 BI S2 BI1 S3 B
I2
I .. A S1S2S3
Not applicable to FP operations unless an option
is provided by user
45Superblock Scheduling
- Typical scheduling
- Dependence graph construction (include both
control and data dependence)
- List scheduling (ready list)
- Speculative execution support
- Restricted model dont move instructions that
may cause traps, such as load, store, int divide,
fp operations
- Using non-trapping version of those instructions
- (not a correct solution, speculative check is
needed)
46Implementation in the Impact Compiler
47Base Code Calibration
- The effect of Superblock optimization will be
reported as speed up over the base-line code. So
the question is how good is the base code? On a
DECstation3100 (MIPS R2000 processor), the base
code (compiled with O) is faster than MIPS
compiler (by 4) and faster than the GNU
compiler (by 15).
48Compile Time Cost
- Profile
- Only show profiling one input set
- Superblock formation
- About 2 to 23 of base compilation time
- Superblock optimization
- Average 101 of base compile time
- One worst case 522 (cmp) of base compile time
49Conclusion from IMPACT-I
- The Impact-I compiler has shown the performance
potential of superblock formation and
optimization techniques.
- Superblock formation/scheduling add 14 of code
and accounts for 100 more of compile time.
- Speculative execution (with non-trapping
instructions) improves performance by 13 to
143
- With 64KB I and D-cache, performance gain from
superblock techniques remains.
- Superscalar and VLIW processors need superblock
techniques.
50Tree Groups
- Traces and superblocks are based the principle
that conditional branches are biased, but some
branches are 50-50 or 30-70. When it happens,
there is overhead involved at exits, and there
missing optimization opportunities. - Tree groups (or tree regions) are a
generalization of superblocks. They are still one
entrance and multiple exits, but they include
multiple flows of control rather than a single
flow of control.
51Tree groups vs. Superblocks
A
B
D
C
E
F
G
52Superblocks
A
D
E
G
53Tree groups
A
Single entry, multiple exits, with multiple c
ontrol flows
B
D
C
E
F
G
G
G
54Why Superblock in VM
Superblock
Compensation code
Compensation code
Compensation code
Compensation code
Compensation code
Trace
55Superblock Formation in VM
- Start Points
- When block use reaches a threshold
- Profile all blocks (UQDBT)
- Profile selected blocks (Dynamo)
- Profile only targets of backward branches (close
loops)
- Profile exits from existing superblocks
- Continuation
- Use hottest edges above a threshold (UQDBT)
- Follow current control path (most recent edge)
(Dynamo)
- End Points
- Start point of this superblock
- Start point of some other superblock
- When a maximum size is reached
- When no edge above threshold can be found
(UQDBT)
- When an indirect jump is reached
56Dynamic Optimization Overview
57Optimization and Compatibility
- Trap compatible
- A process VM implementation is trap compatible
if any trap that would occur during the native
execution of a source instruction is also
observed during the emulation. - At the time of a trap, the memory and register
state becomes visible. Their compatibility must
be maintained.
- Implications
- Instructions may not be deleted
- registers must be updated in their original
order
- stores must preserve their original orders
- access to volatiles should be treated as stores
58Register State Compatibility
Target r1?r2r3 r6?r1r7 r9?r1r5 r3?r6r1
Target r1?r2r3 s1?r1r7 r9?r1r5 r6?s1 r
3?r6r1
Source r1?r2r3 r9?r1r5 r6?r1r7 r3?r6r1
Trap?
Reg state preserved
59Example Intel IA32 EL
- Software method for running IA-32 binaries
- on IPF
- Previous approach was in hardware (iVE)
- Runs with both Windows and Linux
- OS independent section (BTgeneric)
- OS dependent section (BTlib)
- System services
- Two stages
- Fast binary translation (cold code)
- Optimized binary translation (hot code)
- Precise traps are an important consideration
60Cold Code Translation
61Cold Code Translation
- Cold code is generated at basic-block
granularity.
- Simple analysis is done on neighboring blocks
(1-20 basic blocks) for better code generation
- decoding
- building a flow graph
- computing the liveness of IA-32 EFlags bits
- tracking floating point (FP) stack changes
- Translation uses prepared (hand tuned)
translation templates for each IA-32 instruction
- Instrumentation code added
- Block counter, edge counter, misalignment
detection, indirect branch targets
- Backpatching is used to link translated blocks
62Hot Code Translation
63Hot Code Translation
- Trace selection (hyper-block selection)
- Decode and analysis
- IL generation
- Adds misalignment avoidance code
- Tracks IA-32 addresses and their values for CSE.
- Tracks register values for simplifying the
translation.
- Eliminates EFlags generation
- Analyzes FP stack flow and SSE format
conversions.
- Performs other FP optimizations, such as register
allocation and FXCHG elimination
- Build dependence graph
- Dead IL elimination
- Mark sideway ILs
- Annotates weights
- Register renaming
- Control and data speculation
- Scheduling, recovery information
- Cross linking, place generated code in code
cache
Hot code translation is about 20 times slower
64Precise State Support
- Exception handling
- Once an exception is raised in the translated
code, the OS calls IA-32 EL first.
- IA-32 EL filters out IA-32 unrelated exceptions
- Reconstruct IA-32 state
- Simulating IA-32 application exception handler
- Some exceptions should not happen, and must be
avoided (e.g. masked FP exception)
65Precise State Support (cont.)
- IA-32 state reconstruction for cold code
- Ensure IA-32 state change happens only after
executing the last Itanium instruction that can
fault (i.e. memory and FP instructions).
- IA-32 IP is stored in a dedicated register (with
some additional information) and updated during
execution.
66Precise State Support (cont.)
- IA-32 state reconstruction for hot code
- Challenges
- In hot code, Itanium instructions originating
from different IA-32 instructions are usually
inter-mixed, and the IA-32 state (registers) is
often represented by other registers (for
example, in case of register renaming). As a
result, exceptions in hot code may appear in an
incorrect order and redundant exceptions may
occur. - Hot blocks are composed of several IA-32 basic
blocks and may contain branches, loops, and
predicated if then else code sequences.
67Precise State Support (cont.)
- IA-32 state reconstruction for hot code
- Commit points
- a commit point is a barrier enabling the
translator to
- generate a consistent IA-32 state.
- the translator associates several faulty points
in the code with a single commit point.
- commit point is the reordering boundary
- within a single IA-32 instruction translation,
the state update occurs after the last faulty IPF
instruction
- the first commit point is usually set at the
beginning of a block.
- IA-32 state is copied to backup registers at
commit point
68IA-32 Specific Optimizations
- Floating point/MMX
- IPF uses large flat register file
- IA-32 uses stack register file
- IA-32 TAG indicates valid entries
- IA-32 aliases MMX regs to FP regs
- Speculate common case usage and put guard code at
beginning of block, for example
- TOS (Top of Stack) same for all block executions
- No invalid accesses (indicated by TAG)
- 99-100 accurate
- Data Misalignment
- Similar to FX!32
69IA-32 EL Performance
70IA-32 EL Performance
CPU2000 mostly in hot code, very little
overhead Sysmark only 45 hot code, 22 in OS (I
PF native code)
71Same ISA Optimization
- Objective optimize binaries on-the-fly
- Many binaries are un-optimized or are at a low
optimization level
- Initial emulation can be done very efficiently
- Translation at basic block level is identity
translation
- Initial sample-based profiling is attractive
- Original code can be used, running at native
speeds
- Code patching can be used (e.g. ADORE)
- Patch code cache regions into original code
- Replace original code with branches into code
cache (saves code some code duplication)
- Can avoid hash table lookup on indirect jumps
- Can bail-out if performance is lost
72Additional readings for Chapter 4
- IA-32 EL
- Dynamo
- DynamoRIO
- ADORE