Dynamic Binary Optimization

About This Presentation

Title:

Dynamic Binary Optimization

Description:

Initially arranged BBs. Br cond==T. Memory Locality Enhancement. A. B. D. 30 ... Was the trend Sun's MAJC, Transmeta, Daisy, ... for a while until multi-core ... – PowerPoint PPT presentation

Number of Views:186

Avg rating:3.0/5.0

Slides: 73

Provided by: chng3

Category:

more less

Transcript and Presenter's Notes

Title: Dynamic Binary Optimization

1
Dynamic Binary Optimization

After compatibility, performance is the next
consideration.
In many VMs, simple optimizations are performed
to smooth out rough edges
In some VMs, aggressive optimizations can close
the gap between a guests emulated performance
and native platform performance
Profiles serve as a guide for making optimization
decisions. Runtime collected profiles are more
accurate.

2
Optimization Example
3
Type of Profiles

Block or node profiles
Identify hot code blocks
Fewer nodes than edges
Edge profiles
Give a more precise idea of program flow
Block profile can be derived from edge profile
(not vice versa)

4
Path Profiles

A Path profile subsumes an edge profile by
counting paths containing multiple edges.
For superblock formation, the path profile is the
most appropriate type of profile.
Simply follow the most frequent edges from an
edge profile does not always give the accurate
frequent paths.
Path profiles can be collected efficiently, but
would require up-front program analysis to
determine where to place profile probes.
Hardware support can give information on paths
efficiently.
branch trace information
global branch history taken/not-taken bit mask

5
Other Profiles

Cache miss profiles
L1/L2/L3 miss profiles
coherence miss profiles
miss traffic profiles
useless prefetch profiles
Value predication profiles
Speculative execution profiles
data speculation check profiles
Data dependence profiles
Exception profiles
Indirect branch profiles
and more

6
Collecting Profiles

Instrumentation-based
Software probes
Slows down program more
Can be intrusive
Can be bursty
Hardware probes
Less overhead than software
Less well-supported in processors
Typically event counters
Sampling based
Interrupt at random intervals and take sample
Slows down program less
Requires longer time to get same amount of data
Not useful during interpretation

7
Profiling During Interpretation

Profiling code add to interpreter routines
can be applied to specific instruction types
can be applied to certain classes of
instructions (e.g. backward branches)
Profile table can be merged with the translation
lookup table.
Profile counter decaying
saturating counters
automatic decay ? profile manager periodically
divides all profile counts by 2 ( by 1)
Profiling jump instructions
May need to maintain multiple target addresses

8
Profiling Translated Code
Increment edge counter (i) If (counter (i) tr
igger) then invoke optimizer Else branch to fal
l-thru basic block

Translated Basic Block
Increment edge counter (j) If (counter (j) tr
igger) then invoke optimizer Else branch to tar
get basic block

Fall-thru stub
Branch target stub
9
Optimizing Translated Blocks

Use dominant control flow for enhancing memory
locality
Enlarged basic blocks (traces, superblocks, tree
groups) are optimized.
Performance metric guided optimizations
cache related optimizations such as prefetching
value related optimizations (e.g.
specialization)
failure related optimizations such as
speculative optimization, alignment optimization,
and so on.

10
Memory Locality Enhancement
A
A
Br condT
B
30
70
B
D
2
C
29
1
D
68
F
C
E
E
29
1
F
68
G
G
97
1
Initially arranged BBs
11
Memory Locality Enhancement
A
A
B
30
70
B
D
2
C
29
1
D
68
F
C
E
E
29
1
F
68
G
G
97
1
Initially arranged BBs
12
Memory Locality Enhancement
A
A
Br condF
D
30
70
B
D
2
E
29
1
G
68
F
C
E
B
29
1
C
68
G
F
97
1
Rearrangement to improve spatial locality
13
Dynamic Call Inlining
Follow dominant flow of control
14
Pros and Cons of Partial Inlining

Increased spatial locality
Increased extended basic block (larger scope
for optimization)
Reducing parameter passing overhead
May reduce call/return overhead
May increase dynamic code size (due to excessive
code duplication) and increase Icache and ITLB
miss rates.
May increase register pressure
May increase translation time

15
Traces, Superblocks, Treegions

Three common ways to rearrange basic blocks
according to control flow
Trace formation
follows control flow naturally
Superblock formation
more widely used by VM implementation
enforce single entry multiple exits trace (often
use tail duplication)
more amenable to inter basic block
optimizations
Tree group formation
provide a wider scope than superblocks

16
Trace Scheduling and VLIW Processors

VLIW Processors
Each long instruction contains multiple
operations (branches, Loads and Stores, Int/FP
operations, reg-reg transfers) that are executed
in parallel.
No need to track data dependences because
compiler packs independent operations into
instructions after analyzing dependences. (not
entirely true, e.g. memory operations)
Complexity of HW dependence check is replaced by
complexity in the compiler. BTW, what is RISC?
Relegating Important Stuff to Compiler )

17
Example Multiflow 500

Multiflow 500 can issue up to 28 operations in
each instruction (instructions can be up to
1024-bits).

Fetch
Decode
Execut
WB
Fetch
Decode
Execut
WB
18
Historical Background

1970s
FPS-164 and MARS-432
1980s
Multiflow (Fisher), Cydrome (Rau)
Competing with Convex, Ardent, VAX, Cray,
Lost battle to superscalar processors (RS6000,
PA-RISC, MIPS, and so called killer micros)
Late 90s to 2000s
DSP processors (Philips Trimedia)
EPIC (IA-64 processors)
Was the trend Suns MAJC, Transmeta, Daisy,
for a while until multi-core becomes the new wave

19
Control Dependences -Instruction Window
Superscalar Hardware branch prediction guides fe
tching of instructions to fill up the
instruction window. Instructions are issued from
the window as they become ready, that is,
out-of-order execution is possible.Speculative
execution is also possible with out-of-order
execution.
VLIW/EPIC Programs are first profiled. The com
piler uses the profiles to trace out likely
paths. A trace is a software instruction window.
Instruction reordering is performed by the
compiler within the trace.
20
Data Dependences - Exploiting ILP
Superscalar Memory dependences HW load-store di
sambiguation techniques used for enabling
out-of-order execution. False register dependenc
es avoided using register renaming.
True data dependences must be honored. Value
prediction for out-of-order execution of
dependent instructions.
VLIW/EPIC Memory dependences detected by the co
mpiler using dependency analysis. HW support for
advanced loads. False data dependences avoided b
y the compiler through renaming (memory) and
register allocation. True data dependences are
strictly followed.
21
General Comparisons
Superscalar Smaller code size (no nops, no compe
nsation code, no code duplication, )
Binary compatible!! Hiding unpredictable late
ncies. More adaptable!
VLIW/EPIC Simpler hardware, may be easier to imp
lement, easier to verify, less expensive, smaller
die, less power consumption, higher clock rate
Early VLIWs dont have interlocks
Early VLIWs have cluster register files
EPIC addresses some binary compatibility issues
(not completely)
22
Why VLIW?

Superscalar out-of-order implementation is not
scalable for exploiting ILP
Runtime data dependency check complexity not
scalable
Register renaming complexity
Large instruction reordering window is expensive
to implement
HW complexity may limit the clock rate, make
verification more difficult

23
Trace Scheduling is vital for VLIW

Typical basic blocks contain about 5
instructions. These 5 instructions usually have
data dependences. So how can the compiler exploit
ILP and make use of the long instruction word?
Trace Scheduling
It exploits ILP cross basic block boundaries

24
Trace Scheduling

Trace Selection
Find likely sequence of basic blocks (trace) of
(statically predicted or profile predicted) long
sequence of straight-line code
Trace Compaction
Squeeze trace into few VLIW instructions
Need bookkeeping code in case prediction is
wrong

25
Trace Selection

Trace selection is based on either static branch
prediction or edge profiles.
Example of common traces

Simple loop unrolling While (I aI b If (In) exit aI b If
(In) exit
aI b If (In) exit
Error checks If (r bI.data r funct(a) If (r )
If (r
26
Code Motions in Trace Scheduling

Case 1 If a trace operation moves below a
conditional branch, a copy of it must be placed
on the off-trace edge.

Inst 1
Inst 1
Branch x
Inst 2
Inst 2
Inst 2
Branch x
Inst 3
Inst 3
Inst x
Inst x
Inst x1
Inst x1
27
Code Motions in Trace Scheduling

Why do we want to move an operation downwards? To
hide latency, to schedule it for free,

Load r1
Load r1
Branch x
r2 r11
r2r11
Inst 3
Branch x
r2r11
Inst 3
Inst x
Inst x
Inst x1
Inst x1
28
Code Motions in Trace Scheduling

Case 2 If a trace operation moves above a
rejoin, a copy of it must be placed on the
off-trace rejoin edge.

Inst 4
Inst 4
Inst 1
Inst 1
branch
Inst 3
Inst 2
Inst 3
branch
Inst 2
Inst 3
29
Code Motions in Trace Scheduling

Case 3 If a trace operation writes to a variable
and the variable is live on the off-trace edge,
this operation cannot be moved above the branch.
However, register renaming can be used.

Inst 1
Inst 2
branch
r31
r3 a
30
Code Motions in Trace Scheduling

Case 4 Speculative code motion how to hide the
load latency?

Inst 1
Inst 2
Branch x
Inst x
Load r1
r3 r11
31
Code Motions in Trace Scheduling

Case 4 Speculative code motion move the load
above the branch. But ..

The load may trap!!!
Inst 1
Some architectures introduce non-trapping loads
for speculative code motion
Load r1
Inst 2
Branch x
Inst x
But what if the moved load really needs to trap?!
r3 r11
32
Code Motions in Trace Scheduling

Case 4 Speculative code motion IA-64 introduces
speculative loads and check for faults.

If the speculative load (load.s) causes an
exception, then the exception should only be
serviced if the condition is true.
The check (chk.s) verifies whether an exception
has occurred and if so it branches to recovery
code.
Inst 1
Load.s r1
Inst 2
Branch x
Inst x
Chk.s r1
r3 r11
33
Code Motions in Trace Scheduling

Case 5 Moving an instruction below a side
entrance

Inst x
Inst x
Inst 1
Inst 1
Inst 3
Inst 2
Inst 4
Inst 4
Inst 3
Inst 2
Inst 4
Inst 5
Inst 5
Inst 6
Inst 6
Need to adjust the entrance address and code
duplication
34
Code Motions in Trace Scheduling

Summary

1 Speculative execution 2 Need code duplicati
on 3 Need code duplication and branch target a
djustment
4 Need code duplication
1
Side exit
2
4
Side entrance
3
35
Superblock An Effective Technique for VLIW and
Superscalar Compilation

What is a Superblock?
Why Superblocks?
The implementation of Superblocks in the Impact
Compiler

36
What is wrong with Trace Scheduling?
Code motion 3 and 4 have increased much
bookkeeping complexity. Side entrance also cause
optimizations such as copy propagation more
complex Superblock tries to eliminate side entran
ce
1
Side exit
2
4
Side entrance
3
37
Example of Copy Propagation in traces
r1 ? r3
r1?r3
r1 ? 1
r1?1
branch
r4?r1r2
r2 ?3
r2?3
branch
r4?4
r4 ?r1r2
38
Superblock

A Superblock is a Trace which has NO side
entrance.
Control may enter at the top but may leave at one
or more exit points
A Superblock is formed in two steps
Traces are identified (using profile or static
branch prediction)
Tail duplication to remove all side entrances

39
Superblock formation example
A
A
B
C
B
C
E
D
E
D
F
F
F
40
Tail Duplication in Superblocks
41
Superblock ILP Optimization

Superblock enlarging transformations
Branch target expansion
Loop peeling
Loop unrolling
Dependence removing transformations
Register renaming
Operation migration
Induction variable expansion
Accumulator variable expansion
Operation combining

42
Operation Migration

Move an instruction from a superblock where its
result is not used to a less frequently executed
block.

r1 ? A
r1 ? A
r2?r1r3
Branch x
Branch x
Inst x
r2?r1r3
r4?B
r4?B
Inst x
r2 r4r1
r2 r4r1
43
Induction variable expansion
For (I0 I I AIBI I AI BI
I1I1 I2I2 AI BI AI1BI1 A
I2 BI2
I3
In some cases, the increment of Induction variab
les can be folded into the offset of load and
store instructions
44
Accumulator variable expansion
For (I0 I A BI I A BI
I1I1 I2I2 S1 BI S2 BI1 S3 B
I2
I .. A S1S2S3
Not applicable to FP operations unless an option
is provided by user

45
Superblock Scheduling

Typical scheduling
Dependence graph construction (include both
control and data dependence)
List scheduling (ready list)
Speculative execution support
Restricted model dont move instructions that
may cause traps, such as load, store, int divide,
fp operations
Using non-trapping version of those instructions
(not a correct solution, speculative check is
needed)

46
Implementation in the Impact Compiler

Compiler Size comparison

47
Base Code Calibration

The effect of Superblock optimization will be
reported as speed up over the base-line code. So
the question is how good is the base code? On a
DECstation3100 (MIPS R2000 processor), the base
code (compiled with O) is faster than MIPS
compiler (by 4) and faster than the GNU
compiler (by 15).

48
Compile Time Cost

Profile
Only show profiling one input set
Superblock formation
About 2 to 23 of base compilation time
Superblock optimization
Average 101 of base compile time
One worst case 522 (cmp) of base compile time

49
Conclusion from IMPACT-I

The Impact-I compiler has shown the performance
potential of superblock formation and
optimization techniques.
Superblock formation/scheduling add 14 of code
and accounts for 100 more of compile time.
Speculative execution (with non-trapping
instructions) improves performance by 13 to
143
With 64KB I and D-cache, performance gain from
superblock techniques remains.
Superscalar and VLIW processors need superblock
techniques.

50
Tree Groups

Traces and superblocks are based the principle
that conditional branches are biased, but some
branches are 50-50 or 30-70. When it happens,
there is overhead involved at exits, and there
missing optimization opportunities.
Tree groups (or tree regions) are a
generalization of superblocks. They are still one
entrance and multiple exits, but they include
multiple flows of control rather than a single
flow of control.

51
Tree groups vs. Superblocks
A
B
D
C
E
F
G
52
Superblocks
A
D
E
G
53
Tree groups
A
Single entry, multiple exits, with multiple c
ontrol flows
B
D
C
E
F
G
G
G
54
Why Superblock in VM
Superblock
Compensation code
Compensation code
Compensation code
Compensation code
Compensation code
Trace
55
Superblock Formation in VM

Start Points
When block use reaches a threshold
Profile all blocks (UQDBT)
Profile selected blocks (Dynamo)
Profile only targets of backward branches (close
loops)
Profile exits from existing superblocks
Continuation
Use hottest edges above a threshold (UQDBT)
Follow current control path (most recent edge)
(Dynamo)
End Points
Start point of this superblock
Start point of some other superblock
When a maximum size is reached
When no edge above threshold can be found
(UQDBT)
When an indirect jump is reached

56
Dynamic Optimization Overview
57
Optimization and Compatibility

Trap compatible
A process VM implementation is trap compatible
if any trap that would occur during the native
execution of a source instruction is also
observed during the emulation.
At the time of a trap, the memory and register
state becomes visible. Their compatibility must
be maintained.
Implications
Instructions may not be deleted
registers must be updated in their original
order
stores must preserve their original orders
access to volatiles should be treated as stores

58
Register State Compatibility
Target r1?r2r3 r6?r1r7 r9?r1r5 r3?r6r1

Target r1?r2r3 s1?r1r7 r9?r1r5 r6?s1 r
3?r6r1

Source r1?r2r3 r9?r1r5 r6?r1r7 r3?r6r1

Trap?
Reg state preserved
59
Example Intel IA32 EL

Software method for running IA-32 binaries
on IPF
Previous approach was in hardware (iVE)
Runs with both Windows and Linux
OS independent section (BTgeneric)
OS dependent section (BTlib)
System services
Two stages
Fast binary translation (cold code)
Optimized binary translation (hot code)
Precise traps are an important consideration

60
Cold Code Translation
61
Cold Code Translation

Cold code is generated at basic-block
granularity.
Simple analysis is done on neighboring blocks
(1-20 basic blocks) for better code generation
decoding
building a flow graph
computing the liveness of IA-32 EFlags bits
tracking floating point (FP) stack changes
Translation uses prepared (hand tuned)
translation templates for each IA-32 instruction
Instrumentation code added
Block counter, edge counter, misalignment
detection, indirect branch targets
Backpatching is used to link translated blocks

62
Hot Code Translation
63
Hot Code Translation

Trace selection (hyper-block selection)
Decode and analysis
IL generation
Adds misalignment avoidance code
Tracks IA-32 addresses and their values for CSE.
Tracks register values for simplifying the
translation.
Eliminates EFlags generation
Analyzes FP stack flow and SSE format
conversions.
Performs other FP optimizations, such as register
allocation and FXCHG elimination
Build dependence graph
Dead IL elimination
Mark sideway ILs
Annotates weights
Register renaming
Control and data speculation
Scheduling, recovery information
Cross linking, place generated code in code
cache

Hot code translation is about 20 times slower
64
Precise State Support

Exception handling
Once an exception is raised in the translated
code, the OS calls IA-32 EL first.
IA-32 EL filters out IA-32 unrelated exceptions
Reconstruct IA-32 state
Simulating IA-32 application exception handler
Some exceptions should not happen, and must be
avoided (e.g. masked FP exception)

65
Precise State Support (cont.)

IA-32 state reconstruction for cold code
Ensure IA-32 state change happens only after
executing the last Itanium instruction that can
fault (i.e. memory and FP instructions).

IA-32 IP is stored in a dedicated register (with
some additional information) and updated during
execution.

66
Precise State Support (cont.)

IA-32 state reconstruction for hot code
Challenges
In hot code, Itanium instructions originating
from different IA-32 instructions are usually
inter-mixed, and the IA-32 state (registers) is
often represented by other registers (for
example, in case of register renaming). As a
result, exceptions in hot code may appear in an
incorrect order and redundant exceptions may
occur.
Hot blocks are composed of several IA-32 basic
blocks and may contain branches, loops, and
predicated if then else code sequences.

67
Precise State Support (cont.)

IA-32 state reconstruction for hot code
Commit points
a commit point is a barrier enabling the
translator to
generate a consistent IA-32 state.
the translator associates several faulty points
in the code with a single commit point.
commit point is the reordering boundary
within a single IA-32 instruction translation,
the state update occurs after the last faulty IPF
instruction
the first commit point is usually set at the
beginning of a block.
IA-32 state is copied to backup registers at
commit point

68
IA-32 Specific Optimizations

Floating point/MMX
IPF uses large flat register file
IA-32 uses stack register file
IA-32 TAG indicates valid entries
IA-32 aliases MMX regs to FP regs
Speculate common case usage and put guard code at
beginning of block, for example
TOS (Top of Stack) same for all block executions
No invalid accesses (indicated by TAG)
99-100 accurate
Data Misalignment
Similar to FX!32

69
IA-32 EL Performance
70
IA-32 EL Performance
CPU2000 mostly in hot code, very little
overhead Sysmark only 45 hot code, 22 in OS (I
PF native code)
71
Same ISA Optimization

Objective optimize binaries on-the-fly
Many binaries are un-optimized or are at a low
optimization level
Initial emulation can be done very efficiently
Translation at basic block level is identity
translation
Initial sample-based profiling is attractive
Original code can be used, running at native
speeds
Code patching can be used (e.g. ADORE)
Patch code cache regions into original code
Replace original code with branches into code
cache (saves code some code duplication)
Can avoid hash table lookup on indirect jumps
Can bail-out if performance is lost