Dynamic Binary Optimization - PowerPoint PPT Presentation

1 / 72
About This Presentation
Title:

Dynamic Binary Optimization

Description:

Initially arranged BBs. Br cond==T. Memory Locality Enhancement. A. B. D. 30 ... Was the trend Sun's MAJC, Transmeta, Daisy, ... for a while until multi-core ... – PowerPoint PPT presentation

Number of Views:185
Avg rating:3.0/5.0
Slides: 73
Provided by: chng3
Category:

less

Transcript and Presenter's Notes

Title: Dynamic Binary Optimization


1
Dynamic Binary Optimization
  • After compatibility, performance is the next
    consideration.
  • In many VMs, simple optimizations are performed
    to smooth out rough edges
  • In some VMs, aggressive optimizations can close
    the gap between a guests emulated performance
    and native platform performance
  • Profiles serve as a guide for making optimization
    decisions. Runtime collected profiles are more
    accurate.

2
Optimization Example
3
Type of Profiles
  • Block or node profiles
  • Identify hot code blocks
  • Fewer nodes than edges
  • Edge profiles
  • Give a more precise idea of program flow
  • Block profile can be derived from edge profile
    (not vice versa)

4
Path Profiles
  • A Path profile subsumes an edge profile by
    counting paths containing multiple edges.
  • For superblock formation, the path profile is the
    most appropriate type of profile.
  • Simply follow the most frequent edges from an
    edge profile does not always give the accurate
    frequent paths.
  • Path profiles can be collected efficiently, but
    would require up-front program analysis to
    determine where to place profile probes.
  • Hardware support can give information on paths
    efficiently.
  • branch trace information
  • global branch history taken/not-taken bit mask

5
Other Profiles
  • Cache miss profiles
  • L1/L2/L3 miss profiles
  • coherence miss profiles
  • miss traffic profiles
  • useless prefetch profiles
  • Value predication profiles
  • Speculative execution profiles
  • data speculation check profiles
  • Data dependence profiles
  • Exception profiles
  • Indirect branch profiles
  • and more

6
Collecting Profiles
  • Instrumentation-based
  • Software probes
  • Slows down program more
  • Can be intrusive
  • Can be bursty
  • Hardware probes
  • Less overhead than software
  • Less well-supported in processors
  • Typically event counters
  • Sampling based
  • Interrupt at random intervals and take sample
  • Slows down program less
  • Requires longer time to get same amount of data
  • Not useful during interpretation

7
Profiling During Interpretation
  • Profiling code add to interpreter routines
  • can be applied to specific instruction types
  • can be applied to certain classes of
    instructions (e.g. backward branches)
  • Profile table can be merged with the translation
    lookup table.
  • Profile counter decaying
  • saturating counters
  • automatic decay ? profile manager periodically
    divides all profile counts by 2 ( by 1)
  • Profiling jump instructions
  • May need to maintain multiple target addresses

8
Profiling Translated Code
Increment edge counter (i) If (counter (i) tr
igger) then invoke optimizer Else branch to fal
l-thru basic block

Translated Basic Block
Increment edge counter (j) If (counter (j) tr
igger) then invoke optimizer Else branch to tar
get basic block

Fall-thru stub
Branch target stub
9
Optimizing Translated Blocks
  • Use dominant control flow for enhancing memory
    locality
  • Enlarged basic blocks (traces, superblocks, tree
    groups) are optimized.
  • Performance metric guided optimizations
  • cache related optimizations such as prefetching
  • value related optimizations (e.g.
    specialization)
  • failure related optimizations such as
    speculative optimization, alignment optimization,
    and so on.

10
Memory Locality Enhancement
A
A
Br condT
B
30
70
B
D
2
C
29
1
D
68
F
C
E
E
29
1
F
68
G
G
97
1
Initially arranged BBs
11
Memory Locality Enhancement
A
A
B
30
70
B
D
2
C
29
1
D
68
F
C
E
E
29
1
F
68
G
G
97
1
Initially arranged BBs
12
Memory Locality Enhancement
A
A
Br condF
D
30
70
B
D
2
E
29
1
G
68
F
C
E
B
29
1
C
68
G
F
97
1
Rearrangement to improve spatial locality
13
Dynamic Call Inlining
Follow dominant flow of control
14
Pros and Cons of Partial Inlining
  • Increased spatial locality
  • Increased extended basic block (larger scope
    for optimization)
  • Reducing parameter passing overhead
  • May reduce call/return overhead
  • May increase dynamic code size (due to excessive
    code duplication) and increase Icache and ITLB
    miss rates.
  • May increase register pressure
  • May increase translation time

15
Traces, Superblocks, Treegions
  • Three common ways to rearrange basic blocks
    according to control flow
  • Trace formation
  • follows control flow naturally
  • Superblock formation
  • more widely used by VM implementation
  • enforce single entry multiple exits trace (often
    use tail duplication)
  • more amenable to inter basic block
    optimizations
  • Tree group formation
  • provide a wider scope than superblocks

16
Trace Scheduling and VLIW Processors
  • VLIW Processors
  • Each long instruction contains multiple
    operations (branches, Loads and Stores, Int/FP
    operations, reg-reg transfers) that are executed
    in parallel.
  • No need to track data dependences because
    compiler packs independent operations into
    instructions after analyzing dependences. (not
    entirely true, e.g. memory operations)
  • Complexity of HW dependence check is replaced by
    complexity in the compiler. BTW, what is RISC?
    Relegating Important Stuff to Compiler )

17
Example Multiflow 500
  • Multiflow 500 can issue up to 28 operations in
    each instruction (instructions can be up to
    1024-bits).

Fetch
Decode
Execut
WB
Fetch
Decode
Execut
WB
18
Historical Background
  • 1970s
  • FPS-164 and MARS-432
  • 1980s
  • Multiflow (Fisher), Cydrome (Rau)
  • Competing with Convex, Ardent, VAX, Cray,
  • Lost battle to superscalar processors (RS6000,
    PA-RISC, MIPS, and so called killer micros)
  • Late 90s to 2000s
  • DSP processors (Philips Trimedia)
  • EPIC (IA-64 processors)
  • Was the trend Suns MAJC, Transmeta, Daisy,
    for a while until multi-core becomes the new wave

19
Control Dependences -Instruction Window
Superscalar Hardware branch prediction guides fe
tching of instructions to fill up the
instruction window. Instructions are issued from
the window as they become ready, that is,
out-of-order execution is possible.Speculative
execution is also possible with out-of-order
execution.
VLIW/EPIC Programs are first profiled. The com
piler uses the profiles to trace out likely
paths. A trace is a software instruction window.
Instruction reordering is performed by the
compiler within the trace.
20
Data Dependences - Exploiting ILP
Superscalar Memory dependences HW load-store di
sambiguation techniques used for enabling
out-of-order execution. False register dependenc
es avoided using register renaming.
True data dependences must be honored. Value
prediction for out-of-order execution of
dependent instructions.
VLIW/EPIC Memory dependences detected by the co
mpiler using dependency analysis. HW support for
advanced loads. False data dependences avoided b
y the compiler through renaming (memory) and
register allocation. True data dependences are
strictly followed.
21
General Comparisons
Superscalar Smaller code size (no nops, no compe
nsation code, no code duplication, )
Binary compatible!! Hiding unpredictable late
ncies. More adaptable!
VLIW/EPIC Simpler hardware, may be easier to imp
lement, easier to verify, less expensive, smaller
die, less power consumption, higher clock rate
Early VLIWs dont have interlocks
Early VLIWs have cluster register files
EPIC addresses some binary compatibility issues
(not completely)
22
Why VLIW?
  • Superscalar out-of-order implementation is not
    scalable for exploiting ILP
  • Runtime data dependency check complexity not
    scalable
  • Register renaming complexity
  • Large instruction reordering window is expensive
    to implement
  • HW complexity may limit the clock rate, make
    verification more difficult

23
Trace Scheduling is vital for VLIW
  • Typical basic blocks contain about 5
    instructions. These 5 instructions usually have
    data dependences. So how can the compiler exploit
    ILP and make use of the long instruction word?
  • Trace Scheduling
  • It exploits ILP cross basic block boundaries

24
Trace Scheduling
  • Trace Selection
  • Find likely sequence of basic blocks (trace) of
    (statically predicted or profile predicted) long
    sequence of straight-line code
  • Trace Compaction
  • Squeeze trace into few VLIW instructions
  • Need bookkeeping code in case prediction is
    wrong

25
Trace Selection
  • Trace selection is based on either static branch
    prediction or edge profiles.
  • Example of common traces

Simple loop unrolling While (I aI b If (In) exit aI b If
(In) exit
aI b If (In) exit
Error checks If (r bI.data r funct(a) If (r )
If (r
26
Code Motions in Trace Scheduling
  • Case 1 If a trace operation moves below a
    conditional branch, a copy of it must be placed
    on the off-trace edge.

Inst 1
Inst 1
Branch x
Inst 2
Inst 2
Inst 2
Branch x
Inst 3
Inst 3
Inst x
Inst x
Inst x1
Inst x1
27
Code Motions in Trace Scheduling
  • Why do we want to move an operation downwards? To
    hide latency, to schedule it for free,

Load r1
Load r1
Branch x
r2 r11
r2r11
Inst 3
Branch x
r2r11
Inst 3
Inst x
Inst x
Inst x1
Inst x1
28
Code Motions in Trace Scheduling
  • Case 2 If a trace operation moves above a
    rejoin, a copy of it must be placed on the
    off-trace rejoin edge.

Inst 4
Inst 4
Inst 1
Inst 1
branch
Inst 3
Inst 2
Inst 3
branch
Inst 2
Inst 3
29
Code Motions in Trace Scheduling
  • Case 3 If a trace operation writes to a variable
    and the variable is live on the off-trace edge,
    this operation cannot be moved above the branch.
    However, register renaming can be used.

Inst 1
Inst 2
branch
r31
r3 a
30
Code Motions in Trace Scheduling
  • Case 4 Speculative code motion how to hide the
    load latency?

Inst 1
Inst 2
Branch x
Inst x
Load r1
r3 r11
31
Code Motions in Trace Scheduling
  • Case 4 Speculative code motion move the load
    above the branch. But ..

The load may trap!!!
Inst 1
Some architectures introduce non-trapping loads
for speculative code motion
Load r1
Inst 2
Branch x
Inst x
But what if the moved load really needs to trap?!
r3 r11
32
Code Motions in Trace Scheduling
  • Case 4 Speculative code motion IA-64 introduces
    speculative loads and check for faults.

If the speculative load (load.s) causes an
exception, then the exception should only be
serviced if the condition is true.
The check (chk.s) verifies whether an exception
has occurred and if so it branches to recovery
code.
Inst 1
Load.s r1
Inst 2
Branch x
Inst x
Chk.s r1
r3 r11
33
Code Motions in Trace Scheduling
  • Case 5 Moving an instruction below a side
    entrance

Inst x
Inst x
Inst 1
Inst 1
Inst 3
Inst 2
Inst 4
Inst 4
Inst 3
Inst 2
Inst 4
Inst 5
Inst 5
Inst 6
Inst 6
Need to adjust the entrance address and code
duplication
34
Code Motions in Trace Scheduling
  • Summary

1 Speculative execution 2 Need code duplicati
on 3 Need code duplication and branch target a
djustment
4 Need code duplication
1
Side exit
2
4
Side entrance
3
35
Superblock An Effective Technique for VLIW and
Superscalar Compilation
  • What is a Superblock?
  • Why Superblocks?
  • The implementation of Superblocks in the Impact
    Compiler

36
What is wrong with Trace Scheduling?
Code motion 3 and 4 have increased much
bookkeeping complexity. Side entrance also cause
optimizations such as copy propagation more
complex Superblock tries to eliminate side entran
ce
1
Side exit
2
4
Side entrance
3
37
Example of Copy Propagation in traces
r1 ? r3
r1?r3
r1 ? 1
r1?1
branch
r4?r1r2
r2 ?3
r2?3
branch
r4?4
r4 ?r1r2
38
Superblock
  • A Superblock is a Trace which has NO side
    entrance.
  • Control may enter at the top but may leave at one
    or more exit points
  • A Superblock is formed in two steps
  • Traces are identified (using profile or static
    branch prediction)
  • Tail duplication to remove all side entrances

39
Superblock formation example
A
A
B
C
B
C
E
D
E
D
F
F
F
40
Tail Duplication in Superblocks
41
Superblock ILP Optimization
  • Superblock enlarging transformations
  • Branch target expansion
  • Loop peeling
  • Loop unrolling
  • Dependence removing transformations
  • Register renaming
  • Operation migration
  • Induction variable expansion
  • Accumulator variable expansion
  • Operation combining

42
Operation Migration
  • Move an instruction from a superblock where its
    result is not used to a less frequently executed
    block.

r1 ? A
r1 ? A
r2?r1r3
Branch x
Branch x
Inst x
r2?r1r3
r4?B
r4?B
Inst x
r2 r4r1
r2 r4r1
43
Induction variable expansion
For (I0 I I AIBI I AI BI
I1I1 I2I2 AI BI AI1BI1 A
I2 BI2
I3
In some cases, the increment of Induction variab
les can be folded into the offset of load and
store instructions
44
Accumulator variable expansion
For (I0 I A BI I A BI
I1I1 I2I2 S1 BI S2 BI1 S3 B
I2
I .. A S1S2S3
Not applicable to FP operations unless an option
is provided by user

45
Superblock Scheduling
  • Typical scheduling
  • Dependence graph construction (include both
    control and data dependence)
  • List scheduling (ready list)
  • Speculative execution support
  • Restricted model dont move instructions that
    may cause traps, such as load, store, int divide,
    fp operations
  • Using non-trapping version of those instructions
  • (not a correct solution, speculative check is
    needed)

46
Implementation in the Impact Compiler
  • Compiler Size comparison

47
Base Code Calibration
  • The effect of Superblock optimization will be
    reported as speed up over the base-line code. So
    the question is how good is the base code? On a
    DECstation3100 (MIPS R2000 processor), the base
    code (compiled with O) is faster than MIPS
    compiler (by 4) and faster than the GNU
    compiler (by 15).

48
Compile Time Cost
  • Profile
  • Only show profiling one input set
  • Superblock formation
  • About 2 to 23 of base compilation time
  • Superblock optimization
  • Average 101 of base compile time
  • One worst case 522 (cmp) of base compile time

49
Conclusion from IMPACT-I
  • The Impact-I compiler has shown the performance
    potential of superblock formation and
    optimization techniques.
  • Superblock formation/scheduling add 14 of code
    and accounts for 100 more of compile time.
  • Speculative execution (with non-trapping
    instructions) improves performance by 13 to
    143
  • With 64KB I and D-cache, performance gain from
    superblock techniques remains.
  • Superscalar and VLIW processors need superblock
    techniques.

50
Tree Groups
  • Traces and superblocks are based the principle
    that conditional branches are biased, but some
    branches are 50-50 or 30-70. When it happens,
    there is overhead involved at exits, and there
    missing optimization opportunities.
  • Tree groups (or tree regions) are a
    generalization of superblocks. They are still one
    entrance and multiple exits, but they include
    multiple flows of control rather than a single
    flow of control.

51
Tree groups vs. Superblocks
A
B
D
C
E
F
G
52
Superblocks
A
D
E
G
53
Tree groups
A
Single entry, multiple exits, with multiple c
ontrol flows
B
D
C
E
F
G
G
G
54
Why Superblock in VM
Superblock
Compensation code
Compensation code
Compensation code
Compensation code
Compensation code
Trace
55
Superblock Formation in VM
  • Start Points
  • When block use reaches a threshold
  • Profile all blocks (UQDBT)
  • Profile selected blocks (Dynamo)
  • Profile only targets of backward branches (close
    loops)
  • Profile exits from existing superblocks
  • Continuation
  • Use hottest edges above a threshold (UQDBT)
  • Follow current control path (most recent edge)
    (Dynamo)
  • End Points
  • Start point of this superblock
  • Start point of some other superblock
  • When a maximum size is reached
  • When no edge above threshold can be found
    (UQDBT)
  • When an indirect jump is reached

56
Dynamic Optimization Overview
57
Optimization and Compatibility
  • Trap compatible
  • A process VM implementation is trap compatible
    if any trap that would occur during the native
    execution of a source instruction is also
    observed during the emulation.
  • At the time of a trap, the memory and register
    state becomes visible. Their compatibility must
    be maintained.
  • Implications
  • Instructions may not be deleted
  • registers must be updated in their original
    order
  • stores must preserve their original orders
  • access to volatiles should be treated as stores

58
Register State Compatibility
Target r1?r2r3 r6?r1r7 r9?r1r5 r3?r6r1


Target r1?r2r3 s1?r1r7 r9?r1r5 r6?s1 r
3?r6r1

Source r1?r2r3 r9?r1r5 r6?r1r7 r3?r6r1

Trap?
Reg state preserved
59
Example Intel IA32 EL
  • Software method for running IA-32 binaries
  • on IPF
  • Previous approach was in hardware (iVE)
  • Runs with both Windows and Linux
  • OS independent section (BTgeneric)
  • OS dependent section (BTlib)
  • System services
  • Two stages
  • Fast binary translation (cold code)
  • Optimized binary translation (hot code)
  • Precise traps are an important consideration

60
Cold Code Translation
61
Cold Code Translation
  • Cold code is generated at basic-block
    granularity.
  • Simple analysis is done on neighboring blocks
    (1-20 basic blocks) for better code generation
  • decoding
  • building a flow graph
  • computing the liveness of IA-32 EFlags bits
  • tracking floating point (FP) stack changes
  • Translation uses prepared (hand tuned)
    translation templates for each IA-32 instruction
  • Instrumentation code added
  • Block counter, edge counter, misalignment
    detection, indirect branch targets
  • Backpatching is used to link translated blocks

62
Hot Code Translation
63
Hot Code Translation
  • Trace selection (hyper-block selection)
  • Decode and analysis
  • IL generation
  • Adds misalignment avoidance code
  • Tracks IA-32 addresses and their values for CSE.
  • Tracks register values for simplifying the
    translation.
  • Eliminates EFlags generation
  • Analyzes FP stack flow and SSE format
    conversions.
  • Performs other FP optimizations, such as register
    allocation and FXCHG elimination
  • Build dependence graph
  • Dead IL elimination
  • Mark sideway ILs
  • Annotates weights
  • Register renaming
  • Control and data speculation
  • Scheduling, recovery information
  • Cross linking, place generated code in code
    cache

Hot code translation is about 20 times slower
64
Precise State Support
  • Exception handling
  • Once an exception is raised in the translated
    code, the OS calls IA-32 EL first.
  • IA-32 EL filters out IA-32 unrelated exceptions
  • Reconstruct IA-32 state
  • Simulating IA-32 application exception handler
  • Some exceptions should not happen, and must be
    avoided (e.g. masked FP exception)

65
Precise State Support (cont.)
  • IA-32 state reconstruction for cold code
  • Ensure IA-32 state change happens only after
    executing the last Itanium instruction that can
    fault (i.e. memory and FP instructions).
  • IA-32 IP is stored in a dedicated register (with
    some additional information) and updated during
    execution.

66
Precise State Support (cont.)
  • IA-32 state reconstruction for hot code
  • Challenges
  • In hot code, Itanium instructions originating
    from different IA-32 instructions are usually
    inter-mixed, and the IA-32 state (registers) is
    often represented by other registers (for
    example, in case of register renaming). As a
    result, exceptions in hot code may appear in an
    incorrect order and redundant exceptions may
    occur.
  • Hot blocks are composed of several IA-32 basic
    blocks and may contain branches, loops, and
    predicated if then else code sequences.

67
Precise State Support (cont.)
  • IA-32 state reconstruction for hot code
  • Commit points
  • a commit point is a barrier enabling the
    translator to
  • generate a consistent IA-32 state.
  • the translator associates several faulty points
    in the code with a single commit point.
  • commit point is the reordering boundary
  • within a single IA-32 instruction translation,
    the state update occurs after the last faulty IPF
    instruction
  • the first commit point is usually set at the
    beginning of a block.
  • IA-32 state is copied to backup registers at
    commit point

68
IA-32 Specific Optimizations
  • Floating point/MMX
  • IPF uses large flat register file
  • IA-32 uses stack register file
  • IA-32 TAG indicates valid entries
  • IA-32 aliases MMX regs to FP regs
  • Speculate common case usage and put guard code at
    beginning of block, for example
  • TOS (Top of Stack) same for all block executions
  • No invalid accesses (indicated by TAG)
  • 99-100 accurate
  • Data Misalignment
  • Similar to FX!32

69
IA-32 EL Performance
70
IA-32 EL Performance
CPU2000 mostly in hot code, very little
overhead Sysmark only 45 hot code, 22 in OS (I
PF native code)
71
Same ISA Optimization
  • Objective optimize binaries on-the-fly
  • Many binaries are un-optimized or are at a low
    optimization level
  • Initial emulation can be done very efficiently
  • Translation at basic block level is identity
    translation
  • Initial sample-based profiling is attractive
  • Original code can be used, running at native
    speeds
  • Code patching can be used (e.g. ADORE)
  • Patch code cache regions into original code
  • Replace original code with branches into code
    cache (saves code some code duplication)
  • Can avoid hash table lookup on indirect jumps
  • Can bail-out if performance is lost

72
Additional readings for Chapter 4
  • IA-32 EL
  • Dynamo
  • DynamoRIO
  • ADORE
Write a Comment
User Comments (0)
About PowerShow.com