DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers

Description:

... technology developed for ARIES (which migrates PA binary ... Profile Based Optimizations. Less than 5% gain for some inputs. Large slowdown for others ... – PowerPoint PPT presentation

Number of Views:642
Avg rating:3.0/5.0
Slides: 54
Provided by: hsu3
Category:

less

Transcript and Presenter's Notes

Title: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers


1
DYNAMO vs. ADOREA Tale of Two Dynamic Optimizers
  • Wei Chung Hsu
  • Computer Science and Engineering Department
  • University of Minnesota, Twin Cities

2
Dynamic Binary Optimization
  • It is the detection of program hot spots and
    application of optimizations to native binary
    code at run-time. Also called runtime binary
    optimization.
  • Why is static compiler optimization insufficient?

3
Why Dynamic Binary Optimization
  • One size does not fit all runtime environments
    may be different from what the static binary was
    optimized for.
  • Underlying micro-architectures
  • e.g. running Pentium code on Pentium-II
  • Input data sets
  • e.g. some data sets may not incur cache misses
  • Dynamic phase behavior
  • Dynamic libraries

4
Portable Executable
Compile Info
.EXE or .SO
Intermediate Representation
5
Common Binary (fat binary)
Itanium-I binary
Itanium-2 binary
Itanium-3 binary
Annotation
Annotation
Annotation
6
Chubby Binary
Itanium-I specific
Itanium-2 specific
Itanium-3 specific
Annotation
Annotation
Annotation
Common Itanium Binary
7
Using More Accurate Profiles
Optimize from Source
_at_ ISV Sites
Optimize from Source with profile feedback
Optimize from binary with profile feedback
Walk time (or Ahead of time) Optimization
_at_ User Sites
Runtime Optimization
8
Dynamo
  • Dynamo means Dynamic Optimization System
  • A collaborative project between HP Lab (under
    Josh Fisher) and HP System Lab.
  • Build on the dynamic translation technology
    developed for ARIES (which migrates PA binary to
    the Itanium architecture).
  • Considered revolutionary and won the best paper
    award in PLDI2000
  • Dynamo technology was enhanced and continued by
    MIT and later became Dynamo/RIO.
  • Dynamo/RIO group now starts a company called
    Determina (http//www.determina.com/)

9
Migration vs. Dynamic Optimization
Migration (e.g. Aries)
DynOpt (e.g. Dynamo)
existing Incompatible binary
native binary
emulator/ interpreter
emulator/ interpreter
trace selector
translator
dyncode cache
code cache
optimizer
Memory
Memory
10
Migration vs. Dynamic Optimization
Migration
DynOpt (e.g. Dynamo)
existing Incompatible binary
native binary
emulator/ interpreter
emulator/ interpreter
trace selector
translator
dyncode cache
code cache
optimizer
Optional Accelerator Optimization is 2nd priority
Optional Optimizer Optimization is critical
11
Why not Static Binary Translation?
  • The Code-Discovery Problem
  • What is the target of an indirect jump?
  • No guarantee that the locations immediately
    following a jump contain valid instructions
  • Some compilers intersperse data with
    instructions
  • More challenging for ISA with variable length
    instructions
  • padding to align instructions
  • The Code-Location Problem
  • How to translate indirect jumps? The target is
    not known until runtime.
  • Other problems
  • Self-modifying code
  • Self-referencing code
  • Precise traps

12
How Dynamo Works
Interpret until taken branch
Lookup branch target
Start of trace condition?
Jump to code cache
Increment counter for branch target
Counter exceed threshold?
Interpret code gen
Signal handler
Code Cache
End-of-trace condition?
Create trace optimize it
Emit into cache
13
Trace Selection
A
trace selection
A
B
original layout
C
B
C
D
E
D
call
F
E
F
G
H
G
H
I
I
return
14
Trace Selection
A
trace selection
A
Trace layout in trace cache
C
B
C
D
F
D
call
G
F
E
I
G
H
E
I
to B
back to runtime
return
to H
15
Flow of Control on Translated Traces
Translated Trace
Emulation Manager
Stub
Stub
Translated Trace
Stub
Stub
Translated Trace
Stub
Stub
16
Flow of Control on Translated Traces
High overhead
Translated Trace
Emulation Manager
Stub
Stub
Translated Trace
Stub
Stub
Translated Trace
Stub
Stub
17
Translation Linking
Translated Trace
Emulation Manager
Stub
Stub
Translated Trace
Stub
Stub
Translated Trace
Stub
Stub
18
Backpatching/Trace Linking
A
C
When H becomes hot, a new trace is
selected starting from H, and the trace exit
branch in block F is backpatched to branch to
the new trace.
D
H
F
I
G
E
I
E
to B
back to runtime
to H
19
Importance of Trace Linking
  • Performance slowdown when linking is disabled
  • Not a small trick

20
Execution Migrates to Code Cache
interpreter/ emulator
1
0
2
trace selector
3
1
a.out
4
2
Emulation Manager
optimizer
3
Code cache
21
Handle Indirect Branches
  • Variable targets cannot be linked
  • Must map addresses in original program to the
    addresses in code cache
  • Hash table lookup
  • Compare the dynamic target with a predicted
    target

22
Handle Indirect Branches (cont.)
  • Compare with a small number of predicted targets.

cmp real_target, hot_target_1 je hot_target_1 cmp
real_target, hot_target_2 je hot_target_2 call
prof_routine jmp hashtable_lookup
jmp hashtable_lookup
  • A software-based indirect-branch-target-cache to
    avoid going back to the emulation manager.

23
Performance
  • Trace formation Partial procedure inline code
    layout
  • Slowdown
  • Major slowdowns were avoided by early bail-out

24
Summary of Dynamo
  • Dynamic Binary Optimization customizes
    performance delivery
  • Code is optimized by how the code is used
  • Code is optimized for the machine it runs on
  • Code is optimized when all executables are
    available
  • Code is optimized only the part that matters

25
Dynamo Follow-ups
  • Dynamo/RIO Dynamo RIO (Runtime Introspection
    and Optimization) for x86 architecture
  • More successful in Introspection than in
    Optimization.
  • Started the company Determina for system security
    enforcement
  • Similar technology can be applied to migration,
    fast simulation, dynamic instrumentation, program
    introspection, security enforcement, power
    management, etc.

26
What happen to Optimization
  • Dynamo has the following challenges
  • Profiling issues
  • frequency based, not time based
  • hard to detect really hot code, may end up with
    too much translation
  • Code duplication issues
  • trace generation could end up with excessive
    code duplication
  • Code cache management issues
  • for real applications, it requires very large
    code cache
  • Indirect branch handling issues
  • Indirect branch handling is expensive

27
ADORE
  • ADORE means ADaptive Object code RE-optimization
  • Was developed at the CSE department, U. of
    Minnesota
  • Applied a different model for dynamic
    optimization systems (after rethinking of dynamic
    optimization)
  • Considered evolutionary

28
ADORE Model
Executable
Code Cache
Branch/jump instruction
DynOpt manager
29
ADORE Rationale
  • If the executable is compatible, why should we
    use interpretation/emulation?
  • Instrumentation or interpretation based
    profiling does not collect important performance
    events, why not use HPM?
  • If a program runs well, why bother to translate
    hot code?
  • Redirection of execution can be more effectively
    implemented using branches.

30
ADORE Framework
Patch traces
Deployment
Code Cache
Init Code
Optimized Traces
Optimization
Main Thread
Dynamic Optimization Thread
Pass traces to opt
Trace Selection
On phase change
Phase Detection
Int on K-buffer ovf
Kernel
Init PMU
Int. on Event
Hardware Performance Monitoring Unit (PMU)
31
Phase Detection
History of avg PC values
Compute average (E) and Standard Deviation (D) of
PC values in history buffer
M2
M4
M5
M1
M3
Band of tolerance is from E-D to ED. If Mk
is outside band a phase change is triggered
32
Phase Detection
History of avg PC values
Compute average (E) and Standard Deviation (D) of
PC values in history buffer
M2
M4
M5
M1
M3
Band of tolerance is from E-D to ED. If Mk
is outside band a phase change is triggered
33
Phase Detection
History of avg PC values
Compute average (E) and Standard Deviation (D) of
PC values in history buffer
M2
M4
M5
M1
M3
Band of tolerance is from E-D to ED. If Mk
is outside band a phase change is triggered
34
Trace Selection
  • A trace is a single entry, multiple exit code
    sequence (e.g. a superblock)
  • Trace selection is guided by the path profile
    constructed from the branch trace samples (BTB
    samples).
  • Traces can be stitches together to form longer
    traces.
  • Trace end conditions
  • procedure return, backward branch that forms a
    loop, not highly biased branches, trace size
    exceeds a preset threshold.
  • Function calls are considered fall-through.

35
Runtime D-Cache Pre-fetching
  • Locate the most recent delinquent loads
  • If the load instruction is in a loop-type trace,
    determines the reference pattern via address
    dependence analysis.
  • Calculate the stride if the reference has spatial
    or structural locality.
  • If the reference is pointer-chasing, insert codes
    to detect possible strides at runtime.
  • Insert and schedule pre-fetch instructions.

36
Identify Delinquent Loads
  • Using sampled EAR information to identify the
    delinquent loads in a selected trace.
  • Calculate the average latency and the total miss
    penalty of each delinquent load.

.mii ldfd f60r15,8 // average
latency 129 penalty ratio 6.38
add r816,r24 add r428,r24
37
Determine Reference Pattern
// i a ib // b a i Loop
add r14 4, r14 st4 r14 r20, 4
ld4 r20 r14 add r14 4, r14
br.cond Loop
// c bak 1 Loop ld4
r20r16, 4 add r15 r25,r20 add
r15 1, r15 ld8 r15r15
br.cond Loop
//tail arcin- tail //arcin tail-
mark Loop add r11 104, r34
ld8 r11 r11 ld8 r34 r11
br.cond Loop
A. direct array B. indirect
array C. pointer chasing
38
Perf. of ADORE/Itanium on SPEC
39
Performance on BLAST
40
Static Optimizations on BLAST
  • Performance can often degrade at higher
    optimization levels in all three compilers
  • Long query which has a high fraction of stall
    cycles did not benefit from static optimizations

41
Profile Based Optimizations
  • Less than 5 gain for some inputs
  • Large slowdown for others
  • Combining profiles results in moderate gain for
    some inputs

42
Slowdown from PBO
  • Large increase in system time
  • ECC inserts speculative load for future iteration
    in a loop, which causes TLB misses
  • TLB miss exception is handled by OS for
    speculative loads immediately
  • Reconfigured kernel to defer TLB miss on
    speculative loads to hardware
  • On TLB miss for speculative load, the NAT bit is
    set. Recovery code will load data if needed

43
PBO (Kernel Reconfigured)
Difficult to find right set of combined training
input PBO can give performance but has limitations
44
ADORE vs. Dynamo
45
Mis-conceptions about ADORE
  • Compiler optimizations are very complex, doing
    them at runtime is a bad idea.
  • Current ADORE deals with only cache misses. It
    does not handle traditional compiler
    optimizations. (It is a complement, not a
    replacement, of compiler optimization)
  • Inserting cache prefetch instructions (and/or
    branch prediction hints) are safe optimizations.
    No correctness issues.

46
Performance at Different Sampling Rates(based on
Adore/Itanium perf. of Spec2000)
47
Mis-conceptions about DynOpt
  • Compilation/Optimization overhead is usually
    amortized by thousands execution of the binary.
    How can runtime optimization overhead be
    amortized for only one execution?

48
Mis-conceptions about ADORE
  • ADORE will be unreliable, hard to debug,
    difficult to maintain.
  • ADORE performs simple transformations, it could
    be more reliable than a static optimizer.
  • Current ADORE can run real large applications
  • Adore/Itanium on the Bio-informatics application
    BLAST (millions lines of code).
  • 58 speed up on some long queries
  • Adore/Sparc on the application Fluent
  • 14.5 speed up on Panther

49
ADORE/Sparc
  • ADORE has been ported to Sparc/Solaris platform
    since 2005.
  • ADORE uses the libcpc interface on Solaris to
    conduct runtime profiling. A kernel buffer
    enhancement is added to Solaris 10.0 to reduce
    profiling and phase detection overhead
  • Reachability is a true problem. (e.g. Oracle,
    Dyna3D)
  • Lack of branch trace buffer is painful. (e.g.
    Blast)

50
Performance of In-Thread Opt. (USIII)
51
Helper Thread Prefetching for CMP
?
First Core
L2 Cache Miss
Main thread
Cache miss avoided
Trigger to activate (About 65 cycles delay)
Second core
Prefetches initiated
Spin Waiting
Spin again waiting for the next trigger
52
Performance of Helper Thread
53
Summary of ADORE
  • ADORE uses Hardware Performance Monitoring
    capability to implement a light weight runtime
    profiling system. Efficient profiling and phase
    detection is the key to the success of ADORE.
  • ADORE can speed up real-world large applications
    optimized by production compilers.
  • ADORE works on two architectures Itanium and
    SPARC.
  • ADORE can generate helper threads for current and
    future CMPs.
Write a Comment
User Comments (0)
About PowerShow.com