Title: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers
1DYNAMO vs. ADOREA Tale of Two Dynamic Optimizers
- Wei Chung Hsu
- Computer Science and Engineering Department
- University of Minnesota, Twin Cities
2Dynamic Binary Optimization
- It is the detection of program hot spots and
application of optimizations to native binary
code at run-time. Also called runtime binary
optimization. - Why is static compiler optimization insufficient?
3Why Dynamic Binary Optimization
- One size does not fit all runtime environments
may be different from what the static binary was
optimized for. - Underlying micro-architectures
- e.g. running Pentium code on Pentium-II
- Input data sets
- e.g. some data sets may not incur cache misses
- Dynamic phase behavior
- Dynamic libraries
4Portable Executable
Compile Info
.EXE or .SO
Intermediate Representation
5Common Binary (fat binary)
Itanium-I binary
Itanium-2 binary
Itanium-3 binary
Annotation
Annotation
Annotation
6Chubby Binary
Itanium-I specific
Itanium-2 specific
Itanium-3 specific
Annotation
Annotation
Annotation
Common Itanium Binary
7Using More Accurate Profiles
Optimize from Source
_at_ ISV Sites
Optimize from Source with profile feedback
Optimize from binary with profile feedback
Walk time (or Ahead of time) Optimization
_at_ User Sites
Runtime Optimization
8Dynamo
- Dynamo means Dynamic Optimization System
- A collaborative project between HP Lab (under
Josh Fisher) and HP System Lab. - Build on the dynamic translation technology
developed for ARIES (which migrates PA binary to
the Itanium architecture). - Considered revolutionary and won the best paper
award in PLDI2000 - Dynamo technology was enhanced and continued by
MIT and later became Dynamo/RIO. - Dynamo/RIO group now starts a company called
Determina (http//www.determina.com/)
9Migration vs. Dynamic Optimization
Migration (e.g. Aries)
DynOpt (e.g. Dynamo)
existing Incompatible binary
native binary
emulator/ interpreter
emulator/ interpreter
trace selector
translator
dyncode cache
code cache
optimizer
Memory
Memory
10Migration vs. Dynamic Optimization
Migration
DynOpt (e.g. Dynamo)
existing Incompatible binary
native binary
emulator/ interpreter
emulator/ interpreter
trace selector
translator
dyncode cache
code cache
optimizer
Optional Accelerator Optimization is 2nd priority
Optional Optimizer Optimization is critical
11Why not Static Binary Translation?
- The Code-Discovery Problem
- What is the target of an indirect jump?
- No guarantee that the locations immediately
following a jump contain valid instructions - Some compilers intersperse data with
instructions - More challenging for ISA with variable length
instructions - padding to align instructions
- The Code-Location Problem
- How to translate indirect jumps? The target is
not known until runtime. - Other problems
- Self-modifying code
- Self-referencing code
- Precise traps
12How Dynamo Works
Interpret until taken branch
Lookup branch target
Start of trace condition?
Jump to code cache
Increment counter for branch target
Counter exceed threshold?
Interpret code gen
Signal handler
Code Cache
End-of-trace condition?
Create trace optimize it
Emit into cache
13Trace Selection
A
trace selection
A
B
original layout
C
B
C
D
E
D
call
F
E
F
G
H
G
H
I
I
return
14Trace Selection
A
trace selection
A
Trace layout in trace cache
C
B
C
D
F
D
call
G
F
E
I
G
H
E
I
to B
back to runtime
return
to H
15Flow of Control on Translated Traces
Translated Trace
Emulation Manager
Stub
Stub
Translated Trace
Stub
Stub
Translated Trace
Stub
Stub
16Flow of Control on Translated Traces
High overhead
Translated Trace
Emulation Manager
Stub
Stub
Translated Trace
Stub
Stub
Translated Trace
Stub
Stub
17Translation Linking
Translated Trace
Emulation Manager
Stub
Stub
Translated Trace
Stub
Stub
Translated Trace
Stub
Stub
18Backpatching/Trace Linking
A
C
When H becomes hot, a new trace is
selected starting from H, and the trace exit
branch in block F is backpatched to branch to
the new trace.
D
H
F
I
G
E
I
E
to B
back to runtime
to H
19Importance of Trace Linking
- Performance slowdown when linking is disabled
- Not a small trick
20Execution Migrates to Code Cache
interpreter/ emulator
1
0
2
trace selector
3
1
a.out
4
2
Emulation Manager
optimizer
3
Code cache
21Handle Indirect Branches
- Variable targets cannot be linked
- Must map addresses in original program to the
addresses in code cache - Hash table lookup
- Compare the dynamic target with a predicted
target
22Handle Indirect Branches (cont.)
- Compare with a small number of predicted targets.
cmp real_target, hot_target_1 je hot_target_1 cmp
real_target, hot_target_2 je hot_target_2 call
prof_routine jmp hashtable_lookup
jmp hashtable_lookup
- A software-based indirect-branch-target-cache to
avoid going back to the emulation manager.
23Performance
- Trace formation Partial procedure inline code
layout - Slowdown
- Major slowdowns were avoided by early bail-out
24Summary of Dynamo
- Dynamic Binary Optimization customizes
performance delivery - Code is optimized by how the code is used
- Code is optimized for the machine it runs on
- Code is optimized when all executables are
available - Code is optimized only the part that matters
25Dynamo Follow-ups
- Dynamo/RIO Dynamo RIO (Runtime Introspection
and Optimization) for x86 architecture - More successful in Introspection than in
Optimization. - Started the company Determina for system security
enforcement - Similar technology can be applied to migration,
fast simulation, dynamic instrumentation, program
introspection, security enforcement, power
management, etc.
26What happen to Optimization
- Dynamo has the following challenges
- Profiling issues
- frequency based, not time based
- hard to detect really hot code, may end up with
too much translation - Code duplication issues
- trace generation could end up with excessive
code duplication - Code cache management issues
- for real applications, it requires very large
code cache - Indirect branch handling issues
- Indirect branch handling is expensive
27ADORE
- ADORE means ADaptive Object code RE-optimization
- Was developed at the CSE department, U. of
Minnesota - Applied a different model for dynamic
optimization systems (after rethinking of dynamic
optimization) - Considered evolutionary
28ADORE Model
Executable
Code Cache
Branch/jump instruction
DynOpt manager
29ADORE Rationale
- If the executable is compatible, why should we
use interpretation/emulation? - Instrumentation or interpretation based
profiling does not collect important performance
events, why not use HPM? - If a program runs well, why bother to translate
hot code? - Redirection of execution can be more effectively
implemented using branches.
30ADORE Framework
Patch traces
Deployment
Code Cache
Init Code
Optimized Traces
Optimization
Main Thread
Dynamic Optimization Thread
Pass traces to opt
Trace Selection
On phase change
Phase Detection
Int on K-buffer ovf
Kernel
Init PMU
Int. on Event
Hardware Performance Monitoring Unit (PMU)
31Phase Detection
History of avg PC values
Compute average (E) and Standard Deviation (D) of
PC values in history buffer
M2
M4
M5
M1
M3
Band of tolerance is from E-D to ED. If Mk
is outside band a phase change is triggered
32Phase Detection
History of avg PC values
Compute average (E) and Standard Deviation (D) of
PC values in history buffer
M2
M4
M5
M1
M3
Band of tolerance is from E-D to ED. If Mk
is outside band a phase change is triggered
33Phase Detection
History of avg PC values
Compute average (E) and Standard Deviation (D) of
PC values in history buffer
M2
M4
M5
M1
M3
Band of tolerance is from E-D to ED. If Mk
is outside band a phase change is triggered
34Trace Selection
- A trace is a single entry, multiple exit code
sequence (e.g. a superblock) - Trace selection is guided by the path profile
constructed from the branch trace samples (BTB
samples). - Traces can be stitches together to form longer
traces. - Trace end conditions
- procedure return, backward branch that forms a
loop, not highly biased branches, trace size
exceeds a preset threshold. - Function calls are considered fall-through.
35Runtime D-Cache Pre-fetching
- Locate the most recent delinquent loads
- If the load instruction is in a loop-type trace,
determines the reference pattern via address
dependence analysis. - Calculate the stride if the reference has spatial
or structural locality. - If the reference is pointer-chasing, insert codes
to detect possible strides at runtime. - Insert and schedule pre-fetch instructions.
36Identify Delinquent Loads
- Using sampled EAR information to identify the
delinquent loads in a selected trace. - Calculate the average latency and the total miss
penalty of each delinquent load.
.mii ldfd f60r15,8 // average
latency 129 penalty ratio 6.38
add r816,r24 add r428,r24
37Determine Reference Pattern
// i a ib // b a i Loop
add r14 4, r14 st4 r14 r20, 4
ld4 r20 r14 add r14 4, r14
br.cond Loop
// c bak 1 Loop ld4
r20r16, 4 add r15 r25,r20 add
r15 1, r15 ld8 r15r15
br.cond Loop
//tail arcin- tail //arcin tail-
mark Loop add r11 104, r34
ld8 r11 r11 ld8 r34 r11
br.cond Loop
A. direct array B. indirect
array C. pointer chasing
38Perf. of ADORE/Itanium on SPEC
39Performance on BLAST
40Static Optimizations on BLAST
- Performance can often degrade at higher
optimization levels in all three compilers - Long query which has a high fraction of stall
cycles did not benefit from static optimizations
41Profile Based Optimizations
- Less than 5 gain for some inputs
- Large slowdown for others
- Combining profiles results in moderate gain for
some inputs
42Slowdown from PBO
- Large increase in system time
- ECC inserts speculative load for future iteration
in a loop, which causes TLB misses - TLB miss exception is handled by OS for
speculative loads immediately - Reconfigured kernel to defer TLB miss on
speculative loads to hardware - On TLB miss for speculative load, the NAT bit is
set. Recovery code will load data if needed
43PBO (Kernel Reconfigured)
Difficult to find right set of combined training
input PBO can give performance but has limitations
44ADORE vs. Dynamo
45Mis-conceptions about ADORE
- Compiler optimizations are very complex, doing
them at runtime is a bad idea. - Current ADORE deals with only cache misses. It
does not handle traditional compiler
optimizations. (It is a complement, not a
replacement, of compiler optimization) - Inserting cache prefetch instructions (and/or
branch prediction hints) are safe optimizations.
No correctness issues.
46Performance at Different Sampling Rates(based on
Adore/Itanium perf. of Spec2000)
47Mis-conceptions about DynOpt
- Compilation/Optimization overhead is usually
amortized by thousands execution of the binary.
How can runtime optimization overhead be
amortized for only one execution?
48Mis-conceptions about ADORE
- ADORE will be unreliable, hard to debug,
difficult to maintain. - ADORE performs simple transformations, it could
be more reliable than a static optimizer. - Current ADORE can run real large applications
- Adore/Itanium on the Bio-informatics application
BLAST (millions lines of code). - 58 speed up on some long queries
- Adore/Sparc on the application Fluent
- 14.5 speed up on Panther
49ADORE/Sparc
- ADORE has been ported to Sparc/Solaris platform
since 2005. - ADORE uses the libcpc interface on Solaris to
conduct runtime profiling. A kernel buffer
enhancement is added to Solaris 10.0 to reduce
profiling and phase detection overhead - Reachability is a true problem. (e.g. Oracle,
Dyna3D) - Lack of branch trace buffer is painful. (e.g.
Blast)
50Performance of In-Thread Opt. (USIII)
51Helper Thread Prefetching for CMP
?
First Core
L2 Cache Miss
Main thread
Cache miss avoided
Trigger to activate (About 65 cycles delay)
Second core
Prefetches initiated
Spin Waiting
Spin again waiting for the next trigger
52Performance of Helper Thread
53Summary of ADORE
- ADORE uses Hardware Performance Monitoring
capability to implement a light weight runtime
profiling system. Efficient profiling and phase
detection is the key to the success of ADORE. - ADORE can speed up real-world large applications
optimized by production compilers. - ADORE works on two architectures Itanium and
SPARC. - ADORE can generate helper threads for current and
future CMPs.