DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers - PowerPoint PPT Presentation

1 / 53

About This Presentation

Title:

DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers

Description:

... technology developed for ARIES (which migrates PA binary ... Profile Based Optimizations. Less than 5% gain for some inputs. Large slowdown for others ... – PowerPoint PPT presentation

Number of Views:642

Avg rating:3.0/5.0

Slides: 54

Provided by: hsu3

Category:

more less

Transcript and Presenter's Notes

Title: DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers

1
DYNAMO vs. ADOREA Tale of Two Dynamic Optimizers

Wei Chung Hsu
Computer Science and Engineering Department
University of Minnesota, Twin Cities

2
Dynamic Binary Optimization

It is the detection of program hot spots and
application of optimizations to native binary
code at run-time. Also called runtime binary
optimization.
Why is static compiler optimization insufficient?

3
Why Dynamic Binary Optimization

One size does not fit all runtime environments
may be different from what the static binary was
optimized for.
Underlying micro-architectures
e.g. running Pentium code on Pentium-II
Input data sets
e.g. some data sets may not incur cache misses
Dynamic phase behavior
Dynamic libraries

4
Portable Executable
Compile Info
.EXE or .SO
Intermediate Representation
5
Common Binary (fat binary)
Itanium-I binary
Itanium-2 binary
Itanium-3 binary
Annotation
Annotation
Annotation
6
Chubby Binary
Itanium-I specific
Itanium-2 specific
Itanium-3 specific
Annotation
Annotation
Annotation
Common Itanium Binary
7
Using More Accurate Profiles
Optimize from Source
_at_ ISV Sites
Optimize from Source with profile feedback
Optimize from binary with profile feedback
Walk time (or Ahead of time) Optimization
_at_ User Sites
Runtime Optimization
8
Dynamo

Dynamo means Dynamic Optimization System
A collaborative project between HP Lab (under
Josh Fisher) and HP System Lab.
Build on the dynamic translation technology
developed for ARIES (which migrates PA binary to
the Itanium architecture).
Considered revolutionary and won the best paper
award in PLDI2000
Dynamo technology was enhanced and continued by
MIT and later became Dynamo/RIO.
Dynamo/RIO group now starts a company called
Determina (http//www.determina.com/)

9
Migration vs. Dynamic Optimization
Migration (e.g. Aries)
DynOpt (e.g. Dynamo)
existing Incompatible binary
native binary
emulator/ interpreter
emulator/ interpreter
trace selector
translator
dyncode cache
code cache
optimizer
Memory
Memory
10
Migration vs. Dynamic Optimization
Migration
DynOpt (e.g. Dynamo)
existing Incompatible binary
native binary
emulator/ interpreter
emulator/ interpreter
trace selector
translator
dyncode cache
code cache
optimizer
Optional Accelerator Optimization is 2nd priority
Optional Optimizer Optimization is critical
11
Why not Static Binary Translation?

The Code-Discovery Problem
What is the target of an indirect jump?
No guarantee that the locations immediately
following a jump contain valid instructions
Some compilers intersperse data with
instructions
More challenging for ISA with variable length
instructions
padding to align instructions
The Code-Location Problem
How to translate indirect jumps? The target is
not known until runtime.
Other problems
Self-modifying code
Self-referencing code
Precise traps

12
How Dynamo Works
Interpret until taken branch
Lookup branch target
Start of trace condition?
Jump to code cache
Increment counter for branch target
Counter exceed threshold?
Interpret code gen
Signal handler
Code Cache
End-of-trace condition?
Create trace optimize it
Emit into cache
13
Trace Selection
A
trace selection
A
B
original layout
C
B
C
D
E
D
call
F
E
F
G
H
G
H
I
I
return
14
Trace Selection
A
trace selection
A
Trace layout in trace cache
C
B
C
D
F
D
call
G
F
E
I
G
H
E
I
to B
back to runtime
return
to H
15
Flow of Control on Translated Traces
Translated Trace
Emulation Manager
Stub
Stub
Translated Trace
Stub
Stub
Translated Trace
Stub
Stub
16
Flow of Control on Translated Traces
High overhead
Translated Trace
Emulation Manager
Stub
Stub
Translated Trace
Stub
Stub
Translated Trace
Stub
Stub
17
Translation Linking
Translated Trace
Emulation Manager
Stub
Stub
Translated Trace
Stub
Stub
Translated Trace
Stub
Stub
18
Backpatching/Trace Linking
A
C
When H becomes hot, a new trace is
selected starting from H, and the trace exit
branch in block F is backpatched to branch to
the new trace.
D
H
F
I
G
E
I
E
to B
back to runtime
to H
19
Importance of Trace Linking

Performance slowdown when linking is disabled
Not a small trick

20
Execution Migrates to Code Cache
interpreter/ emulator
1
0
2
trace selector
3
1
a.out
4
2
Emulation Manager
optimizer
3
Code cache
21
Handle Indirect Branches

Variable targets cannot be linked
Must map addresses in original program to the
addresses in code cache
Hash table lookup
Compare the dynamic target with a predicted
target

22
Handle Indirect Branches (cont.)

Compare with a small number of predicted targets.

cmp real_target, hot_target_1 je hot_target_1 cmp
real_target, hot_target_2 je hot_target_2 call
prof_routine jmp hashtable_lookup
jmp hashtable_lookup

A software-based indirect-branch-target-cache to
avoid going back to the emulation manager.

23
Performance

Trace formation Partial procedure inline code
layout
Slowdown
Major slowdowns were avoided by early bail-out

24
Summary of Dynamo

Dynamic Binary Optimization customizes
performance delivery
Code is optimized by how the code is used
Code is optimized for the machine it runs on
Code is optimized when all executables are
available
Code is optimized only the part that matters

25
Dynamo Follow-ups

Dynamo/RIO Dynamo RIO (Runtime Introspection
and Optimization) for x86 architecture
More successful in Introspection than in
Optimization.
Started the company Determina for system security
enforcement
Similar technology can be applied to migration,
fast simulation, dynamic instrumentation, program
introspection, security enforcement, power
management, etc.

26
What happen to Optimization

Dynamo has the following challenges
Profiling issues
frequency based, not time based
hard to detect really hot code, may end up with
too much translation
Code duplication issues
trace generation could end up with excessive
code duplication
Code cache management issues
for real applications, it requires very large
code cache
Indirect branch handling issues
Indirect branch handling is expensive

27
ADORE

ADORE means ADaptive Object code RE-optimization
Was developed at the CSE department, U. of
Minnesota
Applied a different model for dynamic
optimization systems (after rethinking of dynamic
optimization)
Considered evolutionary

28
ADORE Model
Executable
Code Cache
Branch/jump instruction
DynOpt manager
29
ADORE Rationale

If the executable is compatible, why should we
use interpretation/emulation?
Instrumentation or interpretation based
profiling does not collect important performance
events, why not use HPM?
If a program runs well, why bother to translate
hot code?
Redirection of execution can be more effectively
implemented using branches.

30
ADORE Framework
Patch traces
Deployment
Code Cache
Init Code
Optimized Traces
Optimization
Main Thread
Dynamic Optimization Thread
Pass traces to opt
Trace Selection
On phase change
Phase Detection
Int on K-buffer ovf
Kernel
Init PMU
Int. on Event
Hardware Performance Monitoring Unit (PMU)
31
Phase Detection
History of avg PC values
Compute average (E) and Standard Deviation (D) of
PC values in history buffer
M2
M4
M5
M1
M3
Band of tolerance is from E-D to ED. If Mk
is outside band a phase change is triggered
32
Phase Detection
History of avg PC values
Compute average (E) and Standard Deviation (D) of
PC values in history buffer
M2
M4
M5
M1
M3
Band of tolerance is from E-D to ED. If Mk
is outside band a phase change is triggered
33
Phase Detection
History of avg PC values
Compute average (E) and Standard Deviation (D) of
PC values in history buffer
M2
M4
M5
M1
M3
Band of tolerance is from E-D to ED. If Mk
is outside band a phase change is triggered
34
Trace Selection

A trace is a single entry, multiple exit code
sequence (e.g. a superblock)
Trace selection is guided by the path profile
constructed from the branch trace samples (BTB
samples).
Traces can be stitches together to form longer
traces.
Trace end conditions
procedure return, backward branch that forms a
loop, not highly biased branches, trace size
exceeds a preset threshold.
Function calls are considered fall-through.

35
Runtime D-Cache Pre-fetching

Locate the most recent delinquent loads
If the load instruction is in a loop-type trace,
determines the reference pattern via address
dependence analysis.
Calculate the stride if the reference has spatial
or structural locality.
If the reference is pointer-chasing, insert codes
to detect possible strides at runtime.
Insert and schedule pre-fetch instructions.

36
Identify Delinquent Loads

Using sampled EAR information to identify the
delinquent loads in a selected trace.
Calculate the average latency and the total miss
penalty of each delinquent load.

.mii ldfd f60r15,8 // average
latency 129 penalty ratio 6.38
add r816,r24 add r428,r24
37
Determine Reference Pattern
// i a ib // b a i Loop
add r14 4, r14 st4 r14 r20, 4
ld4 r20 r14 add r14 4, r14
br.cond Loop
// c bak 1 Loop ld4
r20r16, 4 add r15 r25,r20 add
r15 1, r15 ld8 r15r15
br.cond Loop
//tail arcin- tail //arcin tail-
mark Loop add r11 104, r34
ld8 r11 r11 ld8 r34 r11
br.cond Loop
A. direct array B. indirect
array C. pointer chasing
38
Perf. of ADORE/Itanium on SPEC
39
Performance on BLAST
40
Static Optimizations on BLAST

Performance can often degrade at higher
optimization levels in all three compilers
Long query which has a high fraction of stall
cycles did not benefit from static optimizations

41
Profile Based Optimizations

Less than 5 gain for some inputs
Large slowdown for others
Combining profiles results in moderate gain for
some inputs

42
Slowdown from PBO

Large increase in system time
ECC inserts speculative load for future iteration
in a loop, which causes TLB misses
TLB miss exception is handled by OS for
speculative loads immediately
Reconfigured kernel to defer TLB miss on
speculative loads to hardware
On TLB miss for speculative load, the NAT bit is
set. Recovery code will load data if needed

43
PBO (Kernel Reconfigured)
Difficult to find right set of combined training
input PBO can give performance but has limitations
44
ADORE vs. Dynamo
45
Mis-conceptions about ADORE

Compiler optimizations are very complex, doing
them at runtime is a bad idea.
Current ADORE deals with only cache misses. It
does not handle traditional compiler
optimizations. (It is a complement, not a
replacement, of compiler optimization)
Inserting cache prefetch instructions (and/or
branch prediction hints) are safe optimizations.
No correctness issues.

46
Performance at Different Sampling Rates(based on
Adore/Itanium perf. of Spec2000)
47
Mis-conceptions about DynOpt

Compilation/Optimization overhead is usually
amortized by thousands execution of the binary.
How can runtime optimization overhead be
amortized for only one execution?

48
Mis-conceptions about ADORE

ADORE will be unreliable, hard to debug,
difficult to maintain.
ADORE performs simple transformations, it could
be more reliable than a static optimizer.
Current ADORE can run real large applications
Adore/Itanium on the Bio-informatics application
BLAST (millions lines of code).
58 speed up on some long queries
Adore/Sparc on the application Fluent
14.5 speed up on Panther

49
ADORE/Sparc

ADORE has been ported to Sparc/Solaris platform
since 2005.
ADORE uses the libcpc interface on Solaris to
conduct runtime profiling. A kernel buffer
enhancement is added to Solaris 10.0 to reduce
profiling and phase detection overhead
Reachability is a true problem. (e.g. Oracle,
Dyna3D)
Lack of branch trace buffer is painful. (e.g.
Blast)

50
Performance of In-Thread Opt. (USIII)
51
Helper Thread Prefetching for CMP
?
First Core
L2 Cache Miss
Main thread
Cache miss avoided
Trigger to activate (About 65 cycles delay)
Second core
Prefetches initiated
Spin Waiting
Spin again waiting for the next trigger
52
Performance of Helper Thread
53
Summary of ADORE

ADORE uses Hardware Performance Monitoring
capability to implement a light weight runtime
profiling system. Efficient profiling and phase
detection is the key to the success of ADORE.
ADORE can speed up real-world large applications
optimized by production compilers.
ADORE works on two architectures Itanium and
SPARC.
ADORE can generate helper threads for current and
future CMPs.