Automatic Differentiation Tutorial

About This Presentation

Title:

Automatic Differentiation Tutorial

Description:

Associativity of the chain rule leads to two main modes: forward and reverse ... Propagation starts with dependent variables must reverse flow of computation ... – PowerPoint PPT presentation

Number of Views:400

Avg rating:3.0/5.0

Slides: 52

Provided by: pauldh5

Learn more at: https://www.mcs.anl.gov

Category:

more less

Transcript and Presenter's Notes

Title: Automatic Differentiation Tutorial

1
Automatic Differentiation Tutorial

Paul Hovland
Andrew Lyons
Boyana Norris
Jean Utke
Mathematics Computer Science Division
Argonne National Laboratory

2
AD in a Nutshell

Technique for computing analytic derivatives of
programs (millions of loc)
Derivatives used in optimization, nonlinear PDEs,
sensitivity analysis, inverse problems, etc.
AD analytic differentiation of elementary
functions propagation by chain rule
Every programming language provides a limited
number of elementary mathematical functions
Thus, every function computed by a program may be
viewed as the composition of these so-called
intrinsic functions
Derivatives for the intrinsic functions are known
and can be combined using the chain rule of
differential calculus
Associativity of the chain rule leads to two main
modes forward and reverse
Can be implemented using source transformation or
operator overloading

3
Modes of AD

Forward Mode
Propagates derivative vectors, often denoted ?u
or g_u
Derivative vector ?u contains derivatives of u
with respect to independent variables
Time and storage proportional to vector length (
indeps)
Reverse Mode
Propagates adjoints, denoted u or u_bar
Adjoint u contains derivatives of dependent
variables with respect to u
Propagation starts with dependent variablesmust
reverse flow of computation
Time proportional to adjoint vector length (
dependents)
Storage proportional to number of operations
Because of this limitation, often applied to
subprograms

4
Which mode to use?

Use forward mode when
independents is very small
Only a directional derivative (Jv) is needed
Reverse mode is not tractable
Use reverse mode when
dependents is very small
Only JTv is needed

5
Ways of Implementing AD

Operator Overloading
Use language features to implement
differentiation rules or to generate trace
(tape) of computation
Implementation can be very simple
Difficult to go beyond one operation/statement at
a time in doing AD
Potential overhead due to compiler-generated
temporaries, e.g. w xyz è tmp xy w
tmpz.
Examples ADOL-C, ADMAT, TFADltgt,SACADO
Source Transformation
Requires significant (compiler) infrastructure
More flexibility in exploiting chain rule
associativity
Examples ADIFOR, ADIC, OpenAD, TAF, TAPENADE

6
Operator Overloading simple example
(implementation)
class a_double private double value,
grad public / constructors /
a_double(double v0.0, double g0.0)valuev
grad g / operators / friend a_double
operator(const a_double g1, const a_double g2)
return a_double(g1.valueg2.value,g1.gradg2
.grad) friend a_double operator-(const
a_double g1, const a_double g2) return
a_double(g1.value-g2.value,g1.grad-g2.grad)
friend a_double operator(const a_double g1,
const a_double g2) return
a_double(g1.valueg2.value,g2.valueg1.gradg1.val
ueg2.grad) friend a_double sin(const
a_double g1) return a_double(sin(g1.value),
cos(g1.value)g1.grad) friend a_double
cos(const a_double g1) return
a_double(cos(g1.value),-sin(g1.value)g1.grad)
// ...
7
Operator Overloading simple example (use)
include ltmath.hgt include ltstream.hgt include
"adouble.hxx" void func(a_double f, a_double x,
a_double y) a_double a,b if (x gt y)
a cos(x) b sin(y)yy else a
xsin(x)/y b exp(y) f
exp(ab)
8
Source Transformation simple example
C Generated by TAPENADE (INRIA,
Tropics team) C ... SUBROUTINE FUNC_D(f,
fd, x, xd, y, yd) DOUBLE PRECISION f, fd,
x, xd, y, yd DOUBLE PRECISION a, ad, arg1,
arg1d, b, bd INTRINSIC COS, EXP, SIN C
IF (x .GT. y) THEN ad -(xdSIN(x))
a COS(x) bd ydCOS(y)yy
SIN(y)(ydyyyd) b SIN(y)yy
ELSE ad ((xdSIN(x)xxdCOS(x))y-xSIN
(x)yd)/y2 a xSIN(x)/y bd
ydEXP(y) b EXP(y) END IF
arg1d adb abd arg1 ab fd
arg1dEXP(arg1) f EXP(arg1) RETURN
END
9
The ADIFOR/ADIC Process
subroutine to be differentiated AD_TOP
deba number of derivatives to be
computed AD_PMAX 5 independent
variables AD_IVARS b,c,d,nu100,nu210
dependent variables AD_OVARS
hmin,pmaxe,cfhp Name of file listing names of
files containing deba and all subroutines
called by it. AD_PROG deba.cmp
After initial setup, usually only input code
changes.
10
Tools

Fortran 95
C/C
Fortran 77
MATLAB
Other languages Ada, Python, ...
More tools at http//www.autodiff.org/

11
Tools Fortran 95

TAF (FastOpt)
Commercial tool
Support for (almost) all of Fortran 95
Used extensively in geophysical sciences
applications
Tapenade (INRIA)
Support for many Fortran 95 features
Developed by a team with extensive compiler
experience
OpenAD/F (Argonne/UChicago/Rice)
Support for many Fortran 95 features
Developed by a team with expertise in
combintorial algorithms, compilers, software
engineering, and numerical analysis
Development driven by climate model
astrophysics code
All three forward and reverse source
transformation

12
Tools C/C

ADOL-C (Dresden)
Mature tool
Support for all of C
Operator overloading forward and reverse modes
ADIC (Argonne/UChicago)
Support for all of C, some C
Source transformation forward mode (reverse
under development)
New version (2.0) based on industrial strength
compiler infrastructure
Shares some infrastructure with OpenAD/F
SACADO
Operator overloading forward and reverse modes
See Phipps presentation
TAC (FastOpt)
Commercial tool (under development)
Support for much of C/C
Source transformation forward and reverse modes
Shares some infrastructure with TAF

13
Tools Fortran 77

ADIFOR (Rice/Argonne)
Mature and very robust tool
Support for all of Fortran 77
Forward and (adequate) reverse modes
Hundreds of users 150 citations
AdiMat (Aachen) source transformation
MAD (Cranfield/TOMLAB) operator overloading
Various research prototypes

Tools MATLAB
14
Simple example revisited
include ltmath.hgt include ltstream.hgt include
"adouble.hxx" void func(double f, double x,
double y) double a,b if (x gt y) a
cos(x) b sin(y)yy else a
xsin(x)/y b exp(y) f
exp(ab)
15
ADIFOR 2.0 simple example (x only)
C This file was generated on 09/15/03 by the
version of C ADIFOR compiled on June, 1998. C
... subroutine g_func(f, g_f, x, g_x, y)
double precision f, x, y, a, b
double precision d1_w, d2_v, d1_p, d2_b, g_a,
g_x, g_d1_w, g_f save g_a, g_d1_w
if (x .gt. y) then d2_v cos(x)
d1_p -sin(x) g_a d1_p g_x
a d2_v C-------- b sin(y) y
y else d2_v sin(x)
d1_p cos(x) d2_b 1.0d0 / y
g_a (d2_b d2_v d2_b x d1_p) g_x
a x d2_v / y C-------- b
exp(y) endif g_d1_w b g_a
d1_w a b d2_v exp(d1_w)
d1_p d2_v g_f d1_p g_d1_w
f d2_v C-------- C return end
16
ADIFOR 3.0 simple example
subroutine g_func_proc(ad_p_, f, g_f, x,
g_x, y, g_y) include 'g_maxs.h' cdeclarat
ions here integer ad_p_ double
precision d_tmp_s_0_a double precision
g_a(ad_pmax_) double precision
g_b(ad_pmax_) double precision
g_f(ad_pmax_) double precision
g_X2_dtmp1(ad_pmax_) double precision
g_x(ad_pmax_) double precision
g_y(ad_pmax_) double precision
d_tmp_ipar_0 double precision
d_tmp_val_0 double precision d_tmp_val_1
double precision f, x, y, a, b
double precision X2_dtmp1 CAD4 Start of
executable statements. d_tmp_ipar_0
(-sin(x)) call g_accum_d_2(ad_p_, g_a,
d_tmp_ipar_0, g_x) a cos(x)
d_tmp_val_0 sin(y) d_tmp_val_1
d_tmp_val_0 y d_tmp_ipar_0 cos(y)
d_tmp_s_0_a d_tmp_val_1 y d_tmp_val_0
y y d_tmp_ipar _0 call
g_accum_d_2(ad_p_, g_b, d_tmp_s_0_a, g_y)
b d_tmp_val_1 y call
g_accum_d_4(ad_p_, g_X2_dtmp1, a, g_b, b, g_a)
X2_dtmp1 a b d_tmp_ipar_0
exp(X2_dtmp1) call g_accum_d_2(ad_p_,
g_f, d_tmp_ipar_0, g_X2_dtmp1) f
exp(X2_dtmp1) C return end
17
TAPENADE reverse mode simple example
C Generated by TAPENADE (INRIA,
Tropics team) C Version 2.0.6 - (Id 1.14 vmp
Stable - Fri Sep 5 140823 MEST 2003) C ...
SUBROUTINE FUNC_B(f, fb, x, xb, y, yb)
DOUBLE PRECISION f, fb, x, xb, y, yb DOUBLE
PRECISION a, ab, arg1, arg1b, b, bb INTEGER
branch INTRINSIC COS, EXP, SIN C IF
(x .GT. y) THEN a COS(x) b
SIN(y)yy CALL PUSHINTEGER4(0)
ELSE a xSIN(x)/y b EXP(y)
CALL PUSHINTEGER4(1) END IF arg1
ab f EXP(arg1) arg1b
EXP(arg1)fb ab barg1b bb
aarg1b CALL POPINTEGER4(branch) IF
(branch .LT. 1) THEN yb yb
(SIN(y)yySIN(y)yyCOS(y))bb xb xb
- SIN(x)ab ELSE yb yb
EXP(y)bb - xSIN(x)ab/y2 xb xb
(xCOS(x)/ySIN(x)/y)ab END IF fb
0.D0 END
18
OpenAD system architecture
19
OpenAD simple example
SUBROUTINE func(F, X, Y) use
w2f__types use active_module C Declarations
C ... IF (Xv .GT. Yv) THEN
OpenAD_Symbol_1 COS(Xv)
OpenAD_Symbol_0 (-SIN(Xv)) Av
OpenAD_Symbol_1 OpenAD_Symbol_5
SIN(Yv) OpenAD_Symbol_2
(YvOpenAD_Symbol_5) OpenAD_Symbol_9
(YvOpenAD_Symbol_2) OpenAD_Symbol_3
OpenAD_Symbol_2 OpenAD_Symbol_6
OpenAD_Symbol_5 OpenAD_Symbol_8
COS(Yv) OpenAD_Symbol_7 Yv
OpenAD_Symbol_4 Yv Bv
OpenAD_Symbol_9 OpenAD_Symbol_25
(OpenAD_Symbol_6 OpenAD_Symbol_8 gt
OpenAD_Symbol_7) OpenAD_Symbol_26
(OpenAD_Symbol_3 OpenAD_Symbol_25 gt
OpenAD_Symbol_4) OpenAD_Symbol_28
OpenAD_Symbol_0 CALL setderiv(OpenAD_Symbo
l_29,X) CALL setderiv(OpenAD_Symbol_27,Y)
CALL sax(OpenAD_Symbol_26,OpenAD_Symbol_27
,B) CALL sax(OpenAD_Symbol_28,OpenAD_Symbo
l_29,A) ELSE C ...
20
ADIC simple example
include "ad_deriv.h" include ltmath.hgt include
"adintrinsics.h" void ad_func(DERIV_TYPE
f,DERIV_TYPE x,DERIV_TYPE y) DERIV_TYPE a,
b, ad_var_0, ad_var_1, ad_var_2 double
ad_adji_0,ad_loc_0,ad_loc_1,ad_adj_0,ad_adj_1,ad_a
dj_2,ad_adj_3 if (DERIV_val(x) gt
DERIV_val(y)) DERIV_val(a) cos(
DERIV_val(x)) /cos/ ad_adji_0 -sin(
DERIV_val(x))
ad_grad_axpy_1((a), ad_adji_0, (x))
DERIV_val(ad_var_0) sin( DERIV_val(y))
/sin/ ad_adji_0 cos( DERIV_val(y))
ad_grad_axpy_1((ad_var_0),
ad_adji_0, (y))
ad_loc_0 DERIV_val(ad_var_0) DERIV_val(y)
ad_loc_1 ad_loc_0 DERIV_val(y)
ad_adj_0 DERIV_val(ad_var_0)
DERIV_val(y) ad_adj_1 DERIV_val(y)
DERIV_val(y) ad_grad_axpy_3((b),
ad_adj_1, (ad_var_0), ad_adj_0, (y), ad_loc_0,
(y)) DERIV_val(b) ad_loc_1
else // ...
21
ADIC-Generated Code Interpretation
Original code y x1x2x3x4
DERIV_val(y) value of program
variable y DERIV_grad(y)derivative object
associated with y
ad_loc_0 DERIV_val(x1) DERIV_val(x2) ad_loc_1
ad_loc_0 DERIV_val(x3) dy/dx4 ad_loc_2
ad_loc_1 DERIV_val(x4) y ad_adj_0 ad_loc_0
DERIV_val(x4) dy/dx3 ad_adj_1
DERIV_val(x3) DERIV_val(x4) ad_adj_2
DERIV_val(x1) ad_adj_1 dy/dx2 ad_adj_3
DERIV_val(x2) ad_adj_1 dy/dx1
reverse (or adjoint) mode of AD
for(i0iltnindepsi) DERIV_grad(y)i
ad_adj_3DERIV_grad(x1)iad_adj_2DERIV_grad(x2
)iad_adj_0DERIV_grad(x3)iad_loc_1DERIV_gr
ad(x4)i
forward mode of AD
DERIV_val(y) ad_loc_2
original value
22
Matrix Coloring

Jacobian matrices are often sparse
The forward mode of AD computes J S, where S is
usually an identity matrix or a vector
Can compress Jacobian by choosing S such that
structurally orthogonal columns are combined
A set of columns are structurally orthogonal if
no two of them have nonzeros in the same row
Equivalent problem color the graph whose
adjacency matrix is JTJ
Equivalent problem distance-2 color the
bipartite graph of J

23
Compressed Jacobian
24
What is feasible practical

Key point forward mode computes JS at cost
proportional to number of columns in S reverse
mode computes JTW at cost proportional to number
of columns in W
Jacobians of functions with small number (11000)
of independent variables (forward mode, SI)
Jacobians of functions with small number (1100)
of dependent variables (reverse/adjoint mode,
SI)
Very (extremely) large, but (very) sparse
Jacobians and Hessians (forward mode plus
coloring)
Jacobian-vector products (forward mode)
Transposed-Jacobian-vector products (adjoint
mode)
Hessian-vector products (forward adjoint modes)
Large, dense Jacobian matrices that are
effectively sparse or effectively low rank (e.g.,
see Abdel-Khalik et al., AD2008)

25
Scenarios
p
k

N small use forward mode on full computation
M small use reverse mode on full computation
M N large, P small use reverse mode on A,
forward mode on BC
M N large, K small use reverse mode on AB,
forward mode on C
N, P, K, M large, Jacobians of A, B, C sparse
compressed forward mode
N, P, K, M large, Jacobians of A, B, C low rank
scarce forward mode

B
C
A
n
m
26
Scenarios
p
k

N small use forward mode on full computation
M small use reverse mode on full computation
M N large, P small use reverse mode on A,
forward mode on BC
M N large, K small use reverse mode on AB,
forward mode on C
N, P, K, M large, Jacobians of A, B, C sparse
compressed forward mode
N, P, K, M large, Jacobians of A, B, C low rank
scarce forward mode
N, P, K, M large Jacobians of A, B, C dense
what to do?

B
C
A
n
m
27
Automatic Differentiation What Can Go Wrong (and
what to do about it)
28
Issues with Black Box Differentiation

Source code may not be available or may be
difficult to work with
Simulation may not be (chain rule) differentiable
Feedback due to adaptive algorithms
Nondifferentiable functions
Noisy functions
Convergence rates
Etc.
Accurate derivatives may not be needed (FD might
be cheaper)
Differentiation and discretization do not commute

29
Difficulties in ADIFOR 2.0 processing of FLUENT

Dubious programming techniques
Type mismatches in actual declared parameters
Bugs
inconsistent number of arguments in subroutine
calls
Not conforming to Fortran77 standard
while statement in one subroutine
ADIFOR2.0 limitations
I/O statements containing function invocations

30
Overflows in Fluent.AD

Dynamic range of derivative code often is larger
than that of original code.
This may lead to overflows in the derivative
code, in particular in 32-bit arithmetic.

AD-generated code r4_v ap / cendiv r4_b 1.0
/ cendiv r5_b (-r4_v) / cendiv do g_i_ 1,
g_pmax_ g_axp(g_i_) g_axp(g_i_) r4_b
g_ap(g_i_) r5_b g_cendiv(g_i_) enddo axp
axp r4_v
Original Code if (cendiv.eq.0.0) cendiv
zero endif axp axpap/cendiv
31
Wisconsin Sea Ice Model
32
Chain rule (non-)differentiability
if (x .eq. 1.0) then a y else if ((x .eq.
0.0) then a 0 else a xy endif b
sqrt(x4 y4)
33
Mathematical Challenges

Derivatives of intrinsic functions at points of
non-differentiability
Derivatives of implicitly defined functions
Derivatives of functions computed using numerical
methods

34
Points of Nondifferentiability

Due to intrinsic functions
Several intrinsic functions are defined at points
where their derivatives are not, e.g.
abs(x), sqrt(x) at x0
max(x,y) at xy
Requirements
Record/report exceptions
Optionally, continue computation using some
generalized gradient
ADIFOR/ADIC approach
User-selected reporting mechanism
User-defined generalized gradients, e.g.
1.0,0.0 for max(x,0)
0.5,0.5 for max(x,y)
Various ways of handling
Verbose reports (file, line, type of exception)
Terse summary (like IEEE flags)
Ignore
Due to conditional branches
May be able to handle using trust regions

35
ADIntrinsics System
Template at F77 intrinsic call site (gen. by AD
Tool)
Translation Blueprints (default or by user)
Reporting Flavor (default or user)
Directives in Code (optional by user)
ADIntrinsics Runtime Library
F77 Derivative Code
Library Calls (optional by user)
Executable
36
Implicitly Defined Functions

Implicitly defined functions often computed using
iterative methods
Function and derivatives may converge at
different rates
Derivative may not be accurate if iteration
halted upon function convergence
Solutions
Tighten function convergence criteria
Add derivative convergence to stopping criteria
Compute derivatives directly, e.g. A ?x ?b

37
Derivatives of Functions Computed Using Numerical
Methods

Differentiation and approximation may not commute
Need to be careful about how derivatives of
numerical approximations are used
E.g., differentiating through an ODE integrator
can provide unexpected results due to feedback
induced by adaptive stepsize control

38
CVODE_S Objective
F(y,p)
p
ODE Solver (CVODE)
ytt1, t2, ...
yt0
automatically
Sensitivity Solver
ytt1, t2, ...
p
dy/dptt1, t2, ...
yt0
39
Possible Approaches
Apply AD to CVODE
ad_F(y,ad_y,p,ad_p)
p,ad_p
ad_CVODE
y, ad_ytt1, t2, ...
y, ad_yt0
Solve sensitivity eqns
ad_F(y,ad_y,p,ad_p)
p
CVODE_S
y, dy/dp tt1, t2, ...
yt0
40
CVODE as ODE sensitivity solver

Augmented ODE initial-value problem

41
CVODE_S Test Problem

Diurnal kinetics advection-diffusion equation
100x100 structured grid
16 Pentium III nodes

42
SensPVODE Number of Timesteps
43
SensPVODE Time/Timestep
44
Addressing Limitations in Black Box AD

Detect points of nondifferentiability
proceed with a subgradient
currently supported for intrinsic functions, but
not conditional statements
Exploit mathematics to avoid differentiating
through an adaptive algorithm
Modify termination criterion for implicitly
defined functions
Tighten tolerance
Add derivatives to termination test (preferred)

45
Conclusions

There are some potential gotchas when applying
AD in a black box fashion
Some care should be taken to ensure that the
desired quantity is computed
There are usually workarounds

46
AD Intro Follow Up
47
Practical Matters constructing computational
graphs

At compile time (source transformation)
Structure of graph is known, but edge weights are
not in effect, implement inspector (symbolic)
phase at compile time (offline), executor
(numeric) phase at run time (online)
In order to assemble graph from individual
statements, must be able to resolve aliases, be
able to match variable definitions and uses
Scope of computational graph construction is
usually limited to statements or basic blocks
Computational graph usually has O(10)O(100)
vertices
At run time (operator overloading)
Structure and weights both discovered at runtime
Completely onlinecannot afford polynomial time
algorithms to analyze graph
Computational graph may have O(10,000) vertices

48
Checkpointing real applications

In practice, need a combination of all of these
techniques
At the timestep level, 2- or 3-level
checkpointing is typical too many timesteps to
checkpoint every timestep
At the call tree level, some mixture of joint and
split mode is desirable
Pure split mode consumes too much memory
Pure joint mode wastes time recomputing at the
lowest levels of the call tree
Currently, OpenAD provides a templating mechanism
to simplify the use of mixed checkpointing
strategies
Future research will attempt to automate some of
the checkpointing strategy selection, including
dynamic adaptation

49
What to do for very large, dense Jacobian
matrices?

Jacobian matrix might be large and dense, but
effectively sparse
Many entries below some threshold e (almost
zero)
Can tolerate errors up to d in entries greater
than e
Example advection-diffusion for finite time
step nonlocal terms fall off exponentially
Solution do a partial coloring to compress this
dense Jacobian requires solving a modified graph
coloring problem
Jacobian might be large and dense, but
effectively low rank
Can be well approximated by a low rank matrix
Jacobian-vector products (and JTv) are cheap
Adaptation of efficient subspace method uses Jv
and JTv in random directions to build low rank
approximation (Abdel-Khalik et al., AD08)

50
Application highlights

Atmospheric chemistry
Breast cancer biostatistical analysis
CFD CFL3D, NSC2KE, (Fluent 4.52 Aachen) ...
Chemical kinetics
Climate and weather MITGCM, MM5, CICE
Semiconductor device simulation
Water reservoir simulation

51
Parameter tuning sea ice model
- Simulated (yellow) and observed (green) March
ice thickness (m)
Tuned parameters
Standard parameters
52
Differentiated Toolkit CVODES (nee SensPVODE)

Diurnal kinetics advection-diffusion equation
100x100 structured grid
16 Pentium III nodes

53
Sensitivity Analysis mesoscale weather model
54
Conclusions Future Work

Automatic differentiation research involves a
wide range of combinatorial problems
AD is a powerful tool for scientific computing
Modern automatic differentiation tools are robust
and produce efficient code for complex simulation
codes
Robustness requires an industrial-strength
compiler infrastructure
Efficiency requires sophisticated compiler
analysis
Effective use of automatic differentiation
depends on insight into problem structure
Future Work
Further develop and test techniques for computing
Jacobians that are effectively sparse or
effectively low rank
Develop techniques to automatically generate
complex and adaptive checkpointing strategies

55
About Argonne

First national laboratory
25 miles southwest of Chicago
3000 employees (1000 scientists)
Mathematics and Computer Science (MCS) Division
100 scientific staff
Research in parallel programming models, grid
computing, scientific computing, component-based
software engineering, applied math,
Summer programs for undergrads and grad students
(Jan/Feb deadlines)
Research semester and co-op opportunities for
undergrads
Postdoc, programmer, and scientist positions
available
AD group has 1 opening

56
For More Information

Andreas Griewank, Evaluating Derivatives, SIAM,
2000.
Griewank, On Automatic Differentiation this
and other technical reports available online at
http//www.mcs.anl.gov/autodiff/tech_reports.html
AD in general http//www.mcs.anl.gov/autodiff/,
http//www.autodiff.org/
ADIFOR http//www.mcs.anl.gov/adifor/
ADIC http//www.mcs.anl.gov/adic/
OpenAD http//www.mcs.anl.gov/openad/
Other tools http//www.autodiff.org/
E-mail hovland_at_mcs.anl.gov