Title: Automatic Differentiation Tutorial
1Automatic Differentiation Tutorial
- Paul Hovland
- Andrew Lyons
- Boyana Norris
- Jean Utke
- Mathematics Computer Science Division
- Argonne National Laboratory
2AD in a Nutshell
- Technique for computing analytic derivatives of
programs (millions of loc) - Derivatives used in optimization, nonlinear PDEs,
sensitivity analysis, inverse problems, etc. - AD analytic differentiation of elementary
functions propagation by chain rule - Every programming language provides a limited
number of elementary mathematical functions - Thus, every function computed by a program may be
viewed as the composition of these so-called
intrinsic functions - Derivatives for the intrinsic functions are known
and can be combined using the chain rule of
differential calculus - Associativity of the chain rule leads to two main
modes forward and reverse - Can be implemented using source transformation or
operator overloading
3Modes of AD
- Forward Mode
- Propagates derivative vectors, often denoted ?u
or g_u - Derivative vector ?u contains derivatives of u
with respect to independent variables - Time and storage proportional to vector length (
indeps) - Reverse Mode
- Propagates adjoints, denoted u or u_bar
- Adjoint u contains derivatives of dependent
variables with respect to u - Propagation starts with dependent variablesmust
reverse flow of computation - Time proportional to adjoint vector length (
dependents) - Storage proportional to number of operations
- Because of this limitation, often applied to
subprograms
4Which mode to use?
- Use forward mode when
- independents is very small
- Only a directional derivative (Jv) is needed
- Reverse mode is not tractable
- Use reverse mode when
- dependents is very small
- Only JTv is needed
5Ways of Implementing AD
- Operator Overloading
- Use language features to implement
differentiation rules or to generate trace
(tape) of computation - Implementation can be very simple
- Difficult to go beyond one operation/statement at
a time in doing AD - Potential overhead due to compiler-generated
temporaries, e.g. w xyz è tmp xy w
tmpz. - Examples ADOL-C, ADMAT, TFADltgt,SACADO
- Source Transformation
- Requires significant (compiler) infrastructure
- More flexibility in exploiting chain rule
associativity - Examples ADIFOR, ADIC, OpenAD, TAF, TAPENADE
6Operator Overloading simple example
(implementation)
class a_double private double value,
grad public / constructors /
a_double(double v0.0, double g0.0)valuev
grad g / operators / friend a_double
operator(const a_double g1, const a_double g2)
return a_double(g1.valueg2.value,g1.gradg2
.grad) friend a_double operator-(const
a_double g1, const a_double g2) return
a_double(g1.value-g2.value,g1.grad-g2.grad)
friend a_double operator(const a_double g1,
const a_double g2) return
a_double(g1.valueg2.value,g2.valueg1.gradg1.val
ueg2.grad) friend a_double sin(const
a_double g1) return a_double(sin(g1.value),
cos(g1.value)g1.grad) friend a_double
cos(const a_double g1) return
a_double(cos(g1.value),-sin(g1.value)g1.grad)
// ...
7Operator Overloading simple example (use)
include ltmath.hgt include ltstream.hgt include
"adouble.hxx" void func(a_double f, a_double x,
a_double y) a_double a,b if (x gt y)
a cos(x) b sin(y)yy else a
xsin(x)/y b exp(y) f
exp(ab)
8Source Transformation simple example
C Generated by TAPENADE (INRIA,
Tropics team) C ... SUBROUTINE FUNC_D(f,
fd, x, xd, y, yd) DOUBLE PRECISION f, fd,
x, xd, y, yd DOUBLE PRECISION a, ad, arg1,
arg1d, b, bd INTRINSIC COS, EXP, SIN C
IF (x .GT. y) THEN ad -(xdSIN(x))
a COS(x) bd ydCOS(y)yy
SIN(y)(ydyyyd) b SIN(y)yy
ELSE ad ((xdSIN(x)xxdCOS(x))y-xSIN
(x)yd)/y2 a xSIN(x)/y bd
ydEXP(y) b EXP(y) END IF
arg1d adb abd arg1 ab fd
arg1dEXP(arg1) f EXP(arg1) RETURN
END
9The ADIFOR/ADIC Process
subroutine to be differentiated AD_TOP
deba number of derivatives to be
computed AD_PMAX 5 independent
variables AD_IVARS b,c,d,nu100,nu210
dependent variables AD_OVARS
hmin,pmaxe,cfhp Name of file listing names of
files containing deba and all subroutines
called by it. AD_PROG deba.cmp
After initial setup, usually only input code
changes.
10Tools
- Fortran 95
- C/C
- Fortran 77
- MATLAB
- Other languages Ada, Python, ...
- More tools at http//www.autodiff.org/
11Tools Fortran 95
- TAF (FastOpt)
- Commercial tool
- Support for (almost) all of Fortran 95
- Used extensively in geophysical sciences
applications - Tapenade (INRIA)
- Support for many Fortran 95 features
- Developed by a team with extensive compiler
experience - OpenAD/F (Argonne/UChicago/Rice)
- Support for many Fortran 95 features
- Developed by a team with expertise in
combintorial algorithms, compilers, software
engineering, and numerical analysis - Development driven by climate model
astrophysics code - All three forward and reverse source
transformation
12Tools C/C
- ADOL-C (Dresden)
- Mature tool
- Support for all of C
- Operator overloading forward and reverse modes
- ADIC (Argonne/UChicago)
- Support for all of C, some C
- Source transformation forward mode (reverse
under development) - New version (2.0) based on industrial strength
compiler infrastructure - Shares some infrastructure with OpenAD/F
- SACADO
- Operator overloading forward and reverse modes
- See Phipps presentation
- TAC (FastOpt)
- Commercial tool (under development)
- Support for much of C/C
- Source transformation forward and reverse modes
- Shares some infrastructure with TAF
13Tools Fortran 77
- ADIFOR (Rice/Argonne)
- Mature and very robust tool
- Support for all of Fortran 77
- Forward and (adequate) reverse modes
- Hundreds of users 150 citations
- AdiMat (Aachen) source transformation
- MAD (Cranfield/TOMLAB) operator overloading
- Various research prototypes
Tools MATLAB
14Simple example revisited
include ltmath.hgt include ltstream.hgt include
"adouble.hxx" void func(double f, double x,
double y) double a,b if (x gt y) a
cos(x) b sin(y)yy else a
xsin(x)/y b exp(y) f
exp(ab)
15ADIFOR 2.0 simple example (x only)
C This file was generated on 09/15/03 by the
version of C ADIFOR compiled on June, 1998. C
... subroutine g_func(f, g_f, x, g_x, y)
double precision f, x, y, a, b
double precision d1_w, d2_v, d1_p, d2_b, g_a,
g_x, g_d1_w, g_f save g_a, g_d1_w
if (x .gt. y) then d2_v cos(x)
d1_p -sin(x) g_a d1_p g_x
a d2_v C-------- b sin(y) y
y else d2_v sin(x)
d1_p cos(x) d2_b 1.0d0 / y
g_a (d2_b d2_v d2_b x d1_p) g_x
a x d2_v / y C-------- b
exp(y) endif g_d1_w b g_a
d1_w a b d2_v exp(d1_w)
d1_p d2_v g_f d1_p g_d1_w
f d2_v C-------- C return end
16ADIFOR 3.0 simple example
subroutine g_func_proc(ad_p_, f, g_f, x,
g_x, y, g_y) include 'g_maxs.h' cdeclarat
ions here integer ad_p_ double
precision d_tmp_s_0_a double precision
g_a(ad_pmax_) double precision
g_b(ad_pmax_) double precision
g_f(ad_pmax_) double precision
g_X2_dtmp1(ad_pmax_) double precision
g_x(ad_pmax_) double precision
g_y(ad_pmax_) double precision
d_tmp_ipar_0 double precision
d_tmp_val_0 double precision d_tmp_val_1
double precision f, x, y, a, b
double precision X2_dtmp1 CAD4 Start of
executable statements. d_tmp_ipar_0
(-sin(x)) call g_accum_d_2(ad_p_, g_a,
d_tmp_ipar_0, g_x) a cos(x)
d_tmp_val_0 sin(y) d_tmp_val_1
d_tmp_val_0 y d_tmp_ipar_0 cos(y)
d_tmp_s_0_a d_tmp_val_1 y d_tmp_val_0
y y d_tmp_ipar _0 call
g_accum_d_2(ad_p_, g_b, d_tmp_s_0_a, g_y)
b d_tmp_val_1 y call
g_accum_d_4(ad_p_, g_X2_dtmp1, a, g_b, b, g_a)
X2_dtmp1 a b d_tmp_ipar_0
exp(X2_dtmp1) call g_accum_d_2(ad_p_,
g_f, d_tmp_ipar_0, g_X2_dtmp1) f
exp(X2_dtmp1) C return end
17TAPENADE reverse mode simple example
C Generated by TAPENADE (INRIA,
Tropics team) C Version 2.0.6 - (Id 1.14 vmp
Stable - Fri Sep 5 140823 MEST 2003) C ...
SUBROUTINE FUNC_B(f, fb, x, xb, y, yb)
DOUBLE PRECISION f, fb, x, xb, y, yb DOUBLE
PRECISION a, ab, arg1, arg1b, b, bb INTEGER
branch INTRINSIC COS, EXP, SIN C IF
(x .GT. y) THEN a COS(x) b
SIN(y)yy CALL PUSHINTEGER4(0)
ELSE a xSIN(x)/y b EXP(y)
CALL PUSHINTEGER4(1) END IF arg1
ab f EXP(arg1) arg1b
EXP(arg1)fb ab barg1b bb
aarg1b CALL POPINTEGER4(branch) IF
(branch .LT. 1) THEN yb yb
(SIN(y)yySIN(y)yyCOS(y))bb xb xb
- SIN(x)ab ELSE yb yb
EXP(y)bb - xSIN(x)ab/y2 xb xb
(xCOS(x)/ySIN(x)/y)ab END IF fb
0.D0 END
18OpenAD system architecture
19OpenAD simple example
SUBROUTINE func(F, X, Y) use
w2f__types use active_module C Declarations
C ... IF (Xv .GT. Yv) THEN
OpenAD_Symbol_1 COS(Xv)
OpenAD_Symbol_0 (-SIN(Xv)) Av
OpenAD_Symbol_1 OpenAD_Symbol_5
SIN(Yv) OpenAD_Symbol_2
(YvOpenAD_Symbol_5) OpenAD_Symbol_9
(YvOpenAD_Symbol_2) OpenAD_Symbol_3
OpenAD_Symbol_2 OpenAD_Symbol_6
OpenAD_Symbol_5 OpenAD_Symbol_8
COS(Yv) OpenAD_Symbol_7 Yv
OpenAD_Symbol_4 Yv Bv
OpenAD_Symbol_9 OpenAD_Symbol_25
(OpenAD_Symbol_6 OpenAD_Symbol_8 gt
OpenAD_Symbol_7) OpenAD_Symbol_26
(OpenAD_Symbol_3 OpenAD_Symbol_25 gt
OpenAD_Symbol_4) OpenAD_Symbol_28
OpenAD_Symbol_0 CALL setderiv(OpenAD_Symbo
l_29,X) CALL setderiv(OpenAD_Symbol_27,Y)
CALL sax(OpenAD_Symbol_26,OpenAD_Symbol_27
,B) CALL sax(OpenAD_Symbol_28,OpenAD_Symbo
l_29,A) ELSE C ...
20ADIC simple example
include "ad_deriv.h" include ltmath.hgt include
"adintrinsics.h" void ad_func(DERIV_TYPE
f,DERIV_TYPE x,DERIV_TYPE y) DERIV_TYPE a,
b, ad_var_0, ad_var_1, ad_var_2 double
ad_adji_0,ad_loc_0,ad_loc_1,ad_adj_0,ad_adj_1,ad_a
dj_2,ad_adj_3 if (DERIV_val(x) gt
DERIV_val(y)) DERIV_val(a) cos(
DERIV_val(x)) /cos/ ad_adji_0 -sin(
DERIV_val(x))
ad_grad_axpy_1((a), ad_adji_0, (x))
DERIV_val(ad_var_0) sin( DERIV_val(y))
/sin/ ad_adji_0 cos( DERIV_val(y))
ad_grad_axpy_1((ad_var_0),
ad_adji_0, (y))
ad_loc_0 DERIV_val(ad_var_0) DERIV_val(y)
ad_loc_1 ad_loc_0 DERIV_val(y)
ad_adj_0 DERIV_val(ad_var_0)
DERIV_val(y) ad_adj_1 DERIV_val(y)
DERIV_val(y) ad_grad_axpy_3((b),
ad_adj_1, (ad_var_0), ad_adj_0, (y), ad_loc_0,
(y)) DERIV_val(b) ad_loc_1
else // ...
21ADIC-Generated Code Interpretation
Original code y x1x2x3x4
DERIV_val(y) value of program
variable y DERIV_grad(y)derivative object
associated with y
ad_loc_0 DERIV_val(x1) DERIV_val(x2) ad_loc_1
ad_loc_0 DERIV_val(x3) dy/dx4 ad_loc_2
ad_loc_1 DERIV_val(x4) y ad_adj_0 ad_loc_0
DERIV_val(x4) dy/dx3 ad_adj_1
DERIV_val(x3) DERIV_val(x4) ad_adj_2
DERIV_val(x1) ad_adj_1 dy/dx2 ad_adj_3
DERIV_val(x2) ad_adj_1 dy/dx1
reverse (or adjoint) mode of AD
for(i0iltnindepsi) DERIV_grad(y)i
ad_adj_3DERIV_grad(x1)iad_adj_2DERIV_grad(x2
)iad_adj_0DERIV_grad(x3)iad_loc_1DERIV_gr
ad(x4)i
forward mode of AD
DERIV_val(y) ad_loc_2
original value
22Matrix Coloring
- Jacobian matrices are often sparse
- The forward mode of AD computes J S, where S is
usually an identity matrix or a vector - Can compress Jacobian by choosing S such that
structurally orthogonal columns are combined - A set of columns are structurally orthogonal if
no two of them have nonzeros in the same row - Equivalent problem color the graph whose
adjacency matrix is JTJ - Equivalent problem distance-2 color the
bipartite graph of J
23Compressed Jacobian
24What is feasible practical
- Key point forward mode computes JS at cost
proportional to number of columns in S reverse
mode computes JTW at cost proportional to number
of columns in W - Jacobians of functions with small number (11000)
of independent variables (forward mode, SI) - Jacobians of functions with small number (1100)
of dependent variables (reverse/adjoint mode,
SI) - Very (extremely) large, but (very) sparse
Jacobians and Hessians (forward mode plus
coloring) - Jacobian-vector products (forward mode)
- Transposed-Jacobian-vector products (adjoint
mode) - Hessian-vector products (forward adjoint modes)
- Large, dense Jacobian matrices that are
effectively sparse or effectively low rank (e.g.,
see Abdel-Khalik et al., AD2008)
25Scenarios
p
k
-
- N small use forward mode on full computation
- M small use reverse mode on full computation
- M N large, P small use reverse mode on A,
forward mode on BC - M N large, K small use reverse mode on AB,
forward mode on C - N, P, K, M large, Jacobians of A, B, C sparse
compressed forward mode - N, P, K, M large, Jacobians of A, B, C low rank
scarce forward mode
B
C
A
n
m
26Scenarios
p
k
-
- N small use forward mode on full computation
- M small use reverse mode on full computation
- M N large, P small use reverse mode on A,
forward mode on BC - M N large, K small use reverse mode on AB,
forward mode on C - N, P, K, M large, Jacobians of A, B, C sparse
compressed forward mode - N, P, K, M large, Jacobians of A, B, C low rank
scarce forward mode - N, P, K, M large Jacobians of A, B, C dense
what to do?
B
C
A
n
m
27Automatic Differentiation What Can Go Wrong (and
what to do about it)
28Issues with Black Box Differentiation
- Source code may not be available or may be
difficult to work with - Simulation may not be (chain rule) differentiable
- Feedback due to adaptive algorithms
- Nondifferentiable functions
- Noisy functions
- Convergence rates
- Etc.
- Accurate derivatives may not be needed (FD might
be cheaper) - Differentiation and discretization do not commute
29Difficulties in ADIFOR 2.0 processing of FLUENT
- Dubious programming techniques
- Type mismatches in actual declared parameters
- Bugs
- inconsistent number of arguments in subroutine
calls - Not conforming to Fortran77 standard
- while statement in one subroutine
- ADIFOR2.0 limitations
- I/O statements containing function invocations
30Overflows in Fluent.AD
- Dynamic range of derivative code often is larger
than that of original code. - This may lead to overflows in the derivative
code, in particular in 32-bit arithmetic.
AD-generated code r4_v ap / cendiv r4_b 1.0
/ cendiv r5_b (-r4_v) / cendiv do g_i_ 1,
g_pmax_ g_axp(g_i_) g_axp(g_i_) r4_b
g_ap(g_i_) r5_b g_cendiv(g_i_) enddo axp
axp r4_v
Original Code if (cendiv.eq.0.0) cendiv
zero endif axp axpap/cendiv
31Wisconsin Sea Ice Model
32Chain rule (non-)differentiability
if (x .eq. 1.0) then a y else if ((x .eq.
0.0) then a 0 else a xy endif b
sqrt(x4 y4)
33Mathematical Challenges
- Derivatives of intrinsic functions at points of
non-differentiability - Derivatives of implicitly defined functions
- Derivatives of functions computed using numerical
methods
34Points of Nondifferentiability
- Due to intrinsic functions
- Several intrinsic functions are defined at points
where their derivatives are not, e.g. - abs(x), sqrt(x) at x0
- max(x,y) at xy
- Requirements
- Record/report exceptions
- Optionally, continue computation using some
generalized gradient - ADIFOR/ADIC approach
- User-selected reporting mechanism
- User-defined generalized gradients, e.g.
- 1.0,0.0 for max(x,0)
- 0.5,0.5 for max(x,y)
- Various ways of handling
- Verbose reports (file, line, type of exception)
- Terse summary (like IEEE flags)
- Ignore
- Due to conditional branches
- May be able to handle using trust regions
35ADIntrinsics System
Template at F77 intrinsic call site (gen. by AD
Tool)
Translation Blueprints (default or by user)
Reporting Flavor (default or user)
Directives in Code (optional by user)
ADIntrinsics Runtime Library
F77 Derivative Code
Library Calls (optional by user)
Executable
36Implicitly Defined Functions
- Implicitly defined functions often computed using
iterative methods - Function and derivatives may converge at
different rates - Derivative may not be accurate if iteration
halted upon function convergence - Solutions
- Tighten function convergence criteria
- Add derivative convergence to stopping criteria
- Compute derivatives directly, e.g. A ?x ?b
37Derivatives of Functions Computed Using Numerical
Methods
- Differentiation and approximation may not commute
- Need to be careful about how derivatives of
numerical approximations are used - E.g., differentiating through an ODE integrator
can provide unexpected results due to feedback
induced by adaptive stepsize control
38CVODE_S Objective
F(y,p)
p
ODE Solver (CVODE)
ytt1, t2, ...
yt0
automatically
Sensitivity Solver
ytt1, t2, ...
p
dy/dptt1, t2, ...
yt0
39Possible Approaches
Apply AD to CVODE
ad_F(y,ad_y,p,ad_p)
p,ad_p
ad_CVODE
y, ad_ytt1, t2, ...
y, ad_yt0
Solve sensitivity eqns
ad_F(y,ad_y,p,ad_p)
p
CVODE_S
y, dy/dp tt1, t2, ...
yt0
40CVODE as ODE sensitivity solver
- Augmented ODE initial-value problem
41CVODE_S Test Problem
- Diurnal kinetics advection-diffusion equation
- 100x100 structured grid
- 16 Pentium III nodes
42SensPVODE Number of Timesteps
43SensPVODE Time/Timestep
44Addressing Limitations in Black Box AD
- Detect points of nondifferentiability
- proceed with a subgradient
- currently supported for intrinsic functions, but
not conditional statements - Exploit mathematics to avoid differentiating
through an adaptive algorithm - Modify termination criterion for implicitly
defined functions - Tighten tolerance
- Add derivatives to termination test (preferred)
45Conclusions
- There are some potential gotchas when applying
AD in a black box fashion - Some care should be taken to ensure that the
desired quantity is computed - There are usually workarounds
46AD Intro Follow Up
47Practical Matters constructing computational
graphs
- At compile time (source transformation)
- Structure of graph is known, but edge weights are
not in effect, implement inspector (symbolic)
phase at compile time (offline), executor
(numeric) phase at run time (online) - In order to assemble graph from individual
statements, must be able to resolve aliases, be
able to match variable definitions and uses - Scope of computational graph construction is
usually limited to statements or basic blocks - Computational graph usually has O(10)O(100)
vertices - At run time (operator overloading)
- Structure and weights both discovered at runtime
- Completely onlinecannot afford polynomial time
algorithms to analyze graph - Computational graph may have O(10,000) vertices
48Checkpointing real applications
- In practice, need a combination of all of these
techniques - At the timestep level, 2- or 3-level
checkpointing is typical too many timesteps to
checkpoint every timestep - At the call tree level, some mixture of joint and
split mode is desirable - Pure split mode consumes too much memory
- Pure joint mode wastes time recomputing at the
lowest levels of the call tree - Currently, OpenAD provides a templating mechanism
to simplify the use of mixed checkpointing
strategies - Future research will attempt to automate some of
the checkpointing strategy selection, including
dynamic adaptation
49What to do for very large, dense Jacobian
matrices?
- Jacobian matrix might be large and dense, but
effectively sparse - Many entries below some threshold e (almost
zero) - Can tolerate errors up to d in entries greater
than e - Example advection-diffusion for finite time
step nonlocal terms fall off exponentially - Solution do a partial coloring to compress this
dense Jacobian requires solving a modified graph
coloring problem - Jacobian might be large and dense, but
effectively low rank - Can be well approximated by a low rank matrix
- Jacobian-vector products (and JTv) are cheap
- Adaptation of efficient subspace method uses Jv
and JTv in random directions to build low rank
approximation (Abdel-Khalik et al., AD08)
50Application highlights
- Atmospheric chemistry
- Breast cancer biostatistical analysis
- CFD CFL3D, NSC2KE, (Fluent 4.52 Aachen) ...
- Chemical kinetics
- Climate and weather MITGCM, MM5, CICE
- Semiconductor device simulation
- Water reservoir simulation
51Parameter tuning sea ice model
- Simulated (yellow) and observed (green) March
ice thickness (m)
Tuned parameters
Standard parameters
52Differentiated Toolkit CVODES (nee SensPVODE)
- Diurnal kinetics advection-diffusion equation
- 100x100 structured grid
- 16 Pentium III nodes
53Sensitivity Analysis mesoscale weather model
54Conclusions Future Work
- Automatic differentiation research involves a
wide range of combinatorial problems - AD is a powerful tool for scientific computing
- Modern automatic differentiation tools are robust
and produce efficient code for complex simulation
codes - Robustness requires an industrial-strength
compiler infrastructure - Efficiency requires sophisticated compiler
analysis - Effective use of automatic differentiation
depends on insight into problem structure - Future Work
- Further develop and test techniques for computing
Jacobians that are effectively sparse or
effectively low rank - Develop techniques to automatically generate
complex and adaptive checkpointing strategies
55About Argonne
- First national laboratory
- 25 miles southwest of Chicago
- 3000 employees (1000 scientists)
- Mathematics and Computer Science (MCS) Division
- 100 scientific staff
- Research in parallel programming models, grid
computing, scientific computing, component-based
software engineering, applied math, - Summer programs for undergrads and grad students
(Jan/Feb deadlines) - Research semester and co-op opportunities for
undergrads - Postdoc, programmer, and scientist positions
available - AD group has 1 opening
56For More Information
- Andreas Griewank, Evaluating Derivatives, SIAM,
2000. - Griewank, On Automatic Differentiation this
and other technical reports available online at
http//www.mcs.anl.gov/autodiff/tech_reports.html - AD in general http//www.mcs.anl.gov/autodiff/,
http//www.autodiff.org/ - ADIFOR http//www.mcs.anl.gov/adifor/
- ADIC http//www.mcs.anl.gov/adic/
- OpenAD http//www.mcs.anl.gov/openad/
- Other tools http//www.autodiff.org/
- E-mail hovland_at_mcs.anl.gov