Robin Hogan - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Robin Hogan

Description:

Fast reverse-mode automatic differentiation using expression templates in C++ Robin Hogan University of Reading – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 34
Provided by: RobinH175
Category:

less

Transcript and Presenter's Notes

Title: Robin Hogan


1
Fast reverse-mode automatic differentiation using
expression templates in C
  • Robin Hogan
  • University of Reading

2
Overview
  • Spaceborne radar and lidar
  • Adjoint coding
  • Automatic differentiation
  • New approach
  • Testing with lidar multiple-scattering forward
    models

3
Spaceborne radar, lidar and radiometers
EarthCare
  • The A-Train
  • NASA
  • 700-km orbit
  • CloudSat 94-GHz radar (launch 2006)
  • Calipso 532/1064-nm depol. lidar
  • MODIS multi-wavelength radiometer
  • CERES broad-band radiometer
  • AMSR-E microwave radiometer
  • EarthCARE launch 2015(?)
  • ESAJAXA
  • 400-km orbit more sensitive
  • 94-GHz Doppler radar
  • 355-nm HSRL/depol. lidar
  • Multispectral imager
  • Broad-band radiometer
  • Heart-warming name

4
What do CloudSat and Calipso see?
  • Radar D6, detects whole profile, surface echo
    provides integral constraint
  • Lidar D2, more sensitive to thin cirrus and
    liquid but attenuated
  • Radar-lidar ratio provides size D

Cloudsat radar
CALIPSO lidar
Insects Aerosol Rain Supercooled liquid
cloud Warm liquid cloud Ice and supercooled
liquid Ice Clear No ice/rain but possibly
liquid Ground
Target classification
Delanoe and Hogan (2008, 2010)
5
Unified retrieval
  • Ingredients developed
  • Implement previous work
  • Not yet developed

6
Unified retrieval Forward model
  • From state vector x to forward modelled
    observations H(x)...

x
Ice snow
Liquid cloud
Rain
Aerosol
Lookup tables to obtain profiles of extinction,
scattering backscatter coefficients, asymmetry
factor
Sum the contributions from each constituent
Radiative transfer models
7
Radiative transfer models

Observation Model Speed Status
Radar reflectivity factor Multiscatter single scattering option N OK
Radar reflectivity factor in deep convection Multiscatter single scattering plus TDTS MS model (Hogan and Battaglia 2008) N2 OK
Radar Doppler velocity Single scattering OK if no NUBF fast MS model with Doppler does not exist N2 Not available for MS
HSRL lidar in ice and aerosol Multiscatter PVC model (Hogan 2008) N OK
HSRL lidar in liquid cloud Multiscatter PVC plus TDTS models N2 OK
Lidar depolarization Multiscatter under development N2 In progress
Infrared radiances Delanoe and Hogan (2008) two-stream source function method N No adjoint
Infrared radiances RTTOV (EUMETSAT license) N Disappointing accuracy for clouds
Solar radiances LIDORT (permissive license) N Testing
  • After much pain have hand-coded adjoint for
    multiscatter model (in C) but still need adjoint
    for all the rest of the algorithm (in C)

8
Adjoint and Jacobian coding
  • Variational retrieval methods are posed as
  • find the vector x that minimises the cost
    function J(x)
  • Two common minimization methods
  • The quasi-Newton method requires the adjoint
    code to compute the gradient ?J/?x for any x
  • The Gauss-Newton method writes the observational
    part of the cost function as the sum of the
    squared deviation of the observations from their
    forward modelled counterparts y, and requires a
    code to compute the Jacobian matrix H ?y/?x
  • Since J(x) is complicated (containing all of our
    radiative transfer models), the code to generate
    ?J/?x or ?y/?x is even more complicated
  • Can it be generated automatically?

9
Approaches to adjoint coding
  • Do it by hand (e.g. ECMWF)
  • Painful and time consuming to debug
  • Generates the most efficient code
  • Do it numerically perturb each element of x one
    by one
  • Inefficient and infeasible for large x
  • Subject to round-off error
  • What Im using at the moment with Unified
    Algorithm
  • Automatic differentiation 1 Use a
    source-to-source compiler
  • E.g. TAPENADE/TAF/TAC generate adjoint source
    file from algorithm file generates quite
    efficient code
  • Comercial 5k/year for TAF/TAC academic license
    and need permission to distribute generated
    source code
  • TAPENADE requires to upload file to server
  • Limited support for C classes and no support
    for C templates
  • Automatic differentiation 2 Use an operator
    overloading technique
  • E.g. CppAD, ADOL-C, in principle can work with
    any language features
  • Typically 25 times slower than hand-coded
    adjoint!
  • Can we do better?

10
Simple example
  • Consider simple algorithm y(x0, x1) contrived for
    didactic purposes
  • Implemented in C or Fortran90 as
  • Task given ?J/?y, we want to compute ?J/?x0 and
    ?J/?x1

function algorithm(x) result(y) implicit none
real, intent(in) x(2) real
y real s y 4.0 s
2.0x(1) 3.0x(2)x(2) y y sin(s)
return endfunction
double algorithm(const double x2) double y
4.0 double s 2.0x0 3.0x1x1 y
sin(s) return y
11
Creating the adjoint code 1
  • Differentiate the algorithm
  • Write each statement in matrix form
  • Transpose the matrix to get equivalent adjoint
    statement

12
Creating the adjoint code 2
  • Apply adjoint statements in reverse order

Forward mode
double algorithm_AD(const double x2, double
y_AD1, double x_AD2) double y 4.0 double
s 2.0x0 3.0x1x1 y sin(s) /
Adjoint part / double s_AD 0.0 y_AD0
sin(s) y_AD0 s_AD y cos(s)
y_AD0 x_AD0 3.0 s_AD x_AD1 6.0
x0 s_AD s_AD 0.0 y_AD0 0.0 return
y
Note need to store intermediate values for the
reverse pass Hand-coding is time-consuming and
error prone for large codes
13
Automatic differentiation
  • We want something like this (now in C)
  • Operators (e.g. /) and functions (e.g. sin,
    exp, log) applying to adouble objects are
    overloaded not only to return the result of the
    operation, but also to store the gradient
    information in stack
  • Libraries CppAD, SACADO and ADOL-C do this but
    the result is around 25 times slower than
    hand-coded adjoints why?

adouble algorithm(const adouble x2) adouble y
4.0 adouble s 2.0x0 3.0x1x1 y
sin(s) return y // Main code Stack stack
// Object where info will be stored adouble
x2 , // Set algorithm inputs adouble y
algorithm(x) // Run algorithm and store info
in stack y.set_gradient(y_AD) // Set
dJ/dy stack.reverse() // Run adjoint code from
stored info x_AD0 x0.get_gradient() //
Save resulting values of dJ/dx0 x_AD1
x1.get_gradient() // ... and dJ/dx1
14
Minimum necessary storage
  • What is the minimum necessary storage to store
    these statements?
  • If we label each gradient by an integer (since
    theyre unknown in forward pass) then we need two
    stacks that can be added to as the algorithm
    progresses
  • Can then run backwards through stack to compute
    adjoints

Statement stack
Operation stack
Index to LHS gradient Index to first operation
2 (dy) 0
3 (ds) 0
2 (dy) 2

Multiplier Index to RHS gradient
0 2.0 0 (dx0)
1 6.0x1 1 (dx1)
2 sin(s) 2 (dy)
3 y cos(s) 3 (ds)
4
15
Adjoint algorithm is simple
  • Need to cope with three different types of
    differential statement

Forward mode
Equivalent adjoint statements
General differential statement
for i 0 to n
16
which can be coded as follows
  • This does the right thing in our three cases
  • Zero on RHS
  • One or more gradients on RHS
  • Same gradient on LHS and RHS

17
Dual numbers approach
  • How can these stacks be created?
  • Consider what happens when compiler sees this
    line
  • Compiler splits this up into two parts with
    temporary t
  • We could define adouble as dual number x, dx
    (invented by Clifford 1873) and then overload sin
    and operator
  • sin(s), cos(s)ds sin(s, ds)
  • yt, tdyydt y, dy t, dt
  • This would correctly apply
  • but only if the gradient terms on the
    right-hand-side are known!
  • This is not useful for the reverse-mode (adjoint)
    when we want to store a symbolic representation
    of the gradient on the forward sweep which is
    then filled on the reverse sweep
  • Dual numbers are used in some forward-mode-only
    (tangent linear) automatic differentiation tools.

y y sin(s)
adouble t sin(s) y operator(y, t)
18
So how do CppAD ADOL-C work?
  • In the forward pass they store the whole
    algorithm symbolically, not just the derivative
    form!
  • This means every operator and function needs to
    be stored symbolically (e.g. 0 for plus, 1 for
    minus, 42 for atan etc)
  • The stored algorithm can then be analysed to
    generate an adjoint function
  • This all happens behind the scenes so easy to
    use, but not surprising that it is 25 times
    slower than a hand-coded adjoint

19
Computational graphs
  • The basic problem is that standard operator
    overloading can only pass information from the
    most nested operation outwards

Pass y sin(s) to be new y
operator
Pass value of sin(s)
sin
y
s
20
Implementing the chain rule
 
21
Computational graph 2
  • Clearly differentiation most naturally involves
    passing information in the opposite sense

Each node representing arbitrary function or
operator y(a) needs to be able to take a real
number w and pass wdy/da down the chain Binary
function or operator y(a,b) would pass wdy/da to
one argument and wdy/db to other At the end of
the chain, store the result on the stack But how
do we implement this?
22
What is a template?
  • Templates are a key ingredient to generic
    programming in C
  • Imagine we have a function like this
  • We want it to work with any numerical type
    (single precision, complex numbers etc) but dont
    want to laboriously define a new overloaded
    function for each possible type
  • Can use a function template

double cube(const double x) double y
xxx return y
template lttypename Typegt Type cube(Type x) Type
y xxx return y double a 1.0 b
cube(a) // compiler creates function
cubeltdoublegt complexltdoublegt c(1.0, 2.0) // c
1 2i d cube(c) // compiler creates function
cubeltcomplexltdoublegt gt
23
What is an expression template?
  • C also supports class templates
  • Veldhuizen (1995) used this feature to introduce
    the idea of Expression Templates to optimize
    array operations and make C as fast as
    Fortran-90 for array-wise operations
  • We use it as a way to pass information in both
    directions through the expression tree
  • sin(A) for an argument of arbitrary type A is
    overloaded to return an object of type SinltAgt
  • operator(A,B) for arguments of arbitrary type A
    and B is overloaded to return an object of type
    MultiplyltA,Bgt
  • Now when we compile the statement yysin(x)
  • The right-hand-side resolves to an object RHS
    of type Multiplyltadouble,Sinltadoublegt gt
  • The overloaded assignment operator first calls
    RHS.value() to get y
  • It then calls RHS.calc_gradient(), to add entries
    to operation stack
  • Multiply and Sin are defined with member
    functions so that they can correctly pass
    information up and down the expression tree

24
New approach
  • The following types are passed up the chain at
    compile time

Multiplyltadouble,Sinltadoublegt gt
operator
Sinltadoublegt
adouble
sin
y
adouble
s
25
Implementation of SinltAgt
// Definition of Sin class template ltclass
Agt class Sin public ExpressionltSinltAgt gt
public // Member functions // Constructor
store reference to a and its numerical value
Sin(const ExpressionltAgt a) a_(a),
a_value_(a.value()) // Return the value
double value() const return
sin(a_value_) // Compute derivative and
pass to a void calc_gradient(Stack stack,
double multiplier) const
a_.calc_gradient(stack, cos(a_value_)multiplier)
private // Data members const A a_
// A reference to the object double
a_value_ // The numerical value of object //
Overload the sin function it returns a SinltAgt
object template ltclass Agt inline SinltAgt
sin(const ExpressionltAgt a) return SinltAgt(a)
  • Adept library has done this for all operators
    and functions

26
Optimizations
  • Why are expression templates fast?
  • Compound types representing complex expressions
    are known at compile time
  • C automatically inlines function calls between
    objects in an expression, leaving little more
    than the operations you would put in a hand-coded
    application of the chain rule
  • Further optimizations
  • Stack object keeps memory allocated between calls
    to avoid time spent allocating incrementally more
    memory
  • If the Jacobian is computed it is done in strips
    to exploit vectorization (SSE/SSE2 on Intel) and
    loop unrolling
  • The current stack is accessed by a global but
    thread-local variable, rather than storing a link
    to the stack in every adouble object (as in CppAD
    and ADOL-C)

27
Testing using lidar multiple scattering models
  • Photon Variance-Covariance method for small-angle
    multiple scattering
  • Hogan (JAS 2008)
  • Somewhat similar to a monochromatic radiance
    model
  • Four coupled ODEs are integrated forward in space
  • Several variables at N gates give N output
    signals
  • Computational cost proportional to N
  • Time-dependent two-stream method for wide-angle
    multiple scattering
  • Hogan and Battaglia (JAS 2008)
  • Similar to a time-dependent 1D advection model
  • Four coupled PDEs are integrated forward in time
  • Several variables at N gates gives N output
    signals
  • Computational cost proportional to N 2

28
Simulation of 3D photon transport
  • Animation of scalar flux (II)
  • Colour scale is logarithmic
  • Represents 5 orders of magnitude
  • Domain properties
  • 500-m thick
  • 2-km wide
  • Optical depth of 20
  • No absorption
  • In this simulation the lateral distribution is
    Gaussian at each height and each time

29
Benchmark results
  • Time relative to original code, gcc-4.4, Pentium
    2.5 GHz, 2 MB cache

Adjoint PVC N50 TDTS N50
Hand-coded adjoint 3.0 (1.02.0) 3.6 (1.02.6)
New C library Adept 3.5 (2.70.8) 3.8 (2.61.2)
ADOL-C 25 (187) 20 (155)
CppAD 29 (1577) 34 (1789)
30
Outlook
  • New library Adept (Automatic Differentiation
    using Expression Templates) produces adjoint with
    minimum difficulty for user
  • No knowledge of templates required by user at all
  • Simple and efficient to compute Jacobian matrix
    as well
  • Freely available at http//www.met.reading.ac.uk/c
    louds/adept/
  • Typically 5-20 slower than hand-coded adjoints
  • But immeasurably faster in terms of programmer
    time
  • Code is complete for applying to any C code with
    real numbers
  • Further development desirable
  • Complex numbers
  • Use within C matrix/vector libraries,
    particularly those that already use Expression
    Templates (like the one I use for the Unified
    Algorithm)
  • Easily facilitate checkpointing so large codes
    dont exhaust memory
  • Automatically compute higher-order derivatives
    (e.g. Hessian matrix)
  • Potential for student projects to get small data
    assimilation systems up and running and efficient
    quickly
  • Impossible to apply in Fortran no template
    capability!

31
(No Transcript)
32
Minimizing the cost function
  • and 2nd derivative (the Hessian matrix)
  • Gradient Descent methods
  • Fast adjoint method to calculate ?xJ means dont
    need to calculate Jacobian
  • Disadvantage more iterations needed since we
    dont know curvature of J(x)
  • Quasi-Newton method to get the search direction
    (e.g. L-BFGS used by ECMWF) builds up an
    approximate inverse Hessian A for improved
    convergence
  • Scales well for large x
  • Poorer estimate of the error at the end
  • Gradient of cost function (a vector)
  • Gauss-Newton method
  • Rapid convergence (instant for linear problems)
  • Get solution error covariance for free at the
    end
  • Levenberg-Marquardt is a small modification to
    ensure convergence
  • Need the Jacobian matrix H of every forward
    model can be expensive for larger problems as
    forward model may need to be rerun with each
    element of the state vector perturbed

33
Time-dependent 2-stream approx.
  • Describe diffuse flux in terms of outgoing stream
    I and incoming stream I, and numerically
    integrate the following coupled PDEs
  • These can be discretized quite simply in time and
    space (no implicit methods or matrix inversion
    required)

Source Scattering from the quasi-direct beam
into each of the streams
Time derivative Remove this and we have the
time-independent two-stream approximation
Gain by scattering Radiation scattered from
the other stream
Loss by absorption or scattering Some of lost
radiation will enter the other stream
Spatial derivative Transport of radiation
from upstream
Hogan and Battaglia (2008, J. Atmos. Sci.)
Write a Comment
User Comments (0)
About PowerShow.com