Statistical Approaches to Inverse Problems

About This Presentation

Title:

Statistical Approaches to Inverse Problems

Description:

Statistical Approaches to Inverse Problems DIIG seminars on inverse problems Insight and Algorithms Niels Bohr Institute Copenhagen, Denmark 27-29 May 2002 – PowerPoint PPT presentation

Number of Views:154

Avg rating:3.0/5.0

Slides: 68

Provided by: Phili181

Learn more at: https://www.stat.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Approaches to Inverse Problems

1
Statistical Approaches to Inverse Problems

DIIG seminars on inverse problems Insight and
Algorithms
Niels Bohr InstituteCopenhagen, Denmark27-29
May 2002(Revised 13 May 2003)
P.B. Stark
Department of Statistics
University of California
Berkeley, CA 94720-3860
www.stat.berkeley.edu/stark

2
Abstract

It is useful to distinguish between the intrinsic
uncertainty of an inverse problem and the
uncertainty of applying any particular technique
to solve the inverse problem. The intrinsic
uncertainty depends crucially on the prior
constraints on the unknown (including prior
probability distributions in the case of Bayesian
analyses), on the forward operator, on the
statistics of the observational errors, and on
the nature of the properties of the unknown one
wishes to estimate. I will try to convey some
geometrical intuition for uncertainty, and the
relationship between the intrinsic uncertainty of
linear inverse problems and the uncertainty of
some common techniques applied to them.

3
References Acknowledgements

Donoho, D.L., 1994. Statistical Estimation and
Optimal Recovery, Ann. Stat., 22, 238-270.
Donoho, D.L., 1995. Nonlinear solution of linear
inverse problems by wavelet-vaguelette
decomposition, Appl. Comput. Harm. Anal.,2,
101-126.
Evans, S.N. and Stark, P.B., 2002. Inverse
Problems as Statistics, Inverse Problems, 18,
R1-R43 (in press).
Stark, P.B., 1992. Inference in
infinite-dimensional inverse problems
Discretization and duality, J. Geophys. Res., 97,
14,055-14,082.
Stark, P.B., 1992. Minimax confidence intervals
in geomagnetism, Geophys. J. Intl., 108, 329-338.
Created using TexPoint by G. Necula,
http//raw.cs.berkeley.edu/texpoint

4
Outline

Inverse Problems as Statistics
Ingredients Models
Forward and Inverse Problemsapplied perspective
Statistical point of view
Some connections
Notation linear problems illustration
Example geomagnetism from satellite observations
Example seismic velocity from t(p) and x(p)
Example differential rotation of the Sun from
normal mode splitting
Identifiability and uniqueness
Sketch of identifiablity and extremal modeling
Backus-Gilbert theory
Example solar differential rotation
Example seismic velocity in Earths core

5
Outline, contd.

Decision Theory
Decision rules and estimators
Comparing decision rules Loss and Risk
Strategies Bayes/Minimax duality
Mean distance error and bias
Illustration Regularization
Illustration Minimax estimation of linear
functionals
Example Gauss coefficients of the magnetic field
Distinguishing Models metrics and consistency

6
Inverse Problems as Statistics

Measurable space X of possible data.
Set ? of possible descriptions of the
worldmodels.
Family P Pq q 2 Q of probability
distributions on X, indexed by models ?.
Forward operator q a Pq maps model ? into a
probability measure on X .
Data X are a sample from Pq.
Pq is whole story stochastic variability in the
truth, contamination by measurement error,
systematic error, censoring, etc.

7
Models

Set ? usually has special structure.
? could be a convex subset of a separable Banach
space T. (geomag, seismo, grav, MT, )
Physical significance of ? generally gives qaPq
reasonable analytic properties, e.g., continuity.

8
Forward Problems in Geophysics

Composition of steps
transform idealized description of Earth into
perfect, noise-free, infinite-dimensional data
(approximate physics)
censor perfect data to retain only a finite list
of numbers, because can only measure, record, and
compute with such lists
possibly corrupt the list with measurement error.
Equivalent to single-step procedure with
corruption on par with physics, and mapping
incorporating the censoring.

9
Inverse Problems

Observe data X drawn from distribution P? for
some unknown ???. (Assume ? contains at least two
points otherwise, data superfluous.)
Use data X and the knowledge that ??? to learn
about ? for example, to estimate a parameter
g(?) (the value g(?) at ? of a continuous
G-valued function g defined on ?).

10
Example Geomagnetism
11
Geomagetic model parametrization
12
Geomagnetic inverse problem
13
Geophysical Inverse Problems

Inverse problems in geophysics often solved
using applied math methods for Ill-posed problems
(e.g., Tichonov regularization, analytic
inversions)
Those methods are designed to answer different
questions can behave poorly with data (e.g., bad
bias variance)
Inference ? construction statistical viewpoint
more appropriate for interpreting geophysical
data.

14
Elements of the Statistical View

Distinguish between characteristics of the
problem, and characteristics of methods used to
draw inferences.
One fundamental property of a parameter
g is identifiable if for all ?, z ? T,
g(?) ? g(z) ? Ph ? Pz.
In most inverse problems, g(?) ? not
identifiable, and few linear functionals of ? are
identifiable.

15
Deterministic and Statistical Connections

Identifiabilitydistinct parameter values yield
distinct probability distributions for the
observablessimilar to uniquenessforward
operator maps at most one model into the observed
data.
Consistencyparameter can be estimated with
arbitrary accuracy as the number of data
growsrelated to stability of a recovery
algorithmsmall changes in the data produce small
changes in the recovered model.
? quantitative connections too.

16
More Notation

Let T be a separable Banach space, T its
normed dual.
Write the pairing between T and T
lt, gt T x T ? R.

17
Linear Forward Problems

A forward problem is linear if
T is a subset of a separable Banach space T
X Rn, X (Xj)j1n
For some fixed sequence (?j)j1n of elements of
T,
Xj h kj, q i ej, q 2 Q,
where e (ej)j1n is a vector of stochastic
errors whose distribution does not depend on ?.

18
Linear Forward Problems, contd.

Linear functionals ?j are the representers
Distribution P? is the probability distribution
of X. Typically, dim(T) ? at least, n lt
dim(T), so estimating ? is an underdetermined
problem.
Define
K T ? Rn
q ? (lt?j, ?gt)j1n .
Abbreviate forward problem by X K? e, ? ? T.

19
Linear Inverse Problems

Use X K? e, and the constraint ? ? T to
estimate or draw inferences about g(?).
Probability distribution of X depends on ? only
through K?, so if there are two points
?1, ?2 ? T such that K?1 K?2 but
g(?1)?g(?2),
then g(?) is not identifiable.

20
Ex Sampling w/ systematic and random error

Observe
Xj f(tj) rj ej, j 1, 2, , n,
f 2 C, a set of smooth of functions on 0, 1
tj 2 0, 1
rj? 1, j1, 2, , n
?j iid N(0, 1)
Take Q C -1, 1n, X Rn, and q (f, r1, ,
rn).
Then Pq has density
(2p)-n/2 exp-åj1n (xj f(tj)-rj)2.

21
Sketch Identifiability
Pz Ph h z, so q not identifiable
g cannot be estimated with bounded bias
Pz Ph g(h) g(z), so g not identifiable
22
Backus-Gilbert Theory
Let Q T be a Hilbert space. Let g 2 T T be
a linear parameter. Let kjj1n µ T. Then g(q)
is identifiable iff g L K for some 1 n
matrix L. If also Ee 0, then L X is
unbiased for g. If also e has covariance matrix S
EeeT, then the MSE of L X is L S LT.
23
Sketch Backus-Gilbert
24
Example Differential solar rotation

Stellar oscillations known since late 1700s.
Sun's oscillation observed in 1960 by Leighton,
Noyes, Simon. Explained as trapped acoustic waves
by Ulrich, Leibacher, Stein, 1970-1.

Source SOHO-SOI/MDI website

gt107 modes predicted. gt250,000 identified 106
soon

Formal error bars inflated by 200. Hill et al.,
1996. Science 272, 1292-1295
25
Pattern is Superposition of Modes

Like vibrations of a spherical guitar string
3 quantum numbers l, m, n
l and m are spherical surface wavenumbers
n is radial wavenumber

Source GONG website
26
Waves Trapped in Waveguide

Low l modes sample more deeply
p-modes do not sample core well
Sun essentially opaque to EM transparent to
sound to neutrinos

Source forgotten!
27
Spectrum is very Regular

Explanation as modes, plus stellar evolutionary
theory, predict details of spectrum
Details confirmed in data by Deubner, 1975

Source GONG
28
Oscillations Taste Solar Interior

Frequencies sensitive to material properties
Frequencies sensitive to differential rotation
If Sun were spherically symmetric and did not
rotate, frequencies of the 2l1 modes with the
same l and n would be equal
Asphericity and rotation break the degeneracy
(Scheiner measured 27d equatorial rotation from
sunspots by 1630. Polar 33d.)
Like ultrasound for the Sun

29
Linear forward problem for differential rotation
Dnlmn s klmnW(r,q)r dr dq Language change q
is latitude, W is rotation model. Relationship
assumes eigenfunctions and radial structure
known. Observational errors usually assumed to
be zero-mean independent normal random variables
with known variances.
30
Different Modes sample Sun differently
Left raypath for l100, n8 and l2, n8
p-modesRight raypath for l5, n10 g-mode.
g-modes have not been observed
l20 modes. Left m20. Middle m16. (Doppler
velocities) Right section through eigenfunction
of l20, m16, n 14.
Gough et al., 1996. Science 272, 1281-1283
31
Linear Combinations of Splitting Kernels
Cuts through kernels for rotationA l15, m8.
B l28, m14. C l28, m24.D two targeted
combinations 0.7R, 60o 0.82R, 30oThompson et
al., 1996. Science 272, 1300-1305.
Estimated rotation rate as a function of depth at
three latitudes.Source SOHO-SOI/MDI website
32
Backus-Gilbert Necessary conditions

Let g be an identifiable real-valued parameter.
Suppose ? ?0?T, a symmetric convex set T ? T,
c?R, and g T ? R such that
?0 T ? T
For t ?T, g(?0 t) c g(t), and g(-t)
-g(t)
g(a1t1 a2t2) a1g(t1) a2g(t2), t1, t2 ? T,
a1, a2 ? 0, a1a2 1, and
supt ? T g(t) lt?.
Then ? 1n matrix ? s.t. the restriction of g to
T is the restriction of ?.K to T.

33
Backus-Gilbert Sufficient Conditions

Suppose g (gi)i1m is an Rm-valued parameter
that can be written as the restriction to T of
?.K for some mn matrix ?.
Then
g is identifiable.
If Ee 0, ?.X is an unbiased estimator of g.
If, in addition, e has covariance matrix S
EeeT, the covariance matrix of ?.X is ?.S.?T
whatever be P?.

34
Decision Rules

A (randomized) decision rule
d X ? M1(A)
x ? dx(.),
is a measurable mapping from the space X of
possible data to the collection M1(A) of
probability distributions on a separable metric
space A of actions.
A non-randomized decision rule is a randomized
decision rule that, to each x ?X, assigns a unit
point mass at some value
a a(x) ? A.

35
Why randomized rules?

In some problems, have better behavior.
Allowing randomized rules can make the set of
decisions convex (by allowing mixtures of
different decisions), which makes the math
easier.
If the risk is convex, Rao-Blackwell theorem says
that the optimal decision is not randomized.
(More on this later.)

36
Example randomization natural

Coin has chance 1/3 of landing with one side
showing chance 2/3 of the other showing. Dont
know which side is which.
Want to decide whether P(heads) 1/3 or 2/3.
Toss coin 10 times. X heads.
Toss fair coin once. U heads.

Use data to pick the more likely scenario, but if
data dont help, decide by tossing a fair coin.
37
Estimators

An estimator of a parameter g(?) is a decision
rule d for which the space A of possible actions
is the space G of possible parameter values.
gg(X) is common notation for an estimator of
g(?).
Usually write non-randomized estimator as a
G-valued function of x instead of a M1(G)-valued
function.

38
Comparing Decision Rules

Infinitely many decision rules and estimators.
Which one to use?
The best one!
But what does best mean?

39
Loss and Risk

2-player game Nature v. Statistician.
Nature picks ? from T. ? is secret, but
statistician knows T.
Statistician picks d from a set D of rules. d is
secret.
Generate data X from P?, apply d.
Statistician pays loss L(?, d(X)). L should be
dictated by scientific context, but
Risk is expected loss r(?, d) EqL(?, d(X))
Good rule d has small risk, but what does small
mean?

40
Strategy

Rare that one d has smallest risk 8q?Q.
d is admissible if not dominated (if no
estimator does at least as well for every q, and
better for some q).
Minimax decision minimizes rQ(d) supq?Qr(?, d)
over d?D
Minimax risk is rQ infd 2 D rQ(d)
Bayes decision minimizes
rp(d) sQr(q,d)p(dq) over d?D
for a given prior probability distribution p on
Q.
Bayes risk is rp infd 2 D rp(d).

41
Minimax is Bayes for least favorable prior
Pretty generally for convex ?, D,
concave-convexlike r,

If minimax risk gtgt Bayes risk, prior p controls
the apparent uncertainty of the Bayes estimate.

42
Common Risk Mean Distance Error (MDE)

Let dG denote the metric on G, and let g be an
estimator of g.
MDE at ? of g is
MDE?(g, g) Eq d(g(X), g(?)).
When metric derives from norm, MDE is called mean
norm error (MNE).
When the norm is Hilbertian, MNE2 is called mean
squared error (MSE).

43
Shrinkage

Suppose X N(q, I) with dim(q) d 3.
X not admissible for q for squared-error loss
(Stein, 1956).
Dominated by dS(X) X(1 a/(b X2)) for
small a and big b.
James-Stein better dJS(X) X(1-a/X2), for 0
lt a 2(d-2).
Better if take positive part of shrinkage factor
dJS(X) X(1-a/X2), for 0 lt a 2(d-2).
Not minimax, but close.
Implications for Backus-Gilbert estimates of d 3
linear functionals.
9 extensions to other distributions see Evans
Stark (1996).

44
Bias

When G is a Banach space, can define bias at ? of
g
bias?(g, g) Eq g - g(?)
(when the expectation is well-defined).
If bias?(g, g) 0, say g is unbiased at ? (for
g).
If g is unbiased at ? for g for every ???, say g
is unbiased for g. If such g exists, say g is
unbiasedly estimable.
If g is unbiasedly estimable then g is
identifiable.

45
Example Bounded Normal Mean

Observe X N (q, 1). Know a priori q 2 -t, t.
Want to estimate g(q) q.
Let f() be the standard normal density.Let F()
be the standard normal cumulative distribution
function.
Suppose we elect to use squared-error loss
L(q, d) (q - d)2
r(q, d) Eq L(q, d(X)) Eq (q - d(X))2
rQ(d) supq 2 Q r(q, d) supq 2 Q Eq (q -
d(X))2
rQ infd 2 D supq 2 Q Eq (q - d(X))2

46
Risk of X for bounded normal mean

Consider simple estimator d(X) X.
EX q, so X is unbiased for q, and q is
unbiasedly estimable.
r(q, X) Eq (q X)2 Var(X) 1.
Consider Bayesian prior to capture the constraint
q 2 -t, t
p U-t, t, the uniform distribution on the
interval -t, t.
rp(X) s-tt r(q, X) p(dq) s-tt 1 (2t)-1 dq
1.
In this example, frequentist risk of X equals
Bayes risk of X for uniform prior p.

47
X is not the best Truncation

Easy to find an estimator better than X from both
frequentist and Bayes perspectives.
Truncation estimate dT

dT is biased, but has smaller MSE than X,
whatever be q 2 Q. (dT is the constrained maximum
likelihood estimate.)
48
Risk of dT
x
Pq(X lt -t)
dT
f(xq)
0
t
-t
q
-t
0
q
t
49
Minimax Estimation of BNM

Truncation estimate better than X, but not
minimax.
Clear that r min(1, t2) MSE(X) 1, and rQ(0)
t2.
Minimax MSE estimator is a nonlinear shrinkage
estimator.
Minimax MSE risk is t2/(1t2).

50
Bayes estimation of BNM

Posterior density of q given x is

51
Posterior Mean

The mean of the posterior density minimizes the
Bayes risk, when the loss is squared error

52
Bayes Estimator is Nonlinear Shrinkage
Philip B. Stark function f bayesUnif(x, tau) f
x - (normpdf(tau - x) - normpdf(-tau-x))./(normc
df(tau-x) - normcdf(-tau-x)) return
6
X
dT
4
dp
2
0
Bayes estimator dp, t3
-2
-4
-6
-6
-4
-2
0
2
4
6
For t 3, Bayes risk rp ¼ 0.7 (by simulation)
.Minimax risk rQ 0.75.
53
Bayes/Minimax Risks
Philip B. Stark nsim 20000 risk 0 tau
5 for theta -tau.01tau pred
bayesunif(theta randn(nsim,1),tau) risk risk
(pred - theta)'(pred - theta)/nsim end risk/le
ngth(-tau.01tau)
Philip B. Stark
Philip B. Stark
t rp(simulation) rQ
0.5 0.08 0.33
1 0.25 0.50
2 0.55 0.67
3 0.70 0.75
4 0.77 0.80
5 0.82 0.83
t ! 1 1 1
Difference between knowing q 2 -t, t, and q
U-t, t.
54
Sketch Regularization
55
Consistency of Occams Inversion

Common approach minimize norm (or other
regularization functional) subject to mean data
misfit 1.
Sometimes called Occams Inversion (Constable,
Parker and Constable) simplest hypothesis
consistent with the data.
In many circumstances, this estimator is
inconsistent as number of data grows, greater
and greater chance that the estimator is 0.
Allowable misfit grows faster than norm of
noise-free data image.
In common situations, consistency of the general
approach requires data redundancy and averaging.

56
Singular Value Decomposition, Linear Problems

Assume Q ½ T , separable Hilbert space ejj1n
iid N(0, s2)kjj1n linearly independent
K is compact infinite-dimensional null
space.Let K ltn ! T be the adjoint operator to
K.
9 n triples (nj, xj, lj)j1n, nj 2 T, xj 2 X
and lj 2 lt, such that
Knj lj xj,
K xj lj nj.
njj1n can be chosen to be orthonormal in T
xjj1n can be chosen to be orthonormal in X.
lj gt 0, 8 j. Order s.t. l1 l2 L gt 0.
(nj, xj, lj)j1n are singular value
decomposition of K.

57
Singular Value Weighting

Can write minimum norm model that fits data
exactly as
dMN(X) åj1n lj-1 (xj X) nj.
Write q q q? (span of nj and its
orthocomplement)
Biasq(dMN) Eq dMN(X) - q q?.
Varq dMN Eq åj1n lj-1 (xj e) nj 2 s2
åj1n lj-2.
Components associated with small lj make variance
big noise components multiplied by lj-1.
Singular value truncation reconstruct q using
nj with lj t
dSVT åj1m lj-1 (xj X) nj, where m max k
lk gt t.
Mollifies the noise magnification but increases
bias.

58
Bias of SVT

Bias of SVT bigger than of MNE by projection of q
onto span njjm1n.
Variance of SVT smaller by s2 åj m 1n lj-2.
With adequate prior information about q (to
control bias) can exploit bias-variance tradeoff
to reduce MSE.
SVT in family of estimators that re-weight the
singular functions in the reconstruction
dw åj1n w(lj) (xj X) nj.
Regularization using norm penalty, with
regularization parameter l, corresponds to
w(u) u/(u2 l).
These tend to have smaller norm smaller than
maximum likelihood estimate can be viewed as
shrinkage.

59
Examples of Singular Functions

Linear, time-invariant filters complex sinusoids
Circular convolution sinusoids
Band and time-limiting prolate spheroidal
wavefunctions
Main-field geomagnetism spherical harmonics,
times radial polynomials in r-1
Abel transform Jacobi polynomials and Chebychev
polynomials
See Donoho (1995) for more examples and
references.

60
Minimax Estimation of Linear parameters

Observe X Kq e 2 Rn, with
q 2 Q µ T, T a separable Hilbert space
Q convex
eii1n iid N(0,s2).
Seek to learn about g(q) Q ! R, linear, bounded
on Q
For variety of risks (MSE, MAD, length of
fixed-length confidence interval), minimax risk
is controlled by modulus of continuity of g,
calibrated to the noise level.
Full problem no harder than hardest 1-dimensional
subproblem reduces to BNM (Donoho, 1994).

61
Example Geomagnetism

Q q 2 l2(w) ål11 wl åm-ll qlm 2 q .
Estimate g(q) qlm.
Symmetry of Q and linearity of K, g, let us
characterize the modulus

The problem is to maximize a linear functional of
a vector in the intersection of two ellipsoids.
In the main-field geomagnetism problem, as the
data sampling becomes more uniform over the
spherical idealization of a satellite orbit, both
the norm (prior information) and the operator K
are diagonalized by spherical harmonics.
62
Modulus of Continuity
63
Distinguishing two models

Data tell the difference between two models z and
h if the L1 distance between Pz and Ph is large

64
L1 and Hellinger distances
65
Consistency in Linear Inverse Problems

Xi ?i? ?i, i1, 2, 3, ???, subset of
separable Banach space?i? ? linear, bounded
on ? ?i iid ?
? consistently estimable w.r.t. weak topology iff
?Tk, Tk Borel function of X1, . . . , Xk, s.t.
????, ??gt0, ?? ??,
limk Pq?Tk - ??gt? 0

66
Importance of the Error Distribution

µ a prob. measure on ? µa(B) µ(B-a), a? ?
Pseudo-metric on ?
If restriction to ? converges to metric
compatible with weak topology, can estimate ?
consistently in weak topology.
For given sequence of functionals ki, µ rougher
? consistent estimation easier.

67
Summary

Solving inverse problem means different things
to different audiences.
Statistical viewpoint is useful abstraction.
Physics in mapping ? ? P?Prior information in
constraint ???.
There is more information in the assertion q p,
with p supported on Q, than there is in the
constraint q 2 Q.
Separating model q from parameters g(q) of
interest is useful Sabatiers well posed
questions. Many interesting questions can be
answered without knowing the entire model.
Thinking about measures of performance is useful.
Difficulty of problem ? performance of specific
method.