Title: Quantifying Uncertainty in Inverse Problems
1Quantifying Uncertainty in Inverse Problems
- Mathematical Geophysics and Uncertainty in Earth
Models - Colorado School of Mines
- 14-25 June 2004
- P.B. Stark
- Department of Statistics
- University of California
- Berkeley, CA 94720-3860
- www.stat.berkeley.edu/stark
2Inverse Problems are Statistics
- After my 1st talk at Berkeley, Lucien Le Cam told
me I dont understand all this fuss about
inverse problems. Forward problems are
Probability and inverse problems are Statistics. - It took me 10 years to understand just how right
he was.
3Inverse Problems as Statistics
- Measurable space X of possible data.
- Set ? of descriptions of the worldmodels.
- Family P Pq q 2 Q of probability
distributions on X indexed by models ?. - Forward operator q a Pq maps model ? into a
probability measure on X . - X valued data X are a sample from Pq.
- Pq is everything randomness in the truth,
measurement error, systematic error, censoring,
etc.
4Models
- ? usually has special structure.
- ? could be a convex subset of a separable Banach
space T. (geomag, seismo, grav, MT, ) - Physical significance of ? generally gives qaPq
reasonable analytic properties, e.g., continuity.
5Forward Problems in Physical Science
- Often thought of as a composition of steps
- transform idealized model q into perfect,
noise-free, infinite-dimensional data
(approximate physics) - keep a finite number of the perfect data, because
can only measure, record, and compute with finite
lists - possibly corrupt the list with measurement error.
- Equivalent to single-step procedure with
corruption on par with physics, and mapping
incorporating the censoring.
6Inverse Problems
- Observe data X drawn from P? for some unknown
???. (Assume ? contains at least two points
otherwise, data superfluous.) - Use X and the knowledge that ??? to learn about
? for example, to estimate a parameter g(?) (the
value g(?) at ? of a continuous G-valued function
g defined on ?).
7Modeling/Prediction/Estimation
- What does it mean to solve an inverse problem?
- Constructing a model that fits the data ?
estimation (E.g., LS, RLS). Predicting other data
? estimating model. - Properties of a model that fits the data need not
be shared by the true model. - The data need not constrain the parameter of
interest Non-uniqueness. - Confusion between properties of algorithms and
properties of problems. - Well-posed questions of P. Sabatier
8Inverse Problems in Physical Science
- Inverse problems in science often solved using
applied math methods for ill-posed problems
(e.g., Tichonov regularization, analytic
inversions) - Those methods are designed to answer different
questions can behave poorly with data (e.g., bad
bias variance) - Inference ? construction Statistical viewpoint
may be more appropriate for interpreting real
data with stochastic errors.
9Elements of the Statistical View
- Distinguish between characteristics of the
problem, and characteristics of methods used to
draw inferences. - One fundamental qualitative property of a
parameter - g is identifiable if for all ?, z ? T,
- g(?) ? g(z) ? Ph ? Pz.
- In most inverse problems, g(?) ? not
identifiable, and few linear functionals of ? are
identifiable.
10Deterministic and Statistical Connections
- Identifiabilitydistinct parameter values yield
distinct probability distributions for the
observablesis similar to uniquenessforward
operator maps at most one model into the observed
data. - Consistencyparameter can be estimated with
arbitrary accuracy as the number of data growsis
related to stability of a recovery
algorithmsmall changes in the data produce small
changes in the recovered model. - ? quantitative connections too.
11More Notation
- Let T be a separable Banach space, T its
normed dual. - Write the pairing between T and T
- lt, gt T x T ? R.
12Linear Forward Problems
- A forward problem is linear if
- T is a subset of a separable Banach space T
- X Rn, X (Xj)j1n
- For some fixed sequence (?j)j1n of elements of
T, - Xj hkj, qi ej, q 2 Q,
- where e (ej)j1n is a vector of stochastic
errors whose distribution does not depend on ?.
13Linear Forward Problems, contd.
- Linear functionals ?j are the representers
- Distribution P? is the probability distribution
of X. - Typically, dim(T) ? at least, n lt dim(T), so
estimating ? is an underdetermined problem. - Define
- K T ? Rn
- q ? (lt?j, ?gt)j1n .
- Abbreviate forward problem by X K? e, ? ? T.
14Linear Inverse Problems
- Use X K? e, and the constraint ? ? T to
estimate or make inferences about g(?). - Probability distribution of X depends on ? only
through K?, so if there are two points - ?1, ?2 ? T such that K?1 K?2 but
- g(?1)?g(?2),
- then g(?) is not identifiable.
- This is like what geophysicists call
non-uniqueness.
15Example Sampling w/ systematic random error
- Observe
- Xj f(tj) rj ej, j 1, 2, , n,
- f 2 C, a set of smooth of functions on 0, 1
- tj 2 0, 1
- rj? 1, j1, 2, , n
- ?j iid N(0, 1)
- Take Q C -1, 1n, X Rn, and q (f, r1, ,
rn). - Then Pq has density
- (2p)-n/2 exp-Ã¥j1n (xj f(tj)-rj)2.
16Discretization and Regularization
- Discretization introduces error.
- Also masks error reduces apparent uncertainty
unidentifiable parameters seem identifiable. - Literature confuses numerical stability and
accuracy. - Discretization, regularization increase stability
and bias. Without constraints/prior, cant tell
whether added bias is big or small. Usually could
be infinite.
17Discretize
f(x) ¼ åi1m ai fi(x), m gt nfi(x)i1m linearly
independent on 0, 1 For many choices of
fi(x)i1m, e.g., sinusoids, splines, boxcars,
apparently can estimate f with bounded bias
(without bias if ri are 0). Illusion
artifact of the discretization.
18Sketch Identifiability
Pz Ph h z, so q not identifiable
g cannot be estimated with bounded bias
Pz Ph g(h) g(z), so g not identifiable
19Backus-Gilbert Theory
Let Q T be a Hilbert space. Let g 2 T T be
a linear parameter. Let kjj1n µ T. Then g(q)
is identifiable iff g L K for some 1 n
matrix L. If also Ee 0, then L X is
unbiased for g. If also e has covariance matrix S
EeeT, then the MSE of L X is L S LT.
20Sketch Backus-Gilbert
21Decision Rules
- A (randomized) decision rule
- d X ? M1(A)
- x ? dx(.),
- is a measurable mapping from the space X of
possible data to the collection M1(A) of
probability distributions on a separable metric
space A of actions. - A non-randomized decision rule is a randomized
decision rule that, to each x ?X, assigns a unit
point mass at some value - a a(x) ? A.
22Why randomized rules?
- In some problems, have better behavior.
- Allowing randomized rules can make the set of
decisions convex (by allowing mixtures of
different decisions), which makes the math
easier. - If the risk is convex, Rao-Blackwell theorem says
that the optimal decision is not randomized.
(More on this later.)
23Example randomization natural
- Coin has chance 1/3 of landing with one side
showing chance 2/3 of the other showing. Dont
know which side is which. - Want to decide whether P(heads) 1/3 or 2/3.
- Toss subject coin 10 times. X heads.
- Toss fair coin once. U heads.
Use data to pick the more likely scenario, but if
data dont help, decide by tossing a fair coin.
24Estimators
- An estimator of a parameter g(?) is a decision
rule d for which the space A of possible actions
is the space G of possible parameter values. - gg(X) is common notation for an estimator of
g(?). - Usually write non-randomized estimator as a
G-valued function of x instead of a M1(G)-valued
function.
25Comparing Decision Rules
- Infinitely many decision rules and estimators.
- Which one to use?
- The best one!
- But what does best mean?
26Loss and Risk
- 2-player game Nature v. Statistician.
- Nature picks ? from T. ? is secret, but
statistician knows T. - Statistician picks d from a set D of rules. d is
secret. - Generate data X from P?, apply d.
- Statistician pays loss L(?, d(X)). L should be
dictated by scientific context, but - Risk is expected loss r(?, d) EqL(?, d(X))
- Good rule d has small risk, but what does small
mean?
27Strategy
- Rare that one d has smallest risk 8q?Q.
- d is admissible if not dominated (no estimator
does at least as well for every q, and better for
at least one q). - Minimax decision minimizes rQ(d) supq?Qr(?, d)
over d?D - Minimax risk is rQ infd 2 D rQ(d)
- Bayes decision minimizes
- rp(d) sQr(q,d)p(dq) over d?D
- for a given prior probability distribution p on
Q. - Bayes risk is rp infd 2 D rp(d).
28Minimax is Bayes for least favorable prior
Pretty generally for convex ?, D,
concave-convexlike r,
- If minimax risk gtgt Bayes risk, prior pnot the
data or Qcontrols the apparent uncertainty of
the Bayes estimate.
29Common Risk Mean Distance Error (MDE)
- Let dG denote the metric on G, and let g be an
estimator of g. - MDE at ? of g is
- MDE?(g, g) Eq d(g(X), g(?)).
- If metric derives from norm, MDE is mean norm
error (MNE). - If the norm is Hilbertian, MNE2 is mean squared
error (MSE).
30Shrinkage
- Suppose X N(q, I) with dim(q) d 3.
- X not admissible for q for squared-error loss
(Stein, 1956). - Dominated by X(1 a/(b X2)) for small a
and big b. - James-Stein better dJS(X) X(1-a/X2), 0 lt a
2(d-2). - Even better to take positive part of shrinkage
factor - dJS(X) X(1-a/X2), 0 lt a 2(d-2).
- dJS isnt minimax for MSE, but close.
- Implications for Backus-Gilbert estimates of d 3
linear functionals. - 9 extensions to other distributions see Evans
Stark (1996).
31Example Bounded Normal Mean
- Observe X N (q, 1). Know a priori q 2 -t, t.
- Want to estimate g(q) q.
- f() standard normal density.F() standard
normal cumulative distribution function. - Suppose we choose to use squared-error loss
- L(q, d) (q - d)2
- r(q, d) Eq L(q, d(X)) Eq (q - d(X))2
- rQ(d) supq 2 Q r(q, d) supq 2 Q Eq (q -
d(X))2 - rQ infd 2 D supq 2 Q Eq (q - d(X))2
32Risk of X for bounded normal mean
- Consider the simple (maximum likelihood)
estimator - d(X) X.
- EX q, so X is unbiased for q, and q is
unbiasedly estimable. - r(q, X) Eq (q X)2 Var(X) 1.
- Consider uniform Bayesian prior to capture
constraint q 2 -t, t - p U-t, t, the uniform distribution on the
interval -t, t. - rp(X) s-tt r(q, X) p(dq) s-tt 1 (2t)-1 dq
1. - In this example, frequentist risk of X equals
Bayes risk of X for uniform prior p (but X is not
the Bayes estimator).
33Truncation is better (but not best)
- Easy to find an estimator better than X from both
frequentist and Bayes perspectives. - Truncation estimate dT
dT is biased, but has smaller MSE than X,
whatever be q 2 Q. (dT is the constrained maximum
likelihood estimate.)
34Risk of dT
x
Pq(X lt -t)
dT
f(xq)
0
t
-t
q
-t
0
q
t
35Minimax MSE Estimate of BNM
- Truncation estimate better than X, but neither
minimax nor Bayes. - Clear that r min(1, t2) MSE(X) 1, and rQ(0)
t2. - Minimax MSE estimator is a nonlinear shrinkage
estimator. - Minimax MSE risk is t2/(1t2).
36Bayes estimation of BNM
- Posterior density of q given x is
37Posterior Mean
- The mean of the posterior density minimizes the
Bayes risk, when the loss is squared error
38Bayes estimator is also nonlinear shrinkage
Philip B. Stark function f bayesUnif(x, tau) f
x - (normpdf(tau - x) - normpdf(-tau-x))./(normc
df(tau-x) - normcdf(-tau-x)) return
6
X
dT
4
dp
2
0
Bayes estimator dp, t3
-2
-4
-6
-6
-4
-2
0
2
4
6
For t 3, Bayes risk rp ¼ 0.7 (by simulation)
.Minimax risk rQ 0.75.
39Bayes/Minimax Risks for Mean Squared Error
Philip B. Stark nsim 20000 risk 0 tau
5 for theta -tau.01tau pred
bayesunif(theta randn(nsim,1),tau) risk risk
(pred - theta)'(pred - theta)/nsim end risk/le
ngth(-tau.01tau)
Philip B. Stark
Philip B. Stark
Knowing q U-t, t versus knowing q 2 -t, t.
40Confidence Interval for BNM
- Might be interested in a confidence set for q
instead of point estimate. A 1-a confidence set
I satisfies - Pq (I (X) 3 q) 1-a, 8q 2 Q.
- Actions are now sets, not points. Decision rules
are probability measures on sets. - Sensible loss? Lebesgue measure of confidence
set. But consider the application! - Risk is expected measure.
41Some Confidence Procedures
- Naïve X za/2, X za/2
- Truncated X za/2, X za/2 Ã… -t, t
- Affine fixed-length aXbc/2, aXbc/2
- Nonlinear fixed length f(X)c/2, f(X)c/2, f
measurable - Variable length l(X), u(X), l, u
measurable - Likelihood ratio test (LRT) Include all h 2 -t,
t s.t. - f(h)/f(0) ch.
- Choose ch to get 1-a coverage.
42Minimax Expected Length
- Seek procedure w/ min max expected length for q 2
-t, t. - For t 2za, minimax is Truncated Pratt simple
form. (Evans et al., 2003 Schafer Stark,
2003.) - Table for a 0.05.
- Naïve has length 3.92.
43Regularization
- From statistical perspective, regularization
tries to exploit a bias/variance tradeoff to
reduce MNE. - There are situations in which regularization is
optimaldepends on prior information, forward
operator, loss function, regularization
functional. See, e.g., Donoho (1995), OSullivan
(1986). - Generally need some prior information about q to
know whether a given amount of smoothing
increases bias2 by more than it decreases
variance. (But c.f. shrinkage.) - Can think of regularization in dual ways
minimizing a measure of size s.t. fit data
adequately, or minimizing measure of misfit s.t.
keep the model small. Complementary
interpretations of the Lagrange multiplier. - Generally, to get consistency, need the smoothing
to decrease at the right rate as the number of
data increases. Rate depends on true smoothness.
44Sketch Regularization
45Consistency of Occams Inversion
- Common approach minimize norm (or other
regularization functional) subject to mean data
misfit 1. - Sometimes called Occams Inversion (Constable et
al., 1987) simplest hypothesis consistent with
the data. - In many circumstances, this estimator is
inconsistent as number of data grows, greater
and greater chance that the estimator is 0.
Allowable misfit grows faster than norm of
noise-free data image. - In common situations, consistency of the general
approach requires data redundancy and averaging.
46Singular Value Decomposition, Linear Problems
- Assume Q ½ T , separable Hilbert space ejj1n
iid N(0, s2)kjj1n linearly independent - K is compact infinite-dimensional null
space.Let K ltn ! T be the adjoint operator to
K. - 9 n triples (nj, xj, lj)j1n, nj 2 T, xj 2 X
and lj 2 lt, such that - Knj lj xj,
- K xj lj nj.
- njj1n can be chosen to be orthonormal in T
xjj1n can be chosen to be orthonormal in X. - lj gt 0, 8 j. Order s.t. l1 l2 L gt 0.
- (nj, xj, lj)j1n are singular value
decomposition of K.
47Singular Value Weighting
- Can write minimum norm model that fits data
exactly as - MNE dMN(X) åj1n lj-1 (xj X) nj.
- Write q q q? (span of nj and its
orthocomplement) - Biasq(dMN) Eq dMN(X) - q q?.
- Varq dMN Eq åj1n lj-1 (xj e) nj 2 s2
Ã¥j1n lj-2. - Components associated with small lj make variance
big noise components multiplied by lj-1. - Singular value truncation reconstruct q using
nj with lj t - dSVT åj1m lj-1 (xj X) nj, where m max k
lk gt t. - Mollifies the noise magnification but increases
bias.
48Bias of SVT
- Bias of SVT bigger than of MNE by projection of q
onto span njjm1n. - Variance of SVT smaller by s2 åj m 1n lj-2.
- With enough prior information about q (to control
bias) can exploit bias-variance tradeoff to
reduce MSE. - SVT ? family of estimators that re-weight the
singular functions in the reconstruction - dw åj1n w(lj) (xj X) nj.
- Regularization using norm penalty, with
regularization parameter l, corresponds to - w(u) u/(u2 l).
- These tend to have smaller norm smaller than
maximum likelihood estimate can be viewed as
shrinkage.
49Examples of Singular Functions
- Linear, time-invariant filters complex sinusoids
- Circular convolution sinusoids
- Band and time-limiting prolate spheroidal
wavefunctions - Main-field geomagnetism spherical harmonics,
times radial polynomials in 1/r - Abel transform Jacobi polynomials and Chebychev
polynomials - See Donoho (1995) for more examples and
references.
50Minimax Estimation of Linear parameters
- Observe X Kq e 2 Rn, with
- q 2 Q µ T, T a separable Hilbert space
- Q convex
- eii1n iid N(0,s2).
- Seek to learn about g(q) Q ! R, linear, bounded
on Q - For variety of risks (MSE, MAD, length of
fixed-length confidence interval), minimax risk
is controlled by modulus of continuity of g,
calibrated to the noise level. - Full problem no harder than hardest 1-dimensional
subproblem reduces to BNM (Donoho, 1994).
51Example Geomagnetism
- Q q 2 l2(w) ål11 wl åm-ll qlm 2 q .
- Estimate g(q) qlm.
- Symmetry of Q and linearity of K, g, let us
characterize the modulus
Problem maximize linear functional of a vector
in the intersection of two ellipsoids. In the
main-field geomagnetism problem, as the data
sampling becomes more uniform over the spherical
idealization of a satellite orbit, both the norm
(prior information) and the operator K are
diagonalized by spherical harmonics.
52Modulus of Continuity
53Distinguishing two models
- Data tell the difference between two models z and
h if the L1 distance between Pz and Ph is large
54Summary
- Solving inverse problem means different things
to different audiences. - Statistical viewpoint is useful abstraction.
Physics in mapping ? ? P?Prior information in
constraint ???. - There is more information in the assertion q p,
with p supported on Q, than there is in the
constraint q 2 Q. - Separating model q from parameters g(q) of
interest is useful Sabatiers well posed
questions. Many interesting questions can be
answered without knowing the entire model. - Thinking about measures of performance is useful.
- Difficulty of problem ? performance of specific
method.
55References Acknowledgements
- Constable, S.C., R.L. Parker C.G. Constable,
1987. Occam's inversion A practical algorithm
for generating smooth models from electromagnetic
sounding data, Geophysics, 52, 289-300. - Donoho, D.L., 1994. Statistical Estimation and
Optimal Recovery, Ann. Stat., 22, 238-270. - Donoho, D.L., 1995. Nonlinear solution of linear
inverse problems by wavelet-vaguelette
decomposition, Appl. Comput. Harm. Anal.,2,
101-126. - Evans, S.N. Stark, P.B., 2002. Inverse Problems
as Statistics, Inverse Problems, 18, R1-R43. - Evans, S.N., B.B. Hansen P.B. Stark, 2003.
Minimax expected measure confidence sets for
restricted location parameters, Tech. Rept. 617,
Dept. of Statistics, UC Berkeley - Le Cam, L., 1986. Asymptotic Methods in
Statistical Decision Theory, Springer-Verlag, NY,
1986, 742pp. - OSullivan, F., 1986. A statistical perspective
on ill-posed inverse problems, Statistical
Science, 1, 502-518. - Schafer, C.M. P.B. Stark, 2003. Using what we
know inference with physical constraints, Tech.
Rept., Dept. of Statistics, UC Berkeley - Stark, P.B., 1992. Inference in
infinite-dimensional inverse problems
Discretization and duality, J. Geophys. Res., 97,
14,055-14,082. - Stark, P.B., 1992. Minimax confidence intervals
in geomagnetism, Geophys. J. Intl., 108,
329-338.Created using TexPoint by G. Necula,
http//raw.cs.berkeley.edu/texpoint