Title: Advanced Topics in
1Lecture 7
- Advanced Topics in
- Least Squares
2- the multivariate normal distribution for data, d
- p(d) (2?)-N/2 Cd-1/2 exp -1/2 (d-d)T Cd-1
(d-d) - Lets assume that the expectation d
- Is given by a general linear model
- d Gm
- And that the covariance Cd
- is known (prior covariance)
3- Then we have a distribution P(d m)
- with unknown parameters, m
- p(d)(2?)-N/2Cd-1/2exp -½ (d-Gm)T Cd-1 (d-Gm)
- We can now apply the
- principle of maximum likelihood
- To estimate the unknown parameters m
4Principle of Maximum Likelihood
- Last lecture we stated this principle as
- L(m) Si ln p(di m) with respect to m
- but in this distribution
- the whole data vector d is being treated as a
single quantity - So the princple becomes simply
- Maximize L(m) ln p(d m)
p(dm)(2?)-N/2Cd-1/2exp -½ (d-Gm)T Cd-1
(d-Gm)
5- L(m) ln p(d m)
- - ½Nln(2?) - ½ln(Cd) - ½(d-Gm)T Cd-1 (d-Gm)
- The first two terms do not contain m, so the
principle of maximum likelihood is - Maximize -½ (d-Gm) T Cd-1 (d-Gm)
- or
- Minimize (d-Gm) T Cd-1 (d-Gm)
6- Minimize (d-Gm) T Cd-1 (d-Gm)
- Special case of uncorrelated data with equal
variance - Cd sd2I
- Minimize sd-2 (d-Gm)T (d-Gm) with respect to m
- Which is the same as
- Minimize (d-Gm)T (d- Gm) with respect to m
- This is the Principle of Least Squares
7- This is the Principle of Least Squares
- minimize E eTe (d-Gm)T(d-Gm) with respect to
m - follows from the Principle of Maximum Likelihood
in the special case of - a multivariate Normal distribution
- the data being uncorrelated and of equal variance
8- Corollary
- If your data are
- NOT NORMALLY DISTRIBUTED
- Then least-squares is
- not the right method to use!
9- What if Cdsd2I but sd is unknown?
- note Cd s2N
- L(m,sd)
- -½Nln(2?) - ½ln(Cd) - ½(d-Gm)T Cd-1 (d-Gm)
- -½Nln(2?) Nln(sd) - ½sd-2(d-Gm)T (d-Gm)
- The first two terms do not contain m, so the
principle of maximum likelihood still implies - Minimize (d-Gm)T(d-Gm) eTe E
- Then ?L/?sd 0 Nsd-1 sd-3(d-Gm)T (d-Gm)
- Or, solving for sd
- sd2 N-1 (d-Gm)T (d-Gm) N-1 eTe
10- This is the Principle of Maximum Likelihood
- implies that
- sd2 N-1 (d-Gm)T (d-Gm) N-1 eTe
- Is a good posterior estimate of the variance of
the data, when - the data follow a multivariate normal
distribution - the data are uncorrelated and with uniform
- (but unknown) variance, sd2
11- But back to the general case
- What formula for m does the rule
- Minimize (d-Gm)T Cd-1 (d-Gm)
- imply ?
12- Trick
- Minimize (d-Gm)T (d-Gm)
- Implies m GTG-1 GT d
- Now write, Minimize
- (d-Gm)T Cd-1 (d-Gm)
- (d-Gm)T Cd-1/2 Cd-1/2 (d-Gm)
- (Cd-1/2d-Cd-1/2Gm)T (Cd-1/2d-Cd-1/2Gm)
- (d-Gm)T (d-Gm)
- with dCd-1/2d
- G Cd-1/2G
- This is simple least squares, so m GTG-1 GT
d - or
- m GTCd-1/2Cd-1/2G-1GTCd-1/2Cd-1/2d GT
Cd-1G-1GTCd-1d
Symmetric, so it inverse and square root is
symmetric, too
13- So,
- minimize (d-Gm)T Cd-1 (d-Gm)
- implies
- m GT Cd-1G-1GTCd-1d
- and
- Cm GTCd-1G-1GTCd-1 Cd GTCd-1G-1GTCd-1
T - GTCd-1G-1GTCd-1G GTCd-1G-1 GTCd-1G-1
Remember formula Cm M Cd MT
14Example with Correlated Noise
Uncorrelated Noise
Correlated Noise
15Scatter Plots
di vs. di1 high correlation
di vs. di2 some correlation
di vs. di3 little correlation
16data straight line correlated noise
d a bx n
x
17Model for Cd Cd ij exp -c i-j with
c0.25exponential falloff from main diagonal
MatLab Code c 0.25 XX, YY meshgrid(
1N, 1N ) Cd (sd2)exp(-cabs(XX-YY))
18Results
Intercept Correlated 10.96 20.6
Uncorrelated 8.42 7.9 True
1.0 Slope Correlated 1.92 0.35
Uncorrelated 1.97 0.14 True
2.0
d a bx n
Both fits about the same but
x
note error estimates are larger (more realistic
?) for the correlated case
19How to make correlated noise
w 0.1, 0.3, 0.7, 1.0, 0.7, 0.3, 0.1' w
w/sum(w) Nw length(w) Nw2
(Nw-1)/2 N101 N2(N-1)/2 n1
random('Normal',0,1,NNw,1) n zeros(N,1) for
i -Nw2Nw2 n n w(iNw21)n1(iNw-Nw2
iNwN-1-Nw2) end
Define weighting function
Start with uncorrelated noise
Correlated noise is a weighted average of
neighboring uncorrelated noise values
20- Lets look at the transformations
- dCd-1/2d
- G Cd-1/2G
- In the special case of uncorrelated data with
different variances Cd diag( s12, s22, sN2) - disi-1 di multiply each data by
- the reciprocal of its error
- Gij si-1 Gij multiply each row of the
- data kernel by the same
- amount
- Then solve by ordinary least squares
s12 0 0 0 s22 0 0 0 s32
...
21s1-1G11 s1-1G12 s1-1G13 s2-1G21
s2-1G22 s2-1G13 s3-1G31 s3-1G32
s3-1G33 sN-1GN1 sN-1GN2 sN-1GN3
s1-1d1s2-1d2s3-1d3sN-1dN
m
Rows have been weighted by a factor of si-1
22So this special case is often called
Note that the total error isE eT Cd-1 e Si
si-2 ei2
weight
Each individual error is weighted by the
reciprocal of its variance, so errors involving
data with SMALL variance get MORE weight
23Example fitting a straight line
100 data, first 50 have a different sd than the
last 50
24N101N2(N-1)/2sd(1N2-1) 5sd(N2N)
100sd2i sd.(-2)Cdi diag(sd2i)G(,1)on
es(N,1)G(,2)xGTCdiGIinv(G'CdiG)m
GTCdiGIG'Cdidd2 m(1) m(2) . x
MatLab Code
Note that Cd-1 is explicitly defines as a
diagonal matrix
25Equal variance
Left 50 sd 5 right 50 sd 5
26Left has smaller variance
first 50 sd 5 last 50 sd 100
27Right has smaller variance
first 50 sd 100 last 50 sd 5
28Finally, two miscellaneous comments about
least-squares
29Comment 1Case of fitting functions to a
datasetdi m1 f1(xi) m2 f2(xi) m3 f3(xi)
e.g. di m1 sin(xi) m2 cos(xi) m3 sin(2xi)
30f1(x1) f2(x1) f3(x1) f1(x2) f2(x2)
f3(x2) f1(x3) f2(x3) f3(x3)
f1(xN) f2(xN) f3(xN)
d1d2d3dN
m
31Note that the matrix GTGhas element i,j
GTGij Si fi(xk)fj(xk) fi ? fjand thus is
diagonal if the functions are orthogonal
32if the functions are normalized so fi?fi1 then
GTG I and the least squares solution ism
GTd and Cmsd2 I
super-simple formula! mi fi ? d
guaranteed uncorrelated errors!
33Example of Straight line
yi a bxi implies f1(x) 1 and f2(x)
x so condition f1(x)?f2(x)0 implies Si xi 0
or x0 this happens when the xs straddle the
origin The choice f1(x) 1 and f2(x)
x-x i.e. y a b (x-x) leads to
uncorrelated errors in (a,b)
y
a
a
x
x1
x2
x3
x4
x5
x
34Example wavelet functions
Localized oscillation with a character-istic
frequency
35GTG
Almost diagonal
36Comment 2sometimes writing least-squares as
GTG m GT d or GTG m GT d is more
useful than m GTG-1 GT dsince you can use
some method other than a matrix inverse for
solving the equation