Title: Predicting Output from Computer Experiments
1Predicting Output from Computer Experiments
- Design and Analysis of Computer Experiments
- Chapter 3
- Kevin Leyton-Brown
2Overview
- Overall program in this chapter
- predict the output of a computer simulation
- were going to review approaches to regression,
looking for various kinds of optimality - First, well talk about just predicting our
random variable (x 3.2) - note, in this setting, we have no features
- Then, well consider the inclusion of features in
our predictions, based on features in our
training data (x 3.2, 3.3) - In the end, well apply these ideas to computer
experiments (x 3.3) - Not covered
- an empirical evaluation of seven EBLUPs on
small-sample data (x 3.3, pp. 69-81) - proofs of some esoteric BLUP theorems (x 3.4,
pp. 82-84) - If youve done the reading you already know
- the difference between minimum MSPE linear
unbiased predictors and BLUPs - three different intuitive interpretations of
rgt0R-1(Yn FB) - a lot about statistics
- whether this chapter has anything to do with
computer experiments - If you havent youre in for a treat ?
3Predictors
- Y0 is our random variable our data is Yn (Y1,
, Yn)gt - no features just predict one response from
the others - A generic predictor
predicts Y0 based on Yn - to avoid powerpoint agony, Ill denote as
Y0 from now on - There are three kinds of predictors discussed
- Predictors
- Y0(Yn) has unrestricted functional form
- Linear Predictors
- Y0 a0 ?ni1aiYi a0 agtYn
- Linear unbiased predictors (LUP)
- again, linear predictors Y0 a0 agtYn
- furthermore, unbiased with respect to a given
family F of distributions for (Y0, Yn) - Definition a predictor Y0 is unbiased for Y0
with respect to the class of distributions F over
(Y0, Yn) if for all F 2 F, EFY0 EFY0. - EF denotes expectation under the F()
distribution for (Y0, Yn) - this definition depends on F a linear predictor
is unbiased with respect to a class - as F gets bigger, the set of LUPs gets weakly
smaller
4LUP Example 1
- Suppose that Yi ?0 ?i, where ?i N(0,?2?),
?2? gt 0. - Define F as those distributions in which
- ?0 is a given nonzero constant
- ?2? is unknown, but ?2? gt 0 is known
- Any Y0 a0 aTYn is a LP of Y0
- Which are unbiased? We know that
- E Y0 E a0 ?ni1aiYi a0 ?0?ni1ai (Eq
1) - and E Y0 ?0 (Eq 2)
- For our LP to be unbiased, we must have (Eq 1)
(Eq 2) 8 ?2? - since (Eq 1), (Eq 2) are independent of ?2?, we
just need that, given ?0, a satisfies a0
?0?ni1ai ?0 - solutions
- a0 ?0, a such that ?ni1ai 0
(data-independent predictor Y0 ?0) - a0 0, a such that ?ni1ai 1
- e.g., sample mean of Yn is the LUP corresponding
to a0 0, ai 1/n
5LUP Example 2
- Suppose that Yi ?0 ?i, where ?i N(0,?2?),
?2? gt 0. - Define F as those distributions in which
- ?0 is an unknown real constant
- ?2? is unknown, but ?2? gt 0 is known
- Any Y0 a0 aTYn is a LP of Y0
- Which are unbiased? We know that
- E Y0 E a0 ?ni1aiYi a0 ?0?ni1ai (Eq
1) - and E Y0 ?0 (Eq 2)
- For our LP to be unbiased, we must have (Eq 1)
(Eq 2) 8 ?2? and 8 ?0 - since (Eq 1), (Eq 2) are independent of ?2?, we
just need that, 8 ?0, a satisfies a0
?0?ni1ai ?0 - solutions
- a0 ?0, a such that ?ni1ai 0
(data-independent predictor Y0 ?0) - a0 0, a such that ?ni1ai 1
- e.g., sample mean of Yn is the LUP corresponding
to a0 0, ai 1/n - This illustrates that a LUP for F is a LUP for
subfamilies of F
6Best Mean Squared Prediction Error (MSPE)
Predictors
- Definition MSPE(Y0,F) EF(Y0 - Y0)2
- Definition Y0 is a minimum MSPE predictor at F
if, for any predictor Y0 MSPE(Y0,F)
MSPE(Y0,F) - well also call this a best MSPE predictor
- Fundamental theorem of prediction
- the conditional mean of Y0 given Yn is the
minimum MSPE predictor of Y0 based on Yn
7Best Mean Squared Prediction Error (MSPE)
Predictors
- Theorem Suppose that (Y0, Yn) has a joint
distribution F for which the conditional mean of
Y0 given Yn exists. Then Y0 EY0 Yn is the
best MSPE predictor of Y0. - Proof Fix an arbitrary unbiased predictor
Y0(Yn). - MSPE(Y0,F) EF(Y0 - Y0)2 EF(Y0 - Y0
Y0 - Y0)2 EF(Y0 - Y0)2 MSPE(Y0,F)
2EF(Y0 - Y0)(Y0 - Y0)
MSPE(Y0,F) 2EF(Y0 - Y0)(Y0 -
Y0) (Eq 3) - EF(Y0 - Y0)(Y0 - Y0) EF(Y0 - Y0) EF(Y0 -
Y0) Yn EF(Y0 - Y0) (Y0 - EFY0
Yn) EF(Y0 - Y0) 0 0 - Thus, MSPE(Y0,F) MSPE(Y0,F)
- Notes
- Y0 EY0 Yn is essentially the unique MSPE
predictor - MSPE(Y0,F) MSPE(Y0,F) iff Y0 Y0 almost
everywhere - Y0 EY0 Yn is always unbiased
- EY0 EEY0 Yn EY0
(Why can we condition here?)
8Example Continued-Best MSPE Predictors
- What is the best MSPE predictor when each Yi
N(?0, ?2?)? - Since the Yis are independent, Y0 Yn N(?0,
?2?) - Thus, Y0 EY0Yn ?0
- What if ?2? is known, and Yi N(?0, ?2?), but ?0
is unknown (i.e., ?0 ? 1)? - improper priors do not always give proper
posteriors. But hereY0 Yn yn N1 ?,
?2?(1 1/n)where ? is the sample mean on the
training data Yn - Thus, the best MSPE predictor of Y0 is Y0 (?i
Yi) / n
9Now lets dive in to Gaussian Processes (uh oh)
- Consider the regression model from chapter 2Yi
Y(xi) ?pj1fj?j Z(xi) fgt(xi)? Z(xi) - each fj is a known regression function
- ? is an unknown nonzero p 1 vector
- Z(x) is a zero mean stationary Gaussian process
with dependence specified byCovZ(xi),Z(xj)
?2Z R(xi - xj) for some known correlation
function R. - Then the joint distribution of Y0 Y(x0) and Yn
(Y(x1), , Y(xn)) is
the defn of unbiased and the conditional dist.
of a multivariate normal give
(Eq 4)
10Gaussian Process Example Continued
- The best MSPE predictor of Y0 isY0 EY0 Yn
fgt0? rgt0R-1(Yn - F?) (Eq4) - But for what class of distributions F is this
true? - Y0 depends on
- multivariate normality of (Y0, Yn)
- ?
- R()
- thus the best MSPE predictor changes when ? or R
change, however, it remains the same for all ?2Z
gt 0
11Second GP example
- Second example analogous to the previous linear
example, what if we add uncertainty about ?? - we assume that ?2Z is known, although the authors
say this isnt required - Now we have a two-stage model
- The first stage, our conditional distribution of
(Y0, Yn) given ?, is the same distribution we saw
before. - The second stage is our prior on ?.
- One can show that the best MSPE predictor of Y0
is Y0 EY0 Yn fgt0 E? Yn rgt0R-1(Yn -
F E? Yn) - Compare this to what we had in the one-stage
case Y0 fgt0? rgt0R-1(Yn - F?) - but the authors give a derivation see the book
12So what about E? Yn?
- Of course, the formula for E? Yn depends on
our ? prior - when this prior is uninformative, we can
derive? Yn Np(FgtR-1F)-1F-1Yn,
?2Z(FgtR-1F)-1 - this (somehow) gives us Y0 fgt0B rgt0R-1(Yn
FB), (Eq 5)
B (FgtR-1F)-1FgtR-1Yn
as above with Y0,
for powerpoint reasons I use B instead of - What sense can we make of (Eq 5)?
- the sum of the regression predictor fgt0B and a
correction rgt0R-1(Yn FB) - a function of the training data Yn
- a function of x0, the point at which a prediction
is made - recall that fgt0 f(x0)gt rgt0 (R(x0 - x1), ,
R(x0 - xn))gt - For the moment, we consider (1) we consider (2)
and (3) in x 3.3 - (thats right, were still in x 3.2!)
- The correction term is a linear combination of
the residuals Yn FB based on the GP model fgt?
Z with prediction point specific coefficients
rgt0R-1(Yn FB) ?i ci(x0)(Yn - FB)where the
weight ci(x0) is the ith element of R-1r0 and (Yn
- FB) is the ith residual based on the fitted
model
13Example
- Suppose the true unknown curve is the 1D dampened
cosine y(x) e-1.4xcos(7?x/2) - 7-point training set
- x1 drawn from 0,1/7
- xi x1 (i-1)/7
- Consider predicting y using a stationary GP Y(x)
?0 Z(x) - Z has zero mean, variance ?2Z, correlation
function R(h) e-136.1h2 - F is a 7 1 column vector of ones
- i.e., we have no features, just an intercept ?0
- Using the regression/correction interpretation of
(Eq 5), we can write Y(x0) B0 ?7i1
ci(x0)(Yi - B0) - ci(x0) is the ith element of R-1r0
- (Yi - B0) are the residuals from fitting the
constant model
14Example continued
- Consider y(x0) at x0 0.55 (plotted as a cross
below) - The residuals (Yi - B0) and their associated
weights ci(x0) are plotted below - Note
- weights can be positive or negative
- the correction to the regression B0 is based
primarily on the residuals at the training data
points closest to x0 - the weights for the 3 furthest training instances
are indistinguishable from zero - y(0.55) has interpolated the data
- what does the whole curve look like?
- We need to wait for x 3.3 to find out
15but Ill show you now anyway!
16Interpolating the data
- The correction term rgt0R-1(Yn FB) forces the
model to interpolate the data - suppose x0 is xi for some i 2 1, , n
- then f0 fgt(xi), and
- r0gt (R(xi - x1), , R(xi - xn))gt, which is the
ith row of R - Because R-1r0 is the ith column of R-1R In, the
identity matrix, thus R-1r0 (0, , 0,1,0, ,
0)gt ei, the ith unit vector - Hence rgt0R-1(Yn FB) eigt (Yn FB)
Yi - fgt(xi)B - and so Y(x0) fgt(xi)B (Yi - fgt(xi)B) Yi
(Eq 5),
17An example showing that best MSPE predictors need
not be linear
- Suppose that (Y0, Y1) has the joint distribution
- Then the conditional distribution of Y0 given Y1
y1 is uniform over the interval (0, y12). - The best MSPE predictor of Y0 is the center of
this interval Y0 EY0 Y1 Y12/2 - The minimum MSPE linear unbiased predictor is Y0L
-1/12 ½ Y1 - based on a bunch of calculus
- Their MSPEs are very similar
- E(Y0 - Y12/2)2 ? 0.01667
- E(Y0 - -1/12 ½ Y1)2 ? 0.01806
18Best Linear Unbiased MSPE Predictors
- minimum MSPE predictors depend on the joint
distribution of Yn and Y0 - thus, they tend to be optimal within a very
restricted class F - In an attempt to find predictors that are more
broadly optimal, consider - predictors that are linear in Yn
- these are called best linear predictors (BLPs)
- predictors that are both linear and unbiased for
Y0. - these are called best linear unbiased predictors
(BLUPs)
19BLUP Example 1
- Recall our first example
- Yi ?0 ?i, where ?i N(0,?2?), ?2? gt 0.
- Define F as those distributions in which
- ?0 is a given nonzero constant
- ?2? is unknown, but ?2? gt 0 is known
- Any Y0 a0 agtYn is a LUP of Y0 if a0
?0?ni1ai ?0 - The MSPE of a linear unbiased predictor Y0 a0
agtYn is - E (a0 ?ni1aiYi - Y0)2 E(a0 ?iai(?0
?i) - ?0 - ?0)2
(a0 ?0?i ai - ?0)2 ?2? ?i ai2
?2?
?2? (1 ?i
ai2) (Eq 6)
?2? (Eq 7) - we have equality in (Eq 6) because Y0 is unbiased
- we have equality in (Eq 7) iff ai 0, i 2 1, ,
n (and hence a0 ?0) - Thus, the unique BLUP is Y0 ?0
20BLUP Example 2
- Consider again the enlarged model F with ?0 as an
unknown real ?2? gt 0 - recall that every unbiased Y0 a0 agtYn must
satisfy a0 0 and ?i ai 1 - The MSPE of Y0 isE(?i ai Yi - Y0)2 (?0 ?i ai
- ?0)2 ?2? ?i ai2 ?2?
0 ?2? (1 ?i ai2) (Eq 8)
?2? (1 1/n) (Eq 9) - equality holds in (Eq 8) because ?i ai 1
- (Eq 9) ?i ai2 is minimized under ?i ai 1 when
ai 1/n - Thus the sample mean Y0 1/n ?i Yi is the best
linear unbiased predictor of Y0 for the enlarged
F. - How can the BLUP for a large class not also the
BLUP for a subclass?(didnt we see a claim to
the contrary earlier)? - the previous claim was that every LUP for a class
is also a LUP for a subclass, but it doesnt
hold for BLUPs.
21BLUP Example 3
- Consider the measurement error model Yi Y(xi)
?j fj(xi) ?j ?iwhere the f are known
regression functions, the ? are unknown, and
each ?i N(0, ?2?) - Consider the BLUP of Y(x0) for unknown real ?0
and ?2? gt 0 - A linear predictor Y0 a0 aTYn is unbiased
provided that for all (?, ?2?) Ea0 aTYn a0
aTF? is equal to EY0 fgt(x0)? - This implies a0 0 and Fgta f(x0)
- The BLUP of Y0 is Y0 fgt(x0)B
- where B (FgtF)-1F gtYn is the ordinary least
squares estimator of ? - and the BLUP is unique
- This is proved in the chapter notes, x3.4.
- and now weve reached the end of x3.2!
22thats all for today!
23Prediction for Computer Experiments
- The idea is to build a surrogate or simulator
- a model that predicts the output of a simulation,
to spare you from having to run the actual
simulation - Neural networks, splines, GPs all workguess
what, they like GPs - Let f1, , fp be known regression functions, ? be
a vector of unknown regression coefficients, Z be
a stationary GP on X having zero mean, variance
?2Z, correlation function R. - Then we can see experimental output Y(x) as the
realization of the random function - This model implies that Y0 and Yn have the
multivariate normal distribution - where ? and ?2Z gt 0 are unknown
- Now, drop the Gaussian assumption to consider a
nonparametric moment model based on an arbitrary
second-order stationary process for unknown ? and
?2Z
24Conclusion is this the right tool when we can
get lots of data?