Predicting Output from Computer Experiments - PowerPoint PPT Presentation

About This Presentation
Title:

Predicting Output from Computer Experiments

Description:

Predicting Output from Computer Experiments Design and Analysis of Computer Experiments Chapter 3 Kevin Leyton-Brown Overview Overall program in this chapter: predict ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 25
Provided by: syst71
Category:

less

Transcript and Presenter's Notes

Title: Predicting Output from Computer Experiments


1
Predicting Output from Computer Experiments
  • Design and Analysis of Computer Experiments
  • Chapter 3
  • Kevin Leyton-Brown

2
Overview
  • Overall program in this chapter
  • predict the output of a computer simulation
  • were going to review approaches to regression,
    looking for various kinds of optimality
  • First, well talk about just predicting our
    random variable (x 3.2)
  • note, in this setting, we have no features
  • Then, well consider the inclusion of features in
    our predictions, based on features in our
    training data (x 3.2, 3.3)
  • In the end, well apply these ideas to computer
    experiments (x 3.3)
  • Not covered
  • an empirical evaluation of seven EBLUPs on
    small-sample data (x 3.3, pp. 69-81)
  • proofs of some esoteric BLUP theorems (x 3.4,
    pp. 82-84)
  • If youve done the reading you already know
  • the difference between minimum MSPE linear
    unbiased predictors and BLUPs
  • three different intuitive interpretations of
    rgt0R-1(Yn FB)
  • a lot about statistics
  • whether this chapter has anything to do with
    computer experiments
  • If you havent youre in for a treat ?

3
Predictors
  • Y0 is our random variable our data is Yn (Y1,
    , Yn)gt
  • no features just predict one response from
    the others
  • A generic predictor
    predicts Y0 based on Yn
  • to avoid powerpoint agony, Ill denote as
    Y0 from now on
  • There are three kinds of predictors discussed
  • Predictors
  • Y0(Yn) has unrestricted functional form
  • Linear Predictors
  • Y0 a0 ?ni1aiYi a0 agtYn
  • Linear unbiased predictors (LUP)
  • again, linear predictors Y0 a0 agtYn
  • furthermore, unbiased with respect to a given
    family F of distributions for (Y0, Yn)
  • Definition a predictor Y0 is unbiased for Y0
    with respect to the class of distributions F over
    (Y0, Yn) if for all F 2 F, EFY0 EFY0.
  • EF denotes expectation under the F()
    distribution for (Y0, Yn)
  • this definition depends on F a linear predictor
    is unbiased with respect to a class
  • as F gets bigger, the set of LUPs gets weakly
    smaller

4
LUP Example 1
  • Suppose that Yi ?0 ?i, where ?i N(0,?2?),
    ?2? gt 0.
  • Define F as those distributions in which
  • ?0 is a given nonzero constant
  • ?2? is unknown, but ?2? gt 0 is known
  • Any Y0 a0 aTYn is a LP of Y0
  • Which are unbiased? We know that
  • E Y0 E a0 ?ni1aiYi a0 ?0?ni1ai (Eq
    1)
  • and E Y0 ?0 (Eq 2)
  • For our LP to be unbiased, we must have (Eq 1)
    (Eq 2) 8 ?2?
  • since (Eq 1), (Eq 2) are independent of ?2?, we
    just need that, given ?0, a satisfies a0
    ?0?ni1ai ?0
  • solutions
  • a0 ?0, a such that ?ni1ai 0
    (data-independent predictor Y0 ?0)
  • a0 0, a such that ?ni1ai 1
  • e.g., sample mean of Yn is the LUP corresponding
    to a0 0, ai 1/n

5
LUP Example 2
  • Suppose that Yi ?0 ?i, where ?i N(0,?2?),
    ?2? gt 0.
  • Define F as those distributions in which
  • ?0 is an unknown real constant
  • ?2? is unknown, but ?2? gt 0 is known
  • Any Y0 a0 aTYn is a LP of Y0
  • Which are unbiased? We know that
  • E Y0 E a0 ?ni1aiYi a0 ?0?ni1ai (Eq
    1)
  • and E Y0 ?0 (Eq 2)
  • For our LP to be unbiased, we must have (Eq 1)
    (Eq 2) 8 ?2? and 8 ?0
  • since (Eq 1), (Eq 2) are independent of ?2?, we
    just need that, 8 ?0, a satisfies a0
    ?0?ni1ai ?0
  • solutions
  • a0 ?0, a such that ?ni1ai 0
    (data-independent predictor Y0 ?0)
  • a0 0, a such that ?ni1ai 1
  • e.g., sample mean of Yn is the LUP corresponding
    to a0 0, ai 1/n
  • This illustrates that a LUP for F is a LUP for
    subfamilies of F

6
Best Mean Squared Prediction Error (MSPE)
Predictors
  • Definition MSPE(Y0,F) EF(Y0 - Y0)2
  • Definition Y0 is a minimum MSPE predictor at F
    if, for any predictor Y0 MSPE(Y0,F)
    MSPE(Y0,F)
  • well also call this a best MSPE predictor
  • Fundamental theorem of prediction
  • the conditional mean of Y0 given Yn is the
    minimum MSPE predictor of Y0 based on Yn

7
Best Mean Squared Prediction Error (MSPE)
Predictors
  • Theorem Suppose that (Y0, Yn) has a joint
    distribution F for which the conditional mean of
    Y0 given Yn exists. Then Y0 EY0 Yn is the
    best MSPE predictor of Y0.
  • Proof Fix an arbitrary unbiased predictor
    Y0(Yn).
  • MSPE(Y0,F) EF(Y0 - Y0)2 EF(Y0 - Y0
    Y0 - Y0)2 EF(Y0 - Y0)2 MSPE(Y0,F)
    2EF(Y0 - Y0)(Y0 - Y0)
    MSPE(Y0,F) 2EF(Y0 - Y0)(Y0 -
    Y0) (Eq 3)
  • EF(Y0 - Y0)(Y0 - Y0) EF(Y0 - Y0) EF(Y0 -
    Y0) Yn EF(Y0 - Y0) (Y0 - EFY0
    Yn) EF(Y0 - Y0) 0 0
  • Thus, MSPE(Y0,F) MSPE(Y0,F)
  • Notes
  • Y0 EY0 Yn is essentially the unique MSPE
    predictor
  • MSPE(Y0,F) MSPE(Y0,F) iff Y0 Y0 almost
    everywhere
  • Y0 EY0 Yn is always unbiased
  • EY0 EEY0 Yn EY0

(Why can we condition here?)
8
Example Continued-Best MSPE Predictors
  • What is the best MSPE predictor when each Yi
    N(?0, ?2?)?
  • Since the Yis are independent, Y0 Yn N(?0,
    ?2?)
  • Thus, Y0 EY0Yn ?0
  • What if ?2? is known, and Yi N(?0, ?2?), but ?0
    is unknown (i.e., ?0 ? 1)?
  • improper priors do not always give proper
    posteriors. But hereY0 Yn yn N1 ?,
    ?2?(1 1/n)where ? is the sample mean on the
    training data Yn
  • Thus, the best MSPE predictor of Y0 is Y0 (?i
    Yi) / n

9
Now lets dive in to Gaussian Processes (uh oh)
  • Consider the regression model from chapter 2Yi
    Y(xi) ?pj1fj?j Z(xi) fgt(xi)? Z(xi)
  • each fj is a known regression function
  • ? is an unknown nonzero p 1 vector
  • Z(x) is a zero mean stationary Gaussian process
    with dependence specified byCovZ(xi),Z(xj)
    ?2Z R(xi - xj) for some known correlation
    function R.
  • Then the joint distribution of Y0 Y(x0) and Yn
    (Y(x1), , Y(xn)) is

the defn of unbiased and the conditional dist.
of a multivariate normal give
(Eq 4)
10
Gaussian Process Example Continued
  • The best MSPE predictor of Y0 isY0 EY0 Yn
    fgt0? rgt0R-1(Yn - F?) (Eq4)
  • But for what class of distributions F is this
    true?
  • Y0 depends on
  • multivariate normality of (Y0, Yn)
  • ?
  • R()
  • thus the best MSPE predictor changes when ? or R
    change, however, it remains the same for all ?2Z
    gt 0

11
Second GP example
  • Second example analogous to the previous linear
    example, what if we add uncertainty about ??
  • we assume that ?2Z is known, although the authors
    say this isnt required
  • Now we have a two-stage model
  • The first stage, our conditional distribution of
    (Y0, Yn) given ?, is the same distribution we saw
    before.
  • The second stage is our prior on ?.
  • One can show that the best MSPE predictor of Y0
    is Y0 EY0 Yn fgt0 E? Yn rgt0R-1(Yn -
    F E? Yn)
  • Compare this to what we had in the one-stage
    case Y0 fgt0? rgt0R-1(Yn - F?)
  • but the authors give a derivation see the book

12
So what about E? Yn?
  • Of course, the formula for E? Yn depends on
    our ? prior
  • when this prior is uninformative, we can
    derive? Yn Np(FgtR-1F)-1F-1Yn,
    ?2Z(FgtR-1F)-1
  • this (somehow) gives us Y0 fgt0B rgt0R-1(Yn
    FB), (Eq 5)
    B (FgtR-1F)-1FgtR-1Yn

    as above with Y0,
    for powerpoint reasons I use B instead of
  • What sense can we make of (Eq 5)?
  • the sum of the regression predictor fgt0B and a
    correction rgt0R-1(Yn FB)
  • a function of the training data Yn
  • a function of x0, the point at which a prediction
    is made
  • recall that fgt0 f(x0)gt rgt0 (R(x0 - x1), ,
    R(x0 - xn))gt
  • For the moment, we consider (1) we consider (2)
    and (3) in x 3.3
  • (thats right, were still in x 3.2!)
  • The correction term is a linear combination of
    the residuals Yn FB based on the GP model fgt?
    Z with prediction point specific coefficients
    rgt0R-1(Yn FB) ?i ci(x0)(Yn - FB)where the
    weight ci(x0) is the ith element of R-1r0 and (Yn
    - FB) is the ith residual based on the fitted
    model

13
Example
  • Suppose the true unknown curve is the 1D dampened
    cosine y(x) e-1.4xcos(7?x/2)
  • 7-point training set
  • x1 drawn from 0,1/7
  • xi x1 (i-1)/7
  • Consider predicting y using a stationary GP Y(x)
    ?0 Z(x)
  • Z has zero mean, variance ?2Z, correlation
    function R(h) e-136.1h2
  • F is a 7 1 column vector of ones
  • i.e., we have no features, just an intercept ?0
  • Using the regression/correction interpretation of
    (Eq 5), we can write Y(x0) B0 ?7i1
    ci(x0)(Yi - B0)
  • ci(x0) is the ith element of R-1r0
  • (Yi - B0) are the residuals from fitting the
    constant model

14
Example continued
  • Consider y(x0) at x0 0.55 (plotted as a cross
    below)
  • The residuals (Yi - B0) and their associated
    weights ci(x0) are plotted below
  • Note
  • weights can be positive or negative
  • the correction to the regression B0 is based
    primarily on the residuals at the training data
    points closest to x0
  • the weights for the 3 furthest training instances
    are indistinguishable from zero
  • y(0.55) has interpolated the data
  • what does the whole curve look like?
  • We need to wait for x 3.3 to find out

15
but Ill show you now anyway!
16
Interpolating the data
  • The correction term rgt0R-1(Yn FB) forces the
    model to interpolate the data
  • suppose x0 is xi for some i 2 1, , n
  • then f0 fgt(xi), and
  • r0gt (R(xi - x1), , R(xi - xn))gt, which is the
    ith row of R
  • Because R-1r0 is the ith column of R-1R In, the
    identity matrix, thus R-1r0 (0, , 0,1,0, ,
    0)gt ei, the ith unit vector
  • Hence rgt0R-1(Yn FB) eigt (Yn FB)
    Yi - fgt(xi)B
  • and so Y(x0) fgt(xi)B (Yi - fgt(xi)B) Yi

(Eq 5),
17
An example showing that best MSPE predictors need
not be linear
  • Suppose that (Y0, Y1) has the joint distribution
  • Then the conditional distribution of Y0 given Y1
    y1 is uniform over the interval (0, y12).
  • The best MSPE predictor of Y0 is the center of
    this interval Y0 EY0 Y1 Y12/2
  • The minimum MSPE linear unbiased predictor is Y0L
    -1/12 ½ Y1
  • based on a bunch of calculus
  • Their MSPEs are very similar
  • E(Y0 - Y12/2)2 ? 0.01667
  • E(Y0 - -1/12 ½ Y1)2 ? 0.01806

18
Best Linear Unbiased MSPE Predictors
  • minimum MSPE predictors depend on the joint
    distribution of Yn and Y0
  • thus, they tend to be optimal within a very
    restricted class F
  • In an attempt to find predictors that are more
    broadly optimal, consider
  • predictors that are linear in Yn
  • these are called best linear predictors (BLPs)
  • predictors that are both linear and unbiased for
    Y0.
  • these are called best linear unbiased predictors
    (BLUPs)

19
BLUP Example 1
  • Recall our first example
  • Yi ?0 ?i, where ?i N(0,?2?), ?2? gt 0.
  • Define F as those distributions in which
  • ?0 is a given nonzero constant
  • ?2? is unknown, but ?2? gt 0 is known
  • Any Y0 a0 agtYn is a LUP of Y0 if a0
    ?0?ni1ai ?0
  • The MSPE of a linear unbiased predictor Y0 a0
    agtYn is
  • E (a0 ?ni1aiYi - Y0)2 E(a0 ?iai(?0
    ?i) - ?0 - ?0)2
    (a0 ?0?i ai - ?0)2 ?2? ?i ai2
    ?2?
    ?2? (1 ?i
    ai2) (Eq 6)
    ?2? (Eq 7)
  • we have equality in (Eq 6) because Y0 is unbiased
  • we have equality in (Eq 7) iff ai 0, i 2 1, ,
    n (and hence a0 ?0)
  • Thus, the unique BLUP is Y0 ?0

20
BLUP Example 2
  • Consider again the enlarged model F with ?0 as an
    unknown real ?2? gt 0
  • recall that every unbiased Y0 a0 agtYn must
    satisfy a0 0 and ?i ai 1
  • The MSPE of Y0 isE(?i ai Yi - Y0)2 (?0 ?i ai
    - ?0)2 ?2? ?i ai2 ?2?
    0 ?2? (1 ?i ai2) (Eq 8)
    ?2? (1 1/n) (Eq 9)
  • equality holds in (Eq 8) because ?i ai 1
  • (Eq 9) ?i ai2 is minimized under ?i ai 1 when
    ai 1/n
  • Thus the sample mean Y0 1/n ?i Yi is the best
    linear unbiased predictor of Y0 for the enlarged
    F.
  • How can the BLUP for a large class not also the
    BLUP for a subclass?(didnt we see a claim to
    the contrary earlier)?
  • the previous claim was that every LUP for a class
    is also a LUP for a subclass, but it doesnt
    hold for BLUPs.

21
BLUP Example 3
  • Consider the measurement error model Yi Y(xi)
    ?j fj(xi) ?j ?iwhere the f are known
    regression functions, the ? are unknown, and
    each ?i N(0, ?2?)
  • Consider the BLUP of Y(x0) for unknown real ?0
    and ?2? gt 0
  • A linear predictor Y0 a0 aTYn is unbiased
    provided that for all (?, ?2?) Ea0 aTYn a0
    aTF? is equal to EY0 fgt(x0)?
  • This implies a0 0 and Fgta f(x0)
  • The BLUP of Y0 is Y0 fgt(x0)B
  • where B (FgtF)-1F gtYn is the ordinary least
    squares estimator of ?
  • and the BLUP is unique
  • This is proved in the chapter notes, x3.4.
  • and now weve reached the end of x3.2!

22
thats all for today!
23
Prediction for Computer Experiments
  • The idea is to build a surrogate or simulator
  • a model that predicts the output of a simulation,
    to spare you from having to run the actual
    simulation
  • Neural networks, splines, GPs all workguess
    what, they like GPs
  • Let f1, , fp be known regression functions, ? be
    a vector of unknown regression coefficients, Z be
    a stationary GP on X having zero mean, variance
    ?2Z, correlation function R.
  • Then we can see experimental output Y(x) as the
    realization of the random function
  • This model implies that Y0 and Yn have the
    multivariate normal distribution
  • where ? and ?2Z gt 0 are unknown
  • Now, drop the Gaussian assumption to consider a
    nonparametric moment model based on an arbitrary
    second-order stationary process for unknown ? and
    ?2Z

24
Conclusion is this the right tool when we can
get lots of data?
Write a Comment
User Comments (0)
About PowerShow.com