CHAPTER 13 MODELING CONSIDERATIONS AND STATISTICAL INFORMATION - PowerPoint PPT Presentation

About This Presentation

Title:

CHAPTER 13 MODELING CONSIDERATIONS AND STATISTICAL INFORMATION

Description:

Step 2 (error calculation) Based on estimate for from Step 1 (i th training subset), calculate MSE (or other measure) with data in i th test subset. ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 20

Provided by: Jhu96

Learn more at: https://www.jhuapl.edu

Category:

more less

Transcript and Presenter's Notes

Title: CHAPTER 13 MODELING CONSIDERATIONS AND STATISTICAL INFORMATION

1
CHAPTER 13 MODELING CONSIDERATIONS AND
STATISTICAL INFORMATION
Slides for Introduction to Stochastic Search and
Optimization (ISSO) by J. C. Spall

All models are wrong some are useful.
?George E. P. Box
Organization of chapter in ISSO
Bias-variance tradeoff
Model selection Cross-validation
Fisher information matrix Definition, examples,
and efficient computation

2
Model Definition and MSE

Assume model z h(?, x) v, where z is output,
h() is some function, x is input, v is noise,
and?? is vector of model parameters
h() may represent simulation model
h() may represent metamodel (response surface)
of existing simulation
A fundamental goal is to take n data points and
estimate ?, forming
A common measure of effectiveness for estimate is
mean of squared model error (MSE) at fixed x

3
Bias-Variance Decomposition

The MSE of the model at a fixed x can be
decomposed as
Eh( , x) ? E(zx)2 x
Eh( , x) ? E(h( , x))2x E(h(
, x)) ? E(zx)2
variance at x (bias at x)2
where expectations are computed w.r.t.
Above implies
Model too simple ? High bias/low variance
Model too complex ? Low bias/high variance

4
Unbiased Estimator May Not be Best (Example 13.1
from ISSO)

Unbiased estimator is such that
(i.e., mean of prediction is same
as mean of data z)
Example Let denote sample mean of scalar
i.i.d. data as estimator of true mean ? (h(?,?x)
? in notation above)
Alternative biased estimator of ? is
where 0 lt r lt 1
MSE of biased and unbiased estimators generally
satisfy
Biased estimate better in MSE sense
However, optimal value of r requires knowledge of
unknown (true) ?

5
Bias-Variance Tradeoff in ModelSelection in
Simple Problem
6
Example 13.2 in ISSO Bias-Variance Tradeoff

Suppose true process produces output according to
z f(x) noise, where f?(x) (x x2?)1.1
Compare linear, quadratic, and cubic
approximations
Table below gives average bias, variance, and MSE
Overall pattern of decreasing bias and increasing
variance optimal tradeoff is quadratic model

7
Model Selection

The bias-variance tradeoff provides conceptual
framework for determining a good model
Bias-variance tradeoff not directly useful
Need a practical method for optimizing
bias-variance tradeoff
Practical aim is to pick a model that minimizes a
criterion
f1(fitting?error?from?given?data)??
f2(model?complexity)
where f1 and f2 are increasing functions
All methods based on a tradeoff between fitting
error (high variance) and model complexity (low
bias)
Criterion above may/may not be explicitly used in
given method

8
Methods for Model Selection

Among many popular methods are
Akaike Information Criterion (AIC) (Akaike, 1974)
Popular in time series analysis
Bayesian selection (Akaike, 1977)
Bootstrap-based selection (Efron and Tibshirini,
1997)
Cross-validation (Stone, 1974)
Minimum description length (Risannen, 1978)
V-C dimension (Vapnik and Chervonenkis, 1971)
Popular in computer science
Cross-validation appears to be most popular model
fitting method

9
Cross-Validation

Cross-validation is simple, general method for
comparing candidate models
Other specialized methods may work better in
specific problems
Cross-validation uses the training set of data
Method is based on iteratively partitioning the
full set of training data into training and test
subsets
For each partition, estimate model from training
subset and evaluate model on test subset
Number of training (or test) subsets number of
model fits required
Select model that performs best over all test
subsets

10
Choice of Training and Test Subsets

Let n denote total size of data set, nT denote
size of test subset, nT lt n
Common strategy is leave-one-out nT 1
Implies n test subsets during cross-validation
process
Often better to choose nT gt 1
Sometimes more efficient (sampling w/o
replacement)
Sometimes more accurate model selection
If nT gt 1, sampling may be with or without
replacement
With replacement indicates that there are n
choose nT test subsets, written
With replacement may be prohibitive in practice
e.g., n 30, nT 6 implies nearly 600K
model fits!
Sampling without replacement reduces number of
test subsets to n?/nT (disjoint test subsets)
With replacement indicates that there are n
choose nT samplings
Above may be prohibitive in practice
ee means have may lead to huge number of
samlingslarge tno Cross-validation uses the
training set of data
Method is based on iteratively partitioning the
full set of training data into training and test
subsets
For each partition, estimate model from training
subset and evaluate model on test subset
Select model that performs best over all test
subsets

11
Conceptual Example of Sampling Without
Replacement Cross-Validation with 3 Disjoint
Test Subsets
12
Typical Steps for Cross-Validation

Step 0 (initialization) Determine size of test
subsets and candidate model. Let i be counter
for test subset being used.
Step 1 (estimation) For the i th test subset, let
the remaining data be the i th training subset.
Estimate ? from this training subset.
Step 2 (error calculation) Based on estimate for
? from Step 1 (i th training subset), calculate
MSE (or other measure) with data in i th test
subset.
Step 3 (new training and test subsets) Update i
to i 1 and return to step 1. Form mean of MSE
when all test subsets have been evaluated.
Step 4 (new model) Repeat steps 1 to 3 for next
model. Choose model with lowest mean MSE as best.

13
Numerical Illustration of Cross-Validation
(Example 13.4 in ISSO)

Consider true system corresponding to a sine
function of the input with additive normally
distributed noise
Consider three candidate models
Linear (affine) model
3rd-order polynomial
10th-order polynomial
Suppose 30 data points are available, divided
into 5 disjoint test subsets (sampling w/o
replacement)
Based on RMS error (equiv. to MSE) over test
subsets, 3rd-order polynomial is preferred
See following plot

14
Numerical Illustration (contd) Relative Fits
for 3 Models with Low-Noise Observations
15
Fisher Information Matrix

Fundamental role of data analysis is to extract
information from data
Parameter estimation for models is central to
process of extracting information
The Fisher information matrix plays a central
role in parameter estimation for measuring
information
Information matrix summarizes the amount
of information in the data relative to the
parameters being estimated

16
Problem Setting

Consider the classical statistical problem of
estimating parameter vector ? from n data vectors
z1, z2 ,, zn
Suppose have a probability density and/or mass
function associated with the data
The parameters ? appear in the probability
function and affect the nature of the
distribution
Example zi ? N(mean(?), covariance(?)) for all i
Let l(?z1, z2 ,, zn) represent the likelihood
function, i.e., the p.d.f./p.m.f. viewed as a
function of ? conditioned on the data

17
Information MatrixDefinition

Recall likelihood function l(?z1, z2 ,, zn)
Information matrix defined as
where expectation is w.r.t. z1, z2 ,, zn
Equivalent form based on Hessian matrix
Fn(?) is positive semidefinite of dimension p?p
(pdim(?))

18
Information MatrixTwo Key Properties

Connection of Fn(?) and uncertainty in estimate
is rigorously specified via two famous
results (?? true value of ?)
1. Asymptotic normality
where
2. Cramér-Rao inequality
Above two results indicate greater variability
of ?smaller Fn(?) (and vice versa)

19
Selected Applications

Information matrix is measure of performance for
several applications. Four uses are
1. Confidence regions for parameter estimation
Uses asymptotic normality and/or Cramér-Rao
inequality
2. Prediction bounds for mathematical models
3. Basis for D-optimal criterion for
experimental design
Information matrix serves as measure of how well
? can be estimated for a given set of inputs
4. Basis for noninformative prior in Bayesian
analysis
Sometimes used for objective Bayesian inference