Title: CHAPTER 13 MODELING CONSIDERATIONS AND STATISTICAL INFORMATION
1CHAPTER 13 MODELING CONSIDERATIONS AND
STATISTICAL INFORMATION
Slides for Introduction to Stochastic Search and
Optimization (ISSO) by J. C. Spall
- All models are wrong some are useful.
- ?George E. P. Box
- Organization of chapter in ISSO
- Bias-variance tradeoff
- Model selection Cross-validation
- Fisher information matrix Definition, examples,
and efficient computation
2Model Definition and MSE
- Assume model z h(?, x) v, where z is output,
h() is some function, x is input, v is noise,
and?? is vector of model parameters - h() may represent simulation model
- h() may represent metamodel (response surface)
of existing simulation - A fundamental goal is to take n data points and
estimate ?, forming - A common measure of effectiveness for estimate is
mean of squared model error (MSE) at fixed x
3Bias-Variance Decomposition
- The MSE of the model at a fixed x can be
decomposed as - Eh( , x) ? E(zx)2 x
- Eh( , x) ? E(h( , x))2x E(h(
, x)) ? E(zx)2 - variance at x (bias at x)2
- where expectations are computed w.r.t.
- Above implies
- Model too simple ? High bias/low variance
- Model too complex ? Low bias/high variance
4Unbiased Estimator May Not be Best (Example 13.1
from ISSO)
- Unbiased estimator is such that
(i.e., mean of prediction is same
as mean of data z) - Example Let denote sample mean of scalar
i.i.d. data as estimator of true mean ? (h(?,?x)
? in notation above) - Alternative biased estimator of ? is
where 0 lt r lt 1 - MSE of biased and unbiased estimators generally
satisfy - Biased estimate better in MSE sense
- However, optimal value of r requires knowledge of
unknown (true) ?
5Bias-Variance Tradeoff in ModelSelection in
Simple Problem
6Example 13.2 in ISSO Bias-Variance Tradeoff
- Suppose true process produces output according to
z f(x) noise, where f?(x) (x x2?)1.1 - Compare linear, quadratic, and cubic
approximations - Table below gives average bias, variance, and MSE
- Overall pattern of decreasing bias and increasing
variance optimal tradeoff is quadratic model
7Model Selection
- The bias-variance tradeoff provides conceptual
framework for determining a good model - Bias-variance tradeoff not directly useful
- Need a practical method for optimizing
bias-variance tradeoff - Practical aim is to pick a model that minimizes a
criterion - f1(fitting?error?from?given?data)??
f2(model?complexity) - where f1 and f2 are increasing functions
- All methods based on a tradeoff between fitting
error (high variance) and model complexity (low
bias) - Criterion above may/may not be explicitly used in
given method
8Methods for Model Selection
- Among many popular methods are
- Akaike Information Criterion (AIC) (Akaike, 1974)
- Popular in time series analysis
- Bayesian selection (Akaike, 1977)
- Bootstrap-based selection (Efron and Tibshirini,
1997) - Cross-validation (Stone, 1974)
- Minimum description length (Risannen, 1978)
- V-C dimension (Vapnik and Chervonenkis, 1971)
- Popular in computer science
- Cross-validation appears to be most popular model
fitting method
9Cross-Validation
- Cross-validation is simple, general method for
comparing candidate models - Other specialized methods may work better in
specific problems - Cross-validation uses the training set of data
- Method is based on iteratively partitioning the
full set of training data into training and test
subsets - For each partition, estimate model from training
subset and evaluate model on test subset - Number of training (or test) subsets number of
model fits required - Select model that performs best over all test
subsets
10Choice of Training and Test Subsets
- Let n denote total size of data set, nT denote
size of test subset, nT lt n - Common strategy is leave-one-out nT 1
- Implies n test subsets during cross-validation
process - Often better to choose nT gt 1
- Sometimes more efficient (sampling w/o
replacement) - Sometimes more accurate model selection
- If nT gt 1, sampling may be with or without
replacement - With replacement indicates that there are n
choose nT test subsets, written - With replacement may be prohibitive in practice
e.g., n 30, nT 6 implies nearly 600K
model fits! - Sampling without replacement reduces number of
test subsets to n?/nT (disjoint test subsets) - With replacement indicates that there are n
choose nT samplings - Above may be prohibitive in practice
- ee means have may lead to huge number of
samlingslarge tno Cross-validation uses the
training set of data - Method is based on iteratively partitioning the
full set of training data into training and test
subsets - For each partition, estimate model from training
subset and evaluate model on test subset - Select model that performs best over all test
subsets
11Conceptual Example of Sampling Without
Replacement Cross-Validation with 3 Disjoint
Test Subsets
12Typical Steps for Cross-Validation
- Step 0 (initialization) Determine size of test
subsets and candidate model. Let i be counter
for test subset being used. - Step 1 (estimation) For the i th test subset, let
the remaining data be the i th training subset.
Estimate ? from this training subset. - Step 2 (error calculation) Based on estimate for
? from Step 1 (i th training subset), calculate
MSE (or other measure) with data in i th test
subset. - Step 3 (new training and test subsets) Update i
to i 1 and return to step 1. Form mean of MSE
when all test subsets have been evaluated. - Step 4 (new model) Repeat steps 1 to 3 for next
model. Choose model with lowest mean MSE as best.
13Numerical Illustration of Cross-Validation
(Example 13.4 in ISSO)
- Consider true system corresponding to a sine
function of the input with additive normally
distributed noise - Consider three candidate models
- Linear (affine) model
- 3rd-order polynomial
- 10th-order polynomial
- Suppose 30 data points are available, divided
into 5 disjoint test subsets (sampling w/o
replacement) - Based on RMS error (equiv. to MSE) over test
subsets, 3rd-order polynomial is preferred - See following plot
14Numerical Illustration (contd) Relative Fits
for 3 Models with Low-Noise Observations
15Fisher Information Matrix
- Fundamental role of data analysis is to extract
information from data - Parameter estimation for models is central to
process of extracting information - The Fisher information matrix plays a central
role in parameter estimation for measuring
information - Information matrix summarizes the amount
- of information in the data relative to the
parameters being estimated
16Problem Setting
- Consider the classical statistical problem of
estimating parameter vector ? from n data vectors
z1, z2 ,, zn - Suppose have a probability density and/or mass
function associated with the data - The parameters ? appear in the probability
function and affect the nature of the
distribution - Example zi ? N(mean(?), covariance(?)) for all i
- Let l(?z1, z2 ,, zn) represent the likelihood
function, i.e., the p.d.f./p.m.f. viewed as a
function of ? conditioned on the data
17Information MatrixDefinition
- Recall likelihood function l(?z1, z2 ,, zn)
- Information matrix defined as
-
-
- where expectation is w.r.t. z1, z2 ,, zn
- Equivalent form based on Hessian matrix
-
- Fn(?) is positive semidefinite of dimension p?p
(pdim(?))
18Information MatrixTwo Key Properties
- Connection of Fn(?) and uncertainty in estimate
is rigorously specified via two famous
results (?? true value of ?) - 1. Asymptotic normality
-
- where
-
- 2. Cramér-Rao inequality
- Above two results indicate greater variability
of ?smaller Fn(?) (and vice versa) -
19Selected Applications
- Information matrix is measure of performance for
several applications. Four uses are - 1. Confidence regions for parameter estimation
- Uses asymptotic normality and/or Cramér-Rao
inequality - 2. Prediction bounds for mathematical models
- 3. Basis for D-optimal criterion for
experimental design - Information matrix serves as measure of how well
? can be estimated for a given set of inputs - 4. Basis for noninformative prior in Bayesian
analysis - Sometimes used for objective Bayesian inference