Model Selection and Validation - PowerPoint PPT Presentation

About This Presentation
Title:

Model Selection and Validation

Description:

Model Selection and Validation All models are wrong; some are useful. George E. P. Box Some s were taken from: J. C. Sapll: MODELING CONSIDERATIONS AND ... – PowerPoint PPT presentation

Number of Views:117
Avg rating:3.0/5.0
Slides: 43
Provided by: csTauAc
Category:

less

Transcript and Presenter's Notes

Title: Model Selection and Validation


1
Model Selection and Validation
  • All models are wrong some are useful.
  • ?George E. P. Box
  • Some slides were taken from
  • J. C. Sapll MODELING CONSIDERATIONS AND
    STATISTICAL INFORMATION
  • J. Hinton Preventing overfitting
  • Bei Yu Model Assessment

2
Overfitting
  • The training data contains information about the
    regularities in the mapping from input to output.
    But it also contains noise
  • The target values may be unreliable.
  • There is sampling error. There will be accidental
    regularities just because of the particular
    training cases that were chosen.
  • When we fit the model, it cannot tell which
    regularities are real and which are caused by
    sampling error.
  • So it fits both kinds of regularity.
  • If the model is very flexible it can model the
    sampling error really well. This is a disaster.

3
A simple example of overfitting
  • Which model do you believe?
  • The complicated model fits the data better.
  • But it is not economical
  • A model is convincing when it fits a lot of data
    surprisingly well.
  • It is not surprising that a complicated model can
    fit a small amount of data.

4
Generalization
  • The objective of learning is to achieve good
    generalization to new cases, otherwise just use a
    look-up table.
  • Generalization can be defined as a mathematical
    interpolation or regression over a set of
    training points

f(x)
x
5
Generalization
  • Over-Training is the equivalent of over-fitting a
    set of data points to a curve which is too
    complex
  • Occams Razor (1300s, English Logician)
  • plurality should not be assumed without
    necessity
  • The simplest model which explains the majority of
    the data is usually the best

6
Generalization
  • Preventing Over-training
  • Use a separate test or tuning set of examples
  • Monitor error on the test set as network trains
  • Stop network training just prior to over-fit
    error occurring - early stopping or tuning
  • Number of effective weights is reduced
  • Most new systems have automated early stopping
    methods

7
Generalization
  • Weight Decay an automated method of
    effective weight control
  • Adjust the bp error function to penalize the
    growth of unnecessary weights
  • where weight -cost parameter
  • is decayed by an amount proportional to
    its magnitude those not reinforced gt 0

8
Formal Model Definition
  • Assume model z h(x,??) v, where z is output,
    h() is some function, x is input, v is noise,
    and?? is vector of model parameters
  • A fundamental goal is to take n data points and
    estimate ?, forming

9
Model Error Definition
  • Given a data set xi,yi, i 1,..,n
  • Given a model output h(x,??n), where ??n is taken
    from some family of parameters, the sum squared
    errors (SSE, MSE) is
  • Si yi - h(xi,??n)2,
  • The likelihood is
  • ?iP(h(xi,??n)xi)

10
Error surface as a function of Model parameters
can look like this
11
Error surface can also look like this
Which one is better?
12
Properties of the error surfaces
  • The first surface is rough, thus a small change
    in parameter space can lead to large change in
    error
  • Due to the steepness of the surface, a minimum
    can be found, although a gradient-descent
    optimization algorithm can get stuck in local
    minima
  • The second is very smooth thus, large change in
    parameter set does not lead to much change in
    model error
  • In other words, it is expected that
    generalization performance will be similar to
    performance on a test set

13
Parameter stability
  • Finer detail while the surface is very smooth,
    it is impossible to get to the true minima.
  • Suggests that models that penalize on smoothness
    may be misleading.
  • Breiman (1992) has shown that even in simple
    problems and simple nonlinear models, the degree
    of generalization is strongly dependent on the
    stability of the parameters.

14
Bias-Variance Decomposition
  • Assume
  • Bias-Variance Decomposition
  • K-NN
  • Linear fit
  • Ridge Regression

15
Bias-Variance Decomposition
  • The MSE of the model at a fixed x can be
    decomposed as
  • Eh(x,? ) ? E(zx)2 x
  • Eh(x, ?) ? E(h(x, ?))2x E(h(x,
    ?)) ? E(zx)2
  • variance at x (bias at x)2
  • where expectations are computed w.r.t.
  • Above implies
  • Model too simple ? High bias/low variance
  • Model too complex ? Low bias/high variance

16
Bias-Variance Tradeoff in Model Selection in
Simple Problem
17
Model Selection
  • The bias-variance tradeoff provides conceptual
    framework for determining a good model
  • bias-variance tradeoff not directly useful
  • Many methods for practical determination of a
    good model
  • AIC, Bayesian selection, cross-validation,
    minimum description length, V-C
    dimension, etc.
  • All methods based on a tradeoff between fitting
    error (high variance) and model complexity (low
    bias)
  • Cross-validation is one of the most popular model
    fitting methods

18
Cross-Validation
  • Cross-validation is a simple, general method for
    comparing candidate models
  • Other specialized methods may work better in
    specific problems
  • Cross-validation uses the training set of data
  • Does not work on some pathological distributions
  • Method is based on iteratively partitioning the
    full set of training data into training and test
    subsets
  • For each partition, estimate model from training
    subset and evaluate model on test subset
  • Select model that performs best over all test
    subsets

19
Division of Data for Cross-Validation with
Disjoint Test Subsets
20
Typical Steps for Cross-Validation
  • Step 0 (initialization) Determine size of test
    subsets and candidate model. Let i be counter
    for test subset being used.
  • Step 1 (estimation) For the i th test subset, let
    the remaining data be the i th training subset.
    Estimate ? from this training subset.
  • Step 2 (error calculation) Based on estimate for
    ? from Step 1 (i th training subset), calculate
    MSE (or other measure) with data in i th test
    subset.
  • Step 3 (new training / test subset) Update i to i
    1 and return to step 1. Form mean of MSE when
    all test subsets have been evaluated.
  • Step 4 (new model) Repeat steps 1 to 3 for next
    model. Choose model with lowest mean MSE as best.

21
Numerical Illustration of Cross-Validation
(Example 13.4 in ISSO)
  • Consider true system corresponding to a sine
    function of the input with additive normally
    distributed noise
  • Consider three candidate models
  • Linear (affine) model
  • 3rd-order polynomial
  • 10th-order polynomial
  • Suppose 30 data points are available, divided
    into 5 disjoint test subsets
  • Based on RMS error (equiv. to MSE) over test
    subsets, 3rd-order polynomial is preferred
  • See following plot

22
Numerical Illustration (contd) Relative Fits
for 3 Models with Low-Noise Observations
23
Standard approach to Model Selection
  • Optimize concurrently the likelihood or mean
    squared error together with a complexity penalty.
  • Some penalties norm of the weight vector,
    smoothness, number of terminating leaves (in
    CART), variance weights, cross validation... etc.
  • Spend most computational time on optimizing the
    parameter solution via sophisticated Gradient
    descent methods or even global-minimum seeking
    methods.

24
Alternative approach
  • MDL based model selection
  • Later

25
Model Complexity
26
Preventing overfitting
  • Use a model that has the right capacity
  • enough to model the true regularities
  • not enough to also model the spurious
    regularities (assuming they are weaker).
  • Standard ways to limit the capacity of a neural
    net
  • Limit the number of hidden units.
  • Limit the size of the weights.
  • Stop the learning before it has time to over-fit.

27
Limiting the size of the weights
  • Weight-decay involves adding an extra term to the
    cost function that penalizes the squared weights.
  • Keeps weights small unless they have big error
    derivatives.

C
w
28
The effect of weight-decay
  • It prevents the network from using weights that
    it does not need.
  • This can often improve generalization a lot.
  • It helps to stop it from fitting the sampling
    error.
  • It makes a smoother model in which the output
    changes more slowly as the input changes. w
  • If the network has two very similar inputs it
    prefers to put half the weight on each rather
    than all the weight on one.

w/2
w
w/2
0
29
Model selection
  • How do we decide which limit to use and how
    strong to make the limit?
  • If we use the test data we get an unfair
    prediction of the error rate we would get on new
    test data.
  • Suppose we compared a set of models that gave
    random results, the best one on a particular
    dataset would do better than chance. But it wont
    do better than chance on another test set.
  • So use a separate validation set to do model
    selection.

30
Using a validation set
  • Divide the total dataset into three subsets
  • Training data is used for learning the parameters
    of the model.
  • Validation data is not used of learning but is
    used for deciding what type of model and what
    amount of regularization works best.
  • Test data is used to get a final, unbiased
    estimate of how well the network works. We expect
    this estimate to be worse than on the validation
    data.
  • We could then re-divide the total dataset to get
    another unbiased estimate of the true error rate.

31
Early stopping
  • If we have lots of data and a big model, its very
    expensive to keep re-training it with different
    amounts of weight decay.
  • It is much cheaper to start with very small
    weights and let them grow until the performance
    on the validation set starts getting worse (but
    dont get fooled by noise!)
  • The capacity of the model is limited because the
    weights have not had time to grow big.

32
Why early stopping works
  • When the weights are very small, every hidden
    unit is in its linear range.
  • So a net with a large layer of hidden units is
    linear.
  • It has no more capacity than a linear net in
    which the inputs are directly connected to the
    outputs!
  • As the weights grow, the hidden units start using
    their non-linear ranges so the capacity grows.

outputs
inputs
33
Model Assessment and Selection
  • Loss Function and Error Rate
  • Bias, Variance and Model Complexity
  • Optimization
  • AIC (Akaike Information Criterion)
  • BIC (Bayesian Information Criterion)
  • MDL (Minimum Description Length)

34
Key Methods to Estimate Prediction Error
  • Estimate Optimism, then add it to the training
    error rate.
  • AIC choose the model with smallest AIC
  • BIC choose the model with smallest BIC

35
Model Assessment and Selection
  • Model Selection
  • estimating the performance of different models in
    order to choose the best one.
  • Model Assessment
  • having chosen the model, estimating the
    prediction error on new data.

36
Approaches
  • data-rich
  • data split Train-Validation-Test
  • typical split 50-25-25 (how?)
  • data-insufficient
  • Analytical approaches
  • AIC, BIC, MDL, SRM
  • efficient sample re-use approaches
  • cross validation, bootstrapping

37
Model Complexity
38
Bias-Variance Tradeoff
39
Summary
  • Cross validation A practical way to estimate
    model error.
  • Model Estimation should be done with a penalty
  • When best model estimation is chosen, estimate on
    whole data or average models on cross validated
    data

40
Loss Functions
  • Continuous Response
  • Categorical Response

squared errorabsolute error
0-1 losslog-likelihood
41
Error Functions
  • Training Error
  • the average loss over the training sample.
  • Continuous Response
  • Categorical Response
  • Generalization Error
  • the expected prediction error over an independent
    test sample.
  • Continuous Response
  • Categorical Response

42
Detailed Decomposition for Linear Model Family
  • average squared bias decomposition

0 for LLSF gt0 for ridge regression trade off
with variance
Write a Comment
User Comments (0)
About PowerShow.com