CH3 : Linear methods for regression - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

CH3 : Linear methods for regression

Description:

LS estimators do not have the small mean square error if one accepts biased ... The quality of the models are measured by tenfold cross-validation : ... – PowerPoint PPT presentation

Number of Views:207
Avg rating:3.0/5.0
Slides: 19
Provided by: gova
Category:

less

Transcript and Presenter's Notes

Title: CH3 : Linear methods for regression


1
CH3 Linear methods for regression
  • Some important points of CH2
  • Goal find a  good  model to predict a
    variable Y from predictors X1, X2Xp
  • Quality measure EPE expected prediction error
    E((Y-f(X))2)
  • Solution take f(x) E(YXx) but what is
    E(YXx) ?
  • Decomposition of the expected prediction error
    EPE
  • If Y E(YX) e with V(e) s2 and f(X) is
    an estimator of the model f(X)


Mean Square Error
2
Linear regression models
  • Linear regression models suppose that E(YX) is
    linear
  • Linear models are old tools but
  • Still very useful
  • Simple
  • Allow an easy interpretation of regressors
    effects
  • Very wide since Xis can be any function of other
    variables (quantitative or qualitative)
  • Useful to understand because most other methods
    are generalisations of them.

3
Least square estimation methods
  • Least square is the most popular method to
    estimate linear models
  • Least square criteria
  • Solution of the minimisation problem
  • Predicted values
  • The predictions are the orthogonal projections of
    y in the subspace of Rn defined by the columns of
    the X matrix. H is the projection matrix.
  • The method can be generalised to multiple Ys
  • If the Yis are not correlated, the multivariate
    solution is identical than doing k classical
    regressions

4
Properties of least squares estimators
  • If Yi are independent, X fixed and V(Yi)s2
    constant
  • Then
  • If, in addition, Yf(Xi)e with e N(0, s2 )
  • Then
  • and
  • T tests can be built to test the nullity of one
    parameter
  • F tests can be built to test the nullity of a
    vector of parameters (or linear combinations of
    them)
  • Confidence intervals can be built for one of for
    a vector of parameters.
  • Gauss-Markov theorem (does not need normality
    assumption)
  • The least square estimates of the parameters b
    have the smallest variance among all linear
    UNBIASED estimators.
  • But
  • LS estimators do not have the small mean
    square error if one accepts biased estimators
    (e.g. ridge regression).
  • and since all model are wrong, LSE will
    always be biased

5
Multiple regression from univariate regression
  • Multiple regression parameters can be estimated
    from a sequence of simple univariate regressions
    of the type Y bz e
  • If columns of X are orthogonal, the parameters
    estimates are simply simple regressions of Y on
    the columns of X
  • The algorithm of successive orthogonalisation
    allows to obtain the estimator of bi in a
    multiple regression as the estimator of the
    parameter of a simple regression of Y on the
    residual Zi of the regression of Xi on the other
    columns of X. Zi is the part of Xi which is
    orthogonal to the space of the other columns of X
    and is obtained by a sequence of successive
    orthogonalisations.
  • The Gram-Schmidt procedure is a numerical
    strategy based on the previous algorithm to
    compute LS estimates.
  • These developments help to have a geometrical
    view of LS but will also be the basis of other
    algorithms (NIPALS for PCA, PLS)

6
Other methods to estimate linear models
  • Two reasons why we are not happy with least
    squares
  • Prediction accuracy LES often provide
    predictions with low biais but high variance.
  • Intepretation when the number of regressors is
    to high, the model is difficult to interpret.
    One seek to find a smaller set of regressors with
    higher effects
  • Proposed approaches
  • Subset selection best subset, forward,
    backward, stepwise
  • Shrinkage methods ridge regression and lasso
    method ( generalisation to a bayes view of them)
  • Methods using derived input directions
    Principal component regression, partial least
    squares and canonical correlations

7
Proster cancer data example
  • Variables
  • response level of prostate-specific antigen
  • Regressors 8 clinical measures useful for men
    receiving prostatectomy

8
Example results given
  • For each method, the best compromise model is
    given
  • The quality of the models are measured by tenfold
    cross-validation
  • The data set is divided in ten parts, the model
    estimated by removing one by one each part and
    calculating the cross-validation prediction error
    on the m removed data.
  • The test error is the mean of the CVks, the Std
    Error their standard deviation
  • On the graph, the points are test errors and the
    intervals are at /- 1s of the test error
  • The best model is the less complex model with a
    mean CV within /-1s of the best mean CV

9
Subset selection methods
  • Idea retain only the subset of variables that
    gives the best fit. These are automatic methods
    without reflection on the nature of the
    regressors.
  • Best subset regression
  • all subset of k1,2,p regressors are tested and
    for each k the best model retained. Works for
    p
    criteria making a trade-off of bias and variance.
  • Forward selection
  • Starts with the intercept and add at each step
    the predictors that most improves the fit. The
    improvement is often measured with p-value of an
    F-test.
  • Backward selection
  • Starts with the full model and removes one by
    one the worst regressor. Stops when all
    regressors have a significant effect. Can only
    be used when p
  • Stepwise selection
  • Combines forward and backward to decide at each
    step which variable to remove and/or to add.

10
Shrinkage Methods Ridge Regression
  • Shrink coefficients by imposing a penalty to
    their size

subject to
This is equivalent to
If inputs are orthogonal
Is a function of
11
Singular values decomposition (SVD)
Writing in terms of SVD
shrinked coordinates of y respect to U basis
Greater amount of shrinkage to basis vector
having smaller
Complexity parameter of the model
Effective degrees of freedom
12
Relationship with Principal Components
Writing in terms of SVD
Eigen decomposition
Normalized principal component
Principal components Normalized linear
combinations of the columns of X
Small values correspond to directions of
the column space of having small variance
Ridge regression assumes that response will tend
to vary the most in directions of higher variance
of the inputs and then it protects against high
variability of Y in directions of small
variability of X
13
Lasso regression
subject to
where
  • If t is small enough some coefficients are 0,
    so it acts like a subset selection method
  • If then

t is chosen to minimize de estimation of the
expected prediction error
14
Derived directions Principal Components
Regression
Use of linear combinations (PC)
as
regressors
Since the are orthogonal
  • Ridge and PC regression operate via the principal
    components of the X matrix
  • Ridge regression shrinks the regression
    coefficients of the principal components
    depending on the size of the corresponding
    eigenvalue.
  • Principal component regression discard the
    smallest eigentvalues

15
Derived directions Partial Least Squares
Use a set of linear combinations of the inputs
also taking into account
  • Standardize , set
  • For
  • .
  • .
  • .
    output
  • . Orthogonalize with respect to
  • .The coefficients on the original

16
Derived directions Partial Least Squares
  • If M least squares
  • Non linear function of y
  • Look for directions with high variance and high
    correlation with y

PCR
PLS
Dominates, then behaves similar to Ridge and PCR
If X is orthogonal then
17
Multiple Outcome
  • Least-squares individual least squares estimates
    to each output
  • Selection and shrinkage methodsunivariate
    technique individually or simultaneously to all
    outcomes
  • Ridge apply ridge regression to each column of
    , having different , or apply it to
    all columns with the same
  • If correlation between the outputs
  • Canonical correlation analysis (similar to PCA)
  • Sequence of uncorrelated combinations on X ,
    and Y, such that
  • are maximized
  • Leading canonical responses are those linear
    combinations best predicted by the xs
  • CCA solution is computed via SVD of the
    cross-covariance matrix
  • Reduced-rank regression in terms of regression
    model. estimated by

Canonical vectors
18
Comparisons
  • PLS, PCR and Ridge regression behave similarly.
  • Ridge shrinks all directions but shrinks more low
    variance directions.
  • PCR keeps higher variance Components and discard
    the rest.
  • PLS shrink low-variance directions but tends to
    inflate some of the higher variance, causing
    slightly larger prediction error compared to
    ridge.
  • Ridge is preferred for minimizing the prediction
    error and it shrinks smoothly
  • If X is orthogonal, these methods apply simple
    transformations to the least square estimates. If
    X is not orthogonal, they converge to the least
    squares.
Write a Comment
User Comments (0)
About PowerShow.com