Multiple Regression Analysis - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Multiple Regression Analysis

Description:

Extension of simple linear regression to the case where there are several ... that explains the most variance with the fewest number of explanatory variables. ... – PowerPoint PPT presentation

Number of Views:189
Avg rating:3.0/5.0
Slides: 51
Provided by: Owne1137
Category:

less

Transcript and Presenter's Notes

Title: Multiple Regression Analysis


1
Multiple Regression Analysis
  • Why use MLR?
  • Hypothesis testing for MLR
  • Diagnostics
  • Multicollinearity
  • Choosing the best model
  • ANCOVA

2
Multiple Linear Regression
  • Extension of simple linear regression to the case
    where there are several explanatory variables.
  • The goal is to explain as much as possible the
    variation observed in the response (y) variable,
    leaving as little variation as possible to
    unexplained noise.
  • As in simple linear regression, multiple linear
    regression equations can be obtained by OLS fit
    or by other methods such as LMS, GLS, etc,
    depending on the situation.
  • As multiple OLS linear regression is the most
    common curve fitting technique, we will only
    concentrate on procedures for developing a good
    multiple OLS linear regression model, and on how
    to deal with common problems such as
    multicollinearity.

3
  • As MLR is a complex subject, hand calculation is
    inadvisable when p is greater than 2 or 3 because
    of the amount of the work involved.
  • The complexity involves
  • determining how many and which variables to use,
    including the form of each variable (such as
    linear of nonlinear),
  • interpreting the results, especially the
    regression coefficients, and
  • determining whether an alternative to OLS should
    be used.

4
Why Use MLR?
  • Scientific knowledge and experience usually tell
    us so.
  • Residuals from SLR may indicate that additional
    explanatory variables are required. E.g.
    residuals show there is a temporal trend
    (suggesting time as an additional explanatory
    variable).

5
MLR Model
  • The MLR model will be denoted
  • This can be written in matrix notation as
  • For n observations and 2 explanatory variables X1
    and X2

6
Specifically,
7
ij
  • Where Xij denotes the ith observation on the jth
    explanatory variable. Thus,
  • As with SLR, OLS stipulates that the sum of the
    squared residuals must be minimized. For the
    above model,

8
  • And differentiation of the RHS of the equation
    with respect to ?0, ?1, and ?2 (separately)
    produces 3 equations in the 3 unknown parameters.
  • The 3 equations are called normal equations and
    they can be written in matrix notation as
  • the solution for which is
  • The XX matrix is a (k1) x (k1) symmetric
    matrix whose diagonal elements are the sum of
    squares of the elements in columns of the X
    matrix, and whose off diagonal elements are sums
    of cross products of elements in the same columns.

9
  • The nature of XX plays an important role in the
    properties of the estimators in ? and will often
    be a large factor in the success ( of failure) of
    OLS as an estimation procedure.

10
Hypothesis Tests for MLRNested F Tests
  • F test is the single most important hypothesis
    test for comparing any two nested models. A
    complex model vs. a simpler model which is a
    subset of the complex model. The test statistic
    is

11
  • Complex model has (m 1) parameters and df n -
    (m1)
  • Simple model has (k 1) parameters and df n -
    (k1)
  • If F gt F(tab) with (dfc - dfs) and dfc degrees of
    freedom for selected ?1, then H0 is rejected.
    Rejection indicates that the more complex model
    should be chosen in preference of the simpler
    model and vice versa.

12
Overall F test
  • This is a special case of the nested F-test. It
    is of limited use. It test only whether the
    complex regression equation is better than no
    regression at all. Of much greater interest is
    which of several regression models is best.

13
Partial F test
  • Second special case of the nested F tests. The
    partial F test evaluates whether the nth variable
    adds any new explanatory power to the equation,
    and ought to be in the regression model, given
    that all the other variables are already present.
  • F value (Minitab use t) on a coefficient will
    change depending on what other variables are in
    the model.
  • Cannot answer Does variable m belong in the
    model?
  • Can only answer Whether m belongs in the model
    in the presence of the other variables.

14
  • If every t gt 2 for each coefficient, then it is
    clear that every explanatory variable is
    accounting for a significant amount of variation,
    and all should be present.
  • When one or more of the coefficients has a t lt
    2, some of the variables should be removed from
    the equation, but t values are not a certain
    guide as to which ones to remove.
  • Partial t or F test are used to make automatic
    decisions for removal or inclusion in stepwise
    multiple regression.
  • These automatic procedures do not guarantee that
    some best model is obtained. Better procedures
    are available for doing so.

15
Confidence Intervals
  • CI can be computed for all ?s and for the mean
    response Y at a given value for all explanatory
    variables. PI can be similarly computed around
    an individual estimate of Y. Need to use matrix
    notations for these.

16
Variance-Covariance Matrix
  • In MLR, the variance-covariance matrix is
    computed form
  • Elements of the X prime X inverse matrix for 3
    explanatory variables are

17
  • When multiplied by the error variance (estimated
    by the variance of the residuals, s2 ), the
    diagonal elements of the matrix C00 through C33
    become the variances of the regression
    coefficients, off-diagonals are covariances
    between coefficients.

18
Confidence Intervals For Slope Coefficients
  • If the residuals are normally distributed with
    variance ?2, a 100(1-?) Cl on ?j is
  • where Cjj is the diagonal element of the (XX)-1
    corresponding to the jth explanatory variable.
    Often printed is the SE of the regression
    coefficient

19
Note
  • Cjj is a function of the other explanatory
    variables as well as the jth. Therefore CIs
    will change as explanatory variables are added to
    or deleted from the model.

20
Conficence Intervals for the Mean Response
  • A 100(1-?) Cl for the mean response ?(Y0) for a
    given point in multidimensional space X0 is
    symmetric around the regression estimate Y0.
    These intervals also require the assumption of
    normality of residuals
  • The variance of the mean is the term under the
    square root sign. It changes with X0, increasing
    as X0 moves away from the multidimensional center
    of the data. In fact the term X0 (XX)-1X0 is
    the leverage statistic hi, expressing the
    distance that X0 is from the center of the data.

21
Prediction Intervals for an individual Y
  • A 100(1-?) PI for a single response Y0, given a
    point in multidimensional space X0 is symmetric
    around the regression estimate Y0. It also
    requires the assumption of normality of the
    residuals.
  • Notice the addition of a 1 in the square
    brackets. This reflects the additional variance
    for an individual point.

22
MLR Diagnostics
  • As with SLR, it is very important to use
    graphical tools to diagnose deficiences in MLR.
    The following residuals plots are very important
  • normal probability plots of residuals
  • residuals vs. predicted values (to identify
    curvature or heteroscedasticity)
  • residuals vs. time sequence or location (to
    identify trends)
  • residuals vs. any candidate or explanatory
    variables not in the model (to identify
    variables, or appropriate transformations of
    them, which may be used to improve the model fit)

23
Leverage and influence
  • Regression diagnostics are much more important in
    MLR.
  • Very difficult to recognize points of high
    leverage or high influence for any set of plots.
  • One observation may not be exceptional in terms
    of each of its explanatory variables taken one at
    a time, but viewed in combination it can be very
    exceptional.
  • Numerical diagnostics can accurately detect such
    anomalies.

24
Leverage Statistics
  • The leverage statistic hi expresses the distance
    of a given point X0 from the center of the sample
    observations. It has 2 important uses in MLR
  • identify points unusual in value of the X
    variables (possible errors, poor model,
    nonlinearity, etc.)
  • making predictions. The leverage value for a
    prediction should not exceed the largest hj in
    the original data set.
  • (Regression model may not fit well beyond the
    largest hj even though the X0s may not be beyond
    the bounds of any of its individual explanatory
    variables).
  • Critical hj 3p/n

25
Influence statistic
  • DFFITS is the measure of influence as defined
    earlier
  • Example on the use of DFFITS and hi
  • True model C 30 0.5D ?
  • Data given DE (distance east), DN (distance
    north)
  • D (well depth), C(conc).
  • Any acceptable model should closely reproduce
    true model, and should find C to be independent
    of DE and DN.

26
  • Critical hi 3p/n 0.6, Critical DFFITS
    0.9

27
(No Transcript)
28
(No Transcript)
29
Multicollinearity
  • Very important for users of MLR to understand
    causes and consequences of multicollinearity.
  • Multicollinearity is the condition where at least
    one explanatory variable is closely related to
    one or more other explanatory variables.
  • Consequences
  • Overall F test okay but slopes coefficients
    unrealistically large, and t-tests are
    insignificant.
  • Coefficients unrealistic in sigh. Occurs when 2
    variables describing approximately the same thing
    are counter-balancing each other in the equation,
    having opposite signs.
  • Slope coefficients are unstable. Small change in
    one or a few values could cause large change in
    the coefficients.
  • Automatic procedures e.g. stepwise, forwards and
    backwards methods produce different models.

30
Diagnosing Multicollinearity
  • An excellent and simple measure of
    multicollinearity is the variance inflation
    factor (VIF). For variable j the VIF is
  • Where Rj2 is the R2 from a regression of the jth
    explanatory variable on all the other explanatory
    variables - the equation used for adjustment of
    Xj in partial plots.
  • The ideal VIFj is 1, corresponding to Rj2 0.
    Serious problems are indicated when VIFj gt 10
    (Rj2 0.9).
  • The VIF is a measure of how multicollinearity
    inflates the width of the CI for the jth
    regression coefficient by the amount
    compared to what it would be with a perfectly
    independent set of explanatory variables.
  • The average VIF for the model can also be used.

31
Solutions for multicollinearity
  • 1. Center the data. This will work in some
    cases. E.g. polynomial type regression. This
    will not work in the explanatory variables are
    not derived from one another.
  • 2. Eliminate variables. Eliminate the one
    explanatory variable with the highest VIF first.
    Redo regression and recalculate VIFs.
  • 3. Collect additional data. Collect data that
    will counteract the multicollinearity.
  • 4. Perform ridge regression. This will give
    biased but more stable estimates of slopes-
    method in selecting the biasing factor is
    subjective. (Not available in most popular
    software).

32
Choosing the Best MLR Model
  • Major issue in MLR. Tradeoff between explaining
    more variance and reducing degrees of freedom
    which leads to increasing CI.
  • Remember - R2 will always increase no matter what
    Xs are added in the model ( including random
    numbers!!)
  • Objective Find the model that explains the most
    variance with the fewest number of explanatory
    variables.

33
General Principles
  • 1. Xj must have some effect on Y and makes
    physical sense.
  • 2. Add variable only if it makes significant
    improvement in the model.
  • 3. Model must fulfill assumptions of MLR.

34
Selecting VariablesStepwise procedures.
  • Automatic model selection method. Done by the
    computer using preset criteria. 3 versions
    available forwards, backwards, and stepwise.
    These procedures use a sequence of partial F or
    t-tests to evaluate the significance of a
    variable. The 3 versions do not always agree on
    the best model. Only one variable is added or
    removed at a time.
  • None of the 3 versions test for all possible
    regression. This is a major drawback. Also,
    each explanatory variable is assumed to be
    independent of the others. Thus, these
    procedures are hopeless for multicollinear data.
  • Use of automatic procedures are no longer in
    vogue. There are better procedures now.

35
Overall Measures of Quality
  • 3 newer statistics can be used to evaluate each
    of the 2k regression equations possible from k
    candidate explanatory variables.
  • 1. Mallows Cp
  • 2. PRESS statistics (jacknife type procedure).
  • 3. Adjusted R2

36
Mallows Cp
  • Designed to achieve a good compromise between
    explaining as much variance in Y as possible and
    to minimize SE by keeping the number of
    coefficients small.
  • Where n no. of observations, p no. of
    parameters (k1), sp2 MSE of this p
    coefficient model, ?2 minimum MSE among 2k
    possible models.
  • Best model is the one with the lowest Cp value.
    When several models have nearly equal Cp values,
    then compare in terms of reasonableness,
    multicollinearity, importance of high influence
    points, and cost in order to select the model
    with the best overall properties.

37
PRESS statistic
  • PRESS prediction error, e(i), sum squares. By
    minimizing PRESS, the model with the least error
    in the prediction of future observations is
    selected.

38
Adjusted R2
  • This is an R2 value adjusted for the number of
    explanatory variables (df) in the model. The
    model with the highest adj. R2 is identical to
    the one with the smallest MSE. Comparing R2 with
    adj. R2 ,
  • Overall methods requires more computations but
    more flexibility in choosing between models.
    Stepwise method may miss the best models.
  • E.g. 2 best models may be nearly identical in
    terms of Cp, adj. R2 and/or PRESS statistics, yet
    one involves variables that are much more less
    expensive to measure than the other.

39
Analysis of Covariance (ANCOVA)
  • Regression analysis with grouped or qualitative
    variables e.g. site, day/night, male/female,
    before/after, summer/fall/winter/spring, etc.
  • They can be incorporated in a MLR analysis using
    indicator or dummy variables.
  • Very important class of regression models in
    environmental monitoring - point source pollution
    - gradient sampling design. Is there attenuation
    with distance from year to year?

ANCOVA Regression ANOVA
40
Use of One Binary Variable
  • To the SLR model
  • an additional factor e.g. season (winter vs.
    summer) may be an important influence on Y for
    any given value of X
  • To incorporate the new factor to represent the
    season, define a new variable Z, where
  • 0 if i is from winter season
  • Zi
  • 1 if i is from summer season
  • to produce the model

1
2
41
  • When ?2 is found to be significant, then there
    are two models
  • Therefore, the regression lines differ for the
    two seasons. Both seasons have the same slope,
    but different intercepts, and will plot as two
    parallel lines.

For winter season (Z0)
For summer season (Z1)
42
Different slopes and intercepts
  • If it is suspected that slopes may be different
    as well, the model becomes
  • The intercept equals
  • The slope equals

3
For winter season
For summer season
For winter season
For summer season
43
Hypothesis Testing
  • To determine whether the SLR model with no Z can
    be improved upon by 3, the following hypotheses
    are tested
  • A nested F statistic is computed
  • where s refers to the simpler model (no Z terms)
    of 1 and c refers to the more complex model of
    3
  • Reject H0 if FgtF?,2,n-4.

44
  • If Ho is rejected, model 3 should also be
    compared to model 2 to determine whether there
    is a change in slope in addition to the change in
    intercept, or whether the rejection of model 1
    in favor of 3 was due only to a shift in
    intercept.
  • The hypotheses in this case are
  • using the test statistic
  • Rejecting H0 if FgtF?,1,n-4.

45
  • Assuming H0 and H0 are both rejected, the model
    can be expressed as the two separate equations

For winter season
For summer season
46
  • The coefficients in these two equations will be
    exactly those computed if the two regressions
    were estimated by separating the data, and
    computing two separate regression equations.
  • By using ANCOVA, however, the significance of the
    difference between those two equations has been
    established.

47
Multiple Dummy Variables
  • For the cases where there are more than 2
    categories e.g. 4 seasons, 5 stations, 3 flow
    conditions (rising limb, falling limb, baseflow),
    etc.
  • Example X discharge, Y concentration, and
    these are classified as either rising, falling,
    or baseflow. Two binary variables are required
    to express these three categories (there is
    always 1 less binary variable required than the
    number of categories).
  • Model

4
48
  • To test
  • A nested F statistic is again computed
  • where s refers to the simpler model (no R or D
    terms) of 1 and c refers to the more complex
    model of 4.
  • Greater complexity can be added to include
    interaction terms such as
  • The procedures for selecting models follow the
    pattern described above.

Reject H0 if FgtF?,2,n-4.
5
49
Summary of the Model Selection Criteria
  • 1. Should Y be transformed? Use ei vs. Plot.
  • i) constant variance across the range of Y?
  • ii) residuals normal?
  • iii) curvature?
  • R2, SSE, Cp, and PRESS are not appropriate for
    comparison of models having different units of Y.
  • 2. Should X (or several Xs) be transformed? Use
    partial plots. Same checks as above. Can use
    R2, SSE, or PRESS to help in decision.
  • 3. Which model is best if no. of explanatory
    variables is the same? Use R2, SSE, or PRESS,
    but back up with residual plot.

50
  • 4. Which of several models (nested of not
    nested), each with same Y, is preferable? Use
    minimum Cp or minimum PRESS.
  • 5. For ANCOVA, always do a X-Y plot to check for
    linearity, whether regression lines are parallel,
    and outliers.
  • All assumptions of regression must also be
    checked.
Write a Comment
User Comments (0)
About PowerShow.com