Title: Multiple Regression Analysis
1Multiple Regression Analysis
- Why use MLR?
- Hypothesis testing for MLR
- Diagnostics
- Multicollinearity
- Choosing the best model
- ANCOVA
2Multiple Linear Regression
- Extension of simple linear regression to the case
where there are several explanatory variables. - The goal is to explain as much as possible the
variation observed in the response (y) variable,
leaving as little variation as possible to
unexplained noise. - As in simple linear regression, multiple linear
regression equations can be obtained by OLS fit
or by other methods such as LMS, GLS, etc,
depending on the situation. - As multiple OLS linear regression is the most
common curve fitting technique, we will only
concentrate on procedures for developing a good
multiple OLS linear regression model, and on how
to deal with common problems such as
multicollinearity.
3- As MLR is a complex subject, hand calculation is
inadvisable when p is greater than 2 or 3 because
of the amount of the work involved. - The complexity involves
- determining how many and which variables to use,
including the form of each variable (such as
linear of nonlinear), - interpreting the results, especially the
regression coefficients, and - determining whether an alternative to OLS should
be used.
4Why Use MLR?
- Scientific knowledge and experience usually tell
us so. - Residuals from SLR may indicate that additional
explanatory variables are required. E.g.
residuals show there is a temporal trend
(suggesting time as an additional explanatory
variable).
5MLR Model
- The MLR model will be denoted
- This can be written in matrix notation as
- For n observations and 2 explanatory variables X1
and X2
6Specifically,
7ij
- Where Xij denotes the ith observation on the jth
explanatory variable. Thus, - As with SLR, OLS stipulates that the sum of the
squared residuals must be minimized. For the
above model,
8- And differentiation of the RHS of the equation
with respect to ?0, ?1, and ?2 (separately)
produces 3 equations in the 3 unknown parameters. - The 3 equations are called normal equations and
they can be written in matrix notation as - the solution for which is
- The XX matrix is a (k1) x (k1) symmetric
matrix whose diagonal elements are the sum of
squares of the elements in columns of the X
matrix, and whose off diagonal elements are sums
of cross products of elements in the same columns.
9- The nature of XX plays an important role in the
properties of the estimators in ? and will often
be a large factor in the success ( of failure) of
OLS as an estimation procedure.
10Hypothesis Tests for MLRNested F Tests
- F test is the single most important hypothesis
test for comparing any two nested models. A
complex model vs. a simpler model which is a
subset of the complex model. The test statistic
is
11- Complex model has (m 1) parameters and df n -
(m1) - Simple model has (k 1) parameters and df n -
(k1) - If F gt F(tab) with (dfc - dfs) and dfc degrees of
freedom for selected ?1, then H0 is rejected.
Rejection indicates that the more complex model
should be chosen in preference of the simpler
model and vice versa.
12Overall F test
- This is a special case of the nested F-test. It
is of limited use. It test only whether the
complex regression equation is better than no
regression at all. Of much greater interest is
which of several regression models is best.
13Partial F test
- Second special case of the nested F tests. The
partial F test evaluates whether the nth variable
adds any new explanatory power to the equation,
and ought to be in the regression model, given
that all the other variables are already present. - F value (Minitab use t) on a coefficient will
change depending on what other variables are in
the model. - Cannot answer Does variable m belong in the
model? - Can only answer Whether m belongs in the model
in the presence of the other variables.
14- If every t gt 2 for each coefficient, then it is
clear that every explanatory variable is
accounting for a significant amount of variation,
and all should be present. - When one or more of the coefficients has a t lt
2, some of the variables should be removed from
the equation, but t values are not a certain
guide as to which ones to remove. - Partial t or F test are used to make automatic
decisions for removal or inclusion in stepwise
multiple regression. - These automatic procedures do not guarantee that
some best model is obtained. Better procedures
are available for doing so.
15Confidence Intervals
- CI can be computed for all ?s and for the mean
response Y at a given value for all explanatory
variables. PI can be similarly computed around
an individual estimate of Y. Need to use matrix
notations for these.
16Variance-Covariance Matrix
- In MLR, the variance-covariance matrix is
computed form - Elements of the X prime X inverse matrix for 3
explanatory variables are
17- When multiplied by the error variance (estimated
by the variance of the residuals, s2 ), the
diagonal elements of the matrix C00 through C33
become the variances of the regression
coefficients, off-diagonals are covariances
between coefficients.
18Confidence Intervals For Slope Coefficients
- If the residuals are normally distributed with
variance ?2, a 100(1-?) Cl on ?j is - where Cjj is the diagonal element of the (XX)-1
corresponding to the jth explanatory variable.
Often printed is the SE of the regression
coefficient
19Note
- Cjj is a function of the other explanatory
variables as well as the jth. Therefore CIs
will change as explanatory variables are added to
or deleted from the model.
20Conficence Intervals for the Mean Response
- A 100(1-?) Cl for the mean response ?(Y0) for a
given point in multidimensional space X0 is
symmetric around the regression estimate Y0.
These intervals also require the assumption of
normality of residuals - The variance of the mean is the term under the
square root sign. It changes with X0, increasing
as X0 moves away from the multidimensional center
of the data. In fact the term X0 (XX)-1X0 is
the leverage statistic hi, expressing the
distance that X0 is from the center of the data.
21Prediction Intervals for an individual Y
- A 100(1-?) PI for a single response Y0, given a
point in multidimensional space X0 is symmetric
around the regression estimate Y0. It also
requires the assumption of normality of the
residuals. - Notice the addition of a 1 in the square
brackets. This reflects the additional variance
for an individual point.
22MLR Diagnostics
- As with SLR, it is very important to use
graphical tools to diagnose deficiences in MLR.
The following residuals plots are very important - normal probability plots of residuals
- residuals vs. predicted values (to identify
curvature or heteroscedasticity) - residuals vs. time sequence or location (to
identify trends) - residuals vs. any candidate or explanatory
variables not in the model (to identify
variables, or appropriate transformations of
them, which may be used to improve the model fit)
23Leverage and influence
- Regression diagnostics are much more important in
MLR. - Very difficult to recognize points of high
leverage or high influence for any set of plots. - One observation may not be exceptional in terms
of each of its explanatory variables taken one at
a time, but viewed in combination it can be very
exceptional. - Numerical diagnostics can accurately detect such
anomalies.
24Leverage Statistics
- The leverage statistic hi expresses the distance
of a given point X0 from the center of the sample
observations. It has 2 important uses in MLR - identify points unusual in value of the X
variables (possible errors, poor model,
nonlinearity, etc.) - making predictions. The leverage value for a
prediction should not exceed the largest hj in
the original data set. - (Regression model may not fit well beyond the
largest hj even though the X0s may not be beyond
the bounds of any of its individual explanatory
variables). - Critical hj 3p/n
25Influence statistic
- DFFITS is the measure of influence as defined
earlier - Example on the use of DFFITS and hi
-
- True model C 30 0.5D ?
- Data given DE (distance east), DN (distance
north) - D (well depth), C(conc).
- Any acceptable model should closely reproduce
true model, and should find C to be independent
of DE and DN.
26- Critical hi 3p/n 0.6, Critical DFFITS
0.9
27(No Transcript)
28(No Transcript)
29Multicollinearity
- Very important for users of MLR to understand
causes and consequences of multicollinearity. - Multicollinearity is the condition where at least
one explanatory variable is closely related to
one or more other explanatory variables. - Consequences
- Overall F test okay but slopes coefficients
unrealistically large, and t-tests are
insignificant. - Coefficients unrealistic in sigh. Occurs when 2
variables describing approximately the same thing
are counter-balancing each other in the equation,
having opposite signs. - Slope coefficients are unstable. Small change in
one or a few values could cause large change in
the coefficients. - Automatic procedures e.g. stepwise, forwards and
backwards methods produce different models.
30Diagnosing Multicollinearity
- An excellent and simple measure of
multicollinearity is the variance inflation
factor (VIF). For variable j the VIF is - Where Rj2 is the R2 from a regression of the jth
explanatory variable on all the other explanatory
variables - the equation used for adjustment of
Xj in partial plots. - The ideal VIFj is 1, corresponding to Rj2 0.
Serious problems are indicated when VIFj gt 10
(Rj2 0.9). - The VIF is a measure of how multicollinearity
inflates the width of the CI for the jth
regression coefficient by the amount
compared to what it would be with a perfectly
independent set of explanatory variables. - The average VIF for the model can also be used.
31Solutions for multicollinearity
- 1. Center the data. This will work in some
cases. E.g. polynomial type regression. This
will not work in the explanatory variables are
not derived from one another. - 2. Eliminate variables. Eliminate the one
explanatory variable with the highest VIF first.
Redo regression and recalculate VIFs. - 3. Collect additional data. Collect data that
will counteract the multicollinearity. - 4. Perform ridge regression. This will give
biased but more stable estimates of slopes-
method in selecting the biasing factor is
subjective. (Not available in most popular
software).
32Choosing the Best MLR Model
- Major issue in MLR. Tradeoff between explaining
more variance and reducing degrees of freedom
which leads to increasing CI. - Remember - R2 will always increase no matter what
Xs are added in the model ( including random
numbers!!) - Objective Find the model that explains the most
variance with the fewest number of explanatory
variables.
33General Principles
- 1. Xj must have some effect on Y and makes
physical sense. - 2. Add variable only if it makes significant
improvement in the model. - 3. Model must fulfill assumptions of MLR.
34Selecting VariablesStepwise procedures.
- Automatic model selection method. Done by the
computer using preset criteria. 3 versions
available forwards, backwards, and stepwise.
These procedures use a sequence of partial F or
t-tests to evaluate the significance of a
variable. The 3 versions do not always agree on
the best model. Only one variable is added or
removed at a time. - None of the 3 versions test for all possible
regression. This is a major drawback. Also,
each explanatory variable is assumed to be
independent of the others. Thus, these
procedures are hopeless for multicollinear data. - Use of automatic procedures are no longer in
vogue. There are better procedures now.
35Overall Measures of Quality
- 3 newer statistics can be used to evaluate each
of the 2k regression equations possible from k
candidate explanatory variables. - 1. Mallows Cp
- 2. PRESS statistics (jacknife type procedure).
- 3. Adjusted R2
36Mallows Cp
- Designed to achieve a good compromise between
explaining as much variance in Y as possible and
to minimize SE by keeping the number of
coefficients small. - Where n no. of observations, p no. of
parameters (k1), sp2 MSE of this p
coefficient model, ?2 minimum MSE among 2k
possible models. - Best model is the one with the lowest Cp value.
When several models have nearly equal Cp values,
then compare in terms of reasonableness,
multicollinearity, importance of high influence
points, and cost in order to select the model
with the best overall properties.
37PRESS statistic
- PRESS prediction error, e(i), sum squares. By
minimizing PRESS, the model with the least error
in the prediction of future observations is
selected.
38Adjusted R2
- This is an R2 value adjusted for the number of
explanatory variables (df) in the model. The
model with the highest adj. R2 is identical to
the one with the smallest MSE. Comparing R2 with
adj. R2 , - Overall methods requires more computations but
more flexibility in choosing between models.
Stepwise method may miss the best models. - E.g. 2 best models may be nearly identical in
terms of Cp, adj. R2 and/or PRESS statistics, yet
one involves variables that are much more less
expensive to measure than the other.
39Analysis of Covariance (ANCOVA)
- Regression analysis with grouped or qualitative
variables e.g. site, day/night, male/female,
before/after, summer/fall/winter/spring, etc. - They can be incorporated in a MLR analysis using
indicator or dummy variables. - Very important class of regression models in
environmental monitoring - point source pollution
- gradient sampling design. Is there attenuation
with distance from year to year?
ANCOVA Regression ANOVA
40Use of One Binary Variable
- To the SLR model
- an additional factor e.g. season (winter vs.
summer) may be an important influence on Y for
any given value of X - To incorporate the new factor to represent the
season, define a new variable Z, where - 0 if i is from winter season
- Zi
- 1 if i is from summer season
- to produce the model
1
2
41- When ?2 is found to be significant, then there
are two models - Therefore, the regression lines differ for the
two seasons. Both seasons have the same slope,
but different intercepts, and will plot as two
parallel lines.
For winter season (Z0)
For summer season (Z1)
42Different slopes and intercepts
- If it is suspected that slopes may be different
as well, the model becomes - The intercept equals
- The slope equals
3
For winter season
For summer season
For winter season
For summer season
43Hypothesis Testing
- To determine whether the SLR model with no Z can
be improved upon by 3, the following hypotheses
are tested - A nested F statistic is computed
- where s refers to the simpler model (no Z terms)
of 1 and c refers to the more complex model of
3 - Reject H0 if FgtF?,2,n-4.
44- If Ho is rejected, model 3 should also be
compared to model 2 to determine whether there
is a change in slope in addition to the change in
intercept, or whether the rejection of model 1
in favor of 3 was due only to a shift in
intercept. - The hypotheses in this case are
- using the test statistic
- Rejecting H0 if FgtF?,1,n-4.
45- Assuming H0 and H0 are both rejected, the model
can be expressed as the two separate equations
For winter season
For summer season
46- The coefficients in these two equations will be
exactly those computed if the two regressions
were estimated by separating the data, and
computing two separate regression equations. - By using ANCOVA, however, the significance of the
difference between those two equations has been
established.
47Multiple Dummy Variables
- For the cases where there are more than 2
categories e.g. 4 seasons, 5 stations, 3 flow
conditions (rising limb, falling limb, baseflow),
etc. - Example X discharge, Y concentration, and
these are classified as either rising, falling,
or baseflow. Two binary variables are required
to express these three categories (there is
always 1 less binary variable required than the
number of categories). - Model
4
48- To test
- A nested F statistic is again computed
- where s refers to the simpler model (no R or D
terms) of 1 and c refers to the more complex
model of 4. - Greater complexity can be added to include
interaction terms such as - The procedures for selecting models follow the
pattern described above.
Reject H0 if FgtF?,2,n-4.
5
49Summary of the Model Selection Criteria
- 1. Should Y be transformed? Use ei vs. Plot.
- i) constant variance across the range of Y?
- ii) residuals normal?
- iii) curvature?
- R2, SSE, Cp, and PRESS are not appropriate for
comparison of models having different units of Y. - 2. Should X (or several Xs) be transformed? Use
partial plots. Same checks as above. Can use
R2, SSE, or PRESS to help in decision. - 3. Which model is best if no. of explanatory
variables is the same? Use R2, SSE, or PRESS,
but back up with residual plot.
50- 4. Which of several models (nested of not
nested), each with same Y, is preferable? Use
minimum Cp or minimum PRESS. - 5. For ANCOVA, always do a X-Y plot to check for
linearity, whether regression lines are parallel,
and outliers. - All assumptions of regression must also be
checked. -