Title: Multiple Regression 2
1Multiple Regression 2
2Solving for ? and b
- The weight for predictor xj will be a function
of - The correlation xj and y.
- The extent to which xjs relationship with y is
redundant with other predictors relationships
with y (collinearity). - The correlations between y and all other
predictors. - The correlations between xj and all other
predictors
3Solving for ? and b the two variable case
- ? 1 slope for X1 controlling for the other
independent variable X2 - ? 2 is computed the same way. Swap X1s, X2s
- Compare to bivariate slope
- What happens to b1 if X1 and X2 are totally
uncorrelated?
4Solving for ? and b the two variable case
- Solving for ? and b is relatively simple with two
variables but becomes increasingly complex with
more variables and requires differential calculus
to derive formulas. Matrix algebra can be used
to simplify the process.
5Matrix Equations
- R2 S(ryjbj)
- where each ryj correlation between the DV and
the jth IV - each bi standardized regression coefficient
- R2 RyjBj
- where Ryj row matrix of correlations between
the DV and k IVs. - Bj column matrix of standardized regression
coefficients for the IVs. - Bj Rjj-1Rjy
- In other words, the matrix of standardized
regression coefficients is simply the correlation
matrix between the DV and IVs divided by the
matrix of correlations among the IVs.
6Tests of Regression Coefficients
The Xj predictor is not related to Y when the
other predictors are held constant.
Null Hypothesis
7Further Interpretation of Regression Coefficients
- Regression coefficients in multiple regression
(unstandardized and standardized) are considered
partial regression coefficients because each
coefficient is calculated after controlling for
the other predictors in the model. - Tests of regression coefficients represent a test
of the unique contribution of that variable in
predicting y over and above all other predictor
variables in the model.
8Assumptions
- Predictors are linearly related to criterion.
- Normality of errors -- residuals are normally
distributed around zero - Multivariate normal distribution -- multivariate
extension of bivariate normalityhomoscedasticity.
- Regression diagnostics check on these assumptions
9Regression Diagnostics
- Detecting multivariate outliers
- Distance, leverage, and influence
- Evaluating Collinearity
10Regression Diagnostics
- Methods for identifying problems in your multiple
regression analysis -- a good idea for any
multiple regression analysis - Can help identify
- violation of assumptions
- outliers and overly influential casescases you
might want to delete or transform - important variables youve omitted from the
analysis
11Three Classes of MR Diagnostic Statistics
- 1. Distance -- detects outliers in the dependent
variable and assumption violations -- primary
measure is the residual (Y-Y) or standardized
residual (i.e., put in terms of z scores) or
studentized residual (i.e., put in terms of
t-scores) - 2. Leverage -- identifies potential outliers in
the independent variables -- primary measure is
the leverage statistic or hat diagnostic
12Three Classes of MR Diagnostic Statistics (cont.)
- 3. Influence -- combines distance and leverage to
identify unusually influential observations
(i.e., observations or cases that have a big
influence on the MR equation) -- the measure we
will use is Cooks D
13Distance
- Analyze residuals
- Pay attention to standardized or studentized
residuals gt 2.5 shouldnt be more than 5 of
cases - Tells you which cases are not predicted well by
regression analysis -- you can learn from this in
itself - Necessary to test MR assumptions
- homoscedasticity
- normality of errors
14Distance
- Unstandardized Residuals
- The difference between an observed value and the
value predicted by the model. The mean is 0. - Standardized Residuals
- The residual divided by an estimate of its
standard error. Standardized residuals have a
mean of 0 and a standard deviation of 1. - Studentized Residuals
- The residual divided by an estimate of its
standard deviation that varies from case to case,
depending on the leverage of each cases
predictor values in determining model fit. They
have a mean of 0 and a standard deviation
slightly larger than 1.
15Distance
- Deleted Residuals
- The residual for a case that is excluded from the
calculation of the regression coefficients. It is
the difference between the value of the dependent
variable and the adjusted predicted value. - Studentized Deleted Residuals
- It is a studentized residual with the effect of
the observation deleted from the standard error.
The residual can be large due to distance,
leverage, or influence. The mean is 0 and the
variance is slightly greater than 1.
16Distance-example
- Open mregression1/example2c.sav
- Regress problems on peak, week, and index
- Under statistics select estimates, covariance
matrix, and model fit. - Save predicted values unstandardized and save all
residuals (unstandardized, standardized,
Studentized, deleted, and Studentized deleted) - Okay
17Distance-example output
- Interpret bs and betas. Compare betas with
correlations. - Zero order correlations
- Validity coefficients
- Why is the standard error of estimate different
from the standard deviation of unstandardized
residuals? - Note casewise diagnostics compared to saved
values.
18Leverage (hat diagnostic hat diag)
- Tells you outliers on X variables
- Note that this can detect so-called multivariate
outliers, that is, cases that are not outliers on
any one X variable but are outliers on
combinations of X variables. Example Someone
who is 60 inches tall and weighs 190 pounds. - Guideline Pay attention to cases with centered
leverage that stands out or is greater than 2k/n
for large samples or 3k/n for small samples (.04
in this case). (SPSS prints out the centered
leverage for each case and the average centered
leverage across cases)
19Leverage (hat diagnostic hat diag)
- Possible values on leverage range from a low of
1/N to a high of 1.0. - The mean of the leverage values is k/N
- Rerun the regression but save leverage values
(only) - Examine leverage values gt .04.
20Influence Statistics
- Cooks D
- A measure of how much MSResidual would change if
a particular case were excluded from the
calculations of the regression coefficients. A
large Cooks D indicates that excluding a case
from computation of the regression statistics
changes the coefficients substantially. - Dfbeta(s)
- The difference in beta value is the change in the
regression coefficient that results from the
exclusion of a particular case. A value is
computed for each term in the model, including
the constant. - Standardized Dfbeta(s)
- Standardized difference in the beta value.The
change in the regression coefficient that results
from the exclusion of a particular case. You may
want to examine cases with absolute values
greater than 2/sqrt(N), where N is the number of
cases.
21Influence
- There is no general rule for what constitutes a
large value of Cooks D - Cooks D gt 1 is unusual
- Look for cases that have a large Cooks D
relative to other cases - Rerun analyses and save Cooks, dfBetas, and
standardized dfBetas (only)
22Suggestions for Handling Outliers
- Check for recording errors.
- Determine if they are legitimate cases for the
population you intended to sample. - Transform variables. You sacrifice
interpretation here and it doesnt help much for
floor or ceiling effects. - Trimming (delete extreme cases)?
- Winsorizing (assign extreme cases the highest
value considered reasonable - (e.g., if someone reports 99 drink/week, and the
next heaviest drinker is 50, change the 99 to
50.) - Run analyses with and without outliers. If
conclusions dont change, leave them in. If they
do change and you take them out, provide
justification for removal and note how they
affected the results.
23Collinearity Among the Predictors
- Identifying the Source(s) of Collinearity
- Tolerance
- Variance Inflation Factor
- Condition Indices and Variance Proportions
- Handling Collinearity
24Collinearity
- Collinearity
- We want the predictors to be highly correlated
with the dependent variable. - We do not want the predictors to be highly
correlated with each other. - Collinearity occurs when a predictor is too
highly correlated with one or more of the other
predictors. - Impact of Collinearity
- The regression coefficients are very sensitive to
minor changes in the data. - The regression coefficients have large standard
errors, which lead to low power for the
predictors. - In the extreme case, singularity, you cannot
calculate the regression equation.
25Tolerance
- Tolerance tells us
- The amount of overlap between the predictor and
all other remaining predictors. - The degree of instability in the regression
coefficients. - Tolerance values less than 0.10 are often
considered to be an indication of collinearity.
26Variance Inflation Factor
- The VIF tells us
- The degree to which the standard error of the
predictor is increased due to the predictors
correlation with the other predictors in the
model. - VIF values greater than 10 (or, Tolerance values
less than 0.10) are often considered to be an
indication of collinearity.
27Condition Indices and Variance Proportions
- Taken together, they provide information about
- whether collinearity is a concern
- if collinearity is a concern, which predictors
are too highly correlated - Weak Dependencies have condition indices around
5-10 and two or more variance proportions greater
than 0.50. - Strong Dependencies have condition indices
around 30 or higher and two or more variance
proportions greater than 0.50.
28Collinearity diagnostics
- Eigen value -- the amount of total variation that
can be explained by one dimension among the
variables -- when several are close to 0, this
indicates high multicollinearity. Ideal
situation all 1s. - Condition index --square root of the ratio of
each eigen value to each successive eigen value
gt 15 indicates possible problem and gt 30
indicates serious problem with multicollinearity
29Collinearity diagnostics (cont.)
- Variance proportions -- proportion of each
variable explained by a given dimension --
multicollinearity can be a problem when a
dimension explains a high proportion of variance
in more than one variable. The proportions of
variance for each variable shows the damage
multicollinearity does to estimation of the
regression coefficient for each.
30Example
- Regress problems on week, peak, and index.
- Under statistics select collinearity diagnostics
- Okay
- Examine tolerance, VIF, and collinearity
diagnostics
31Collinearity in practice
- When is collinearity a problem?
- When you have predictors that are VERY highly
correlated (gt.7). - Mention centering, product terms, and
interactions ? interpretation of coefficients.
32Handling Collinearity
- Combine the information contained in your
predictors, linear combinations (mean of z-scored
predictors), factor analysis, SEM. - Delete some of the predictors that are too highly
correlated. - Collect additional datain the hope that
additional data will reduce the collinearity.