Multiple Regression 2 - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Multiple Regression 2

Description:

Multiple Regression 2 Solving for and b The weight for predictor xj will be a function of: The correlation xj and y. The extent to which xj s relationship with y is ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 33

Provided by: psyphzPsy

Learn more at: http://psyphz.psych.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Multiple Regression 2

1
Multiple Regression 2
2
Solving for ? and b

The weight for predictor xj will be a function
of
The correlation xj and y.
The extent to which xjs relationship with y is
redundant with other predictors relationships
with y (collinearity).
The correlations between y and all other
predictors.
The correlations between xj and all other
predictors

3
Solving for ? and b the two variable case

? 1 slope for X1 controlling for the other
independent variable X2
? 2 is computed the same way. Swap X1s, X2s
Compare to bivariate slope
What happens to b1 if X1 and X2 are totally
uncorrelated?

4
Solving for ? and b the two variable case

Solving for ? and b is relatively simple with two
variables but becomes increasingly complex with
more variables and requires differential calculus
to derive formulas. Matrix algebra can be used
to simplify the process.

5
Matrix Equations

R2 S(ryjbj)
where each ryj correlation between the DV and
the jth IV
each bi standardized regression coefficient
R2 RyjBj
where Ryj row matrix of correlations between
the DV and k IVs.
Bj column matrix of standardized regression
coefficients for the IVs.
Bj Rjj-1Rjy
In other words, the matrix of standardized
regression coefficients is simply the correlation
matrix between the DV and IVs divided by the
matrix of correlations among the IVs.

6
Tests of Regression Coefficients
The Xj predictor is not related to Y when the
other predictors are held constant.
Null Hypothesis
7
Further Interpretation of Regression Coefficients

Regression coefficients in multiple regression
(unstandardized and standardized) are considered
partial regression coefficients because each
coefficient is calculated after controlling for
the other predictors in the model.
Tests of regression coefficients represent a test
of the unique contribution of that variable in
predicting y over and above all other predictor
variables in the model.

8
Assumptions

Predictors are linearly related to criterion.
Normality of errors -- residuals are normally
distributed around zero
Multivariate normal distribution -- multivariate
extension of bivariate normalityhomoscedasticity.
Regression diagnostics check on these assumptions

9
Regression Diagnostics

Detecting multivariate outliers
Distance, leverage, and influence
Evaluating Collinearity

10
Regression Diagnostics

Methods for identifying problems in your multiple
regression analysis -- a good idea for any
multiple regression analysis
Can help identify
violation of assumptions
outliers and overly influential casescases you
might want to delete or transform
important variables youve omitted from the
analysis

11
Three Classes of MR Diagnostic Statistics

1. Distance -- detects outliers in the dependent
variable and assumption violations -- primary
measure is the residual (Y-Y) or standardized
residual (i.e., put in terms of z scores) or
studentized residual (i.e., put in terms of
t-scores)
2. Leverage -- identifies potential outliers in
the independent variables -- primary measure is
the leverage statistic or hat diagnostic

12
Three Classes of MR Diagnostic Statistics (cont.)

3. Influence -- combines distance and leverage to
identify unusually influential observations
(i.e., observations or cases that have a big
influence on the MR equation) -- the measure we
will use is Cooks D

13
Distance

Analyze residuals
Pay attention to standardized or studentized
residuals gt 2.5 shouldnt be more than 5 of
cases
Tells you which cases are not predicted well by
regression analysis -- you can learn from this in
itself
Necessary to test MR assumptions
homoscedasticity
normality of errors

14
Distance

Unstandardized Residuals
The difference between an observed value and the
value predicted by the model. The mean is 0.
Standardized Residuals
The residual divided by an estimate of its
standard error. Standardized residuals have a
mean of 0 and a standard deviation of 1.
Studentized Residuals
The residual divided by an estimate of its
standard deviation that varies from case to case,
depending on the leverage of each cases
predictor values in determining model fit. They
have a mean of 0 and a standard deviation
slightly larger than 1.

15
Distance

Deleted Residuals
The residual for a case that is excluded from the
calculation of the regression coefficients. It is
the difference between the value of the dependent
variable and the adjusted predicted value.
Studentized Deleted Residuals
It is a studentized residual with the effect of
the observation deleted from the standard error.
The residual can be large due to distance,
leverage, or influence. The mean is 0 and the
variance is slightly greater than 1.

16
Distance-example

Open mregression1/example2c.sav
Regress problems on peak, week, and index
Under statistics select estimates, covariance
matrix, and model fit.
Save predicted values unstandardized and save all
residuals (unstandardized, standardized,
Studentized, deleted, and Studentized deleted)
Okay

17
Distance-example output

Interpret bs and betas. Compare betas with
correlations.
Zero order correlations
Validity coefficients
Why is the standard error of estimate different
from the standard deviation of unstandardized
residuals?
Note casewise diagnostics compared to saved
values.

18
Leverage (hat diagnostic hat diag)

Tells you outliers on X variables
Note that this can detect so-called multivariate
outliers, that is, cases that are not outliers on
any one X variable but are outliers on
combinations of X variables. Example Someone
who is 60 inches tall and weighs 190 pounds.
Guideline Pay attention to cases with centered
leverage that stands out or is greater than 2k/n
for large samples or 3k/n for small samples (.04
in this case). (SPSS prints out the centered
leverage for each case and the average centered
leverage across cases)

19
Leverage (hat diagnostic hat diag)

Possible values on leverage range from a low of
1/N to a high of 1.0.
The mean of the leverage values is k/N
Rerun the regression but save leverage values
(only)
Examine leverage values gt .04.

20
Influence Statistics

Cooks D
A measure of how much MSResidual would change if
a particular case were excluded from the
calculations of the regression coefficients. A
large Cooks D indicates that excluding a case
from computation of the regression statistics
changes the coefficients substantially.
Dfbeta(s)
The difference in beta value is the change in the
regression coefficient that results from the
exclusion of a particular case. A value is
computed for each term in the model, including
the constant.
Standardized Dfbeta(s)
Standardized difference in the beta value.The
change in the regression coefficient that results
from the exclusion of a particular case. You may
want to examine cases with absolute values
greater than 2/sqrt(N), where N is the number of
cases.

21
Influence

There is no general rule for what constitutes a
large value of Cooks D
Cooks D gt 1 is unusual
Look for cases that have a large Cooks D
relative to other cases
Rerun analyses and save Cooks, dfBetas, and
standardized dfBetas (only)

22
Suggestions for Handling Outliers

Check for recording errors.
Determine if they are legitimate cases for the
population you intended to sample.
Transform variables. You sacrifice
interpretation here and it doesnt help much for
floor or ceiling effects.
Trimming (delete extreme cases)?
Winsorizing (assign extreme cases the highest
value considered reasonable
(e.g., if someone reports 99 drink/week, and the
next heaviest drinker is 50, change the 99 to
50.)
Run analyses with and without outliers. If
conclusions dont change, leave them in. If they
do change and you take them out, provide
justification for removal and note how they
affected the results.

23
Collinearity Among the Predictors

Identifying the Source(s) of Collinearity
Tolerance
Variance Inflation Factor
Condition Indices and Variance Proportions
Handling Collinearity

24
Collinearity

Collinearity
We want the predictors to be highly correlated
with the dependent variable.
We do not want the predictors to be highly
correlated with each other.
Collinearity occurs when a predictor is too
highly correlated with one or more of the other
predictors.
Impact of Collinearity
The regression coefficients are very sensitive to
minor changes in the data.
The regression coefficients have large standard
errors, which lead to low power for the
predictors.
In the extreme case, singularity, you cannot
calculate the regression equation.

25
Tolerance

Tolerance tells us
The amount of overlap between the predictor and
all other remaining predictors.
The degree of instability in the regression
coefficients.
Tolerance values less than 0.10 are often
considered to be an indication of collinearity.

26
Variance Inflation Factor

The VIF tells us
The degree to which the standard error of the
predictor is increased due to the predictors
correlation with the other predictors in the
model.
VIF values greater than 10 (or, Tolerance values
less than 0.10) are often considered to be an
indication of collinearity.

27
Condition Indices and Variance Proportions

Taken together, they provide information about
whether collinearity is a concern
if collinearity is a concern, which predictors
are too highly correlated
Weak Dependencies have condition indices around
5-10 and two or more variance proportions greater
than 0.50.
Strong Dependencies have condition indices
around 30 or higher and two or more variance
proportions greater than 0.50.

28
Collinearity diagnostics

Eigen value -- the amount of total variation that
can be explained by one dimension among the
variables -- when several are close to 0, this
indicates high multicollinearity. Ideal
situation all 1s.
Condition index --square root of the ratio of
each eigen value to each successive eigen value
gt 15 indicates possible problem and gt 30
indicates serious problem with multicollinearity

29
Collinearity diagnostics (cont.)

Variance proportions -- proportion of each
variable explained by a given dimension --
multicollinearity can be a problem when a
dimension explains a high proportion of variance
in more than one variable. The proportions of
variance for each variable shows the damage
multicollinearity does to estimation of the
regression coefficient for each.

30
Example

Regress problems on week, peak, and index.
Under statistics select collinearity diagnostics
Okay
Examine tolerance, VIF, and collinearity
diagnostics

31
Collinearity in practice

When is collinearity a problem?
When you have predictors that are VERY highly
correlated (gt.7).
Mention centering, product terms, and
interactions ? interpretation of coefficients.

32
Handling Collinearity

Combine the information contained in your
predictors, linear combinations (mean of z-scored
predictors), factor analysis, SEM.
Delete some of the predictors that are too highly
correlated.
Collect additional datain the hope that
additional data will reduce the collinearity.

Write a Comment

User Comments (0)