Title: Detecting and reducing multicollinearity
1Detecting and reducing multicollinearity
2Detecting multicollinearity
3Common methods of detection
- Realized effects (changes in coefficients,
changes in standard errors of coefficients,
changes in sequential sums of squares) of
multicollinearity. - Non-significant t-tests for all of the slopes but
a significant overall F-test. - Significant correlations among pairs of predictor
variables (correlations, matrix scatter plots). - Variance inflation factors (VIF).
4The first variance at issue
For the model
the variance of the estimated coefficient bk is
5The second variance at issue
For the model
the variance of the estimated coefficient bk is
6The ratio of the two variances
7Variance inflation factors
The variance inflation factor for the kth
predictor is
8Variance inflation factors (VIFk)
- A measure of how much the variance of the
estimated regression coefficient bk is inflated
by the existence of correlation among the
predictor variables in the model. - VIFs exceeding 4 warrant investigation.
- VIFs exceeding 10 are signs of serious
multicollinearity.
9Blood pressure example
n 20 hypertensive individuals
p-1 6 predictor variables
10Blood pressure example
BP Age Weight
BSA Duration Pulse Age 0.659 Weight
0.950 0.407 BSA 0.866 0.378
0.875 Duration 0.293 0.344 0.201
0.131 Pulse 0.721 0.619 0.659 0.465
0.402 Stress 0.164 0.368 0.034
0.018 0.312 0.506
Blood pressure (BP) is the response.
11Regress y BP on all 6 predictors
Predictor Coef SE Coef T P
VIF Constant -12.870 2.557 -5.03
0.000 Age 0.70326 0.04961 14.18
0.000 1.8 Weight 0.96992 0.06311 15.37
0.000 8.4 BSA 3.776 1.580
2.39 0.033 5.3 Dur 0.06838 0.04844
1.41 0.182 1.2 Pulse -0.08448
0.05161 -1.64 0.126 4.4 Stress
0.005572 0.003412 1.63 0.126 1.8 S
0.4072 R-Sq 99.6 R-Sq(adj)
99.4 Analysis of Variance Source DF
SS MS F P Regression
6 557.844 92.974 560.64 0.000 Residual
Error 13 2.156 0.166 Total 19
560.000
12Regress x2 weight on 5 predictors
Predictor Coef SE Coef T P
VIF Constant 19.674 9.465 2.08
0.057 Age -0.1446 0.2065 -0.70 0.495
1.7 BSA 21.422 3.465 6.18 0.000
1.4 Dur 0.0087 0.2051 0.04 0.967
1.2 Pulse 0.5577 0.1599 3.49 0.004
2.4 Stress -0.02300 0.01308 -1.76 0.101
1.5 S 1.725 R-Sq 88.1 R-Sq(adj)
83.9 Analysis of Variance Source DF
SS MS F P Regression 5
308.839 61.768 20.77 0.000 Residual Error
14 41.639 2.974 Total 19 350.478
13The variance inflation factor calculated by its
definition
The variance of the weight coefficient is
inflated by a factor of 8.40 due to the existence
of correlation among the predictor variables in
the model.
14The pairwise correlations
BP Age Weight
BSA Duration Pulse Age 0.659 Weight
0.950 0.407 BSA 0.866 0.378
0.875 Duration 0.293 0.344 0.201
0.131 Pulse 0.721 0.619 0.659 0.465
0.402 Stress 0.164 0.368 0.034
0.018 0.312 0.506
Blood pressure (BP) is the response.
15Regress y BP on age, weight, duration and stress
Predictor Coef SE Coef T P
VIF Constant -15.870 3.195 -4.97
0.000 Age 0.68374 0.06120 11.17
0.000 1.5 Weight 1.03413 0.03267 31.65
0.000 1.2 Dur 0.03989 0.06449
0.62 0.545 1.2 Stress 0.002184 0.003794
0.58 0.573 1.2 S 0.5505 R-Sq
99.2 R-Sq(adj) 99.0 Analysis of
Variance Source DF SS MS
F P Regression 4 555.45 138.86
458.28 0.000 Residual Error 15 4.55
0.30 Total 19 560.00
16Reducing data-based multicollinearity
17Data-based multicollinearity
- Multicollinearity that results from a poorly
designed experiment, reliance on purely
observational data, or the inability to
manipulate the system on which you collect the
data.
18Some methods
- Modify the regression model by eliminating one or
more predictor variables. - Collect additional data under different
experimental or observational conditions.
19(Modified!) Allen Cognitive Level (ACL) Study
- Relationship of ACL test to level of pathology in
a set of 23 patients in a hospital psychiatry
unit - Response y ACL score
- x1 vocabulary (Vocab) score on Shipley
Institute of Living Scale - x2 abstraction (Abstract) score on Shipley
Institute of Living Scale - x3 score on Symbol-Digit Modalities Test (SDMT)
20Allen Cognitive Level (ACL) Study on 23 patients
21Strong correlation between Vocab and Abstract
Pearson correlation of Vocab and Abstract 0.990
22Regress y ACL on SDMT, Vocab, and Abstract
Predictor Coef SE Coef T P
VIF Constant 3.747 1.342 2.79
0.012 SDMT 0.02326 0.01273 1.83
0.083 1.7 Vocab 0.0283 0.1524 0.19
0.855 49.3 Abstract -0.0138 0.1006
-0.14 0.892 50.6 S 0.7344 R-Sq
26.5 R-Sq(adj) 14.8 Analysis of
Variance Source DF SS MS
F P Regression 3 3.6854 1.2285
2.28 0.112 Residual Error 19 10.2476
0.5393 Total 22 13.9330
23Allen Cognitive Level (ACL) Study on 69 patients
24Plot after having collected more data
Pearson correlation of Vocab and Abstract 0.698
25Regress y ACL on SDMT, Vocab, and Abstract
Predictor Coef SE Coef T P
VIF Constant 3.9463 0.3381 11.67
0.000 SDMT 0.027404 0.007168 3.82
0.000 1.6 Vocab -0.01740 0.01808 -0.96
0.339 2.1 Abstract 0.01218 0.01159
1.05 0.297 2.2 S 0.6878 R-Sq 28.6
R-Sq(adj) 25.3 Analysis of
Variance Source DF SS MS
F P Regression 3 12.3009
4.1003 8.67 0.000 Residual Error 65
30.7487 0.4731 Total 68 43.0496
26Reducing structural multicollinearity
- In context of
- polynomial regression models
27Structural multicollinearity
- Multicollinearity that is a mathematical artifact
caused by creating new predictors from other
predictors, such as, creating the predictor x2
from the predictor x.
28Example
- (General research question) What is impact of
exercise on human immune system? - (Specific research question) How is amount of
immunoglobin in blood (y) related to maximal
oxygen uptake (x)?
29Scatter plot
30A quadratic polynomial regression function
- where
- yi amount of immunoglobin in blood (mg)
- xi maximal oxygen uptake (ml/kg)
- typical assumptions about error terms (INE)
31Estimated quadratic function
32Interpretation of the regression coefficients
- If 0 is a possible x value, then b0 is the
predicted response. Otherwise, interpretation of
b0 is meaningless. - b1 is the slope of the tangent line at x 0.
- b2 indicates the up/down direction of curve
- b2 lt 0 means curve is concave down
- b2 gt 0 means curve is concave up
33Regress y iggon oxygen and oxygen2
The regression equation is igg - 1464 88.3
oxygen - 0.536 oxygensq Predictor Coef SE
Coef T P VIF Constant -1464.4
411.4 -3.56 0.001 oxygen 88.31
16.47 5.36 0.000 99.9 oxygensq -0.5362
0.1582 -3.39 0.002 99.9 S 106.4
R-Sq 93.8 R-Sq(adj) 93.3 Analysis of
Variance Source DF SS MS
F P Regression 2 4602211
2301105 203.16 0.000 Residual Error 27
305818 11327 Total 29 4908029
34Structural multicollinearity
Pearson correlation of oxygen and oxygensq 0.995
35Center the predictors
Mean of oxygen 50.637
oxygen oxcent oxcentsq 34.6 -16.037
257.185 45.0 -5.637 31.776 62.3 11.663
136.026 58.9 8.263 68.277 42.5
-8.137 66.211 44.3 -6.337 40.158 67.9
17.263 298.011 58.5 7.863 61.827
35.6 -15.037 226.111 49.6 -1.037
1.075 33.0 -17.637 311.064
36Wow! It really works!
Pearson correlation of oxcent and oxcentsq 0.219
37A better quadratic polynomial regression function
- and
- yi amount of immunoglobin in blood (mg)
- typical assumptions about error terms (INE)
38Regress y iggon oxcent and oxcent2
The regression equation is igg 1632 34.0
oxcent - 0.536 oxcentsq Predictor Coef SE
Coef T P VIF Constant 1632.20
29.35 55.61 0.000 oxcent 34.000
1.689 20.13 0.000 1.1 oxcentsq -0.5362
0.1582 -3.39 0.002 1.1 S 106.4
R-Sq 93.8 R-Sq(adj) 93.3 Analysis of
Variance Source DF SS MS
F P Regression 2 4602211 2301105
203.16 0.000 Residual Error 27 305818
11327 Total 29 4908029
39Interpretation of the regression coefficients
- b0 is predicted response at the predictor mean.
- b1 is the estimated slope of the tangent line at
the predictor mean and, often, similar to the
estimated slope in the simple model. - b2 indicates the up/down direction of curve
- b2 lt 0 means curve is concave down
- b2 gt 0 means curve is concave up
40Estimated regression function
41Similar estimates of coefficients from
first-order linear model
42The relationship between the two forms of the
model
Centered model
Original model
where
43 Mean of oxygen 50.637
44Model evaluation
45Model evaluation
46Model use What is predicted IgG if maximal
oxygen uptake is 90?
Predicted Values for New Observations New Obs
Fit SE Fit 95.0 CI 95.0 PI 1
2139.6 219.2 (1689.8,2589.5) (1639.6,2639.7)
XX X denotes a row with X values away from the
center XX denotes a row with very extreme X
values Values of Predictors for New
Observations New Obs oxcent oxcentsq 1
39.4 1549
There is an even greater danger in extrapolation
when modeling data with a polynomial function,
because of changes in direction.
47The hierarchical approach to model fitting
Widely accepted approach is to fit a higher-order
model and then explore whether a lower-order
(simpler) model is adequate.
Is a first-order linear model (line) adequate?
48The hierarchical approach to model fitting
But then if a polynomial term of a given order
is retained, then all related lower-order terms
are also retained. That is, if a quadratic term
was significant, you would use this regression
function