Detecting and reducing multicollinearity - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Detecting and reducing multicollinearity

Description:

Detecting and reducing multicollinearity Detecting multicollinearity Common methods of detection Realized effects (changes in coefficients, changes in standard errors ... – PowerPoint PPT presentation

Number of Views:243
Avg rating:3.0/5.0
Slides: 49
Provided by: lsi4
Category:

less

Transcript and Presenter's Notes

Title: Detecting and reducing multicollinearity


1
Detecting and reducing multicollinearity
2
Detecting multicollinearity
3
Common methods of detection
  • Realized effects (changes in coefficients,
    changes in standard errors of coefficients,
    changes in sequential sums of squares) of
    multicollinearity.
  • Non-significant t-tests for all of the slopes but
    a significant overall F-test.
  • Significant correlations among pairs of predictor
    variables (correlations, matrix scatter plots).
  • Variance inflation factors (VIF).

4
The first variance at issue
For the model
the variance of the estimated coefficient bk is
5
The second variance at issue
For the model
the variance of the estimated coefficient bk is
6
The ratio of the two variances
7
Variance inflation factors
The variance inflation factor for the kth
predictor is
8
Variance inflation factors (VIFk)
  • A measure of how much the variance of the
    estimated regression coefficient bk is inflated
    by the existence of correlation among the
    predictor variables in the model.
  • VIFs exceeding 4 warrant investigation.
  • VIFs exceeding 10 are signs of serious
    multicollinearity.

9
Blood pressure example
n 20 hypertensive individuals
p-1 6 predictor variables
10
Blood pressure example
BP Age Weight
BSA Duration Pulse Age 0.659 Weight
0.950 0.407 BSA 0.866 0.378
0.875 Duration 0.293 0.344 0.201
0.131 Pulse 0.721 0.619 0.659 0.465
0.402 Stress 0.164 0.368 0.034
0.018 0.312 0.506
Blood pressure (BP) is the response.
11
Regress y BP on all 6 predictors
Predictor Coef SE Coef T P
VIF Constant -12.870 2.557 -5.03
0.000 Age 0.70326 0.04961 14.18
0.000 1.8 Weight 0.96992 0.06311 15.37
0.000 8.4 BSA 3.776 1.580
2.39 0.033 5.3 Dur 0.06838 0.04844
1.41 0.182 1.2 Pulse -0.08448
0.05161 -1.64 0.126 4.4 Stress
0.005572 0.003412 1.63 0.126 1.8 S
0.4072 R-Sq 99.6 R-Sq(adj)
99.4 Analysis of Variance Source DF
SS MS F P Regression
6 557.844 92.974 560.64 0.000 Residual
Error 13 2.156 0.166 Total 19
560.000
12
Regress x2 weight on 5 predictors
Predictor Coef SE Coef T P
VIF Constant 19.674 9.465 2.08
0.057 Age -0.1446 0.2065 -0.70 0.495
1.7 BSA 21.422 3.465 6.18 0.000
1.4 Dur 0.0087 0.2051 0.04 0.967
1.2 Pulse 0.5577 0.1599 3.49 0.004
2.4 Stress -0.02300 0.01308 -1.76 0.101
1.5 S 1.725 R-Sq 88.1 R-Sq(adj)
83.9 Analysis of Variance Source DF
SS MS F P Regression 5
308.839 61.768 20.77 0.000 Residual Error
14 41.639 2.974 Total 19 350.478
13
The variance inflation factor calculated by its
definition
The variance of the weight coefficient is
inflated by a factor of 8.40 due to the existence
of correlation among the predictor variables in
the model.
14
The pairwise correlations
BP Age Weight
BSA Duration Pulse Age 0.659 Weight
0.950 0.407 BSA 0.866 0.378
0.875 Duration 0.293 0.344 0.201
0.131 Pulse 0.721 0.619 0.659 0.465
0.402 Stress 0.164 0.368 0.034
0.018 0.312 0.506
Blood pressure (BP) is the response.
15
Regress y BP on age, weight, duration and stress
Predictor Coef SE Coef T P
VIF Constant -15.870 3.195 -4.97
0.000 Age 0.68374 0.06120 11.17
0.000 1.5 Weight 1.03413 0.03267 31.65
0.000 1.2 Dur 0.03989 0.06449
0.62 0.545 1.2 Stress 0.002184 0.003794
0.58 0.573 1.2 S 0.5505 R-Sq
99.2 R-Sq(adj) 99.0 Analysis of
Variance Source DF SS MS
F P Regression 4 555.45 138.86
458.28 0.000 Residual Error 15 4.55
0.30 Total 19 560.00
16
Reducing data-based multicollinearity
17
Data-based multicollinearity
  • Multicollinearity that results from a poorly
    designed experiment, reliance on purely
    observational data, or the inability to
    manipulate the system on which you collect the
    data.

18
Some methods
  • Modify the regression model by eliminating one or
    more predictor variables.
  • Collect additional data under different
    experimental or observational conditions.

19
(Modified!) Allen Cognitive Level (ACL) Study
  • Relationship of ACL test to level of pathology in
    a set of 23 patients in a hospital psychiatry
    unit
  • Response y ACL score
  • x1 vocabulary (Vocab) score on Shipley
    Institute of Living Scale
  • x2 abstraction (Abstract) score on Shipley
    Institute of Living Scale
  • x3 score on Symbol-Digit Modalities Test (SDMT)

20
Allen Cognitive Level (ACL) Study on 23 patients
21
Strong correlation between Vocab and Abstract
Pearson correlation of Vocab and Abstract 0.990
22
Regress y ACL on SDMT, Vocab, and Abstract
Predictor Coef SE Coef T P
VIF Constant 3.747 1.342 2.79
0.012 SDMT 0.02326 0.01273 1.83
0.083 1.7 Vocab 0.0283 0.1524 0.19
0.855 49.3 Abstract -0.0138 0.1006
-0.14 0.892 50.6 S 0.7344 R-Sq
26.5 R-Sq(adj) 14.8 Analysis of
Variance Source DF SS MS
F P Regression 3 3.6854 1.2285
2.28 0.112 Residual Error 19 10.2476
0.5393 Total 22 13.9330
23
Allen Cognitive Level (ACL) Study on 69 patients
24
Plot after having collected more data
Pearson correlation of Vocab and Abstract 0.698
25
Regress y ACL on SDMT, Vocab, and Abstract
Predictor Coef SE Coef T P
VIF Constant 3.9463 0.3381 11.67
0.000 SDMT 0.027404 0.007168 3.82
0.000 1.6 Vocab -0.01740 0.01808 -0.96
0.339 2.1 Abstract 0.01218 0.01159
1.05 0.297 2.2 S 0.6878 R-Sq 28.6
R-Sq(adj) 25.3 Analysis of
Variance Source DF SS MS
F P Regression 3 12.3009
4.1003 8.67 0.000 Residual Error 65
30.7487 0.4731 Total 68 43.0496
26
Reducing structural multicollinearity
  • In context of
  • polynomial regression models

27
Structural multicollinearity
  • Multicollinearity that is a mathematical artifact
    caused by creating new predictors from other
    predictors, such as, creating the predictor x2
    from the predictor x.

28
Example
  • (General research question) What is impact of
    exercise on human immune system?
  • (Specific research question) How is amount of
    immunoglobin in blood (y) related to maximal
    oxygen uptake (x)?

29
Scatter plot
30
A quadratic polynomial regression function
  • where
  • yi amount of immunoglobin in blood (mg)
  • xi maximal oxygen uptake (ml/kg)
  • typical assumptions about error terms (INE)

31
Estimated quadratic function
32
Interpretation of the regression coefficients
  • If 0 is a possible x value, then b0 is the
    predicted response. Otherwise, interpretation of
    b0 is meaningless.
  • b1 is the slope of the tangent line at x 0.
  • b2 indicates the up/down direction of curve
  • b2 lt 0 means curve is concave down
  • b2 gt 0 means curve is concave up

33
Regress y iggon oxygen and oxygen2
The regression equation is igg - 1464 88.3
oxygen - 0.536 oxygensq Predictor Coef SE
Coef T P VIF Constant -1464.4
411.4 -3.56 0.001 oxygen 88.31
16.47 5.36 0.000 99.9 oxygensq -0.5362
0.1582 -3.39 0.002 99.9 S 106.4
R-Sq 93.8 R-Sq(adj) 93.3 Analysis of
Variance Source DF SS MS
F P Regression 2 4602211
2301105 203.16 0.000 Residual Error 27
305818 11327 Total 29 4908029
34
Structural multicollinearity
Pearson correlation of oxygen and oxygensq 0.995
35
Center the predictors
Mean of oxygen 50.637
oxygen oxcent oxcentsq 34.6 -16.037
257.185 45.0 -5.637 31.776 62.3 11.663
136.026 58.9 8.263 68.277 42.5
-8.137 66.211 44.3 -6.337 40.158 67.9
17.263 298.011 58.5 7.863 61.827
35.6 -15.037 226.111 49.6 -1.037
1.075 33.0 -17.637 311.064
36
Wow! It really works!
Pearson correlation of oxcent and oxcentsq 0.219
37
A better quadratic polynomial regression function
  • and
  • yi amount of immunoglobin in blood (mg)
  • typical assumptions about error terms (INE)

38
Regress y iggon oxcent and oxcent2
The regression equation is igg 1632 34.0
oxcent - 0.536 oxcentsq Predictor Coef SE
Coef T P VIF Constant 1632.20
29.35 55.61 0.000 oxcent 34.000
1.689 20.13 0.000 1.1 oxcentsq -0.5362
0.1582 -3.39 0.002 1.1 S 106.4
R-Sq 93.8 R-Sq(adj) 93.3 Analysis of
Variance Source DF SS MS
F P Regression 2 4602211 2301105
203.16 0.000 Residual Error 27 305818
11327 Total 29 4908029
39
Interpretation of the regression coefficients
  • b0 is predicted response at the predictor mean.
  • b1 is the estimated slope of the tangent line at
    the predictor mean and, often, similar to the
    estimated slope in the simple model.
  • b2 indicates the up/down direction of curve
  • b2 lt 0 means curve is concave down
  • b2 gt 0 means curve is concave up

40
Estimated regression function
41
Similar estimates of coefficients from
first-order linear model
42
The relationship between the two forms of the
model
Centered model
Original model
where
43
Mean of oxygen 50.637
44
Model evaluation
45
Model evaluation
46
Model use What is predicted IgG if maximal
oxygen uptake is 90?
Predicted Values for New Observations New Obs
Fit SE Fit 95.0 CI 95.0 PI 1
2139.6 219.2 (1689.8,2589.5) (1639.6,2639.7)
XX X denotes a row with X values away from the
center XX denotes a row with very extreme X
values Values of Predictors for New
Observations New Obs oxcent oxcentsq 1
39.4 1549
There is an even greater danger in extrapolation
when modeling data with a polynomial function,
because of changes in direction.
47
The hierarchical approach to model fitting
Widely accepted approach is to fit a higher-order
model and then explore whether a lower-order
(simpler) model is adequate.
Is a first-order linear model (line) adequate?
48
The hierarchical approach to model fitting
But then if a polynomial term of a given order
is retained, then all related lower-order terms
are also retained. That is, if a quadratic term
was significant, you would use this regression
function
Write a Comment
User Comments (0)
About PowerShow.com