Title: Lab 5
1Lab 5 Regression Assumptions
Multicollinearity Henian Chen, M.D., Ph.D.
2 Four Basic Regression Assumptions LINE
Linear the relationship between E(Y) and
each X is linear. Independent the
error terms for different observations
are uncorrelated (lack of
autocorrelation). Normal the error term is
normally distributed. Equal Variance the
conditional variance of the error
term is constant (homoscedasticity).
3Problems Caused by Violation of
Assumptions Violations of some of the
assumptions lead to biased estimates of
regression coefficients and incorrect standard
errors. Violations of other assumptions lead to
incorrect standard errors. Serious violations of
the assumptions potentially lead to incorrect
significance tests and confidence intervals.
4How to Detect Violations of Assumptions Graphica
l display and analysis of residuals can be very
informative in detecting problems with regression
models. Residuals (error) represent the
portion of each cases score on Y that cannot be
accounted for by the regression model.
5Random Sampling
X1 n100
POPULATION N (0,1)
E n100
X2 X1X1 X3 X1X1X1 Y1 X1 X2 X3
E
6SAS Program for Assumption Checking  data
assumption do i1 to 100 / the sample size is
100 / x1rannor(0) / random sampling from
N(0,1) / x2(rannor(0))2 / X2 X1 X1
/ x3(rannor(0))3 / X3 X1 X1 X1
/ erannor(0) y1x1x2x3e output end proc
reg model y1x1 x2 x3/r plot student.x1 /
plot residuals vs. X1 / plot npp.residual. /
normal probability plot of residuals / output
outreg rresidual / save the residuals / proc
univariate normal var residual / examine the
distribution of residuals / run
7 DATA assumption i x1
x2 x3 e
y1 residual 1 -0.19691
0.17381 -0.9672 -0.66817 -1.6585
-0.66149 2 -0.96728 0.06206 -0.0032
-0.60851 -1.5170 -0.52268 3 0.16762
3.01545 0.0470 -1.69718 1.5329
-1.64700 4 -0.04552 0.05732 -11.7605
-0.25936 -12.0081 -0.35362 5 0.93499
0.67739 1.2696 -0.12053 2.7615
-0.19345 6 -1.55778 0.00810 -0.0499
1.32103 -0.2785 1.46224 7 -0.11813
0.13446 0.1438 -0.71702 -0.5569
-0.71036 8 -0.32067 0.59637 20.9912
0.65614 21.9231 0.85498 9 0.72745
2.77372 -0.1221 -0.80055 2.5786
-0.81183 10 0.88710 0.26480 -0.8891
0.61607 0.8788 0.52081
--------------------------------------------------
--------------------- --------------------------
--------------------------------------------- 99
0.17938 2.70423 0.2452 0.53972
3.6685 0.58253 100 -1.09027 0.64640
-11.8565 -0.78319 -13.0836 -0.76247
8 Analysis of Residuals The UNIVARIATE
Procedure Variable
residual N 100 Sum
Weights 100 Mean
0 Sum Observations 0 Std
Deviation 0.89467855 Variance
0.8004497 Skewness -0.1266863
Kurtosis 0.14505764 Tests for
Normality Test --Statistic---
-----p Value------ Shapiro-Wilk W
0.993441 Pr lt W 0.9132 Kolmogorov-Smirnov
D 0.047899 Pr gt D
gt0.1500 Cramer-von Mises W-Sq 0.028132
Pr gt W-Sq gt0.2500 Anderson-Darling A-Sq
0.197547 Pr gt A-Sq gt0.2500
9Scatterplot of Residuals vs. X1
10Normal Probability Plot of Residuals
11Multicollinearity Very high multiple
correlations among some or all of the predictors
in an equation Problems of Multicollinearity Th
e regression coefficient will be very
unreliable The regression coefficient will have
a very large standard error The confidence
interval of regression coefficient will be so
large as to make the estimate of little or no
value The regression coefficient will become
more difficult to interpret
12- How to Detect Multicollinearity
- As the squared correlation (r2) increases toward
1.0, the magnitude - of potential problems associated with
multicollinearity increases - correspondingly.
-
- 2. Tolerance (1-R2)
- One minus the squared multiple correlation of a
given IV from other Ivs - in the equation. Tolerance values of 0.10 or less
Indicate that there - may be serious multicollinearity.
- 3. The Variance Inflation Factor VIF1/(1-R2)
- VIF Is the reciprocal of the Tolerance. Any VIF
of 10 or more provides - evidence of serious multicollinearity.
- 4. Condition Number (k)
- The square root of the ratio of the largest
eigenvalue to the smallest - eigenvalue. k of 30 or larger indicate that there
may be serious
13 Random Sampling
The correlation between X1
and X2 is very high
14SAS Program for Collinearity  data
collinearity do i1 to 30 / the sample size
is 30 / x1rannor(0)28 / random sampling
from N(8,2²) / x2rannor(0)x12 / X2 2X1
e, e from N(0,1) / yrannor(0)x131 / y
3X1 1 e, e from N(0,1) / output end proc
reg model yx1 model yx2 model yx1 x2/ vif
tol collin collinoint run proc reg model yx1
x2/selectionforward run Run this program
10 times !
15DATA collinearity Obs
I x1 x2 y
1 1 7.1051 14.1473
21.2238 2 2 6.9889
14.3070 20.4224 3 3
7.5157 15.5566 23.0285 4
4 3.6790 6.7917 11.0950
5 5 9.2190 18.2427 28.7113
6 6 13.3599 28.9172
40.0390 7 7 8.4446
18.0301 28.3211 8 8
8.6625 17.9663 26.0228 9
9 10.7858 21.2204 34.7941
10 10 6.8514 14.8137 21.7308
11 11 4.0134 6.5129
11.5200 12 12 9.1870
19.5773 26.1309 13 13
10.4770 21.8295 33.4174
14 14 8.8016 17.9182 26.5724
15 15 8.5503 17.2934
25.7114 16 16 5.8284
11.6626 17.5066 17 17
9.4274 18.5729 29.4839 18
18 6.8077 11.9619 20.9820
19 19 9.9442 19.2891 30.7821
-------------------------------
------- ----------------------
---------------- 29 29
10.1040 20.1968 31.2689
30 30 7.8775 16.4311 25.9299
16SAS Output 1 Model Y X1
Parameter Standard Variable DF
Estimate Error t Value Pr gt
t Intercept 1 0.36405 0.79911
0.46 0.6522 x1 1
3.07798 0.10463 29.42
lt.0001 Â Model Y X2
Parameter Standard Variable
DF Estimate Error t Value
Pr gt t Intercept 1 0.77546
1.39741 0.55 0.5834 x2
1 1.52155 0.09218 16.51
lt.0001
17SAS Output 2 Model Y X1 X2
Parameter
Estimates Parameter
Standard
Variance Variable DF
Estimate Error t Value Pr gt
t Tolerance Inflation Intercept
1 -0.31174 0.91947 -0.34
0.7372 . 0 x1
1 3.38771 0.50003 6.77
lt.0001 0.04799 20.83709 x2
1 -0.11803 0.24206
-0.49 0.6298 0.04799
20.83709
Collinearity Diagnostics
Condition
---------Proportion of Variation---------
Number Eigenvalue Index
Intercept x1 x2
1 2.95102 1.00000
0.00742 0.00036845 0.00040210
2 0.04727 7.90102
0.94627 0.00963 0.01410
3 0.00171 41.57072
0.04631 0.99000
0.98550
Collinearity Diagnostics (intercept adjusted)
Condition --Proportion of Variation-
Number Eigenvalue Index
x1 x2 1
1.97571 1.00000
0.01215 0.01215 2
0.02429 9.01865 0.98785
0.98785
18 SAS Output 3
Dependent Variable y
Parameter
Standard
Variable Estimate Error Type
II SS F Value Pr gt F
Intercept
0.97453 0.18014 24.82542 29.27
lt.0001 x1 4.05555
0.37985 96.69632 113.99 lt.0001
x2 -0.48721 0.18900
5.63716 6.65 0.0157
All
variables have been entered into the model.
19 SAS Output 4
Dependent Variable y
Parameter
Standard
Variable Estimate Error Type
II SS F Value Pr gt F Intercept 1.11645
0.14702 36.22174 57.67 lt.0001 x1
2.97601 0.15006 247.03498 393.31
lt.0001 Â No other variable met the
0.5000 significance level
for entry into the model.
20SAS Output 5
Dependent Variable y
Parameter
Standard
Variable Estimate Error Type
II SS F Value Pr gt F Intercept 0.85843
0.16774 19.06416 26.19 lt.0001 x1
3.37690 0.31533 83.47922 114.69
lt.0001 x2 -0.14771 0.15416
0.66820 0.92 0.3465 Â All
variables have been entered into the model.