Title: Multiple Linear Regression
1Multiple Linear Regression
- Dingcai Cao
- University of Chicago
- Feb 9, 2006
2On Tuesday Simple Linear Regression
Linear regression
- Introduction
- Linear regression is a general method for
estimating/describing association between a
continuous outcome variable (dependent) and one
or multiple predictors in one equation. - Statistical model Least squares Estimates
- Diagnosis
- Normality
- Constant variance
- Outliers
Today The relationship b/w simple linear
regression and ANOVA Multiple linear regression
3Linear regression
Today The relationship b/w simple linear
regression and ANOVA Multiple linear regression
4Linear Regression vs. ANOVA
ANOVA Dependent Continuous Independent
Categorical
Linear regression Dependent Continuous
Independent Continuous/ Categorical
Linear models
ANOVA and regression are the same thing!!!
5Linear Regression vs. ANOVA
Scientific Question Is there any difference in
the loneliness between female and male? H0
Female Male H1 Female ?Male
Student t test or ANOVA
6Linear Regression vs. ANOVA
ANOVA WITH SOLUTION PROC GLM DATA
MYLIB.LONELINESS CLASS GENDER MODEL LONELY3
GENDER/SOLUTION RUN LINEAR REGRESSION USING
GLM PROC GLM DATA MYLIB.LONELINESS MODEL
LONELY3 GENDER RUN
7Linear Regression vs. ANOVA
ANOVA
REGRESSION
Dependent Variable loneliness
Sum of Source
DF Squares Mean Square F
Value Pr gt F Model
1 4.902852
4.902852 2.24 0.1347
Error 498
1087.709443 2.184156
Corrected Total
499 1092.612294
R-Square Coeff Var Root MSE lonely3
Mean
0.004487 29.18049 1.477889
5.064648 Source
DF Type I SS Mean Square
F Value Pr gt F
gender 1 4.90285179
4.90285179 2.24 0.1347
Source DF Type
III SS Mean Square F Value Pr gt F
gender 1
4.90285179 4.90285179 2.24
0.1347
Standard
Parameter Estimate
Error t Value Pr gt t
Intercept 4.931462122
B 0.11077245 44.52 lt.0001
gender 1
0.206809959 B 0.13803488 1.50
0.1347 gender 2
0.000000000 B . .
.
Sum of
Source DF Squares
Mean Square F Value Pr gt F
Model 1
4.902852 4.902852 2.24 0.1347
Error
498 1087.709443 2.184156
Corrected Total
499 1092.612294
R-Square Coeff Var Root MSE
lonely3 Mean
0.004487 29.18049 1.477889
5.064648 Source
DF Type I SS Mean Square
F Value Pr gt F
gender 1 4.90285179
4.90285179 2.24 0.1347
Source DF Type
III SS Mean Square F Value Pr gt F
gender 1
4.90285179 4.90285179 2.24
0.1347
Standard
Parameter Estimate
Error t Value Pr gt t
Intercept 5.345082039
0.19850165 26.93 lt.0001
gender -0.206809959
0.13803488 -1.50 0.1347
8Linear Regression vs. ANOVA
In the example, we show that ANOVA is a special
case of linear regression.
What if there are more than 2 groups in the
ANOVA?
9Linear Regression vs. ANOVA
Dummy variable for categorical data Outcome Y
(continuous) Predictor X (categorical)
X Group 1 Group 2 Group 3
10Linear Regression
- Linear regression is a general method for
estimating/describing association between a
continuous outcome variable (dependent) and one
or multiple predictors in one equation. - One predictor Simple linear regression
- Multiple predictors Multiple linear regression
11Statistical Model
Simple linear regression
Multiple linear regression
x
12Example The academic performance of schools
Variable
Type Len Pos Label
---------------------------------------------
--------------------------------
11 acs_46 Num 3 39
avg class size 4-6
10 acs_k3 Num 3
36 avg class size k-3
3 api00 Num
4 12 api 2000
4 api99 Num
4 16 api 1999
17 avg_ed
Num 8 57 avg parent ed
15
col_grad Num 3 51 parent college
grad
22 collcat Num 8 78
2 dnum Num 4 8
district number
7 ell Num 3
27 english language learners
19 emer Num
3 68 pct emer credential
20 enroll Num
4 71 number of students
18 full
Num 3 65 pct full credential
16
grad_sch Num 3 54 parent grad
school
5 growth Num 4 20 growth
1999 to 2000
13 hsg Num 3 45
parent hsg
21 mealcat Num 3
75 Percentage free meals in 3 categories
6 meals Num
3 24 pct free meals
9 mobility Num
3 33 pct 1st year in school
12 not_hsg
Num 3 42 parent not hsg
1 snum
Num 8 0 school number
14
some_col Num 3 48 parent some
college
8 yr_rnd Num 3 30 year
round school
PROC CONTENTS OUTPUT BY SAS
13Example The academic performance of schools
PROC CORR DATA MYLIB.ELEMAPI2 VAR API00 ELL
MEALS YR_RND MOBILITY ACS_K3 ACS_46 FULL EMER
ENROLL RUN
PROC CORR OUTPUT BY SAS
Pearson Correlation Coefficients
Prob gt r
under H0 Rho0
Number of Observations
api00 ell
meals yr_rnd mobility acs_k3 acs_46
full emer enroll api00
1.00000 -0.76763 -0.90070 -0.47544
-0.20641 0.17100 0.23291 0.57441 -0.58273
-0.31817 api 2000
lt.0001 lt.0001 lt.0001 lt.0001 0.0006
lt.0001 lt.0001 lt.0001 lt.0001 ell
-0.76763 1.00000 0.77238
0.49793 -0.02046 -0.05565 -0.17330 -0.48476
0.47218 0.40302 english language learners
lt.0001 lt.0001 lt.0001 0.6837
0.2680 0.0005 lt.0001 lt.0001 lt.0001
meals -0.90070 0.77238
1.00000 0.41845 0.21665 -0.18797 -0.21309
-0.52756 0.53304 0.24103 pct free meals
lt.0001 lt.0001 lt.0001
lt.0001 0.0002 lt.0001 lt.0001 lt.0001
lt.0001 yr_rnd -0.47544
0.49793 0.41845 1.00000 0.03479 0.02270
-0.04207 -0.39771 0.43472 0.59182 year
round school lt.0001 lt.0001
lt.0001 0.4883 0.6517 0.4032
lt.0001 lt.0001 lt.0001 mobility
-0.20641 -0.02046 0.21665 0.03479
1.00000 0.04014 0.12769 0.02521 0.05961
0.10502 pct 1st year in school lt.0001
0.6837 lt.0001 0.4883 0.4245
0.0110 0.6156 0.2348 0.0360 acs_k3
0.17100 -0.05565 -0.18797
0.02270 0.04014 1.00000 0.27078 0.16057
-0.11033 0.10890 avg class size k-3
0.0006 0.2680 0.0002 0.6517 0.4245
lt.0001 0.0013 0.0277 0.0298
acs_46 0.23291 -0.17330
-0.21309 -0.04207 0.12769 0.27078 1.00000
0.11773 -0.12446 0.02829 avg class size
4-6 lt.0001 0.0005 lt.0001
0.4032 0.0110 lt.0001 0.0190
0.0131 0.5741 full
0.57441 -0.48476 -0.52756 -0.39771 0.02521
0.16057 0.11773 1.00000 -0.90568 -0.33769
pct full credential lt.0001 lt.0001
lt.0001 lt.0001 0.6156 0.0013 0.0190
lt.0001 lt.0001 emer
-0.58273 0.47218 0.53304 0.43472
0.05961 -0.11033 -0.12446 -0.90568 1.00000
0.34309 pct emer credential lt.0001
lt.0001 lt.0001 lt.0001 0.2348 0.0277
0.0131 lt.0001 lt.0001 enroll
-0.31817 0.40302 0.24103
0.59182 0.10502 0.10890 0.02829 -0.33769
0.34309 1.00000 number of students
lt.0001 lt.0001 lt.0001 lt.0001 0.0360
0.0298 0.5741 lt.0001 lt.0001
14Example The academic performance of schools
Dependent Variable api00 api 2000
Analysis of Variance
Sum of
Mean Source
DF Squares Square F
Value Pr gt F Model
9 6740702
748967 232.41 lt.0001
Error 385 1240708
3222.61761
Corrected Total 394
7981410
Root MSE
56.76810 R-Square 0.8446
Dependent Mean
648.65063 Adj R-Sq 0.8409
Coeff Var
8.75172
Parameter
Estimates
Parameter
Standard Variable Label
DF Estimate
Error t Value Pr gt t
Intercept Intercept 1
758.94179 62.28601 12.18 lt.0001
ell english language
learners 1 -0.86007 0.21063
-4.08 lt.0001 meals
pct free meals 1 -2.94822
0.17035 -17.31 lt.0001
yr_rnd year round school 1
-19.88875 9.25844 -2.15
0.0323 mobility pct 1st year
in school 1 -1.30135 0.43621
-2.98 0.0030 acs_k3
avg class size k-3 1 1.31870
2.25268 0.59 0.5586
acs_46 avg class size 4-6 1
2.03246 0.79832 2.55
0.0113 full pct full
credential 1 0.60972
0.47582 1.28 0.2008
emer pct emer credential 1
-0.70662 0.60541 -1.17 0.2439
enroll number of students
1 -0.01216 0.01679
-0.72 0.4693
15Diagnosis Normal Distribution?
Histogram
Boxplot
Normal Probability Plot 150
2 0
150
.
8
. 7
.
15
. 29
.
33
. 40
-----
.
57
-10
60 -----
-10
. 49
-----
.
39
.
26
.
17
.
7
. 3
.
2
-170
1 0 -170
------------------------
----------------
------------------------ may
represent up to 2 counts
-2 -1 0
1 2
16Diagnosis Constant Variance?
Residual vs fitted value plot
17Diagnosis Outliers?
18Model Selection
Dependent Variable api00 api 2000
Analysis of Variance
Sum of
Mean Source
DF Squares Square F
Value Pr gt F Model
9 6740702
748967 232.41 lt.0001
Error 385 1240708
3222.61761
Corrected Total 394
7981410
Root MSE
56.76810 R-Square 0.8446
Dependent Mean
648.65063 Adj R-Sq 0.8409
Coeff Var
8.75172
Parameter
Estimates
Parameter
Standard Variable Label
DF Estimate
Error t Value Pr gt t
Intercept Intercept 1
758.94179 62.28601 12.18 lt.0001
ell english language
learners 1 -0.86007 0.21063
-4.08 lt.0001 meals
pct free meals 1 -2.94822
0.17035 -17.31 lt.0001
yr_rnd year round school 1
-19.88875 9.25844 -2.15
0.0323 mobility pct 1st year
in school 1 -1.30135 0.43621
-2.98 0.0030 acs_k3
avg class size k-3 1 1.31870
2.25268 0.59 0.5586
acs_46 avg class size 4-6 1
2.03246 0.79832 2.55
0.0113 full pct full
credential 1 0.60972
0.47582 1.28 0.2008
emer pct emer credential 1
-0.70662 0.60541 -1.17 0.2439
enroll number of students
1 -0.01216 0.01679
-0.72 0.4693
19Model Selection
Forward Model Selection Starting with the null
model, add variables sequentially. Backward
Model Selection Starting with the full model,
delete variables with large P-values
sequentially. Stepwise Model Selection Combinati
on of Backward/Forward methods.
PROC REG DATA MYLIB.ELEMAPI2 M1 MODEL API00
ELL MEALS YR_RND MOBILITY ACS_K3 ACS_46 FULL
EMER ENROLL/SELECTION FORWARD M2 MODEL API00
ELL MEALS YR_RND MOBILITY ACS_K3 ACS_46 FULL
EMER ENROLL/SELECTION BACKWARD M3 MODEL
API00 ELL MEALS YR_RND MOBILITY ACS_K3 ACS_46
FULL EMER ENROLL/SELECTION STEPWISE RUN
20Forward Model Selection
Forward Selection Step 1
Variable meals
Entered R-Square 0.8104 and C(p) 78.6700
Analysis of Variance
Sum of
Mean Source
DF Squares Square
F Value Pr gt F
Model 1 6467843
6467843 1679.39 lt.0001
Error 393 1513567
3851.31515
Corrected Total 394
7981410
Parameter Standard
Variable Estimate Error Type
II SS F Value Pr gt F
Intercept 890.33789 6.67326
68555604 17800.6 lt.0001
meals -4.01693 0.09802
6467843 1679.39 lt.0001
21Forward Model Selection
Forward Selection Step 2
Variable emer Entered R-Square
0.8256 and C(p) 42.9320
Analysis of
Variance
Sum of Mean
Source
DF Squares Square F Value Pr
gt F Model
2 6589458 3294729
927.86 lt.0001 Error
392 1391952
3550.89695
Corrected Total 394
7981410
Parameter Standard
Variable Estimate Error Type
II SS F Value Pr gt F
Intercept 891.69625 6.41191
68674840 19340.1 lt.0001
meals -3.66331 0.11185
3809235 1072.75 lt.0001
emer -1.78671 0.30530
121615 34.25 lt.0001
22Forward Model Selection
Forward Selection Step 3
Variable ell Entered R-Square
0.8352 and C(p) 21.2736
Analysis of
Variance
Sum of Mean
Source
DF Squares Square F Value Pr
gt F Model
3 6665700 2221900
660.30 lt.0001 Error
391 1315710
3364.98682
Corrected Total 394
7981410
Parameter Standard
Variable Estimate Error Type
II SS F Value Pr gt F
Intercept 887.25446 6.31117
66505925 19764.1 lt.0001
ell -0.88928 0.18683
76242 22.66 lt.0001
meals -3.16584 0.15092
1480643 440.01 lt.0001
emer -1.61167 0.29947
97463 28.96 lt.0001
23Forward Model Selection
Forward Selection Step 4
Variable yr_rnd Entered
R-Square 0.8380 and C(p) 16.1280
Analysis of Variance
Sum of
Mean Source
DF Squares Square F
Value Pr gt F Model
4 6688728
1672182 504.49 lt.0001
Error 390 1292682
3314.56983
Corrected Total 394
7981410
Parameter Standard
Variable Estimate Error Type
II SS F Value Pr gt F
Intercept 885.52859 6.29784
65531336 19770.7 lt.0001
ell -0.73721 0.19419
47770 14.41 0.0002
meals -3.17476 0.14983
1488241 449.00 lt.0001
yr_rnd -21.74359 8.24936
23028 6.95 0.0087
emer -1.40732 0.30716
69579 20.99 lt.0001
24Forward Model Selection
Forward Selection Step 5
Variable mobility Entered
R-Square 0.8405 and C(p) 11.9797
Analysis of Variance
Sum of
Mean Source
DF Squares Square F
Value Pr gt F Model
5 6708541
1341708 410.04 lt.0001
Error 389 1272868
3272.15552
Corrected Total 394
7981410
Parameter Standard
Variable Estimate Error Type
II SS F Value Pr gt F
Intercept 900.34592 8.68410
35172503 10749.0 lt.0001
ell -0.88118 0.20162
62504 19.10 lt.0001
meals -3.03351 0.15955
1182889 361.50 lt.0001
yr_rnd -20.98176 8.20226
21412 6.54 0.0109
mobility -1.01603 0.41290
19814 6.06 0.0143
emer -1.44072 0.30549
72777 22.24 lt.0001
25Forward Model Selection
Forward Selection Step 6
Variable acs_46 Entered
R-Square 0.8434 and C(p) 6.9282
Analysis of Variance
Sum of
Mean Source
DF Squares Square F
Value Pr gt F Model
6 6731265
1121878 348.19 lt.0001
Error 388 1250144
3222.02150
Corrected Total 394
7981410
Parameter Standard
Variable Estimate Error Type
II SS F Value Pr gt F
Intercept 838.88765 24.69433
3718275 1154.02 lt.0001
ell -0.89299 0.20012
64158 19.91 lt.0001
meals -2.95860 0.16081
1090573 338.47 lt.0001
yr_rnd -22.35084 8.15549
24200 7.51 0.0064
mobility -1.22139 0.41695
27648 8.58 0.0036
acs_46 2.06035 0.77582
22724 7.05 0.0082
emer -1.42212 0.30322
70872 22.00 lt.0001
26Forward Model Selection
Forward Selection Step 7
Variable full Entered R-Square
0.8442 and C(p) 6.8016
Analysis of
Variance
Sum of Mean
Source
DF Squares Square F Value Pr
gt F Model
7 6738119 962588
299.63 lt.0001 Error
387 1243291
3212.63859
Corrected Total 394
7981410
Parameter Standard
Variable Estimate Error Type
II SS F Value Pr gt F
Intercept 770.54568 52.89162
681844 212.24 lt.0001
ell -0.88545 0.19989
63036 19.62 lt.0001
meals -2.93428 0.16144
1061304 330.35 lt.0001
yr_rnd -22.76066 8.14844
25066 7.80 0.0055
mobility -1.35103 0.42570
32358 10.07 0.0016
acs_46 2.11781 0.77569
23948 7.45 0.0066
full 0.68305 0.46767
6853.20941 2.13 0.1450
emer -0.66301 0.60150
3903.28850 1.21 0.2710
27Forward Model Selection
Forward Selection Step 8
Variable enroll Entered R-Square
0.8444 and C(p) 8.3427
Analysis of
Variance
Sum of Mean
Source
DF Squares Square F Value Pr
gt F Model
8 6739598 842450
261.86 lt.0001 Error
386 1241812
3217.12983
Corrected Total 394
7981410
Parameter Standard
Variable Estimate Error Type
II SS F Value Pr gt F
Intercept 777.22985 53.83881
670466 208.40 lt.0001
ell -0.84478 0.20883
52648 16.36 lt.0001
meals -2.96498 0.16778
1004713 312.30 lt.0001
yr_rnd -19.80050 9.24933
14743 4.58 0.0329
mobility -1.29000 0.43540
28240 8.78 0.0032
acs_46 2.13851 0.77683
24380 7.58 0.0062
full 0.64876 0.47072
6110.87487 1.90 0.1689
emer -0.67253 0.60209
4013.89996 1.25 0.2647
enroll -0.01134 0.01672
1479.01726 0.46 0.4982
28Forward Model Selection
No other variable met
the 0.5000 significance level for entry into the
model.
Summary of Forward Selection
Variable
Number Partial Model Step
Entered Label Vars In
R-Square R-Square C(p) F Value
Pr gt F 1 meals pct free
meals 1 0.8104
0.8104 78.6700 1679.39 lt.0001
2 emer pct emer credential
2 0.0152 0.8256 42.9320
34.25 lt.0001 3 ell
english language learners 3 0.0096
0.8352 21.2736 22.66 lt.0001
4 yr_rnd year round school
4 0.0029 0.8380 16.1280
6.95 0.0087 5 mobility pct
1st year in school 5 0.0025
0.8405 11.9797 6.06 0.0143
6 acs_46 avg class size 4-6
6 0.0028 0.8434 6.9282
7.05 0.0082 7 full pct
full credential 7 0.0009
0.8442 6.8016 2.13 0.1450
8 enroll number of students
8 0.0002 0.8444 8.3427
0.46 0.4982
29Collinearity/Mulilinearity
- Refer to the dependency in the predictors. That
is, a predictor is nearly a linear combination of
other predictors in the model. - Why do we care?
- Difficult to draw conclusion based on the results
- YX1X2, X1 2X2,
- Unstable estimates and large standard errors
- It is always important to look at the correlation
matrix before model fitting.
30Collinearity/Mulilinearity
Statistically, collinearity is measured by two
values Variance inflation factor
(VIF) Tolerance (TOL) TOL(X1) 1 R2(X1X2,
X3, , Xk) TOL(X2) 1 R2(X2X1, X3, ,
Xk) TOL(Xi) 1 R2(Xi All Xj, j ? i) VIF
1/TOL The lowest value of VIF is 1.0 VIF gt 10,
Strong indication of collinearity
31Collinearity SAS Option
PROC REG DATA MYLIB.ELEMAPI2 MODEL API00
ELL MEALS YR_RND MOBILITY ACS_K3 ACS_46 FULL EMER
ENROLL/VIF TOL OUTPUT OUT C P PRED L95
L95 U95 U95 R RESID COOKD COOKD RUN
Parameter Estimates
Parameter
Standard
Variance Variable Label
DF Estimate Error t Value
Pr gt t Tolerance Inflation
Intercept Intercept 1
758.94179 62.28601 12.18 lt.0001
. 0 ell
english language learners 1 -0.86007
0.21063 -4.08 lt.0001 0.30090
3.32338 meals pct free meals
1 -2.94822 0.17035
-17.31 lt.0001 0.27706 3.60929
yr_rnd year round school 1
-19.88875 9.25844 -2.15 0.0323
0.53272 1.87716 mobility pct
1st year in school 1 -1.30135
0.43621 -2.98 0.0030 0.76279
1.31098 acs_k3 avg class size k-3
1 1.31870 2.25268 0.59
0.5586 0.85399 1.17098 acs_46
avg class size 4-6 1
2.03246 0.79832 2.55 0.0113
0.86593 1.15483 full pct full
credential 1 0.60972
0.47582 1.28 0.2008 0.16289
6.13907 emer pct emer credential
1 -0.70662 0.60541 -1.17
0.2439 0.16344 6.11858 enroll
number of students 1
-0.01216 0.01679 -0.72 0.4693
0.56555 1.76820
32Goodness of Fit
R2 The larger, the better Adjusted R2 The
larger, the better Root MSE The smaller, the
better Most frequently used statistic (probably)
is Cp p 1 number of independent variables
(k)
33Model Comparison
Full Model 1RSS1 (df1) Simple Model 2RSS2
(df2) F(df2 df1, df1) (RSS2
RSS1)/(df2-df1)/(RSS1/df1))
34Model Comparison
Analysis of Variance
Sum
of Mean
Source DF Squares
Square F Value Pr gt F
Model 8 6739598
842450 261.86 lt.0001
Error 386
1241812 3217.12983
Corrected Total 394
7981410
Parameter Standard
Variable Estimate Error Type
II SS F Value Pr gt F
Intercept 777.22985 53.83881
670466 208.40 lt.0001
ell -0.84478 0.20883
52648 16.36 lt.0001
meals -2.96498 0.16778
1004713 312.30 lt.0001
yr_rnd -19.80050 9.24933
14743 4.58 0.0329
mobility -1.29000 0.43540
28240 8.78 0.0032
acs_46 2.13851 0.77683
24380 7.58 0.0062
full 0.64876 0.47072
6110.87487 1.90 0.1689
emer -0.67253 0.60209
4013.89996 1.25 0.2647
enroll -0.01134 0.01672
1479.01726 0.46 0.4982
35Model Comparison
Analysis of Variance
Sum
of Mean
Source DF Squares
Square F Value Pr gt F
Model 5 6708541
1341708 410.04 lt.0001
Error 389
1272868 3272.15552
Corrected Total 394
7981410
Parameter Standard
Variable Estimate Error Type
II SS F Value Pr gt F
Intercept 900.34592 8.68410
35172503 10749.0 lt.0001
ell -0.88118 0.20162
62504 19.10 lt.0001
meals -3.03351 0.15955
1182889 361.50 lt.0001
yr_rnd -20.98176 8.20226
21412 6.54 0.0109
mobility -1.01603 0.41290
19814 6.06 0.0143
emer -1.44072 0.30549
72777 22.24 lt.0001
36Model Comparison
Full Model 1 api00 ell meals yr_rnd
mobility acs_46 full emer enroll RSS1
1241812 df1 386 Simple Model 2RSS2
(df2) api00 ell meals yr_rnd mobility
emer RSS2 1272868 df2 389 F(df2 df1, df1)
(RSS2 RSS1)/(df2-df1)/(RSS1/df1)) F(3,
386) (1272868 -1241812 )/3/ (1241812
/386) 3.22 P 0.00022