Title: SIMPLE AND MULTIPLE REGRESSION
1SIMPLE AND MULTIPLE REGRESSION
2Relació entre variables
- Entre variables discretes
- (exemple veure Titanic)
- Entre continues (Regressió !)
- Entre discretes i continues (Regressió!)
3gt Rendiment en Matemàtiques, gt Nombre de
llibres a casa
Pisa 2003
4gt Rendiment en Matemàtiques, gt Nombre de
llibres a casa
Pisa 2003
5Regressió Lineal ?
Pisa 2003
6Regressió Lineal ?
Pisa 2003
7Regressió Lineal ?
Pisa 2003
8Regressió Lineal
9 We first load the PISAespanya.sav file and
then this is the sintaxis file for SPSS
analysis Q38 How often do these things happen
in your math class Student dont't listen to what
the teacher says CROSSTABS /TABLESsubnatio
BY st38q02 /FORMAT AVALUE TABLES
/STATISTICCHISQ /CELLS COUNT ROW . FACTOR
/VARIABLES pv1math pv2math pv3math pv4math
pv5math pv1math1 pv2math1 pv3math1 pv4math1
pv5math1 pv1math2 pv2math2 pv3math2 pv4math2
pv5math2 pv1math3 pv2math3 pv3math3 pv4math3
pv5math3 pv1math4 pv2math4 pv3math4 pv4math4
pv5math4 /MISSING LISTWISE /ANALYSIS pv1math
pv2math pv3math pv4math pv5math pv1math1
pv2math1 pv3math1 pv4math1 pv5math1 pv1math2
pv2math2 pv3math2 pv4math2 pv5math2 pv1math3
pv2math3 pv3math3 pv4math3 pv5math3 pv1math4
pv2math4 pv3math4 pv4math4 pv5math4 /PRINT
INITIAL EXTRACTION FSCORE /PLOT EIGEN ROTATION
/CRITERIA FACTORS(1) ITERATE(25) /EXTRACTION
ML /ROTATION NOROTATE /SAVE REG(ALL)
. GRAPH /SCATTERPLOT(BIVAR)st19q01 WITH
fac1_1 /MISSINGLISTWISE . REGRESSION
/MISSING LISTWISE /STATISTICS COEFF OUTS R
ANOVA /CRITERIAPIN(.05) POUT(.10) /NOORIGIN
/DEPENDENT fac1_1 /METHODENTER st19q01
/PARTIALPLOT ALL /SCATTERPLOT(ZRESID ,ZPRED
) /RESIDUALS HIST(ZRESID) NORM(ZRESID) .
10library(foreign) help(read.spss) dataread.spss("G
/DATA/PISAdata2003/ReducedDataSpain.sav",
use.value.labelsTRUE,to.data.framTRUE) names(dat
a) 1 "SUBNATIO" "SCHOOLID" "ST03Q01"
"ST19Q01" "ST26Q04" "ST26Q05" 7 "ST27Q01"
"ST27Q02" "ST27Q03" "ST30Q02" "EC07Q01"
"EC07Q02" 13 "EC07Q03" "EC08Q01" "IC01Q01"
"IC01Q02" "IC01Q03" "IC02Q01" 19 "IC03Q01"
"MISCED" "FISCED" "HISCED" "PARED"
"PCMATH" 25 "RMHMWK" "CULTPOSS" "HEDRES"
"HOMEPOS" "ATSCHL" "STUREL" 31 "BELONG"
"INTMAT" "INSTMOT" "MATHEFF" "ANXMAT"
"MEMOR" 37 "COMPLRN" "COOPLRN" "TEACHSUP"
"ESCS" "W.FSTUWT" "OECD" 43 "UH"
"FAC1.1" attach(data) mean(FAC1.1) 1
-8.95814e-16 tabulate(ST19Q01) 1 106 0
15 1266 1927 2372 3575 1155 375 gt
table(ST19Q01) ST19Q01 Miss
Invalid N/A More than 500
books 106 0
15 1266
201-500 books 101-200 books 26-100
books 11-25 books 1927
2372 3575
1155 0-10 books 375
11Efecto de Cultural Possession of the family
12Data
Variables Y and X observed on a sample of size
n yi , xi i 1,2, ..., n
13Covariance and correlation
14Scatterplot for various values of correlation
15Coeficient de correlació r 0 , tot i que hi ha
una relació funcional exacta (no lineal!)
gt cbind(x,y) x y 1, -10 100 2,
-9 81 3, -8 64 4, -7 49 5, -6 36
6, -5 25 7, -4 16 8, -3 9 9,
-2 4 10, -1 1 11, 0 0 12, 1
1 13, 2 4 14, 3 9 15, 4 16 16,
5 25 17, 6 36 18, 7 49 19, 8
64 20, 9 81 21, 10 100 gt
16Regressió Lineal Simple
Variables Y X E Y X a b X Var (Y
X ) s2
17Linear relation y 1 .6 X
18Linear relation and sample data
19Regression Model
Yi a b Xi ei
ei mean zero variance s2 normally
distributed
20Sample Data Scatterplot
21Fitted regression line
a 0.5789 b0.6270
22Fitted and true regression lines
a1, b.6
a 0.5789 b0.6270
23Fitted and true regression lines in repeated
(20) sampling
a1, b.6
24OLS estimate of beta (under repeated sampling)
Estimate of beta for different samples (100)
0.619 0.575 0.636 0.543 0.555 0.594 0.611 0.584
0.576 ...... gt
a1, b.6
gt mean(bs) 1 0.6042086 gt sd(bs) 1
0.03599894 gt
25 REGRESSION Analysis of the Simulated Data
- (with R and other software )
26Fitted regression line
a1, b.6
a 1.0232203, b 0.6436286
27Regression Analysis
regression lm(Y X) summary(regression) Call l
m(formula Y X) Residuals Min 1Q
Median 3Q Max -6.0860 -2.1429 -0.1863
1.9695 9.4817 Coefficients
Estimate Std. Error t value Pr(gtt)
(Intercept) 1.0232 0.3188 3.21 0.00180
X 0.6436 0.0377 17.07 lt
2e-16 --- Signif. codes 0 ' 0.001 '
0.01 ' 0.05 .' 0.1 ' 1 Residual standard
error 3.182 on 98 degrees of freedom Multiple
R-Squared 0.7483, Adjusted R-squared 0.7458
F-statistic 291.4 on 1 and 98 DF, p-value lt
2.2e-16 gtgt
28Regression Analysis with Stata
. use "E\Albert\COURSES\cursDAS\AS2003\data\MONT.
dta", clear . regress y x Source SS
df MS Number of obs
100 ---------------------------------------
F( 1, 98) 291.42 Model
2950.73479 1 2950.73479 Prob gt
F 0.0000 Residual 992.280727 98
10.1253135 R-squared
0.7483 ---------------------------------------
Adj R-squared 0.7458 Total
3943.01551 99 39.8284395 Root
MSE 3.182 ------------------------------
------------------------------------------------
y Coef. Std. Err. t
Pgtt 95 Conf. Interval ----------------
--------------------------------------------------
----------- x .6436286 .0377029
17.071 0.000 .5688085 .7184488
_cons 1.02322 .3187931 3.210 0.002
.3905858 1.655855 ------------------------
--------------------------------------------------
---- . predict yh . graph yh y x, c(s.) s(io)
29Regression analysis with SPSS
30Estimación
31Gráfico de dispersión
32Fitted Regression
FYi 1.02 .64 Xi , R2.74 s.e.
(.037) t-value 17.07
Regression coeficient of X is significant (5
significance level), with the expected value of Y
icreasing .64 for each unit increase of X. The
95 confidence interval for the regression
coefficient is .64-1.96.037, .
.641.96.037.57, .71 74 of the variation
of Y is explained by the variation of X
33Fitted regression line
34Residual plot
35OLS analysis
36Variance decomposition
37Properties of OLS estimation
38Sampling distributions
39Inferences
40Student-t distribution
41Significance tests
42F-test
43Prediction of Y
44Multiple Regression
45t-test and CI
46F-test
47Confidence bounds for Y
48Interpreting multiple regression by means of
simple regression
49Adjusted R2
50Exemple de lAnàlisi de Regressió
51Variables
52Matrix Plot
Necessitat de transformar les variables !
53Transformació de variables
54Matrix Plot
Relacions (aproximadament) lineals entre
variables transformades !
55Anàlisi de Regressió
REGRESSION /MISSING LISTWISE /STATISTICS
COEFF OUTS R ANOVA /CRITERIAPIN(.05)
POUT(.10) /NOORIGIN /DEPENDENT espvida
/METHODENTER calories logpib logmetg
/PARTIALPLOT ALL /SCATTERPLOT(ZRESID ,ZPRED
) /SAVE RESID .
56Residus vs y ajustada
57Regressió parcial
58Regressió parcial
59Regressió parcial
60Anàlisi de Regressió
REGRESSION /MISSING LISTWISE /STATISTICS
COEFF OUTS R ANOVA /CRITERIAPIN(.05)
POUT(.10) /NOORIGIN /DEPENDENT espvida
/METHODENTER calories logpib logmetg cal2
/PARTIALPLOT ALL /SCATTERPLOT(ZRESID ,ZPRED
) /SAVE RESID .
61Residus vs y ajustada
62Case statistics
- Case missfit
- Potential for influence Leverage
- Influence
63Residuals
64Hat matrix
65Influence Analysis
66Diagnostic case statistics
After fitting regression, use the
instruction Fpredict namevar
predicted value of y , cooksd Cooks D
influence measure , dfbeta(x1) DFBETAS for
regression coefficient on var x1 , dfits DFFITS
influence measures , hat Diagonal elements of
hat matrix (leverage) , leverage
(same as hat) , residuals ,
rstandard standardized residuals , .rstudent
Studentized residuals , stdf standard erros
of predicte individual y, standard error of
forecast , stdp standard errors of predicted
mean y , stdr standard errors of residuals ,
welsch Welschs distance influence measure
.... display tprob(47, 2.337) Sort namevar
list x1 x2 5/1
67Leverages
after fit . fpredict lev, leverage .
gsort -lev
. list nombre lev in 1/5
nombre lev
1. RECAGUREN SOCIEDAD LIMITADA,
.0803928 2.
EL MARQUES DEL AMPURDAN S,L, .0767692
3. ADOBOS Y
DERIVADOS S,L, .0572497
4. CONSERVAS GARAVILLA SA
.0549707 5. PLANTA
DE ELABORADOS ALIMENTICIOS MA .0531497
.
68Box plot of leverage values
. predict lev, leverage . graph lev, box s(_n)
Cases with extreme leverages
69Leverage versus residual square plot
. lvr2plot, s(_n)
70Dfbetas
. fpredict beta, dfbeta(nt_paau) . graph beta,
box s(_n)
71Regression Outliers, basic idea
Outlier
72Regression Outliers, indicators
Indicator Description Rule of thumb (when wrong)
Resid Residual actual predicted -
ZResid Standardized residual residual divided by standarddeviation residual gt 3 (in absolute value)
SResid Studentized Residu residual divided by standarddeviation residual at that particular datapoint of X values gt 3 (in absolute value)
DResid Difference residual and residual when datapoint deleted -
SDResid See DResid, standardized by standard deviation at that particular datapoint of X values gt 3 (in absolute value)
73Regression Outliers, in SPSS
74Regression Influential Points, Basic Idea
Influential Point (no outlier!)
75Regression Influential Points, Indicators
Indicator Description Rule of thumb
Lever Afstand tussen punt en overige puntenNB potentieel invloedrijk gt 2 (p1) / n
Cook Verandering in residuen van overige cases als een bepaalde case niet in de regressie meedoet gt 1
DfBeta Verschil tussen betas wanneer de case wel meedoet en wanneer die niet meedoet in het modelNB voor elke beta krijgen we deze -
SdfBeta DfBeta / Standaardfout DfBetaNB voor elke beta gt 2/vn
DfFit Verschil tussen voorspelde waarde als case wel versus niet meedoet in model -
SDfFit DfFit / standaarddeviatie SdFit gt 2 /v(p/n)
CovRatio Verandering in Varianties/Covarianties als punt niet meedoet Abs (CovRatio 1)gt 3 p / n
76Regression Influential points, in SPSS
Case 2 is an influential point
77(No Transcript)
78(No Transcript)
79(No Transcript)
80(No Transcript)
81Regression Influential Points, what to do?
- Nothing at all..
- Check data
- Delete a-typical datapoints, then repeat
regression without these datapoints
to delete a point or not is an issue
statisticians disagree on
82MULTICOLLINEARITY
83Regression Multicollinearity
- If predictors correlate high, then we speak of
multicollinearity - Is this a problem? If you want to asess the
influence of each predictor, yes it is, because - Standarderrors blow up, making coefficients
not-significant
84Analyzing math data
. use "G\Albert\COURSES\cursDAS\AS2003b\data\mat.
dta", clear . save "G\Albert\COURSES\CursMetEsta
d\Curs2004\Metodes\mathdata.dta" file
G\Albert\COURSES\CursMetEstad\Curs2004\Metodes\ma
thdata.dta saved . edit - preserve . gen
perform (nt_m1 nt_m2 nt_m3)/3 (110 missing
values generated) . corr perform nt_paau
nt_acces nt_exp (obs189) perform
nt_paau nt_acces nt_exp -----------------------
---------------------- perform 1.0000
nt_paau 0.3535 1.0000 nt_acces 0.5057
0.8637 1.0000 nt_exp 0.5002 0.3533
0.7760 1.0000 . outfile nt_exp nt_paau
nt_acces perform using "G\Albert\COURSES\CursMetE
sta gt d\Curs2004\Metodes\mathdata.dat" .
85(No Transcript)
86Multiple regression perform vs nt_acces nt_paau
. regress perform nt_acces nt_paau
Source SS df MS
Number of obs 245
---------------------------------------
F( 2, 242) 31.07 Model
71.1787647 2 35.5893823 Prob gt
F 0.0000 Residual 277.237348 242
1.14560888 R-squared 0.2043
---------------------------------------
Adj R-squared 0.1977 Total
348.416112 244 1.42793489 Root
MSE 1.0703
------------------------------------------------
------------------------------ perform
Coef. Std. Err. t Pgtt 95
Conf. Interval -------------------------------
----------------------------------------------
nt_acces 1.272819 .2427707 5.243
0.000 .7946054 1.751032 nt_paau
-.2755092 .1835091 -1.501 0.135
-.6369882 .0859697 _cons -1.513124
.9729676 -1.555 0.121 -3.42969
.4034425 ---------------------------------------
---------------------------------------
.
Perform rendiment a mates I a III
87Collinearity
88Diagnostics for multicollinearity
. corre nt_paau nt_exp nt_acces
(obs276)
nt_paau nt_exp nt_acces
-----------------------
------------
nt_paau 1.0000
nt_exp
0.3435 1.0000
nt_acces 0.8473 0.7890
1.0000 . fit perform nt_paau nt_exp
nt_access
. vif
Variable VIF 1/VIF
-------------------------------
nt_acces
1201.85 0.000832
nt_paau 514.27
0.001945
nt_exp 384.26 0.002602
-------------------------------
Mean VIF
700.13
.
VIF 1/(1 Rj2)
Any explanatory variable with a VIF greater than
5 (or tolerance less than .2) show a degree of
collinearity that may be Problematic
This ratio is called Tolerance
In the case of just nt_paau an nt_exp we Get
. vif Variable VIF 1/VIF
------------------------------- nt_exp
1.14 0.875191 nt_paau 1.14
0.875191 ------------------------------- Mean
VIF 1.14 .
89Multiple regression perform vs nt_paau nt_exp
. regress perform nt_paau nt_exp
Source SS df
MS Number of obs 189
---------------------------------------
F( 2, 186) 37.24 Model
75.2441994 2 37.6220997 Prob gt
F 0.0000 Residual 187.897174 186
1.01019986 R-squared 0.2859
---------------------------------------
Adj R-squared 0.2783 Total
263.141373 188 1.39968815 Root
MSE 1.0051
------------------------------------------------
------------------------------ perform
Coef. Std. Err. t Pgtt 95
Conf. Interval -------------------------------
----------------------------------------------
nt_paau .3382551 .1109104 3.050
0.003 .119451 .5570593 nt_exp
.9040681 .1396126 6.476 0.000
.6286403 1.179496 _cons -3.295308
1.104543 -2.983 0.003 -5.474351
-1.116266 --------------------------------------
----------------------------------------
. corr nt_exp nt_paau nt_acces (obs276)
nt_exp nt_paau nt_acces ----------------
-------------------- nt_exp 1.0000 nt_paau
0.3435 1.0000 nt_acces 0.7890 0.8473
1.0000
. predict yh (option xb assumed fitted
values) (82 missing values generated) . predict
e, resid (169 missing values generated) .
90Regression Multicollinearity, Indicators
Indicator description Rule of thumb (when wrong)
Overall F_Test versus test coefficients Overall F-Test is significant, but individual coefficients are not -
Beta Standardized coefficient Outside -1, 1
Tolerance Tolerance unique variance of a predictor (not shared/explained by other predictors) NB Tolerance per coefficient lt 0.01
Variantie Inflation Factor v VIF indicates how much the standard error of a particular coefficient is inflated due to correlatation between this particular predictor and the other predictors NB VIF per coefficient gt10
Eigenvalues rather technical /- 0
Condition Index rather technical gt 30
Variance Proportion rather technicallook tot loadings on the dimensions Loadings around 1
91Regression Multicollinearity, in SPSS
diagnostics
92Regression Multicollineariteit, in SPSS
Beta gt 1
Tolerance, VIF in orde
93Regressie Multicollineariteit, in SPSS
2 eigenwaarden rond 0
Deze variabelen zorgen voor multicoll.
CI in orde
94Regression Multicollinearity, what to do?
- Nothing (if there is no interest in the
individual coefficients, only in good prediction) - Leave one (or more) predictor(s) out
- Use PCA to reduce high correlated variables to
smaller number of uncorrelated variables
95Variables Categòriques
Use Survey_sample.sav, in i/.../data
96Salari vs gènere anys deducacióstatus de
treball
97Creació de variables dicotòmiques
GET FILE'G\Albert\Web\Metodes2005\Dades\survey
_sample.sav'. COMPUTE D1 wrkstat1. EXECUTE
. COMPUTE D2 wrkstat2. EXECUTE . COMPUTE D3
wrkstat3. EXECUTE . COMPUTE D4 wrkstat4.
EXECUTE . COMPUTE D5 wrkstat5. EXECUTE
. COMPUTE D6 wrkstat6. EXECUTE . COMPUTE D7
wrkstat7. EXECUTE . COMPUTE D8
wrkstat8. EXECUTE .
98Regressió en blocks
REGRESSION /MISSING LISTWISE /STATISTICS
COEFF OUTS R ANOVA CHANGE /CRITERIAPIN(.05)
POUT(.10) /NOORIGIN /DEPENDENT rincome
/METHODENTER sex /METHODENTER d2 d3 d4 d5 d6
d7 d8 .
99(No Transcript)
100Regressió en blocks
REGRESSION /MISSING LISTWISE /STATISTICS
COEFF OUTS R ANOVA CHANGE /CRITERIAPIN(.05)
POUT(.10) /NOORIGIN /DEPENDENT rincome
/METHODENTER sex /METHODENTER educ
/METHODENTER d2 d3 d4 d5 d6 d7 d8 .
101Categorical Predictors
Is income dependent on years of age and religion ?
102Categorical Predictors
Compute dummy variable for each category, except
last
103Categorical Predictors
And so on
104Categorical Predictors
Block 1
105Categorical Predictors
Block 2
106Categorical Predictors
Ask for R2 change
107Categorical Predictors
Look at R Square change for importance of
categorical variable