SIMPLE AND MULTIPLE REGRESSION

About This Presentation

Title:

SIMPLE AND MULTIPLE REGRESSION

Description:

Title: M todes Estad stics Aplicats a les Ci ncies Pol tiques i de l'Administraci (12084) Author: cjt239 Last modified by: upf Created Date – PowerPoint PPT presentation

Number of Views:462

Avg rating:3.0/5.0

Slides: 108

Provided by: cjt9

Learn more at: https://www.econ.upf.edu

Category:

more less

Transcript and Presenter's Notes

Title: SIMPLE AND MULTIPLE REGRESSION

1
SIMPLE AND MULTIPLE REGRESSION
2
Relació entre variables

Entre variables discretes
(exemple veure Titanic)
Entre continues (Regressió !)
Entre discretes i continues (Regressió!)

3
gt Rendiment en Matemàtiques, gt Nombre de
llibres a casa
Pisa 2003
4
gt Rendiment en Matemàtiques, gt Nombre de
llibres a casa
Pisa 2003
5
Regressió Lineal ?
Pisa 2003
6
Regressió Lineal ?
Pisa 2003
7
Regressió Lineal ?
Pisa 2003
8
Regressió Lineal

9
We first load the PISAespanya.sav file and
then this is the sintaxis file for SPSS
analysis Q38 How often do these things happen
in your math class Student dont't listen to what
the teacher says CROSSTABS /TABLESsubnatio
BY st38q02 /FORMAT AVALUE TABLES
/STATISTICCHISQ /CELLS COUNT ROW . FACTOR
/VARIABLES pv1math pv2math pv3math pv4math
pv5math pv1math1 pv2math1 pv3math1 pv4math1
pv5math1 pv1math2 pv2math2 pv3math2 pv4math2
pv5math2 pv1math3 pv2math3 pv3math3 pv4math3
pv5math3 pv1math4 pv2math4 pv3math4 pv4math4
pv5math4 /MISSING LISTWISE /ANALYSIS pv1math
pv2math pv3math pv4math pv5math pv1math1
pv2math1 pv3math1 pv4math1 pv5math1 pv1math2
pv2math2 pv3math2 pv4math2 pv5math2 pv1math3
pv2math3 pv3math3 pv4math3 pv5math3 pv1math4
pv2math4 pv3math4 pv4math4 pv5math4 /PRINT
INITIAL EXTRACTION FSCORE /PLOT EIGEN ROTATION
/CRITERIA FACTORS(1) ITERATE(25) /EXTRACTION
ML /ROTATION NOROTATE /SAVE REG(ALL)
. GRAPH /SCATTERPLOT(BIVAR)st19q01 WITH
fac1_1 /MISSINGLISTWISE . REGRESSION
/MISSING LISTWISE /STATISTICS COEFF OUTS R
ANOVA /CRITERIAPIN(.05) POUT(.10) /NOORIGIN
/DEPENDENT fac1_1 /METHODENTER st19q01
/PARTIALPLOT ALL /SCATTERPLOT(ZRESID ,ZPRED
) /RESIDUALS HIST(ZRESID) NORM(ZRESID) .
10
library(foreign) help(read.spss) dataread.spss("G
/DATA/PISAdata2003/ReducedDataSpain.sav",
use.value.labelsTRUE,to.data.framTRUE) names(dat
a) 1 "SUBNATIO" "SCHOOLID" "ST03Q01"
"ST19Q01" "ST26Q04" "ST26Q05" 7 "ST27Q01"
"ST27Q02" "ST27Q03" "ST30Q02" "EC07Q01"
"EC07Q02" 13 "EC07Q03" "EC08Q01" "IC01Q01"
"IC01Q02" "IC01Q03" "IC02Q01" 19 "IC03Q01"
"MISCED" "FISCED" "HISCED" "PARED"
"PCMATH" 25 "RMHMWK" "CULTPOSS" "HEDRES"
"HOMEPOS" "ATSCHL" "STUREL" 31 "BELONG"
"INTMAT" "INSTMOT" "MATHEFF" "ANXMAT"
"MEMOR" 37 "COMPLRN" "COOPLRN" "TEACHSUP"
"ESCS" "W.FSTUWT" "OECD" 43 "UH"
"FAC1.1" attach(data) mean(FAC1.1) 1
-8.95814e-16 tabulate(ST19Q01) 1 106 0
15 1266 1927 2372 3575 1155 375 gt
table(ST19Q01) ST19Q01 Miss
Invalid N/A More than 500
books 106 0
15 1266
201-500 books 101-200 books 26-100
books 11-25 books 1927
2372 3575
1155 0-10 books 375
11
Efecto de Cultural Possession of the family
12
Data
Variables Y and X observed on a sample of size
n yi , xi i 1,2, ..., n
13
Covariance and correlation
14
Scatterplot for various values of correlation
15
Coeficient de correlació r 0 , tot i que hi ha
una relació funcional exacta (no lineal!)
gt cbind(x,y) x y 1, -10 100 2,
-9 81 3, -8 64 4, -7 49 5, -6 36
6, -5 25 7, -4 16 8, -3 9 9,
-2 4 10, -1 1 11, 0 0 12, 1
1 13, 2 4 14, 3 9 15, 4 16 16,
5 25 17, 6 36 18, 7 49 19, 8
64 20, 9 81 21, 10 100 gt
16
Regressió Lineal Simple
Variables Y X E Y X a b X Var (Y
X ) s2
17
Linear relation y 1 .6 X
18
Linear relation and sample data
19
Regression Model
Yi a b Xi ei
ei mean zero variance s2 normally
distributed
20
Sample Data Scatterplot
21
Fitted regression line
a 0.5789 b0.6270
22
Fitted and true regression lines
a1, b.6
a 0.5789 b0.6270
23
Fitted and true regression lines in repeated
(20) sampling
a1, b.6
24
OLS estimate of beta (under repeated sampling)
Estimate of beta for different samples (100)
0.619 0.575 0.636 0.543 0.555 0.594 0.611 0.584
0.576 ...... gt
a1, b.6
gt mean(bs) 1 0.6042086 gt sd(bs) 1
0.03599894 gt
25
REGRESSION Analysis of the Simulated Data

(with R and other software )

26
Fitted regression line
a1, b.6
a 1.0232203, b 0.6436286
27
Regression Analysis
regression lm(Y X) summary(regression) Call l
m(formula Y X) Residuals Min 1Q
Median 3Q Max -6.0860 -2.1429 -0.1863
1.9695 9.4817 Coefficients
Estimate Std. Error t value Pr(gtt)
(Intercept) 1.0232 0.3188 3.21 0.00180
X 0.6436 0.0377 17.07 lt
2e-16 --- Signif. codes 0 ' 0.001 '
0.01 ' 0.05 .' 0.1 ' 1 Residual standard
error 3.182 on 98 degrees of freedom Multiple
R-Squared 0.7483, Adjusted R-squared 0.7458
F-statistic 291.4 on 1 and 98 DF, p-value lt
2.2e-16 gtgt
28
Regression Analysis with Stata
. use "E\Albert\COURSES\cursDAS\AS2003\data\MONT.
dta", clear . regress y x Source SS
df MS Number of obs
100 ---------------------------------------
F( 1, 98) 291.42 Model
2950.73479 1 2950.73479 Prob gt
F 0.0000 Residual 992.280727 98
10.1253135 R-squared
0.7483 ---------------------------------------
Adj R-squared 0.7458 Total
3943.01551 99 39.8284395 Root
MSE 3.182 ------------------------------
------------------------------------------------
y Coef. Std. Err. t
Pgtt 95 Conf. Interval ----------------
--------------------------------------------------
----------- x .6436286 .0377029
17.071 0.000 .5688085 .7184488
_cons 1.02322 .3187931 3.210 0.002
.3905858 1.655855 ------------------------
--------------------------------------------------
---- . predict yh . graph yh y x, c(s.) s(io)

29
Regression analysis with SPSS
30
Estimación
31
Gráfico de dispersión
32
Fitted Regression
FYi 1.02 .64 Xi , R2.74 s.e.
(.037) t-value 17.07

Regression coeficient of X is significant (5
significance level), with the expected value of Y
icreasing .64 for each unit increase of X. The
95 confidence interval for the regression
coefficient is .64-1.96.037, .
.641.96.037.57, .71 74 of the variation
of Y is explained by the variation of X
33
Fitted regression line
34
Residual plot
35
OLS analysis
36
Variance decomposition
37
Properties of OLS estimation
38
Sampling distributions
39
Inferences
40
Student-t distribution
41
Significance tests
42
F-test
43
Prediction of Y
44
Multiple Regression
45
t-test and CI
46
F-test
47
Confidence bounds for Y
48
Interpreting multiple regression by means of
simple regression
49
Adjusted R2
50
Exemple de lAnàlisi de Regressió

Dades de paisos.sav

51
Variables
52
Matrix Plot
Necessitat de transformar les variables !
53
Transformació de variables
54
Matrix Plot
Relacions (aproximadament) lineals entre
variables transformades !
55
Anàlisi de Regressió
REGRESSION /MISSING LISTWISE /STATISTICS
COEFF OUTS R ANOVA /CRITERIAPIN(.05)
POUT(.10) /NOORIGIN /DEPENDENT espvida
/METHODENTER calories logpib logmetg
/PARTIALPLOT ALL /SCATTERPLOT(ZRESID ,ZPRED
) /SAVE RESID .
56
Residus vs y ajustada
57
Regressió parcial
58
Regressió parcial
59
Regressió parcial
60
Anàlisi de Regressió
REGRESSION /MISSING LISTWISE /STATISTICS
COEFF OUTS R ANOVA /CRITERIAPIN(.05)
POUT(.10) /NOORIGIN /DEPENDENT espvida
/METHODENTER calories logpib logmetg cal2
/PARTIALPLOT ALL /SCATTERPLOT(ZRESID ,ZPRED
) /SAVE RESID .
61
Residus vs y ajustada
62
Case statistics

Case missfit
Potential for influence Leverage
Influence

63
Residuals
64
Hat matrix
65
Influence Analysis
66
Diagnostic case statistics
After fitting regression, use the
instruction Fpredict namevar
predicted value of y , cooksd Cooks D
influence measure , dfbeta(x1) DFBETAS for
regression coefficient on var x1 , dfits DFFITS
influence measures , hat Diagonal elements of
hat matrix (leverage) , leverage
(same as hat) , residuals ,
rstandard standardized residuals , .rstudent
Studentized residuals , stdf standard erros
of predicte individual y, standard error of
forecast , stdp standard errors of predicted
mean y , stdr standard errors of residuals ,
welsch Welschs distance influence measure
.... display tprob(47, 2.337) Sort namevar
list x1 x2 5/1
67
Leverages
after fit . fpredict lev, leverage .
gsort -lev

. list nombre lev in 1/5

nombre lev
1. RECAGUREN SOCIEDAD LIMITADA,
.0803928 2.
EL MARQUES DEL AMPURDAN S,L, .0767692
3. ADOBOS Y
DERIVADOS S,L, .0572497
4. CONSERVAS GARAVILLA SA
.0549707 5. PLANTA
DE ELABORADOS ALIMENTICIOS MA .0531497

.
68
Box plot of leverage values
. predict lev, leverage . graph lev, box s(_n)
Cases with extreme leverages
69
Leverage versus residual square plot
. lvr2plot, s(_n)
70
Dfbetas
. fpredict beta, dfbeta(nt_paau) . graph beta,
box s(_n)
71
Regression Outliers, basic idea
Outlier
72
Regression Outliers, indicators
Indicator Description Rule of thumb (when wrong)
Resid Residual actual predicted -
ZResid Standardized residual residual divided by standarddeviation residual gt 3 (in absolute value)
SResid Studentized Residu residual divided by standarddeviation residual at that particular datapoint of X values gt 3 (in absolute value)
DResid Difference residual and residual when datapoint deleted -
SDResid See DResid, standardized by standard deviation at that particular datapoint of X values gt 3 (in absolute value)
73
Regression Outliers, in SPSS
74
Regression Influential Points, Basic Idea
Influential Point (no outlier!)
75
Regression Influential Points, Indicators
Indicator Description Rule of thumb
Lever Afstand tussen punt en overige puntenNB potentieel invloedrijk gt 2 (p1) / n
Cook Verandering in residuen van overige cases als een bepaalde case niet in de regressie meedoet gt 1
DfBeta Verschil tussen betas wanneer de case wel meedoet en wanneer die niet meedoet in het modelNB voor elke beta krijgen we deze -
SdfBeta DfBeta / Standaardfout DfBetaNB voor elke beta gt 2/vn
DfFit Verschil tussen voorspelde waarde als case wel versus niet meedoet in model -
SDfFit DfFit / standaarddeviatie SdFit gt 2 /v(p/n)
CovRatio Verandering in Varianties/Covarianties als punt niet meedoet Abs (CovRatio 1)gt 3 p / n
76
Regression Influential points, in SPSS
Case 2 is an influential point
77
(No Transcript)
78
(No Transcript)
79
(No Transcript)
80
(No Transcript)
81
Regression Influential Points, what to do?

Nothing at all..
Check data
Delete a-typical datapoints, then repeat
regression without these datapoints

to delete a point or not is an issue
statisticians disagree on
82
MULTICOLLINEARITY

Diagnostic tools

83
Regression Multicollinearity

If predictors correlate high, then we speak of
multicollinearity
Is this a problem? If you want to asess the
influence of each predictor, yes it is, because
Standarderrors blow up, making coefficients
not-significant

84
Analyzing math data
. use "G\Albert\COURSES\cursDAS\AS2003b\data\mat.
dta", clear . save "G\Albert\COURSES\CursMetEsta
d\Curs2004\Metodes\mathdata.dta" file
G\Albert\COURSES\CursMetEstad\Curs2004\Metodes\ma
thdata.dta saved . edit - preserve . gen
perform (nt_m1 nt_m2 nt_m3)/3 (110 missing
values generated) . corr perform nt_paau
nt_acces nt_exp (obs189) perform
nt_paau nt_acces nt_exp -----------------------
---------------------- perform 1.0000
nt_paau 0.3535 1.0000 nt_acces 0.5057
0.8637 1.0000 nt_exp 0.5002 0.3533
0.7760 1.0000 . outfile nt_exp nt_paau
nt_acces perform using "G\Albert\COURSES\CursMetE
sta gt d\Curs2004\Metodes\mathdata.dat" .
85
(No Transcript)
86
Multiple regression perform vs nt_acces nt_paau
. regress perform nt_acces nt_paau

Source SS df MS
Number of obs 245
---------------------------------------
F( 2, 242) 31.07 Model
71.1787647 2 35.5893823 Prob gt
F 0.0000 Residual 277.237348 242
1.14560888 R-squared 0.2043
---------------------------------------
Adj R-squared 0.1977 Total
348.416112 244 1.42793489 Root
MSE 1.0703

------------------------------------------------
------------------------------ perform
Coef. Std. Err. t Pgtt 95
Conf. Interval -------------------------------
----------------------------------------------
nt_acces 1.272819 .2427707 5.243
0.000 .7946054 1.751032 nt_paau
-.2755092 .1835091 -1.501 0.135
-.6369882 .0859697 _cons -1.513124
.9729676 -1.555 0.121 -3.42969
.4034425 ---------------------------------------
---------------------------------------

.
Perform rendiment a mates I a III
87
Collinearity
88
Diagnostics for multicollinearity
. corre nt_paau nt_exp nt_acces
(obs276)

nt_paau nt_exp nt_acces
-----------------------
------------
nt_paau 1.0000
nt_exp
0.3435 1.0000
nt_acces 0.8473 0.7890
1.0000 . fit perform nt_paau nt_exp
nt_access
. vif

Variable VIF 1/VIF

-------------------------------
nt_acces
1201.85 0.000832
nt_paau 514.27
0.001945
nt_exp 384.26 0.002602

-------------------------------
Mean VIF
700.13

.

VIF 1/(1 Rj2)
Any explanatory variable with a VIF greater than
5 (or tolerance less than .2) show a degree of
collinearity that may be Problematic
This ratio is called Tolerance
In the case of just nt_paau an nt_exp we Get
. vif Variable VIF 1/VIF
------------------------------- nt_exp
1.14 0.875191 nt_paau 1.14
0.875191 ------------------------------- Mean
VIF 1.14 .
89
Multiple regression perform vs nt_paau nt_exp
. regress perform nt_paau nt_exp

Source SS df
MS Number of obs 189
---------------------------------------
F( 2, 186) 37.24 Model
75.2441994 2 37.6220997 Prob gt
F 0.0000 Residual 187.897174 186
1.01019986 R-squared 0.2859
---------------------------------------
Adj R-squared 0.2783 Total
263.141373 188 1.39968815 Root
MSE 1.0051

------------------------------------------------
------------------------------ perform
Coef. Std. Err. t Pgtt 95
Conf. Interval -------------------------------
----------------------------------------------
nt_paau .3382551 .1109104 3.050
0.003 .119451 .5570593 nt_exp
.9040681 .1396126 6.476 0.000
.6286403 1.179496 _cons -3.295308
1.104543 -2.983 0.003 -5.474351
-1.116266 --------------------------------------
----------------------------------------
. corr nt_exp nt_paau nt_acces (obs276)
nt_exp nt_paau nt_acces ----------------
-------------------- nt_exp 1.0000 nt_paau
0.3435 1.0000 nt_acces 0.7890 0.8473
1.0000
. predict yh (option xb assumed fitted
values) (82 missing values generated) . predict
e, resid (169 missing values generated) .
90
Regression Multicollinearity, Indicators
Indicator description Rule of thumb (when wrong)
Overall F_Test versus test coefficients Overall F-Test is significant, but individual coefficients are not -
Beta Standardized coefficient Outside -1, 1
Tolerance Tolerance unique variance of a predictor (not shared/explained by other predictors) NB Tolerance per coefficient lt 0.01
Variantie Inflation Factor v VIF indicates how much the standard error of a particular coefficient is inflated due to correlatation between this particular predictor and the other predictors NB VIF per coefficient gt10
Eigenvalues rather technical /- 0
Condition Index rather technical gt 30
Variance Proportion rather technicallook tot loadings on the dimensions Loadings around 1
91
Regression Multicollinearity, in SPSS
diagnostics
92
Regression Multicollineariteit, in SPSS
Beta gt 1
Tolerance, VIF in orde
93
Regressie Multicollineariteit, in SPSS
2 eigenwaarden rond 0
Deze variabelen zorgen voor multicoll.
CI in orde
94
Regression Multicollinearity, what to do?

Nothing (if there is no interest in the
individual coefficients, only in good prediction)
Leave one (or more) predictor(s) out
Use PCA to reduce high correlated variables to
smaller number of uncorrelated variables

95
Variables Categòriques
Use Survey_sample.sav, in i/.../data
96
Salari vs gènere anys deducacióstatus de
treball
97
Creació de variables dicotòmiques
GET FILE'G\Albert\Web\Metodes2005\Dades\survey
_sample.sav'. COMPUTE D1 wrkstat1. EXECUTE
. COMPUTE D2 wrkstat2. EXECUTE . COMPUTE D3
wrkstat3. EXECUTE . COMPUTE D4 wrkstat4.
EXECUTE . COMPUTE D5 wrkstat5. EXECUTE
. COMPUTE D6 wrkstat6. EXECUTE . COMPUTE D7
wrkstat7. EXECUTE . COMPUTE D8
wrkstat8. EXECUTE .
98
Regressió en blocks
REGRESSION /MISSING LISTWISE /STATISTICS
COEFF OUTS R ANOVA CHANGE /CRITERIAPIN(.05)
POUT(.10) /NOORIGIN /DEPENDENT rincome
/METHODENTER sex /METHODENTER d2 d3 d4 d5 d6
d7 d8 .
99
(No Transcript)
100
Regressió en blocks
REGRESSION /MISSING LISTWISE /STATISTICS
COEFF OUTS R ANOVA CHANGE /CRITERIAPIN(.05)
POUT(.10) /NOORIGIN /DEPENDENT rincome
/METHODENTER sex /METHODENTER educ
/METHODENTER d2 d3 d4 d5 d6 d7 d8 .
101
Categorical Predictors
Is income dependent on years of age and religion ?
102
Categorical Predictors
Compute dummy variable for each category, except
last
103
Categorical Predictors
And so on
104
Categorical Predictors
Block 1
105
Categorical Predictors
Block 2
106
Categorical Predictors
Ask for R2 change
107
Categorical Predictors
Look at R Square change for importance of
categorical variable

Write a Comment

User Comments (0)