Regression analysis - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Regression analysis

Description:

Assumes: linear relation of y and x. Goal: explaining or predicting y using x ... Many observations in relation to number of predictors (k/n 40) ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 35

Provided by: rug

Category:

more less

Transcript and Presenter's Notes

Title: Regression analysis

1
Regression analysis

Marijtje van Duijn

2
(Simple) Regression

y dependent variable
at least interval measurement
x independent or explanatory variable
interval or dummy also known as predictor
?0 and ?1 regression coefficients
?0 intercept where line intercepts y axis (for
x0)
?1 slope change in y for change in x of 1 meas.
unit
e error or residual
i index for individual/observation (case), n
observations

3
Regression

Assumes linear relation of y and x
Goal explaining or predicting y using x
Data Model Error
NB relation not necessarily causal
(cor)relational
theory may justify causality

4
Regression

Estimate line (or optimize model) by minimizing
total error, i.e. sum of distances between
observations and line
How? Applying Ordinary Least Squares (OLS)

5
Multiple regressionmore than 1 explanatory
variable
6
Explained Variance

Measure for regression quality
How well does the line fit the observations
How much variance of the independent variable is
explained by the model?
Total variance model variance residual
variance

7
Analysis of Variance table for regression
8
Explained variance - 2

R2 lies between 0 and 1,
the percentage of explained variance
R also known as multiple correlation coefficient
(also between 0 and 1)
What does it mean if R2 0?
No relation of y with x or y and x1 , x2 , x3.
Tested with F-test (null hypothesis H0 R20)
FMSreg/MSE with (k, n-k-1) degrees of freedom
F(R2/k)/(1-R2)/n-k-1)
Equivalent to null hypothesis H0 ?1 ?2 ?3 0
What does a significant F imply (interpret
p-value)

9
Explained variance - 3

The more explanatory variables (predictors) the
better fit of the line (or the x-dimensional
plain) to the observations.
So explained variance increases with increasing
number of predictors.
Part of this increase is random, due to sample
fluctuation (capitalisation on chance)
Adjustment via correction for the number of
predictors R2adj1-R2(n-1)/(n-k-1)

10
Model selection

Process of determining whether adding variables
improves model fit.
Does R2 increase? Compare adjusted R2.
Better test model improvement using partial
F-test

11
Model selection and multicollinearity

The more predictors, the better? No!
Because of multicollinearity, association between
predictors.
Ideal
Uncorrelated predictors
predictors highly correlated with Y
Perfect multicollinearity r121

12
Problems caused by multicollinearity

R2 does not improve
The variance of the estimated regression
coefficients increase (VIFVariance Inflation
Factor) ?not good for testing
So selection of predictors
How? Substantive, and/or automatic (based on
selection rules).

13
Model selection procedures

Model selection selection of predictors
Enter use all explanatory variables in model
Stepwise methods
forward
1. First add X with largest r(X,Y)
2. Add most significant X (based on F, p0.05)
backward (elimination)
1. Use all predictors
2. Delete least significant X (F, p0.10)
stepwise combination of forward and backward

14
(Dis)advantages selection methods

Advantage easy
Disadvantage
Order random hard to interpret and (therefore)
not substantively relevant
Danger of capitalization on chance (especially
forward), implying overestimation of significance
Less problematic if
Prediction is not the main goal
Many observations in relation to number of
predictors (k/ngt40)
Cross validation confirms results

15
Possible solution

Before the analysis model or variable selection
based on substantive reasons.
Distinguish
Control variables and/or variables that need to
be included in the model
Variables with undefined status
With many variables a solution may be to combine
variables (factors)

16
Model assumptions of multiple regression

Independent observations
How to check? Difficult!
Method used for data collection.
Autocorrelation
Linearity
inspection via plots
predicted y vs. residual
partial regression plots

Residuals normally distributed with constant
variance (homoscedasticity)
testing or visual inspection of normality with
Q-Q plot and/or histogram (of estimated
residuals), or a boxplot.
inspection of constant variance via plot of
predicted y vs. residual
Residual and predictors independent
also via constant variance
X fixed (measured without error)

18
Possible solutions when assumptions are violated

Non-constant variance
variance stabilizing transformation (?, ln, 1/Y)
different estimation method (WLS ipv OLS)
Non-linearity
transformation (in Y or X)
adding other explanatory variables

19
Regression diagnostics

How well does the model fit?
Residual analysis
Does the model hold for all observations?
Outliers
How does the model change leaving out 1
observation?
Influential points

20
When are diagnostics helpful?

Known behavior of diagnostic (under null
hypothesis of no violation of assumptions)
Easy to compute
Preferably graphical
Providing an indication for a solution of the
model violation

21
Residual analysis

Via plots (same ones as before)
Per observation
Standardized (ZRESID)
Studentized (without assuming constant variance,
corrected with leverage)
Deleted residual (Leaving out the observation)
All residuals are compared to a standard normal
distribution

22
Outliers

Large residuals
Indicate deviation in Y-dimension testable
(studentized residuals are t distributed)
Leverage plus Mahalanobis distance.
Indicates deviation in X-dimension testable
Cooks distance, combination of residual en
leverage
Indicates devation in both X- and Y-dimension
Rule of thumb gt 1 (but actually dependent on n)

23
Influential points

How does one observation influence or change the
results?
Via change in regression coefficient (DfBeta)
rule of thumb 2/?n
Via change in fit (DfFit)
rule of thumb 2/?k/n

24
(Semi-)Partial correlation

Partial correlation correlation of two variables
for fixed third variable (i.e. corrected for the
influence of a third variable)
Semi-partial correlation correlation of two
variables correction one variable for the
influence of a third variable
In regression unique contribution of an
additional x to the explained variance of y
provides complete separation of multiple
correlation coefficient can be useful for model
selection.

25
R2y.123r 2y1 r2y2.1(s) r2y3.12(s)
26
Interaction/moderation

Combination of two variables X1 en X2 usually the
product
Can be a solution for non-linearity
Substantive the effect of X1 depends on (the the
value of) X2
different slopes for different folks
Interpretation depends on type of variable
(continuous, dichotomous, nominal).

27
X1 dichotomous and X2 continuous

X1 0 of X1 1
(e.g. man/woman control/experimental group)
Y ?0 ?1X1 ?2X2 ?3 X1X2
X1 0 Y ?0 ?2X2
X1 1 Y (?0 ?1) (?2 ?3)X2
so intercept and effect of X2 change
the interaction effect represents the change in
the effect of X2 or, better, the difference
between the groups
In general Y (?0 ?1X1) (?2 ?3 X1)X2

28
X1 and X2 dichotomous

X1 0 of X1 1 (e.g. man/woman)
X2 0 of X2 1 (e.g. control/experimental)
Y ?0 ?1X1 ?2X2 ?3 X1X2
X1 0, X2 0 Y ?0
X1 1, X2 0 Y ?0 ?1
X1 0, X2 1 Y ?0 ?2
X1 1, X2 1 Y ?0 ?1 ?2 ?3
4 groups with different means

29
X1 nominal and X2 continuous

X1 has more than 2 (c) values
(e.g. age groups, control/experiment1/experiment2)
Make dummies, for each difference with a
reference group (e.g. youngest or oldest,
control)
results in c-1 dichotomous variables.
It is also possible to make dummies (indicators)
for all groups but then model will have no
intercept.

30
c3 group 1 reference group

group D1 D2
1 0 0
2 1 0
3 0 1
Y ?0 ?1dD1 ?2dD2 ?2X2 ?3 D1X2 ?4 D2X2
group1 Y ?0 ?2X2
group2 Y (?0 ?1d) (?2 ?3)X2
group3 Y (?0 ?2d) (?2 ?4)X2

31
X1 and X2 continuous

Same idea, more difficult interpretation
Y ?0 ?1X1 ?2X2 ?3 X1X2?
Y (?0 ?1X1) (?2?3 X1)X2?
Y (?0 ?2X 2) (?1?3 X2)X1
Centering (or standardizing) predictors
facilitates interpretation

32
X1 en X2 continuous-2
33
X1 en X2 continuous-3

NB Centering (or standardizing) only changes
regression coefficients, not the model fit (R2
and F)

34
Mediation

Effect of X on Y (explaining Y with X) is via M
Partial mediation X has a direct effect on Y in
addition to M (explains own part of variance of
Y)
(bijv. Ytestscore XIQ, Meerdere testscore)
Possible substantive interpretation of
multicollinearity
X explains Y
M explains X
M explains Y, controlling for X.