Title: Regression analysis
1Regression analysis
2(Simple) Regression
- y dependent variable
- at least interval measurement
- x independent or explanatory variable
- interval or dummy also known as predictor
- ?0 and ?1 regression coefficients
- ?0 intercept where line intercepts y axis (for
x0) - ?1 slope change in y for change in x of 1 meas.
unit - e error or residual
- i index for individual/observation (case), n
observations
3Regression
- Assumes linear relation of y and x
- Goal explaining or predicting y using x
- Data Model Error
- NB relation not necessarily causal
- (cor)relational
- theory may justify causality
4Regression
- Estimate line (or optimize model) by minimizing
total error, i.e. sum of distances between
observations and line - How? Applying Ordinary Least Squares (OLS)
-
5Multiple regressionmore than 1 explanatory
variable
6Explained Variance
- Measure for regression quality
- How well does the line fit the observations
- How much variance of the independent variable is
explained by the model? - Total variance model variance residual
variance
7Analysis of Variance table for regression
8Explained variance - 2
- R2 lies between 0 and 1,
- the percentage of explained variance
- R also known as multiple correlation coefficient
- (also between 0 and 1)
- What does it mean if R2 0?
- No relation of y with x or y and x1 , x2 , x3.
- Tested with F-test (null hypothesis H0 R20)
- FMSreg/MSE with (k, n-k-1) degrees of freedom
- F(R2/k)/(1-R2)/n-k-1)
- Equivalent to null hypothesis H0 ?1 ?2 ?3 0
- What does a significant F imply (interpret
p-value)
9Explained variance - 3
- The more explanatory variables (predictors) the
better fit of the line (or the x-dimensional
plain) to the observations. - So explained variance increases with increasing
number of predictors. - Part of this increase is random, due to sample
fluctuation (capitalisation on chance) - Adjustment via correction for the number of
predictors R2adj1-R2(n-1)/(n-k-1)
10Model selection
- Process of determining whether adding variables
improves model fit. - Does R2 increase? Compare adjusted R2.
- Better test model improvement using partial
F-test -
11Model selection and multicollinearity
- The more predictors, the better? No!
- Because of multicollinearity, association between
predictors. - Ideal
- Uncorrelated predictors
- predictors highly correlated with Y
- Perfect multicollinearity r121
12Problems caused by multicollinearity
- R2 does not improve
- The variance of the estimated regression
coefficients increase (VIFVariance Inflation
Factor) ?not good for testing - So selection of predictors
- How? Substantive, and/or automatic (based on
selection rules).
13Model selection procedures
- Model selection selection of predictors
- Enter use all explanatory variables in model
- Stepwise methods
- forward
- 1. First add X with largest r(X,Y)
- 2. Add most significant X (based on F, p0.05)
- backward (elimination)
- 1. Use all predictors
- 2. Delete least significant X (F, p0.10)
- stepwise combination of forward and backward
14(Dis)advantages selection methods
- Advantage easy
- Disadvantage
- Order random hard to interpret and (therefore)
not substantively relevant - Danger of capitalization on chance (especially
forward), implying overestimation of significance - Less problematic if
- Prediction is not the main goal
- Many observations in relation to number of
predictors (k/ngt40) - Cross validation confirms results
15Possible solution
- Before the analysis model or variable selection
based on substantive reasons. - Distinguish
- Control variables and/or variables that need to
be included in the model - Variables with undefined status
- With many variables a solution may be to combine
variables (factors)
16Model assumptions of multiple regression
- Independent observations
- How to check? Difficult!
- Method used for data collection.
- Autocorrelation
- Linearity
- inspection via plots
- predicted y vs. residual
- partial regression plots
17- Residuals normally distributed with constant
variance (homoscedasticity) - testing or visual inspection of normality with
Q-Q plot and/or histogram (of estimated
residuals), or a boxplot. - inspection of constant variance via plot of
- predicted y vs. residual
- Residual and predictors independent
- also via constant variance
- X fixed (measured without error)
18Possible solutions when assumptions are violated
- Non-constant variance
- variance stabilizing transformation (?, ln, 1/Y)
- different estimation method (WLS ipv OLS)
- Non-linearity
- transformation (in Y or X)
- adding other explanatory variables
19Regression diagnostics
- How well does the model fit?
- Residual analysis
- Does the model hold for all observations?
- Outliers
- How does the model change leaving out 1
observation? - Influential points
20When are diagnostics helpful?
- Known behavior of diagnostic (under null
hypothesis of no violation of assumptions) - Easy to compute
- Preferably graphical
- Providing an indication for a solution of the
model violation
21Residual analysis
- Via plots (same ones as before)
- Per observation
- Standardized (ZRESID)
- Studentized (without assuming constant variance,
corrected with leverage) - Deleted residual (Leaving out the observation)
- All residuals are compared to a standard normal
distribution
22Outliers
- Large residuals
- Indicate deviation in Y-dimension testable
(studentized residuals are t distributed) - Leverage plus Mahalanobis distance.
- Indicates deviation in X-dimension testable
- Cooks distance, combination of residual en
leverage - Indicates devation in both X- and Y-dimension
- Rule of thumb gt 1 (but actually dependent on n)
23Influential points
- How does one observation influence or change the
results? - Via change in regression coefficient (DfBeta)
rule of thumb 2/?n - Via change in fit (DfFit)
- rule of thumb 2/?k/n
24(Semi-)Partial correlation
- Partial correlation correlation of two variables
for fixed third variable (i.e. corrected for the
influence of a third variable) - Semi-partial correlation correlation of two
variables correction one variable for the
influence of a third variable - In regression unique contribution of an
additional x to the explained variance of y
provides complete separation of multiple
correlation coefficient can be useful for model
selection.
25R2y.123r 2y1 r2y2.1(s) r2y3.12(s)
26Interaction/moderation
- Combination of two variables X1 en X2 usually the
product - Can be a solution for non-linearity
- Substantive the effect of X1 depends on (the the
value of) X2 - different slopes for different folks
- Interpretation depends on type of variable
(continuous, dichotomous, nominal).
27X1 dichotomous and X2 continuous
- X1 0 of X1 1
- (e.g. man/woman control/experimental group)
- Y ?0 ?1X1 ?2X2 ?3 X1X2
- X1 0 Y ?0 ?2X2
- X1 1 Y (?0 ?1) (?2 ?3)X2
- so intercept and effect of X2 change
- the interaction effect represents the change in
the effect of X2 or, better, the difference
between the groups - In general Y (?0 ?1X1) (?2 ?3 X1)X2
28X1 and X2 dichotomous
- X1 0 of X1 1 (e.g. man/woman)
- X2 0 of X2 1 (e.g. control/experimental)
- Y ?0 ?1X1 ?2X2 ?3 X1X2
- X1 0, X2 0 Y ?0
- X1 1, X2 0 Y ?0 ?1
- X1 0, X2 1 Y ?0 ?2
- X1 1, X2 1 Y ?0 ?1 ?2 ?3
- 4 groups with different means
29X1 nominal and X2 continuous
- X1 has more than 2 (c) values
- (e.g. age groups, control/experiment1/experiment2)
- Make dummies, for each difference with a
reference group (e.g. youngest or oldest,
control) - results in c-1 dichotomous variables.
- It is also possible to make dummies (indicators)
for all groups but then model will have no
intercept.
30c3 group 1 reference group
- group D1 D2
- 1 0 0
- 2 1 0
- 3 0 1
- Y ?0 ?1dD1 ?2dD2 ?2X2 ?3 D1X2 ?4 D2X2
- group1 Y ?0 ?2X2
- group2 Y (?0 ?1d) (?2 ?3)X2
- group3 Y (?0 ?2d) (?2 ?4)X2
31X1 and X2 continuous
- Same idea, more difficult interpretation
- Y ?0 ?1X1 ?2X2 ?3 X1X2?
- Y (?0 ?1X1) (?2?3 X1)X2?
- Y (?0 ?2X 2) (?1?3 X2)X1
- Centering (or standardizing) predictors
facilitates interpretation
32X1 en X2 continuous-2
33X1 en X2 continuous-3
- NB Centering (or standardizing) only changes
regression coefficients, not the model fit (R2
and F)
34Mediation
- Effect of X on Y (explaining Y with X) is via M
- Partial mediation X has a direct effect on Y in
addition to M (explains own part of variance of
Y) - (bijv. Ytestscore XIQ, Meerdere testscore)
- Possible substantive interpretation of
multicollinearity - X explains Y
- M explains X
- M explains Y, controlling for X.