Regression analysis - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Regression analysis

Description:

Assumes: linear relation of y and x. Goal: explaining or predicting y using x ... Many observations in relation to number of predictors (k/n 40) ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 35
Provided by: rug
Category:

less

Transcript and Presenter's Notes

Title: Regression analysis


1
Regression analysis
  • Marijtje van Duijn

2
(Simple) Regression
  • y dependent variable
  • at least interval measurement
  • x independent or explanatory variable
  • interval or dummy also known as predictor
  • ?0 and ?1 regression coefficients
  • ?0 intercept where line intercepts y axis (for
    x0)
  • ?1 slope change in y for change in x of 1 meas.
    unit
  • e error or residual
  • i index for individual/observation (case), n
    observations

3
Regression
  • Assumes linear relation of y and x
  • Goal explaining or predicting y using x
  • Data Model Error
  • NB relation not necessarily causal
  • (cor)relational
  • theory may justify causality

4
Regression
  • Estimate line (or optimize model) by minimizing
    total error, i.e. sum of distances between
    observations and line
  • How? Applying Ordinary Least Squares (OLS)

5
Multiple regressionmore than 1 explanatory
variable
6
Explained Variance
  • Measure for regression quality
  • How well does the line fit the observations
  • How much variance of the independent variable is
    explained by the model?
  • Total variance model variance residual
    variance

7
Analysis of Variance table for regression
8
Explained variance - 2
  • R2 lies between 0 and 1,
  • the percentage of explained variance
  • R also known as multiple correlation coefficient
  • (also between 0 and 1)
  • What does it mean if R2 0?
  • No relation of y with x or y and x1 , x2 , x3.
  • Tested with F-test (null hypothesis H0 R20)
  • FMSreg/MSE with (k, n-k-1) degrees of freedom
  • F(R2/k)/(1-R2)/n-k-1)
  • Equivalent to null hypothesis H0 ?1 ?2 ?3 0
  • What does a significant F imply (interpret
    p-value)

9
Explained variance - 3
  • The more explanatory variables (predictors) the
    better fit of the line (or the x-dimensional
    plain) to the observations.
  • So explained variance increases with increasing
    number of predictors.
  • Part of this increase is random, due to sample
    fluctuation (capitalisation on chance)
  • Adjustment via correction for the number of
    predictors R2adj1-R2(n-1)/(n-k-1)

10
Model selection
  • Process of determining whether adding variables
    improves model fit.
  • Does R2 increase? Compare adjusted R2.
  • Better test model improvement using partial
    F-test

11
Model selection and multicollinearity
  • The more predictors, the better? No!
  • Because of multicollinearity, association between
    predictors.
  • Ideal
  • Uncorrelated predictors
  • predictors highly correlated with Y
  • Perfect multicollinearity r121

12
Problems caused by multicollinearity
  • R2 does not improve
  • The variance of the estimated regression
    coefficients increase (VIFVariance Inflation
    Factor) ?not good for testing
  • So selection of predictors
  • How? Substantive, and/or automatic (based on
    selection rules).

13
Model selection procedures
  • Model selection selection of predictors
  • Enter use all explanatory variables in model
  • Stepwise methods
  • forward
  • 1. First add X with largest r(X,Y)
  • 2. Add most significant X (based on F, p0.05)
  • backward (elimination)
  • 1. Use all predictors
  • 2. Delete least significant X (F, p0.10)
  • stepwise combination of forward and backward

14
(Dis)advantages selection methods
  • Advantage easy
  • Disadvantage
  • Order random hard to interpret and (therefore)
    not substantively relevant
  • Danger of capitalization on chance (especially
    forward), implying overestimation of significance
  • Less problematic if
  • Prediction is not the main goal
  • Many observations in relation to number of
    predictors (k/ngt40)
  • Cross validation confirms results

15
Possible solution
  • Before the analysis model or variable selection
    based on substantive reasons.
  • Distinguish
  • Control variables and/or variables that need to
    be included in the model
  • Variables with undefined status
  • With many variables a solution may be to combine
    variables (factors)

16
Model assumptions of multiple regression
  • Independent observations
  • How to check? Difficult!
  • Method used for data collection.
  • Autocorrelation
  • Linearity
  • inspection via plots
  • predicted y vs. residual
  • partial regression plots

17
  • Residuals normally distributed with constant
    variance (homoscedasticity)
  • testing or visual inspection of normality with
    Q-Q plot and/or histogram (of estimated
    residuals), or a boxplot.
  • inspection of constant variance via plot of
  • predicted y vs. residual
  • Residual and predictors independent
  • also via constant variance
  • X fixed (measured without error)

18
Possible solutions when assumptions are violated
  • Non-constant variance
  • variance stabilizing transformation (?, ln, 1/Y)
  • different estimation method (WLS ipv OLS)
  • Non-linearity
  • transformation (in Y or X)
  • adding other explanatory variables

19
Regression diagnostics
  • How well does the model fit?
  • Residual analysis
  • Does the model hold for all observations?
  • Outliers
  • How does the model change leaving out 1
    observation?
  • Influential points

20
When are diagnostics helpful?
  • Known behavior of diagnostic (under null
    hypothesis of no violation of assumptions)
  • Easy to compute
  • Preferably graphical
  • Providing an indication for a solution of the
    model violation

21
Residual analysis
  • Via plots (same ones as before)
  • Per observation
  • Standardized (ZRESID)
  • Studentized (without assuming constant variance,
    corrected with leverage)
  • Deleted residual (Leaving out the observation)
  • All residuals are compared to a standard normal
    distribution

22
Outliers
  • Large residuals
  • Indicate deviation in Y-dimension testable
    (studentized residuals are t distributed)
  • Leverage plus Mahalanobis distance.
  • Indicates deviation in X-dimension testable
  • Cooks distance, combination of residual en
    leverage
  • Indicates devation in both X- and Y-dimension
  • Rule of thumb gt 1 (but actually dependent on n)

23
Influential points
  • How does one observation influence or change the
    results?
  • Via change in regression coefficient (DfBeta)
    rule of thumb 2/?n
  • Via change in fit (DfFit)
  • rule of thumb 2/?k/n

24
(Semi-)Partial correlation
  • Partial correlation correlation of two variables
    for fixed third variable (i.e. corrected for the
    influence of a third variable)
  • Semi-partial correlation correlation of two
    variables correction one variable for the
    influence of a third variable
  • In regression unique contribution of an
    additional x to the explained variance of y
    provides complete separation of multiple
    correlation coefficient can be useful for model
    selection.

25
R2y.123r 2y1 r2y2.1(s) r2y3.12(s)
26
Interaction/moderation
  • Combination of two variables X1 en X2 usually the
    product
  • Can be a solution for non-linearity
  • Substantive the effect of X1 depends on (the the
    value of) X2
  • different slopes for different folks
  • Interpretation depends on type of variable
    (continuous, dichotomous, nominal).

27
X1 dichotomous and X2 continuous
  • X1 0 of X1 1
  • (e.g. man/woman control/experimental group)
  • Y ?0 ?1X1 ?2X2 ?3 X1X2
  • X1 0 Y ?0 ?2X2
  • X1 1 Y (?0 ?1) (?2 ?3)X2
  • so intercept and effect of X2 change
  • the interaction effect represents the change in
    the effect of X2 or, better, the difference
    between the groups
  • In general Y (?0 ?1X1) (?2 ?3 X1)X2

28
X1 and X2 dichotomous
  • X1 0 of X1 1 (e.g. man/woman)
  • X2 0 of X2 1 (e.g. control/experimental)
  • Y ?0 ?1X1 ?2X2 ?3 X1X2
  • X1 0, X2 0 Y ?0
  • X1 1, X2 0 Y ?0 ?1
  • X1 0, X2 1 Y ?0 ?2
  • X1 1, X2 1 Y ?0 ?1 ?2 ?3
  • 4 groups with different means

29
X1 nominal and X2 continuous
  • X1 has more than 2 (c) values
  • (e.g. age groups, control/experiment1/experiment2)
  • Make dummies, for each difference with a
    reference group (e.g. youngest or oldest,
    control)
  • results in c-1 dichotomous variables.
  • It is also possible to make dummies (indicators)
    for all groups but then model will have no
    intercept.

30
c3 group 1 reference group
  • group D1 D2
  • 1 0 0
  • 2 1 0
  • 3 0 1
  • Y ?0 ?1dD1 ?2dD2 ?2X2 ?3 D1X2 ?4 D2X2
  • group1 Y ?0 ?2X2
  • group2 Y (?0 ?1d) (?2 ?3)X2
  • group3 Y (?0 ?2d) (?2 ?4)X2

31
X1 and X2 continuous
  • Same idea, more difficult interpretation
  • Y ?0 ?1X1 ?2X2 ?3 X1X2?
  • Y (?0 ?1X1) (?2?3 X1)X2?
  • Y (?0 ?2X 2) (?1?3 X2)X1
  • Centering (or standardizing) predictors
    facilitates interpretation

32
X1 en X2 continuous-2
33
X1 en X2 continuous-3
  • NB Centering (or standardizing) only changes
    regression coefficients, not the model fit (R2
    and F)

34
Mediation
  • Effect of X on Y (explaining Y with X) is via M
  • Partial mediation X has a direct effect on Y in
    addition to M (explains own part of variance of
    Y)
  • (bijv. Ytestscore XIQ, Meerdere testscore)
  • Possible substantive interpretation of
    multicollinearity
  • X explains Y
  • M explains X
  • M explains Y, controlling for X.
Write a Comment
User Comments (0)
About PowerShow.com