Advanced topics in regression - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Advanced topics in regression

Description:

Why did we remove car weigth and percentage imported cars from the model? ... What is the predicted weight of the child of a 150 pound, smoking woman? ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 38
Provided by: uio
Category:

less

Transcript and Presenter's Notes

Title: Advanced topics in regression


1
Advanced topics in regression
  • Tron Anders Moger
  • 18.10.2006

2
Last time
P-value for test all ßs0 vs At least one ß not 0
  • Had the model
  • death rate per 1000abcar agecprop light
    trucks

Kno independent variables
MSESSR/K
SE(R2)
d.f.(SSR)K
SSR
Pearsons r vR2
R21-SSE/SST
F test-statistic MSR/MSE
d.f.(SSE)n-K-1
Adj R2 1-(SSE/(n-K-1))/(SST/(n-1))
SST
d.f.(SST)n-1
SSE
MSEse2s SSE/(n-K-1)
P-value for test ß0 vs ß not 0
T test-statistic ß/SE(ß)
95CI for ß
ß-estimates
SE(ß)
3
Why did we remove car weigth and percentage
imported cars from the model?
  • They did not show a significant relationship with
    the dependent variable (ß not different form 0)
  • Unless independent variables are completely
    uncorrelated, you will get different bs when
    including several variables in your model
    compared to just one variable (collinerarity)
  • Hence, would like to remove variables that has
    nothing to do with the dependent variable, but
    still influence the effect of important
    independent variables

4
Relationship R2 and b
  • Which result would make you most happy?

Low R2, high b (with wide CI)
High R2, low b (with narrow CI)
100
100
100
100
5
Centered variables
  • Remember, we found the model
  • Birth weight2369.6724.429mothers weight
  • Hence, constant has no interpretation
  • Construct mothers weight 2mothers
    weight-mean(mothers weight)
  • Get
  • And the model
  • Birth weight2944.6564.429mothers weight 2
  • Constant is now pred. birth weight for a 130 lbs
    mother

130 lbs
6
Indicator variables
  • Binary variables (yes/no, male/female, ) can be
    represented as 1/0, and used as independent
    variables.
  • Also called dummy variables in the book.
  • When used directly, they influence only the
    constant term of the regression
  • It is also possible to use a binary variable so
    that it changes both constant term and slope for
    the regression

7
Example Regression of birth weight with mothers
weight and smoking status as independent variables
8
Interpretation
  • Have fitted the model
  • Birth weight2500.1744.238mothers
    weight-270.013smoking status
  • If the mother start to smoke (and her weight
    remian constant), what is the predicted influence
    on the infatnts birth weight?
  • -270.0131 -270 grams
  • What is the predicted weight of the child of a
    150 pound, smoking woman?
  • 2500.1744.238150-270.01312866 grams

9
Will R2 automatically be low forindicator
variables?
0
1
10
What if categorical variable has more than two
values?
  • Example Ethinicity black, white, other
  • For categorical variables with m possible values,
    use m-1 indicators.
  • Important A model with two indicator variables
    will assume that the effect of one indicator adds
    to the effect of the other
  • If this may be unsuitable, use an additional
    interaction variable (product of indicators)

11
Model birth weight as a function of ethnicity
  • Have constructed variables black0 or 1 and
    other0 or 1
  • Model Birth weightabblackcothers
  • Get
  • Hence, predicted birth weight decrease by 384
    grams for blacks and 299 grams for others
  • Predicted birth weight for whites is 3104 grams

12
Interaction
  • Sometimes the effect (on y) of one independent
    variable (x1) depends on the value of another
    independent variable (x2)
  • Means that you e.g get different slopes x1 for
    different values of x2
  • Usually modelled by constructing a product of the
    two variables, and including it in the model
  • Example bwtabmwtcsmokingdmwtsmoking
  • a(bdsmoking)mwtcsmoking

13
Get SPSS to do the estimation
  • Get
  • bwt23475.41mwt47.87smoking-2.46mwtsmoking
  • Mwt100 lbs vs mwt200lbs for non-smokers
  • bwt2888g and bwt3428g, difference540g
  • Mwt100 lbs vs mwt200lbs for smokers
  • bwt2690g and bwt2985g, difference295g

14
What does this mean?
  • Mothers weight has a greater impact for birth
    weight for non-smokers than for smokers (or the
    other way round)

15
What does this mean contd?
  • We see that the slope is steeper for non-smokers
  • In fact, a model with mwt and mwtsmoking fits
    better than the model mwt and smoking

16
Should you always look at all possible
interactions?
  • No.
  • Example shows interaction between an indicator
    and a continuous variable, fairly easy to
    interpret
  • Interaction between two continuous variables,
    slightly more complicated
  • Interaction between three or more variables
    Difficult too interpret
  • Doesnt matter if you have a good model, if you
    cant interpret it
  • Often interested in interactions you think are
    there before you do the study

17
Multicollinearity
  • Means that two or more independent variables are
    closely correlated
  • To discover it, make plots and compute
    correlations (or make a regression of one
    parameter on the others)
  • To deal with it
  • Remove unnecessary variables
  • Define and compute an index
  • If variables are kept, model could still be used
    for prediction

18
Example Traffic deaths
  • Recall Used four variables to predict traffic
    deaths in the U.S.
  • Among them Average car weight and prop. imported
    cars
  • However, the correlation between these two
    variables is pretty high

19
Correlation car weight vs imp.cars
  • Pearson r is 0.94
  • Problematic to use both of these as independents
    in a regression

20
Choice of variables
  • Include variables which you believe have a clear
    influence on the dependent variable, even if the
    variable is uninteresting This helps find the
    true relationship between interesting variables
    and the dependent.
  • Avoid including a pair (or a set) of variables
    whose values are clearly linearily related

21
Choice of values
  • Should have a good spread Again, avoid
    collinearity
  • Should cover the range for which the model will
    be used
  • For categorical variables, one may choose to
    combine levels in a systematic way.

22
Specification bias
  • Unless two independent variables are
    uncorrelated, the estimation of one will
    influence the estimation of the other
  • Not including one variable which bias the
    estimation of the other
  • Thus, one should be humble when interpreting
    regression results There are probably always
    variables one could have added

23
Heteroscedasticity what is it?
  • In the standard regression model
  • it is assumed that all have the same
    variance.
  • If the variance varies with the independent
    variables or dependent variable, the model is
    heteroscedastic.
  • Sometimes, it is clear that data exhibit such
    properties.

24
Heteroscedasticity why does it matter?
  • Our standard methods for estimation, confidence
    intervals, and hypothesis testing assume equal
    variances.
  • If we go on and use these methods anyway, our
    answers might be quite wrong!

25
Heteroscedasticity how to detect it?
  • Fit a regression model, and study the residuals
  • make a plot of them against independent variables
  • make a plot of them against the predicted values
    for the dependent variable
  • Possibility Test for heteroscedasticity by doing
    a regression of the squared residuals on the
    predicted values.

26
Example The model traffic deathsabcar
ageclight trucks
  • Does not look too bad

27
What is bad?
  • This and this

e
e
0
0
y
y
28
Heteroscedasticity what to do about it?
  • Using a transformation of the dependent variable
  • log-linear models
  • If the standard deviation of the errors appears
    to be proportional to the predicted values, a
    two-stage regression analysis is a possibility

29
Dependence over time
  • Sometimes, y1, y2, , yn are not completely
    independent observations (given the independent
    variables).
  • Lagged values yi may depend on yi-1 in addition
    to its independent variables
  • Autocorrelated errors The residuals ei are
    correlated
  • Often relevant for time-series data

30
Lagged values
  • In this case, we may run a multiple regression
    just as before, but including the previous
    dependent variable yi-1 as a predictor variable
    for yi.
  • Use the model ytß0ß1x1?yt-1et
  • A 1-unit increase in x1 in first time period
    yields an expected increase in y of ß1, an
    increase ß1? in the second period, ß1?2 in the
    third period and so on
  • Total expected increase in all future is ß1/(1-?)

31
Example Pension funds from textbook CD
  • Want to use the market return for stocks (say, in
    millon ) as a predictor for the percentage of
    pension fund portifolios at market value (y) at
    the end of the year
  • Have data for 25 yrs-gt24 observations

32
Get the model
  • yt1.3970.235stock return0.954yt-1
  • A one million increase in stock return one year
    yields a 0.24 increase in pension fund
    portifolios at market value
  • For the next year 0.2350.9540.22
  • And the third year 0.2350.95420.21
  • For all future 0.235/(1-0.954)5.1
  • What if you have a 2 million increase?

33
Autocorrelated errors
  • In the standard regression model, the errors are
    independent.
  • Using standard regression formulas anyway can
    lead to errors Typically, the uncertainty in the
    result is underestimated.

34
Autocorrelation how to detect?
  • Plotting residuals against time!
  • The Durbin-Watson test compares the possibility
    of independent errors with a first-order
    autoregressive model

Option in SPSS
Test depends on K (no. of independent
variables), n (no. observations) and sig.level
a Test H0 ?0 vs H1 ?0 Reject H0 if
dltdL Accept H0 if dgtdU Inconclusive if dLltdltdU
Test statistic
35
Example Pension funds
  • Want to test ?0 on 5-level
  • Test statistic d1.008
  • Have one independent variable (K1 in table 12 on
    p. 876) and n24
  • Find critical values of dL1.27 and dU1.45
  • Reject H0

36
Autocorrelation what to do?
  • It is possible to use a two-stage regression
    procedure
  • If a first-order auto-regressive model with
    parameter is appropriate, the model
  • will have uncorrelated errors
  • Estimate from the Durbin-Watson statistic,
    and estimate from the model above

37
Next time
  • What if the assumption of normality for your data
    is invalid?
  • You have to forget all you have learnt so far,
    and do something else
  • Non-parametric statistics
Write a Comment
User Comments (0)
About PowerShow.com