Advanced topics in regression - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Advanced topics in regression

Description:

Why did we remove car weigth and percentage imported cars from the model? ... What is the predicted weight of the child of a 150 pound, smoking woman? ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 38

Provided by: uio

Category:

more less

Transcript and Presenter's Notes

Title: Advanced topics in regression

1
Advanced topics in regression

Tron Anders Moger
18.10.2006

2
Last time
P-value for test all ßs0 vs At least one ß not 0

Had the model
death rate per 1000abcar agecprop light
trucks

Kno independent variables
MSESSR/K
SE(R2)
d.f.(SSR)K
SSR
Pearsons r vR2
R21-SSE/SST
F test-statistic MSR/MSE
d.f.(SSE)n-K-1
Adj R2 1-(SSE/(n-K-1))/(SST/(n-1))
SST
d.f.(SST)n-1
SSE
MSEse2s SSE/(n-K-1)
P-value for test ß0 vs ß not 0
T test-statistic ß/SE(ß)
95CI for ß
ß-estimates
SE(ß)
3
Why did we remove car weigth and percentage
imported cars from the model?

They did not show a significant relationship with
the dependent variable (ß not different form 0)
Unless independent variables are completely
uncorrelated, you will get different bs when
including several variables in your model
compared to just one variable (collinerarity)
Hence, would like to remove variables that has
nothing to do with the dependent variable, but
still influence the effect of important
independent variables

4
Relationship R2 and b

Which result would make you most happy?

Low R2, high b (with wide CI)
High R2, low b (with narrow CI)
100
100
100
100
5
Centered variables

Remember, we found the model
Birth weight2369.6724.429mothers weight
Hence, constant has no interpretation
Construct mothers weight 2mothers
weight-mean(mothers weight)
Get
And the model
Birth weight2944.6564.429mothers weight 2
Constant is now pred. birth weight for a 130 lbs
mother

130 lbs
6
Indicator variables

Binary variables (yes/no, male/female, ) can be
represented as 1/0, and used as independent
variables.
Also called dummy variables in the book.
When used directly, they influence only the
constant term of the regression
It is also possible to use a binary variable so
that it changes both constant term and slope for
the regression

7
Example Regression of birth weight with mothers
weight and smoking status as independent variables
8
Interpretation

Have fitted the model
Birth weight2500.1744.238mothers
weight-270.013smoking status
If the mother start to smoke (and her weight
remian constant), what is the predicted influence
on the infatnts birth weight?
-270.0131 -270 grams
What is the predicted weight of the child of a
150 pound, smoking woman?
2500.1744.238150-270.01312866 grams

9
Will R2 automatically be low forindicator
variables?
0
1
10
What if categorical variable has more than two
values?

Example Ethinicity black, white, other
For categorical variables with m possible values,
use m-1 indicators.
Important A model with two indicator variables
will assume that the effect of one indicator adds
to the effect of the other
If this may be unsuitable, use an additional
interaction variable (product of indicators)

11
Model birth weight as a function of ethnicity

Have constructed variables black0 or 1 and
other0 or 1
Model Birth weightabblackcothers
Get
Hence, predicted birth weight decrease by 384
grams for blacks and 299 grams for others
Predicted birth weight for whites is 3104 grams

12
Interaction

Sometimes the effect (on y) of one independent
variable (x1) depends on the value of another
independent variable (x2)
Means that you e.g get different slopes x1 for
different values of x2
Usually modelled by constructing a product of the
two variables, and including it in the model
Example bwtabmwtcsmokingdmwtsmoking
a(bdsmoking)mwtcsmoking

13
Get SPSS to do the estimation

Get
bwt23475.41mwt47.87smoking-2.46mwtsmoking
Mwt100 lbs vs mwt200lbs for non-smokers
bwt2888g and bwt3428g, difference540g
Mwt100 lbs vs mwt200lbs for smokers
bwt2690g and bwt2985g, difference295g

14
What does this mean?

Mothers weight has a greater impact for birth
weight for non-smokers than for smokers (or the
other way round)

15
What does this mean contd?

We see that the slope is steeper for non-smokers
In fact, a model with mwt and mwtsmoking fits
better than the model mwt and smoking

16
Should you always look at all possible
interactions?

No.
Example shows interaction between an indicator
and a continuous variable, fairly easy to
interpret
Interaction between two continuous variables,
slightly more complicated
Interaction between three or more variables
Difficult too interpret
Doesnt matter if you have a good model, if you
cant interpret it
Often interested in interactions you think are
there before you do the study

17
Multicollinearity

Means that two or more independent variables are
closely correlated
To discover it, make plots and compute
correlations (or make a regression of one
parameter on the others)
To deal with it
Remove unnecessary variables
Define and compute an index
If variables are kept, model could still be used
for prediction

18
Example Traffic deaths

Recall Used four variables to predict traffic
deaths in the U.S.
Among them Average car weight and prop. imported
cars
However, the correlation between these two
variables is pretty high

19
Correlation car weight vs imp.cars

Pearson r is 0.94
Problematic to use both of these as independents
in a regression

20
Choice of variables

Include variables which you believe have a clear
influence on the dependent variable, even if the
variable is uninteresting This helps find the
true relationship between interesting variables
and the dependent.
Avoid including a pair (or a set) of variables
whose values are clearly linearily related

21
Choice of values

Should have a good spread Again, avoid
collinearity
Should cover the range for which the model will
be used
For categorical variables, one may choose to
combine levels in a systematic way.

22
Specification bias

Unless two independent variables are
uncorrelated, the estimation of one will
influence the estimation of the other
Not including one variable which bias the
estimation of the other
Thus, one should be humble when interpreting
regression results There are probably always
variables one could have added

23
Heteroscedasticity what is it?

In the standard regression model
it is assumed that all have the same
variance.
If the variance varies with the independent
variables or dependent variable, the model is
heteroscedastic.
Sometimes, it is clear that data exhibit such
properties.

24
Heteroscedasticity why does it matter?

Our standard methods for estimation, confidence
intervals, and hypothesis testing assume equal
variances.
If we go on and use these methods anyway, our
answers might be quite wrong!

25
Heteroscedasticity how to detect it?

Fit a regression model, and study the residuals
make a plot of them against independent variables
make a plot of them against the predicted values
for the dependent variable
Possibility Test for heteroscedasticity by doing
a regression of the squared residuals on the
predicted values.

26
Example The model traffic deathsabcar
ageclight trucks

Does not look too bad

27
What is bad?

This and this

e
e
0
0
y
y
28
Heteroscedasticity what to do about it?

Using a transformation of the dependent variable
log-linear models
If the standard deviation of the errors appears
to be proportional to the predicted values, a
two-stage regression analysis is a possibility

29
Dependence over time

Sometimes, y1, y2, , yn are not completely
independent observations (given the independent
variables).
Lagged values yi may depend on yi-1 in addition
to its independent variables
Autocorrelated errors The residuals ei are
correlated
Often relevant for time-series data

30
Lagged values

In this case, we may run a multiple regression
just as before, but including the previous
dependent variable yi-1 as a predictor variable
for yi.
Use the model ytß0ß1x1?yt-1et
A 1-unit increase in x1 in first time period
yields an expected increase in y of ß1, an
increase ß1? in the second period, ß1?2 in the
third period and so on
Total expected increase in all future is ß1/(1-?)

31
Example Pension funds from textbook CD

Want to use the market return for stocks (say, in
millon ) as a predictor for the percentage of
pension fund portifolios at market value (y) at
the end of the year
Have data for 25 yrs-gt24 observations

32
Get the model

yt1.3970.235stock return0.954yt-1
A one million increase in stock return one year
yields a 0.24 increase in pension fund
portifolios at market value
For the next year 0.2350.9540.22
And the third year 0.2350.95420.21
For all future 0.235/(1-0.954)5.1
What if you have a 2 million increase?

33
Autocorrelated errors

In the standard regression model, the errors are
independent.
Using standard regression formulas anyway can
lead to errors Typically, the uncertainty in the
result is underestimated.

34
Autocorrelation how to detect?

Plotting residuals against time!
The Durbin-Watson test compares the possibility
of independent errors with a first-order
autoregressive model

Option in SPSS
Test depends on K (no. of independent
variables), n (no. observations) and sig.level
a Test H0 ?0 vs H1 ?0 Reject H0 if
dltdL Accept H0 if dgtdU Inconclusive if dLltdltdU
Test statistic
35
Example Pension funds