Title: Advanced topics in regression
1Advanced topics in regression
- Tron Anders Moger
- 18.10.2006
2Last time
P-value for test all ßs0 vs At least one ß not 0
- Had the model
- death rate per 1000abcar agecprop light
trucks
Kno independent variables
MSESSR/K
SE(R2)
d.f.(SSR)K
SSR
Pearsons r vR2
R21-SSE/SST
F test-statistic MSR/MSE
d.f.(SSE)n-K-1
Adj R2 1-(SSE/(n-K-1))/(SST/(n-1))
SST
d.f.(SST)n-1
SSE
MSEse2s SSE/(n-K-1)
P-value for test ß0 vs ß not 0
T test-statistic ß/SE(ß)
95CI for ß
ß-estimates
SE(ß)
3Why did we remove car weigth and percentage
imported cars from the model?
- They did not show a significant relationship with
the dependent variable (ß not different form 0) - Unless independent variables are completely
uncorrelated, you will get different bs when
including several variables in your model
compared to just one variable (collinerarity) - Hence, would like to remove variables that has
nothing to do with the dependent variable, but
still influence the effect of important
independent variables
4Relationship R2 and b
- Which result would make you most happy?
Low R2, high b (with wide CI)
High R2, low b (with narrow CI)
100
100
100
100
5Centered variables
- Remember, we found the model
- Birth weight2369.6724.429mothers weight
- Hence, constant has no interpretation
- Construct mothers weight 2mothers
weight-mean(mothers weight) - Get
- And the model
- Birth weight2944.6564.429mothers weight 2
- Constant is now pred. birth weight for a 130 lbs
mother
130 lbs
6Indicator variables
- Binary variables (yes/no, male/female, ) can be
represented as 1/0, and used as independent
variables. - Also called dummy variables in the book.
- When used directly, they influence only the
constant term of the regression - It is also possible to use a binary variable so
that it changes both constant term and slope for
the regression
7Example Regression of birth weight with mothers
weight and smoking status as independent variables
8Interpretation
- Have fitted the model
- Birth weight2500.1744.238mothers
weight-270.013smoking status - If the mother start to smoke (and her weight
remian constant), what is the predicted influence
on the infatnts birth weight? - -270.0131 -270 grams
- What is the predicted weight of the child of a
150 pound, smoking woman? - 2500.1744.238150-270.01312866 grams
9Will R2 automatically be low forindicator
variables?
0
1
10What if categorical variable has more than two
values?
- Example Ethinicity black, white, other
- For categorical variables with m possible values,
use m-1 indicators. - Important A model with two indicator variables
will assume that the effect of one indicator adds
to the effect of the other - If this may be unsuitable, use an additional
interaction variable (product of indicators)
11Model birth weight as a function of ethnicity
- Have constructed variables black0 or 1 and
other0 or 1 - Model Birth weightabblackcothers
- Get
- Hence, predicted birth weight decrease by 384
grams for blacks and 299 grams for others - Predicted birth weight for whites is 3104 grams
12Interaction
- Sometimes the effect (on y) of one independent
variable (x1) depends on the value of another
independent variable (x2) - Means that you e.g get different slopes x1 for
different values of x2 - Usually modelled by constructing a product of the
two variables, and including it in the model - Example bwtabmwtcsmokingdmwtsmoking
- a(bdsmoking)mwtcsmoking
13Get SPSS to do the estimation
- Get
- bwt23475.41mwt47.87smoking-2.46mwtsmoking
- Mwt100 lbs vs mwt200lbs for non-smokers
- bwt2888g and bwt3428g, difference540g
- Mwt100 lbs vs mwt200lbs for smokers
- bwt2690g and bwt2985g, difference295g
14What does this mean?
- Mothers weight has a greater impact for birth
weight for non-smokers than for smokers (or the
other way round)
15What does this mean contd?
- We see that the slope is steeper for non-smokers
- In fact, a model with mwt and mwtsmoking fits
better than the model mwt and smoking
16Should you always look at all possible
interactions?
- No.
- Example shows interaction between an indicator
and a continuous variable, fairly easy to
interpret - Interaction between two continuous variables,
slightly more complicated - Interaction between three or more variables
Difficult too interpret - Doesnt matter if you have a good model, if you
cant interpret it - Often interested in interactions you think are
there before you do the study
17Multicollinearity
- Means that two or more independent variables are
closely correlated - To discover it, make plots and compute
correlations (or make a regression of one
parameter on the others) - To deal with it
- Remove unnecessary variables
- Define and compute an index
- If variables are kept, model could still be used
for prediction
18Example Traffic deaths
- Recall Used four variables to predict traffic
deaths in the U.S. - Among them Average car weight and prop. imported
cars - However, the correlation between these two
variables is pretty high
19Correlation car weight vs imp.cars
- Pearson r is 0.94
- Problematic to use both of these as independents
in a regression
20Choice of variables
- Include variables which you believe have a clear
influence on the dependent variable, even if the
variable is uninteresting This helps find the
true relationship between interesting variables
and the dependent. - Avoid including a pair (or a set) of variables
whose values are clearly linearily related
21Choice of values
- Should have a good spread Again, avoid
collinearity - Should cover the range for which the model will
be used - For categorical variables, one may choose to
combine levels in a systematic way.
22Specification bias
- Unless two independent variables are
uncorrelated, the estimation of one will
influence the estimation of the other - Not including one variable which bias the
estimation of the other - Thus, one should be humble when interpreting
regression results There are probably always
variables one could have added
23Heteroscedasticity what is it?
- In the standard regression model
- it is assumed that all have the same
variance. - If the variance varies with the independent
variables or dependent variable, the model is
heteroscedastic. - Sometimes, it is clear that data exhibit such
properties.
24Heteroscedasticity why does it matter?
- Our standard methods for estimation, confidence
intervals, and hypothesis testing assume equal
variances. - If we go on and use these methods anyway, our
answers might be quite wrong!
25Heteroscedasticity how to detect it?
- Fit a regression model, and study the residuals
- make a plot of them against independent variables
- make a plot of them against the predicted values
for the dependent variable - Possibility Test for heteroscedasticity by doing
a regression of the squared residuals on the
predicted values.
26Example The model traffic deathsabcar
ageclight trucks
27What is bad?
e
e
0
0
y
y
28Heteroscedasticity what to do about it?
- Using a transformation of the dependent variable
- log-linear models
- If the standard deviation of the errors appears
to be proportional to the predicted values, a
two-stage regression analysis is a possibility
29Dependence over time
- Sometimes, y1, y2, , yn are not completely
independent observations (given the independent
variables). - Lagged values yi may depend on yi-1 in addition
to its independent variables - Autocorrelated errors The residuals ei are
correlated - Often relevant for time-series data
30Lagged values
- In this case, we may run a multiple regression
just as before, but including the previous
dependent variable yi-1 as a predictor variable
for yi. - Use the model ytß0ß1x1?yt-1et
- A 1-unit increase in x1 in first time period
yields an expected increase in y of ß1, an
increase ß1? in the second period, ß1?2 in the
third period and so on - Total expected increase in all future is ß1/(1-?)
31Example Pension funds from textbook CD
- Want to use the market return for stocks (say, in
millon ) as a predictor for the percentage of
pension fund portifolios at market value (y) at
the end of the year - Have data for 25 yrs-gt24 observations
32Get the model
- yt1.3970.235stock return0.954yt-1
- A one million increase in stock return one year
yields a 0.24 increase in pension fund
portifolios at market value - For the next year 0.2350.9540.22
- And the third year 0.2350.95420.21
- For all future 0.235/(1-0.954)5.1
- What if you have a 2 million increase?
33Autocorrelated errors
- In the standard regression model, the errors are
independent. - Using standard regression formulas anyway can
lead to errors Typically, the uncertainty in the
result is underestimated.
34Autocorrelation how to detect?
- Plotting residuals against time!
- The Durbin-Watson test compares the possibility
of independent errors with a first-order
autoregressive model
Option in SPSS
Test depends on K (no. of independent
variables), n (no. observations) and sig.level
a Test H0 ?0 vs H1 ?0 Reject H0 if
dltdL Accept H0 if dgtdU Inconclusive if dLltdltdU
Test statistic
35Example Pension funds
- Want to test ?0 on 5-level
- Test statistic d1.008
- Have one independent variable (K1 in table 12 on
p. 876) and n24 - Find critical values of dL1.27 and dU1.45
- Reject H0
36Autocorrelation what to do?
- It is possible to use a two-stage regression
procedure - If a first-order auto-regressive model with
parameter is appropriate, the model - will have uncorrelated errors
- Estimate from the Durbin-Watson statistic,
and estimate from the model above
37Next time
- What if the assumption of normality for your data
is invalid? - You have to forget all you have learnt so far,
and do something else - Non-parametric statistics