Title: Model Building, Estimation
1Model Building, Estimation Prediction
Topics Motivational Example Multicollinearity
(Redundancy) Avoiding Multicollinearity Stepwise
Procedure Some Caveats Estimation Prediction
with Regression Models
2Problem Scenario
- A real estate agent wanted to develop a model to
predict the selling price of a home. The agent
believed that the most important variable in
determining the price of house are its size,
number of bedrooms, and lot size. Accordingly he
gathered relevant data on a random sample of 100
recently sold homes.
3Problem Scenario Regression Model
4Contradicting conclusions From Regression Output
- Global F-test is highly significant
- At least one explanatory variable is useful in
predicting home price - All t-tests indicate none of the explanatory
variables have significant marginal contribution - Home price decreases as lot size increases
5Whats Going On Here?
High correlation among X-variables leads to
redundancy
6MultiCollinearity
- Exists when any X-variable can be expressed as a
linear combination of the other X-variables this
is root cause - E.g. X1 0.9 X2 X3 0.4X2 0.3X1
- Another symptom includes unstable coefficient
estimates from one sample to the next high
variance of bi
7Implications of MultiCollinearity
- Interpretation of regression coefficients not
meaningful in presence of multiC - Is often a problem in models that contain
interaction terms and quadratic terms - E.g. X1, X2, X1X2, X12, X22
- Does not affect predictions
- Significant F-test with one or more significant
t-tests does NOT imply multiC
8Avoiding Multicollinearity
- If chief purpose of the model is to predict Y
with no interest in interpreting relationships,
ignore multiC - Inspect correlation matrix and choose X variables
that have highest correlation with Y but low
correlation with other Xs - Eliminate insignificant X variables from model in
stepwise fashion
9Avoidance Example
- H_Size has highest correlation with Price
- H_Size also highly correlated with Lot_Sz and
Bedrooms - Choose 1-variable model with H_Size only
10Manual Stepwise Regression Review
- Include all reasonable potential X variables in
the model - If all t-stat gt 2 or p-value lt .05, stop
accept model - Else, drop X variable with lowest t-stat or
highest p-value and rerun regression - Go back to step 2 and continue until all t-stat
gt 2 or p-value lt .05
11Automatic Stepwise Regression
- Is a search process that adds or deletes
variables at each step until no changes can
improve the model - Three variants available
- Forward
- Backward
- General
12Variants of Stepwise Regression
- Forward procedure begins with no explanatory
variables in the model and successively adds one
at a time until no new explanatory variable makes
a significant contribution - Typical criterion to enter p-value lt .05
13Variants of Stepwise Regression
- Backward procedure begins with all potential
explanatory variables in the model and deletes
them one at a time until further deletion would
do more harm than good - Typical criterion to leave p-value gt .05
14Variants of Stepwise Regression
- vGeneral procedure much like Forward variant but
a variable that enters the model could be deleted
in a later step - Typical criterion to enter p-value lt .05
- Typical criterion to leave p-value gt .10
15Application of General Stepwise Predicting
Monthly Rent Payment
16Caveats with Stepwise
- Do NOT use automatic procedure mindlessly. Can
get nonsensical models - Does NOT guarantee best model
- Regard number of X variables given in final model
as a guide to how many should be your target.
Check for glaring omissions or inclusions
17Running Stepwise Regression in StatTools
- Name the data set in the usual way
- Place the cursor anywhere in the spreadsheet and
click on the Regression Classification icon
(3rd from right) - Select Regression then arrow down to highlight
the regression type forward, backward, (general)
stepwise
18Running Stepwise Regression in StatTools
- By clicking insert a check mark in the box next
to the Y variable (D) and X-variables (I) - By clicking insert a check mark in the box next
to the advanced option include detailed step
information - Accept the default radio button Use p-values,
adjust criteria to enter or leave (if needed)
then click O.K.
19Prediction Interval for a new Individual
Observation
- Best point estimate for response variable (Y)
when explanatory variables take on given values
is given by plugging given values into final
regression equation
20Prediction Interval for a new Individual
Observation
- Account for sampling variation by expressing
prediction interval
21Prediction Interval for a new Individual
Observation
- tmult value depends on level of confidence
- Typically 95 level employed
- Lower and upper prediction limits available in
StatTools
22Example Prediction Interval for new Individual
Obs.
- A particular family not included in the study has
the following description - Family size 5, Located in NE
23Example Prediction Interval for new Individual
Obs.
- Rents home, First wage earner 50K
- Second wage earner 20K
- Avg. monthly util. 200, total debt 5K
24StatTools Prediction Interval for new obs.
- With 95 confidence we predict that the a new
family fitting this description will pay between
442 and 1,409 per month in rent
25Confidence Interval for Mean Obs. with Given
Characteristic
- Best point estimate for response variable (Y)
same as for prediction interval given by
plugging given values into final regression
equation
26Confidence Interval for Mean Obs. with Given
Characteristic
- Sampling variation smaller for estimating mean
than for predicting individual new observation
27Confidence Interval formula for mean Obs.
- Lower and upper confidence limits NOT available
in StatTools - Easy hand calculation in Excel
28Example Confidence Interval for mean of all Obs.
- Estimate mean monthly payment for all families
having the following description - Family size 5, Located in NE
29Example Confidence Interval for mean of all Obs.
- Rents home, First wage earner 50K
- Second wage earner 20K
- Avg. monthly util. 200, total debt 5K
30Excel Calculation of Confidence Interval for mean
obs.
- From StatTools, Yhat 925.35
31Excel Calculation of Confidence Interval for mean
obs.
- We are 95 confident that families fitting this
description will pay on average between 905 and
947 per month in rent
32Concluding Remarks about Use of Regression Model
- Some useful variables were found to predict and
estimate monthly rent/mortage payments - R2 of 53 implies there might be other useful
variables not yet considered
33Concluding Remarks about Use of Regression Model
- Wide prediction interval due to large Se
- Need to investigate assumption violations