Ch 9: Regression Wisdom Sifting Residuals for Groups - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Ch 9: Regression Wisdom Sifting Residuals for Groups

Description:

... also related to the number of televisions per person in that country. ... Re-examining the graph, Justin realizes there is a significant outlier which has ... – PowerPoint PPT presentation

Number of Views:121

Avg rating:3.0/5.0

Slides: 36

Provided by: Addison6

Category:

more less

Transcript and Presenter's Notes

Title: Ch 9: Regression Wisdom Sifting Residuals for Groups

1
Ch 9 Regression Wisdom Sifting Residuals for
Groups

No regression analysis is complete without a
display of the residuals to check that the linear
model is reasonable.
Residuals often reveal subtleties that were not
clear from a plot of the original data.
Sometimes the subtleties we see are additional
details that help confirm or refine our
understanding.
Sometimes they reveal violations of the
regression conditions that require our attention
This chapter looks at some of the things that can
lead to problems in regression models, and, when
possible, how to react

2
Regression Sanity Check

Before you plug-and-chug with your regression and
write the answer down, check to see if it makes
sense.
Are you sure you are using the right units?
Regression models with coefficients expressed in
1000K will produce goofy results if you use
plain s instead. Or using monthly sales when
the model was built assuming weekly sales were to
be used.
Will fractional predictions make sense in the
context? 14.5 houses, 10.3 people, etc
What do we do if our model predicts a negative
number?
We will look at extrapolation and leverage
shortly

3
Example Sanity Check

Arnie has been given a model that predicts his
weekend car sales at Arnies Autos based on the
interest rate he can offer for financing
Cars-sold 50 -570rate
Will the model results change if we interpret a
6 rate to be 6 verses .06?
What is the numerical answer? Does it make sense
from a practical standpoint? What do we need to
do first?
How many cars are we predicted to sell if the
interest rates go up to 9? What does this
really mean?
At what interest rate (and above) would Arnie
expect to see no sales? How do we find this rate?

4
Regression Common Sense

Before you plug-and-chug with your regression and
write the answer down, check to see if it makes
sense.
Are you sure you are using the right units?
Regression models with coefficients expressed in
1000K will produce goofy results if you use
plain s instead. Or using monthly sales when
the model was built assuming weekly sales were to
be used.
Will fractional predictions make sense in the
context? 14.5 houses, 10.3 people, etc
What do we do if our model predicts a negative
number?
We will look at extrapolation and leverage
shortly

5
Sifting Residuals for Groups - the Tools!

It is smart to look at both a histogram of the
residuals and a scatter-plot of the residuals vs.
predicted values
Using the Residuals from Ch08s Cereal Model, the
small modes in the histogram are marked with
different colors and symbols in the residual plot
above. What do you see?

Residual Plot for Predicting Calories Given Sugar
Histogram of Residuals for Predicting Calories
given Sugar
6
Subsets

An important, often unstated condition for
fitting models is that all the data must come
from the same group.
An examination of residuals often leads us to
discover groups of observations that are
different from the rest.
When we discover that there is more than one
group in a regression, we can either
Build models for each of the subgroups separately
Work within the confines of a single model but
note the condition
Use advanced techniques, such as multiple
regression, which are beyond the scope of DS212.

7
Subsets Example

Accepted industry wisdom in the breakfast cereal
business is that different cereals are targeted
at different consumer groups and thus placed on
different shelves.
Figure 9.3 from the text shows regression lines
fit to calories and sugar for each of the three
cereal shelves in a supermarket (shelf 1 is
the lowest shelf)

Predicted Calories by Sugar Content, Given Shelf
Level
8
Another Typical ProblemGetting the Bends

Linear regression only works for linear models.
(That sounds obvious, but when you fit a
regression, you cant take it for granted.)
A curved relationship between two variables might
not be apparent when looking at a scatter-plot
alone, but will be more obvious in a plot of the
residuals.
Remember, we want to see nothing in a plot of
the residuals.

9
Getting the Bends -Example

The curved relationship between is more apparent
in the plot of the residuals than in the original
scatter-plot

Predicting a Cars Fuel Efficiency Given Weight
Residuals Plot from Predicting Fuel Efficiency
Given Weight
10
Extrapolation Reaching Beyond the Data

Linear models give a predicted value for each
case in the data.
We cannot assume that a linear relationship in
the data exists much beyond the range of the
data.
Once we venture into new x territory, such a
prediction is called an extrapolation.
Extrapolations are dubious because they require
the additionaland very questionableassumption
that nothing about the relationship between x and
y changes even at extreme values of x.
Extrapolations can get you into trouble. You are
better off not making extrapolations, especially
distant ones.

11
Predicting the Future

Extrapolation is always dangerous. But, when the
x-variable in the model is time, extrapolation
becomes an attempt to peer into the future.
Heres some more realistic advice If you must
extrapolate into the distant future, at least
dont believe that the prediction will come true.

12
Extrapolation (cont.)

A regression of mean age at first marriage for
men vs. year fit to the first few decades of the
20th century does not hold for later years

13
Outliers, Leverage, and Influence

Outlying points can strongly influence a
regression. Even a single point far from the body
of the data can dominate the analysis.
Any point that stands away from the others can be
called an outlier and deserves your special
attention.

14
Outliers, Leverage, and Influence An Example

The following scatter-plot suggests that
something was awry in Palm Beach County, Florida,
during the 2000 presidential election

Predicting Votes for Buchanan Given Votes for
Nader, Florida Counties in 2000
15
Outliers Example Continued

The red line shows the effects that one unusual
point can have on a regression.
The R2 for the original regression was 43.
Re-running the regression without Palm Beach
gives an R2 of 82.

Predicting Votes for Buchanan Given Votes for
Nader, 2 Different Regression Lines
16
Outliers, Leverage, and Influence (cont.)

The linear model doesnt fit points with large
residuals very well.
Because they seem to be different from the other
cases, it is important to pay special attention
to points with large residuals.
A data point can also be unusual if its x-value
is far from the mean of the x-values. Such points
are said to have high leverage.

17
Outliers, Leverage, and Influence (cont.)

A point with high leverage has the potential to
change the regression line.
We say that a point is influential if omitting it
from the analysis gives a very different model.
Bozos effect on Comedians IQ vs. Shoe Size
model

Predicting IQ given Shoe Size, 2 Models Wholly
Dependant on 1 Outlier with Leverage
18
Outliers, Leverage, and Influence (cont.)

When we investigate an unusual point, we often
learn more about the situation than we could have
learned from the model alone.
You cannot simply delete unusual points from the
data. You can, however, fit a model with and
without these points as long as you examine and
discuss the two regression models to understand
how they differ.

19
Outliers, Leverage, and Influence (cont.)

Warning
Influential points can hide in plots of
residuals.
Points with high leverage pull the line close to
them (as per the IQ vs. shoesize example), so
they often have small residuals.
Youll see influential points more easily in
scatter-plots of the original data or by finding
a regression model with and without the points.

20
Lurking Variables and Causation

No matter how strong the association, no matter
how large the R2 value, no matter how straight
the line, there is no way to conclude from a
regression alone that one variable causes the
other.
Theres always the possibility that some third
variable is driving both of the variables you
have observed.
With observational data, as opposed to data from
a designed experiment, there is no way to be sure
that a lurking variable is not the cause of any
apparent association.

21
Lurking Variables and Causation An Example

The following scatter-plot shows that the average
life expectancy for a country is related to the
number of doctors per person in that country.
We could come up with all sorts of reasonable
explanations justifying this, but

22
Lurking Variables and Causation Another Example

This new scatter-plot shows that the average life
expectancy for a country is also related to the
number of televisions per person in that country.
And the relationship is even stronger R2 of 72
instead of 62
Since TVs are cheaper than doctors, Why dont we
send TVs to countries with low life expectancies
in order to extend lifetimes. Right?

23
Lurking Variables and Causation An Example

How about considering a lurking variable? That
makes more sense
Countries with higher standards of living have
both longer life expectancies and more doctors
(and TVs!).
If higher living standards cause changes in these
other variables, improving living standards might
be expected to prolong lives and, incidentally,
also increase the numbers of doctors and TVs.

24
Working With Summary Values

Scatter-plots of statistics summarized over
groups tend to show less variability than we
would see if we measured the same variable on
individuals.
If instead of data on individuals we only had the
mean weight for each height value, we would see
an even stronger association

Mean Weight Given Mean Height
Weight Given Height for a Survey of Men
25
Working With Summary Values (cont.)

Means vary less than individual values.
We will study this effect later in DS212
Therefore, scatter-plots of summary statistics
show less scatter than the baseline data on
individuals.
This can give a false impression of how well a
line summarizes the data.
There is no simple correction for this
phenomenon.
Once we reduce our data to just summary data,
theres no simple way to get the original values
back.

26
What Can Go Wrong?

Make sure the relationship is linear.
Check the Straight Enough condition.
Be on guard for different groups in your
regression.
If you find subsets that behave differently,
consider fitting different models for each
subset.
Beware of extrapolation...especially into the
future!
Look for unusual points.
Unusual points always deserve attention and that
may well reveal more about your data than the
rest of the points combined.
Beware of high leverage points, and especially
those that are influential.
Consider comparing two regressions, with
extraordinary points and without and then compare
the results. But
Dont just arbitrarily remove unusual points to
get a model that fits better
Beware of lurking variablesdont assume that
association is causation.
Watch out when dealing with data that are
summaries.
Summary data inflate the impression of the
relationship strength.

27
Example

Pr 19. Federal employees with the authority to
carry firearms and make arrests are sometimes
assaulted, and even hurt or killed. The table
below shows these rates for 5 years 1995-9
Can assault rates be used to accurately predict
injury or death?
What do we do first?

28
Example

Our scatter-plot shows a significant outlier.
Do you think that this outlier has significant
leverage?

29
Example

Does it look like the outlier has a large
residual?
Does it look like the outlier has a lot of
leverage?
Does it look like the model is a decent
predictor?
What might we consider doing next?

30
Example ReAnalysing without the Outlier

Removing the one outlier dramatically changes the
model results. Now it appears our model is
worthless.
Was the model a better predictor when we had the
NPS outlier?

31
Example Old Test Question

A sales associate at Northstars SkiSports shop
is trying to predict sales of their ski
bootwarmers given the average daily temperature
(measured in oF). Excels regression wizard
produces the following summary output.

32
Example Old Test Question

What is the correlation for this model?
What of the variation in the sales of
bootwarmers is NOT explained by temperature?
Write the equation for this linear regression
Sales ___________
How many bootwarmers are predicted to sell if the
temperature is 10oF ? (since we cant sell
fractional bootwarmers, use normal rounding!)
If we sold 14 bootwarmers on a 10oF day, the
model over / under predicted sales.
Calculate the residual for that day (Leave the
prediction in non-integer form )
Based on model results, for what temperature
RANGE can we interpret that we are unlikely to
sell any bootwarmers?

33
Example Old Test Question

Justin is writing a paper for his Human
Resources class on compensation for computer
programmers. He surveyed 11 programmers who
have given truthful answers about how much they
earn annually and their age. He used Excel to
perform a regression and generated the chart
below and determined the following values
(intercept) b0 55 and (slope) b1 -.60 and R2
. 05.

34
Example Old Test Question

What is the correlation of the model? Does age
appear to be a good predictor of salary in this
situation? Explain why in Statistical terms
Use this model to predict the salary of a
30-year-old computer programmer ___K/yr
Re-examining the graph, Justin realizes there is
a significant outlier which has the following
(Circle 1) High / Low leverage and also a
(Circle 1) High / Low residual error
When he removes this point from his data set and
re-runs the regression he gets the following
values
(intercept) b0 30 and (slope) b1 .90 and
R2 .27.
What is the correlation of the revised model?
Does the revised model appear to be a better
predictor?
Use the revised model to predict the 30-year-old
programmers salary

35
Example Old Test Question

Justin picks the better predictor of the two
models to base his analysis on. Circle ALL the
relevant conclusions (Yes, there can be more than
1 answer!). The model.
a) proves age discrimination exists in that older
programmers make more money than younger ones
b) proves age discrimination exists in that
younger programmers make more money than older
ones
c) does not provide definitive proof about
whether there is age discrimination for
programmers.
d) shows a positive correlation between age and
salary for computer programmers
e) shows a negative correlation between age and
salary for computer programmers
f) shows neither a positive or negative
correlation between age and salary for computer
programmers