Title: Ch 9: Regression Wisdom Sifting Residuals for Groups
1Ch 9 Regression Wisdom Sifting Residuals for
Groups
- No regression analysis is complete without a
display of the residuals to check that the linear
model is reasonable. - Residuals often reveal subtleties that were not
clear from a plot of the original data. - Sometimes the subtleties we see are additional
details that help confirm or refine our
understanding. - Sometimes they reveal violations of the
regression conditions that require our attention - This chapter looks at some of the things that can
lead to problems in regression models, and, when
possible, how to react
2Regression Sanity Check
- Before you plug-and-chug with your regression and
write the answer down, check to see if it makes
sense. - Are you sure you are using the right units?
- Regression models with coefficients expressed in
1000K will produce goofy results if you use
plain s instead. Or using monthly sales when
the model was built assuming weekly sales were to
be used. - Will fractional predictions make sense in the
context? 14.5 houses, 10.3 people, etc - What do we do if our model predicts a negative
number? - We will look at extrapolation and leverage
shortly
3Example Sanity Check
- Arnie has been given a model that predicts his
weekend car sales at Arnies Autos based on the
interest rate he can offer for financing - Cars-sold 50 -570rate
- Will the model results change if we interpret a
6 rate to be 6 verses .06? - What is the numerical answer? Does it make sense
from a practical standpoint? What do we need to
do first? - How many cars are we predicted to sell if the
interest rates go up to 9? What does this
really mean? - At what interest rate (and above) would Arnie
expect to see no sales? How do we find this rate?
4Regression Common Sense
- Before you plug-and-chug with your regression and
write the answer down, check to see if it makes
sense. - Are you sure you are using the right units?
- Regression models with coefficients expressed in
1000K will produce goofy results if you use
plain s instead. Or using monthly sales when
the model was built assuming weekly sales were to
be used. - Will fractional predictions make sense in the
context? 14.5 houses, 10.3 people, etc - What do we do if our model predicts a negative
number? - We will look at extrapolation and leverage
shortly
5Sifting Residuals for Groups - the Tools!
- It is smart to look at both a histogram of the
residuals and a scatter-plot of the residuals vs.
predicted values - Using the Residuals from Ch08s Cereal Model, the
small modes in the histogram are marked with
different colors and symbols in the residual plot
above. What do you see?
Residual Plot for Predicting Calories Given Sugar
Histogram of Residuals for Predicting Calories
given Sugar
6Subsets
- An important, often unstated condition for
fitting models is that all the data must come
from the same group. - An examination of residuals often leads us to
discover groups of observations that are
different from the rest. - When we discover that there is more than one
group in a regression, we can either - Build models for each of the subgroups separately
- Work within the confines of a single model but
note the condition - Use advanced techniques, such as multiple
regression, which are beyond the scope of DS212.
7Subsets Example
- Accepted industry wisdom in the breakfast cereal
business is that different cereals are targeted
at different consumer groups and thus placed on
different shelves. - Figure 9.3 from the text shows regression lines
fit to calories and sugar for each of the three
cereal shelves in a supermarket (shelf 1 is
the lowest shelf)
Predicted Calories by Sugar Content, Given Shelf
Level
8Another Typical ProblemGetting the Bends
- Linear regression only works for linear models.
(That sounds obvious, but when you fit a
regression, you cant take it for granted.) - A curved relationship between two variables might
not be apparent when looking at a scatter-plot
alone, but will be more obvious in a plot of the
residuals. - Remember, we want to see nothing in a plot of
the residuals.
9Getting the Bends -Example
- The curved relationship between is more apparent
in the plot of the residuals than in the original
scatter-plot
Predicting a Cars Fuel Efficiency Given Weight
Residuals Plot from Predicting Fuel Efficiency
Given Weight
10Extrapolation Reaching Beyond the Data
- Linear models give a predicted value for each
case in the data. - We cannot assume that a linear relationship in
the data exists much beyond the range of the
data. - Once we venture into new x territory, such a
prediction is called an extrapolation. - Extrapolations are dubious because they require
the additionaland very questionableassumption
that nothing about the relationship between x and
y changes even at extreme values of x. - Extrapolations can get you into trouble. You are
better off not making extrapolations, especially
distant ones.
11Predicting the Future
- Extrapolation is always dangerous. But, when the
x-variable in the model is time, extrapolation
becomes an attempt to peer into the future. - Heres some more realistic advice If you must
extrapolate into the distant future, at least
dont believe that the prediction will come true.
12Extrapolation (cont.)
- A regression of mean age at first marriage for
men vs. year fit to the first few decades of the
20th century does not hold for later years
13Outliers, Leverage, and Influence
- Outlying points can strongly influence a
regression. Even a single point far from the body
of the data can dominate the analysis. - Any point that stands away from the others can be
called an outlier and deserves your special
attention.
14Outliers, Leverage, and Influence An Example
- The following scatter-plot suggests that
something was awry in Palm Beach County, Florida,
during the 2000 presidential election
Predicting Votes for Buchanan Given Votes for
Nader, Florida Counties in 2000
15Outliers Example Continued
- The red line shows the effects that one unusual
point can have on a regression. - The R2 for the original regression was 43.
Re-running the regression without Palm Beach
gives an R2 of 82.
Predicting Votes for Buchanan Given Votes for
Nader, 2 Different Regression Lines
16Outliers, Leverage, and Influence (cont.)
- The linear model doesnt fit points with large
residuals very well. - Because they seem to be different from the other
cases, it is important to pay special attention
to points with large residuals. - A data point can also be unusual if its x-value
is far from the mean of the x-values. Such points
are said to have high leverage.
17Outliers, Leverage, and Influence (cont.)
- A point with high leverage has the potential to
change the regression line. - We say that a point is influential if omitting it
from the analysis gives a very different model. - Bozos effect on Comedians IQ vs. Shoe Size
model
Predicting IQ given Shoe Size, 2 Models Wholly
Dependant on 1 Outlier with Leverage
18Outliers, Leverage, and Influence (cont.)
- When we investigate an unusual point, we often
learn more about the situation than we could have
learned from the model alone. - You cannot simply delete unusual points from the
data. You can, however, fit a model with and
without these points as long as you examine and
discuss the two regression models to understand
how they differ.
19Outliers, Leverage, and Influence (cont.)
- Warning
- Influential points can hide in plots of
residuals. - Points with high leverage pull the line close to
them (as per the IQ vs. shoesize example), so
they often have small residuals. - Youll see influential points more easily in
scatter-plots of the original data or by finding
a regression model with and without the points.
20Lurking Variables and Causation
- No matter how strong the association, no matter
how large the R2 value, no matter how straight
the line, there is no way to conclude from a
regression alone that one variable causes the
other. - Theres always the possibility that some third
variable is driving both of the variables you
have observed. - With observational data, as opposed to data from
a designed experiment, there is no way to be sure
that a lurking variable is not the cause of any
apparent association.
21Lurking Variables and Causation An Example
- The following scatter-plot shows that the average
life expectancy for a country is related to the
number of doctors per person in that country.
We could come up with all sorts of reasonable
explanations justifying this, but
22Lurking Variables and Causation Another Example
- This new scatter-plot shows that the average life
expectancy for a country is also related to the
number of televisions per person in that country. - And the relationship is even stronger R2 of 72
instead of 62 - Since TVs are cheaper than doctors, Why dont we
send TVs to countries with low life expectancies
in order to extend lifetimes. Right?
23Lurking Variables and Causation An Example
- How about considering a lurking variable? That
makes more sense - Countries with higher standards of living have
both longer life expectancies and more doctors
(and TVs!). - If higher living standards cause changes in these
other variables, improving living standards might
be expected to prolong lives and, incidentally,
also increase the numbers of doctors and TVs.
24Working With Summary Values
- Scatter-plots of statistics summarized over
groups tend to show less variability than we
would see if we measured the same variable on
individuals. - If instead of data on individuals we only had the
mean weight for each height value, we would see
an even stronger association
Mean Weight Given Mean Height
Weight Given Height for a Survey of Men
25Working With Summary Values (cont.)
- Means vary less than individual values.
- We will study this effect later in DS212
- Therefore, scatter-plots of summary statistics
show less scatter than the baseline data on
individuals. - This can give a false impression of how well a
line summarizes the data. - There is no simple correction for this
phenomenon. - Once we reduce our data to just summary data,
theres no simple way to get the original values
back.
26What Can Go Wrong?
- Make sure the relationship is linear.
- Check the Straight Enough condition.
- Be on guard for different groups in your
regression. - If you find subsets that behave differently,
consider fitting different models for each
subset. - Beware of extrapolation...especially into the
future! - Look for unusual points.
- Unusual points always deserve attention and that
may well reveal more about your data than the
rest of the points combined. - Beware of high leverage points, and especially
those that are influential. - Consider comparing two regressions, with
extraordinary points and without and then compare
the results. But - Dont just arbitrarily remove unusual points to
get a model that fits better - Beware of lurking variablesdont assume that
association is causation. - Watch out when dealing with data that are
summaries. - Summary data inflate the impression of the
relationship strength.
27Example
- Pr 19. Federal employees with the authority to
carry firearms and make arrests are sometimes
assaulted, and even hurt or killed. The table
below shows these rates for 5 years 1995-9 - Can assault rates be used to accurately predict
injury or death? - What do we do first?
28Example
- Our scatter-plot shows a significant outlier.
- Do you think that this outlier has significant
leverage?
29Example
- Does it look like the outlier has a large
residual? - Does it look like the outlier has a lot of
leverage? - Does it look like the model is a decent
predictor? - What might we consider doing next?
30Example ReAnalysing without the Outlier
- Removing the one outlier dramatically changes the
model results. Now it appears our model is
worthless. - Was the model a better predictor when we had the
NPS outlier?
31Example Old Test Question
- A sales associate at Northstars SkiSports shop
is trying to predict sales of their ski
bootwarmers given the average daily temperature
(measured in oF). Excels regression wizard
produces the following summary output.
32Example Old Test Question
- What is the correlation for this model?
- What of the variation in the sales of
bootwarmers is NOT explained by temperature? - Write the equation for this linear regression
- Sales ___________
- How many bootwarmers are predicted to sell if the
temperature is 10oF ? (since we cant sell
fractional bootwarmers, use normal rounding!) - If we sold 14 bootwarmers on a 10oF day, the
model over / under predicted sales. - Calculate the residual for that day (Leave the
prediction in non-integer form ) - Based on model results, for what temperature
RANGE can we interpret that we are unlikely to
sell any bootwarmers?
33Example Old Test Question
- Justin is writing a paper for his Human
Resources class on compensation for computer
programmers. He surveyed 11 programmers who
have given truthful answers about how much they
earn annually and their age. He used Excel to
perform a regression and generated the chart
below and determined the following values
(intercept) b0 55 and (slope) b1 -.60 and R2
. 05.
34Example Old Test Question
- What is the correlation of the model? Does age
appear to be a good predictor of salary in this
situation? Explain why in Statistical terms - Use this model to predict the salary of a
30-year-old computer programmer ___K/yr - Re-examining the graph, Justin realizes there is
a significant outlier which has the following
(Circle 1) High / Low leverage and also a
(Circle 1) High / Low residual error - When he removes this point from his data set and
re-runs the regression he gets the following
values - (intercept) b0 30 and (slope) b1 .90 and
R2 .27. - What is the correlation of the revised model?
- Does the revised model appear to be a better
predictor? - Use the revised model to predict the 30-year-old
programmers salary
35Example Old Test Question
- Justin picks the better predictor of the two
models to base his analysis on. Circle ALL the
relevant conclusions (Yes, there can be more than
1 answer!). The model. - a) proves age discrimination exists in that older
programmers make more money than younger ones - b) proves age discrimination exists in that
younger programmers make more money than older
ones - c) does not provide definitive proof about
whether there is age discrimination for
programmers. - d) shows a positive correlation between age and
salary for computer programmers - e) shows a negative correlation between age and
salary for computer programmers - f) shows neither a positive or negative
correlation between age and salary for computer
programmers