Ch 9: Regression Wisdom Sifting Residuals for Groups - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Ch 9: Regression Wisdom Sifting Residuals for Groups

Description:

... also related to the number of televisions per person in that country. ... Re-examining the graph, Justin realizes there is a significant outlier which has ... – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 36
Provided by: Addison6
Category:

less

Transcript and Presenter's Notes

Title: Ch 9: Regression Wisdom Sifting Residuals for Groups


1
Ch 9 Regression Wisdom Sifting Residuals for
Groups
  • No regression analysis is complete without a
    display of the residuals to check that the linear
    model is reasonable.
  • Residuals often reveal subtleties that were not
    clear from a plot of the original data.
  • Sometimes the subtleties we see are additional
    details that help confirm or refine our
    understanding.
  • Sometimes they reveal violations of the
    regression conditions that require our attention
  • This chapter looks at some of the things that can
    lead to problems in regression models, and, when
    possible, how to react

2
Regression Sanity Check
  • Before you plug-and-chug with your regression and
    write the answer down, check to see if it makes
    sense.
  • Are you sure you are using the right units?
  • Regression models with coefficients expressed in
    1000K will produce goofy results if you use
    plain s instead. Or using monthly sales when
    the model was built assuming weekly sales were to
    be used.
  • Will fractional predictions make sense in the
    context? 14.5 houses, 10.3 people, etc
  • What do we do if our model predicts a negative
    number?
  • We will look at extrapolation and leverage
    shortly

3
Example Sanity Check
  • Arnie has been given a model that predicts his
    weekend car sales at Arnies Autos based on the
    interest rate he can offer for financing
  • Cars-sold 50 -570rate
  • Will the model results change if we interpret a
    6 rate to be 6 verses .06?
  • What is the numerical answer? Does it make sense
    from a practical standpoint? What do we need to
    do first?
  • How many cars are we predicted to sell if the
    interest rates go up to 9? What does this
    really mean?
  • At what interest rate (and above) would Arnie
    expect to see no sales? How do we find this rate?

4
Regression Common Sense
  • Before you plug-and-chug with your regression and
    write the answer down, check to see if it makes
    sense.
  • Are you sure you are using the right units?
  • Regression models with coefficients expressed in
    1000K will produce goofy results if you use
    plain s instead. Or using monthly sales when
    the model was built assuming weekly sales were to
    be used.
  • Will fractional predictions make sense in the
    context? 14.5 houses, 10.3 people, etc
  • What do we do if our model predicts a negative
    number?
  • We will look at extrapolation and leverage
    shortly

5
Sifting Residuals for Groups - the Tools!
  • It is smart to look at both a histogram of the
    residuals and a scatter-plot of the residuals vs.
    predicted values
  • Using the Residuals from Ch08s Cereal Model, the
    small modes in the histogram are marked with
    different colors and symbols in the residual plot
    above. What do you see?

Residual Plot for Predicting Calories Given Sugar
Histogram of Residuals for Predicting Calories
given Sugar
6
Subsets
  • An important, often unstated condition for
    fitting models is that all the data must come
    from the same group.
  • An examination of residuals often leads us to
    discover groups of observations that are
    different from the rest.
  • When we discover that there is more than one
    group in a regression, we can either
  • Build models for each of the subgroups separately
  • Work within the confines of a single model but
    note the condition
  • Use advanced techniques, such as multiple
    regression, which are beyond the scope of DS212.

7
Subsets Example
  • Accepted industry wisdom in the breakfast cereal
    business is that different cereals are targeted
    at different consumer groups and thus placed on
    different shelves.
  • Figure 9.3 from the text shows regression lines
    fit to calories and sugar for each of the three
    cereal shelves in a supermarket (shelf 1 is
    the lowest shelf)

Predicted Calories by Sugar Content, Given Shelf
Level
8
Another Typical ProblemGetting the Bends
  • Linear regression only works for linear models.
    (That sounds obvious, but when you fit a
    regression, you cant take it for granted.)
  • A curved relationship between two variables might
    not be apparent when looking at a scatter-plot
    alone, but will be more obvious in a plot of the
    residuals.
  • Remember, we want to see nothing in a plot of
    the residuals.

9
Getting the Bends -Example
  • The curved relationship between is more apparent
    in the plot of the residuals than in the original
    scatter-plot

Predicting a Cars Fuel Efficiency Given Weight
Residuals Plot from Predicting Fuel Efficiency
Given Weight
10
Extrapolation Reaching Beyond the Data
  • Linear models give a predicted value for each
    case in the data.
  • We cannot assume that a linear relationship in
    the data exists much beyond the range of the
    data.
  • Once we venture into new x territory, such a
    prediction is called an extrapolation.
  • Extrapolations are dubious because they require
    the additionaland very questionableassumption
    that nothing about the relationship between x and
    y changes even at extreme values of x.
  • Extrapolations can get you into trouble. You are
    better off not making extrapolations, especially
    distant ones.

11
Predicting the Future
  • Extrapolation is always dangerous. But, when the
    x-variable in the model is time, extrapolation
    becomes an attempt to peer into the future.
  • Heres some more realistic advice If you must
    extrapolate into the distant future, at least
    dont believe that the prediction will come true.

12
Extrapolation (cont.)
  • A regression of mean age at first marriage for
    men vs. year fit to the first few decades of the
    20th century does not hold for later years

13
Outliers, Leverage, and Influence
  • Outlying points can strongly influence a
    regression. Even a single point far from the body
    of the data can dominate the analysis.
  • Any point that stands away from the others can be
    called an outlier and deserves your special
    attention.

14
Outliers, Leverage, and Influence An Example
  • The following scatter-plot suggests that
    something was awry in Palm Beach County, Florida,
    during the 2000 presidential election

Predicting Votes for Buchanan Given Votes for
Nader, Florida Counties in 2000
15
Outliers Example Continued
  • The red line shows the effects that one unusual
    point can have on a regression.
  • The R2 for the original regression was 43.
    Re-running the regression without Palm Beach
    gives an R2 of 82.

Predicting Votes for Buchanan Given Votes for
Nader, 2 Different Regression Lines
16
Outliers, Leverage, and Influence (cont.)
  • The linear model doesnt fit points with large
    residuals very well.
  • Because they seem to be different from the other
    cases, it is important to pay special attention
    to points with large residuals.
  • A data point can also be unusual if its x-value
    is far from the mean of the x-values. Such points
    are said to have high leverage.

17
Outliers, Leverage, and Influence (cont.)
  • A point with high leverage has the potential to
    change the regression line.
  • We say that a point is influential if omitting it
    from the analysis gives a very different model.
  • Bozos effect on Comedians IQ vs. Shoe Size
    model

Predicting IQ given Shoe Size, 2 Models Wholly
Dependant on 1 Outlier with Leverage
18
Outliers, Leverage, and Influence (cont.)
  • When we investigate an unusual point, we often
    learn more about the situation than we could have
    learned from the model alone.
  • You cannot simply delete unusual points from the
    data. You can, however, fit a model with and
    without these points as long as you examine and
    discuss the two regression models to understand
    how they differ.

19
Outliers, Leverage, and Influence (cont.)
  • Warning
  • Influential points can hide in plots of
    residuals.
  • Points with high leverage pull the line close to
    them (as per the IQ vs. shoesize example), so
    they often have small residuals.
  • Youll see influential points more easily in
    scatter-plots of the original data or by finding
    a regression model with and without the points.

20
Lurking Variables and Causation
  • No matter how strong the association, no matter
    how large the R2 value, no matter how straight
    the line, there is no way to conclude from a
    regression alone that one variable causes the
    other.
  • Theres always the possibility that some third
    variable is driving both of the variables you
    have observed.
  • With observational data, as opposed to data from
    a designed experiment, there is no way to be sure
    that a lurking variable is not the cause of any
    apparent association.

21
Lurking Variables and Causation An Example
  • The following scatter-plot shows that the average
    life expectancy for a country is related to the
    number of doctors per person in that country.
    We could come up with all sorts of reasonable
    explanations justifying this, but

22
Lurking Variables and Causation Another Example
  • This new scatter-plot shows that the average life
    expectancy for a country is also related to the
    number of televisions per person in that country.
  • And the relationship is even stronger R2 of 72
    instead of 62
  • Since TVs are cheaper than doctors, Why dont we
    send TVs to countries with low life expectancies
    in order to extend lifetimes. Right?

23
Lurking Variables and Causation An Example
  • How about considering a lurking variable? That
    makes more sense
  • Countries with higher standards of living have
    both longer life expectancies and more doctors
    (and TVs!).
  • If higher living standards cause changes in these
    other variables, improving living standards might
    be expected to prolong lives and, incidentally,
    also increase the numbers of doctors and TVs.

24
Working With Summary Values
  • Scatter-plots of statistics summarized over
    groups tend to show less variability than we
    would see if we measured the same variable on
    individuals.
  • If instead of data on individuals we only had the
    mean weight for each height value, we would see
    an even stronger association

Mean Weight Given Mean Height
Weight Given Height for a Survey of Men
25
Working With Summary Values (cont.)
  • Means vary less than individual values.
  • We will study this effect later in DS212
  • Therefore, scatter-plots of summary statistics
    show less scatter than the baseline data on
    individuals.
  • This can give a false impression of how well a
    line summarizes the data.
  • There is no simple correction for this
    phenomenon.
  • Once we reduce our data to just summary data,
    theres no simple way to get the original values
    back.

26
What Can Go Wrong?
  • Make sure the relationship is linear.
  • Check the Straight Enough condition.
  • Be on guard for different groups in your
    regression.
  • If you find subsets that behave differently,
    consider fitting different models for each
    subset.
  • Beware of extrapolation...especially into the
    future!
  • Look for unusual points.
  • Unusual points always deserve attention and that
    may well reveal more about your data than the
    rest of the points combined.
  • Beware of high leverage points, and especially
    those that are influential.
  • Consider comparing two regressions, with
    extraordinary points and without and then compare
    the results. But
  • Dont just arbitrarily remove unusual points to
    get a model that fits better
  • Beware of lurking variablesdont assume that
    association is causation.
  • Watch out when dealing with data that are
    summaries.
  • Summary data inflate the impression of the
    relationship strength.

27
Example
  • Pr 19. Federal employees with the authority to
    carry firearms and make arrests are sometimes
    assaulted, and even hurt or killed. The table
    below shows these rates for 5 years 1995-9
  • Can assault rates be used to accurately predict
    injury or death?
  • What do we do first?

28
Example
  • Our scatter-plot shows a significant outlier.
  • Do you think that this outlier has significant
    leverage?

29
Example
  • Does it look like the outlier has a large
    residual?
  • Does it look like the outlier has a lot of
    leverage?
  • Does it look like the model is a decent
    predictor?
  • What might we consider doing next?

30
Example ReAnalysing without the Outlier
  • Removing the one outlier dramatically changes the
    model results. Now it appears our model is
    worthless.
  • Was the model a better predictor when we had the
    NPS outlier?

31
Example Old Test Question
  • A sales associate at Northstars SkiSports shop
    is trying to predict sales of their ski
    bootwarmers given the average daily temperature
    (measured in oF). Excels regression wizard
    produces the following summary output.

32
Example Old Test Question
  • What is the correlation for this model?
  • What of the variation in the sales of
    bootwarmers is NOT explained by temperature?
  • Write the equation for this linear regression
  • Sales ___________
  • How many bootwarmers are predicted to sell if the
    temperature is 10oF ? (since we cant sell
    fractional bootwarmers, use normal rounding!)
  • If we sold 14 bootwarmers on a 10oF day, the
    model over / under predicted sales.
  • Calculate the residual for that day (Leave the
    prediction in non-integer form )
  • Based on model results, for what temperature
    RANGE can we interpret that we are unlikely to
    sell any bootwarmers?

33
Example Old Test Question
  • Justin is writing a paper for his Human
    Resources class on compensation for computer
    programmers. He surveyed 11 programmers who
    have given truthful answers about how much they
    earn annually and their age. He used Excel to
    perform a regression and generated the chart
    below and determined the following values
    (intercept) b0 55 and (slope) b1 -.60 and R2
    . 05.

34
Example Old Test Question
  • What is the correlation of the model? Does age
    appear to be a good predictor of salary in this
    situation? Explain why in Statistical terms
  • Use this model to predict the salary of a
    30-year-old computer programmer ___K/yr
  • Re-examining the graph, Justin realizes there is
    a significant outlier which has the following
    (Circle 1) High / Low leverage and also a
    (Circle 1) High / Low residual error
  • When he removes this point from his data set and
    re-runs the regression he gets the following
    values
  • (intercept) b0 30 and (slope) b1 .90 and
    R2 .27.
  • What is the correlation of the revised model?
  • Does the revised model appear to be a better
    predictor?
  • Use the revised model to predict the 30-year-old
    programmers salary

35
Example Old Test Question
  • Justin picks the better predictor of the two
    models to base his analysis on. Circle ALL the
    relevant conclusions (Yes, there can be more than
    1 answer!). The model.
  • a) proves age discrimination exists in that older
    programmers make more money than younger ones
  • b) proves age discrimination exists in that
    younger programmers make more money than older
    ones
  • c) does not provide definitive proof about
    whether there is age discrimination for
    programmers.
  • d) shows a positive correlation between age and
    salary for computer programmers
  • e) shows a negative correlation between age and
    salary for computer programmers
  • f) shows neither a positive or negative
    correlation between age and salary for computer
    programmers
Write a Comment
User Comments (0)
About PowerShow.com