Diagnostics - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Diagnostics

Description:

How a non-linear function shows up on a 'residual versus fits' plot ... Or, the spread of the residuals can vary in some complex fashion. ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 56
Provided by: lsi4
Category:

less

Transcript and Presenter's Notes

Title: Diagnostics


1
Diagnostics Part I
  • Using plots to check to see if the assumptions we
    made about the model are realistic

2
Diagnostic methods
  • Some simple (but subjective) plots. (Now)
  • Formal statistical tests. (Next)

3
Review of some simple plots
  • while checking scope of model

4
Dot Plot
5
Dot Plot
  • Summarizes quantitative data.
  • Horizontal axis represents measurement scale.
  • Plot one dot for each data point.

6
Stem-and-Leaf Plot
Stem-and-leaf of Shoes N 139 Leaf Unit
1.0 12 0 223334444444 63 0
55555555555556666666667777777888888888888899999999
9 (33) 1 000000000000011112222233333333444
43 1 555555556667777888 25 2
0000000000023 12 2 5557 8 3 0023
4 3 4 4 00 2 4 2 5 0
1 5 1 6 1 6 1 7
1 7 5
7
Stem-and-Leaf Plot
  • Summarizes quantitative data.
  • Each data point is broken down into a stem and
    a leaf.
  • First, stems are aligned in a column.
  • Then, leaves are attached to the stems.

8
Box Plot
9
Box Plot
  • Summarizes quantitative data.
  • Vertical (or horizontal) axis represents
    measurement scale.
  • Lines in box represent the 25th percentile
    (first quartile), the 50th percentile
    (median), and the 75th percentile (third
    quartile), respectively.

10
Box Plot (contd)
  • Whiskers are drawn to the most extreme data
    points that are not more than 1.5 times the
    length of the box beyond either quartile.
  • Whiskers are useful for identifying outliers.
  • Outliers, or extreme observations, are denoted
    by asterisks.
  • Generally, data points falling beyond the
    whiskers are considered outliers.

11
Okay, now the really new stuff
12
Simple linear regression model
The response Yi is a function of a systematic
linear component and a random error component
with assumptions that
  • Error terms have mean 0, i.e., E(?i) 0.
  • ?i and ?j are uncorrelated (independent).
  • Error terms have same variance, i.e., Var(?i)
    ?2.
  • Error terms ?i are normally distributed.

13
Why should we keep nagging ourselves about the
model?
  • All of the estimates, confidence intervals,
    prediction intervals, hypothesis tests, etc. have
    been developed assuming that the model is
    correct.
  • If the model is incorrect, then the formulas and
    methods we use are at risk of being incorrect.
    (Some are more forgiving than others.)

14
Things that can go wrong with the model
  • Regression function is not linear.
  • Error terms do not have constant variance.
  • Error terms are not independent.
  • The model fits all but one or a few outlier
    observations.
  • Error terms are not normally distributed.
  • Important predictor variable(s) has been left out
    of the model.

15
Residual analysis the basic idea
We would think the observed residuals
would reflect the properties assumed for the
unknown true error terms
So, investigate the observed residuals to see if
they behave properly.
16
Some points of clarification about residuals
  • The mean of the residuals, e-bar, is 0. So, no
    need to check that the mean of the residuals is 0
    the LS estimation method has made it so.
  • The residuals are not independent, since they are
    all a function of the same estimated regression
    function.

17
Example Alcohol consumption (X) and Arm muscle
strength (Y)
18
A well-behaved residuals vs. fits plot
19
Characteristics of a well-behaved residual
versus fits plot
  • The residuals bounce randomly around the 0
    line. (Linear is reasonable).
  • No one residual stands out from the basic
    random pattern of residuals. (No outliers).
  • The residuals roughly form a horizontal band
    around 0 line. (Constant variance).

20
Residuals versus predictor plot offers nothing
different.
21
Example Is tire tread wear linearly related to
mileage?
mileage groove 0 394.33 4 329.50 8
291.00 12 255.17 16 229.33 20
204.83 24 179.00 28 163.83 32
150.33
X mileage in 1000 miles Y groove depth in
mils (0.001 inches)
22
Example Is tire tread wear linearly related to
mileage?
23
A residual versus fits plot suggesting
relationship is not linear
24
How a non-linear function shows up on a residual
versus fits plot
  • The residuals depart from 0 in a systematic
    fashion, such as being positive for small X
    values, negative for medium X values, and
    positive again for large X values.

25
Example How is plutonium activity related to
alpha particle counts?
26
A residual versus fits plot suggesting
non-constant error variance
27
How non-constant error variance shows up on a
residual vs. fits plot
  • The plot has a fanning effect, such as the
    residuals being close to 0 for small X values and
    being much more spread out for large X values.
  • The fanning effect can also be in the reverse
    direction.
  • Or, the spread of the residuals can vary in some
    complex fashion.

28
Example Relationship between tobacco use and
alcohol use?
Region Alcohol Tobacco North
6.47 4.03 Yorkshire 6.13
3.76 Northeast 6.19
3.77 EastMidlands 4.89
3.34 WestMidlands 5.63
3.47 EastAnglia 4.52 2.92
Southeast 5.89 3.20 Southwest
4.79 2.71 Wales 5.27
3.53 Scotland 6.08 4.51 Northern
Ireland 4.02 4.56
  • Family Expenditure Survey of British Dept. of
    Employment
  • X average weekly expenditure on tobacco
  • Y average weekly expenditure on alcohol

29
Example Relationship between tobacco use and
alcohol use?
30
A residual versus fits plot suggesting an
outlier exists.
outlier
31
How large does a residual need to be before being
flagged?
  • The magnitude of the residuals depends on the
    units of the response variable.
  • Make the residuals unitless by dividing by
    their standard deviation. That is, use
    standardized residuals.
  • Then, an observation with a standardized residual
    greater than 2 or smaller than -2 should be
    flagged for further investigation.

32
Standardized residuals versus fits plot
33
Minitab identifies observations with large
standardized residuals
Unusual Observations Obs Tobacco Alcohol Fit
SE Fit Resid St Resid 11 4.56 4.020
5.728 0.482 -1.708 -2.58R R denotes an
observation with a large standardized residual.
34
Anscombe data set 3
35
A residual versus fits plot suggesting an
outlier exists.
36
How an outlier shows up on a residuals vs. fits
plot
  • The observations residual stands apart from the
    basic random pattern of the rest of the
    residuals.
  • The random pattern of the residual plot can even
    disappear if one outlier really deviates from the
    straight line of the rest of the data.

37
Other simple plots that might help spot an outlier
  • Boxplots
  • Stem-n-leaf plots
  • Dotplots

38
Boxplot of residuals for Alcohol (Y) and Tobacco
(X) example
39
Dotplot of residuals for Alcohol (Y) and Tobacco
(X) example
40
Residuals vs. order plots to assess
non-independence of error terms
  • If the data are obtained in a time (or space)
    sequence, a residuals vs. order plot helps to
    see if there is any correlation between error
    terms that are near each other in the sequence.
  • A horizontal band bouncing randomly around 0
    suggests errors are independent, while a
    systematic pattern suggests not.

41
Residuals vs order plots suggesting
non-independence of error terms
42
Normal probability plot to assess normality of
error terms
  • Plot of residuals on horizontal axis against
    expected values of the residuals under normality
    (normal scores) on vertical axis.
  • Plot that is nearly linear suggests normality of
    error terms.

43
Normal probability plot interpretation
skewed left
normal
skewed right
44
Normal probability plot for Alcohol (X) and
Strength (Y) example
45
Normal probability plot for Tree diameter (X) and
C-dating Age (Y)
46
Residuals vs omitted predictors plots
  • To determine whether there are any other key
    variables that could provide additional
    predictive power to the response.
  • Look for systematic patterns.
  • If the plot reveals that the residuals vary
    systematically, we dont say the original model
    is wrong. Its just that it can be improved.

47
Residuals vs omitted plot
48
In summary,
49
Nonlinearity of regression function
  • Scatter plot of response versus predictor
  • (Standardized) residuals versus fits plot
  • (Standardized) residuals versus predictor plot

50
Nonconstancy or error variance
  • (Standardized) residuals versus fits plot
  • (Standardized) residuals versus predictor plot

51
Presence of outliers
  • (Standardized) residuals versus fits plot
  • (Standardized) residuals versus predictor plot
  • Box plots, stem-n-leaf plots, dot plots of
    (standardized) residuals

52
Non-independence of error terms
  • (Standardized) residuals versus order plot

53
Non-normality of error terms
  • Normality probability plots
  • Box plots, dotplots, stem-n-leaf plots
  • Mean far from median?

54
Residual vs plots in Minitab
  • Stat gtgt Regression gtgt Regression.
  • Specify predictor and response.
  • Under Graphs, specify whether regular or
    standardized residuals desired. Select which
    residual plots are desired. If residual versus
    predictor plot desired, specify predictor in box.
  • Select OK. Select OK.

55
Boxplots, dotplots, etc. of residuals
  • Stat gtgt Regression gtgt Regression
  • Specify predictor and response.
  • Under Storage, select residuals and/or
    standardized residuals. They will be stored in
    worksheet. Then
  • Graph gtgt Boxplot or Graph gtgtDotplot or
    GraphgtgtStemleaf
Write a Comment
User Comments (0)
About PowerShow.com