Diagnostics - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

Diagnostics

Description:

How a non-linear function shows up on a 'residual versus fits' plot ... Or, the spread of the residuals can vary in some complex fashion. ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 56

Provided by: lsi4

Category:

Tags: diagnostics

more less

Transcript and Presenter's Notes

Title: Diagnostics

1
Diagnostics Part I

Using plots to check to see if the assumptions we
made about the model are realistic

2
Diagnostic methods

Some simple (but subjective) plots. (Now)
Formal statistical tests. (Next)

3
Review of some simple plots

while checking scope of model

4
Dot Plot
5
Dot Plot

Summarizes quantitative data.
Horizontal axis represents measurement scale.
Plot one dot for each data point.

6
Stem-and-Leaf Plot
Stem-and-leaf of Shoes N 139 Leaf Unit
1.0 12 0 223334444444 63 0
55555555555556666666667777777888888888888899999999
9 (33) 1 000000000000011112222233333333444
43 1 555555556667777888 25 2
0000000000023 12 2 5557 8 3 0023
4 3 4 4 00 2 4 2 5 0
1 5 1 6 1 6 1 7
1 7 5
7
Stem-and-Leaf Plot

Summarizes quantitative data.
Each data point is broken down into a stem and
a leaf.
First, stems are aligned in a column.
Then, leaves are attached to the stems.

8
Box Plot
9
Box Plot

Summarizes quantitative data.
Vertical (or horizontal) axis represents
measurement scale.
Lines in box represent the 25th percentile
(first quartile), the 50th percentile
(median), and the 75th percentile (third
quartile), respectively.

10
Box Plot (contd)

Whiskers are drawn to the most extreme data
points that are not more than 1.5 times the
length of the box beyond either quartile.
Whiskers are useful for identifying outliers.
Outliers, or extreme observations, are denoted
by asterisks.
Generally, data points falling beyond the
whiskers are considered outliers.

11
Okay, now the really new stuff
12
Simple linear regression model
The response Yi is a function of a systematic
linear component and a random error component
with assumptions that

Error terms have mean 0, i.e., E(?i) 0.
?i and ?j are uncorrelated (independent).
Error terms have same variance, i.e., Var(?i)
?2.
Error terms ?i are normally distributed.

13
Why should we keep nagging ourselves about the
model?

All of the estimates, confidence intervals,
prediction intervals, hypothesis tests, etc. have
been developed assuming that the model is
correct.
If the model is incorrect, then the formulas and
methods we use are at risk of being incorrect.
(Some are more forgiving than others.)

14
Things that can go wrong with the model

Regression function is not linear.
Error terms do not have constant variance.
Error terms are not independent.
The model fits all but one or a few outlier
observations.
Error terms are not normally distributed.
Important predictor variable(s) has been left out
of the model.

15
Residual analysis the basic idea
We would think the observed residuals
would reflect the properties assumed for the
unknown true error terms
So, investigate the observed residuals to see if
they behave properly.
16
Some points of clarification about residuals

The mean of the residuals, e-bar, is 0. So, no
need to check that the mean of the residuals is 0
the LS estimation method has made it so.
The residuals are not independent, since they are
all a function of the same estimated regression
function.

17
Example Alcohol consumption (X) and Arm muscle
strength (Y)
18
A well-behaved residuals vs. fits plot
19
Characteristics of a well-behaved residual
versus fits plot

The residuals bounce randomly around the 0
line. (Linear is reasonable).
No one residual stands out from the basic
random pattern of residuals. (No outliers).
The residuals roughly form a horizontal band
around 0 line. (Constant variance).

20
Residuals versus predictor plot offers nothing
different.
21
Example Is tire tread wear linearly related to
mileage?
mileage groove 0 394.33 4 329.50 8
291.00 12 255.17 16 229.33 20
204.83 24 179.00 28 163.83 32
150.33
X mileage in 1000 miles Y groove depth in
mils (0.001 inches)
22
Example Is tire tread wear linearly related to
mileage?
23
A residual versus fits plot suggesting
relationship is not linear
24
How a non-linear function shows up on a residual
versus fits plot

The residuals depart from 0 in a systematic
fashion, such as being positive for small X
values, negative for medium X values, and
positive again for large X values.

25
Example How is plutonium activity related to
alpha particle counts?
26
A residual versus fits plot suggesting
non-constant error variance
27
How non-constant error variance shows up on a
residual vs. fits plot

The plot has a fanning effect, such as the
residuals being close to 0 for small X values and
being much more spread out for large X values.
The fanning effect can also be in the reverse
direction.
Or, the spread of the residuals can vary in some
complex fashion.

28
Example Relationship between tobacco use and
alcohol use?
Region Alcohol Tobacco North
6.47 4.03 Yorkshire 6.13
3.76 Northeast 6.19
3.77 EastMidlands 4.89
3.34 WestMidlands 5.63
3.47 EastAnglia 4.52 2.92
Southeast 5.89 3.20 Southwest
4.79 2.71 Wales 5.27
3.53 Scotland 6.08 4.51 Northern
Ireland 4.02 4.56

Family Expenditure Survey of British Dept. of
Employment
X average weekly expenditure on tobacco
Y average weekly expenditure on alcohol

29
Example Relationship between tobacco use and
alcohol use?
30
A residual versus fits plot suggesting an
outlier exists.
outlier
31
How large does a residual need to be before being
flagged?

The magnitude of the residuals depends on the
units of the response variable.
Make the residuals unitless by dividing by
their standard deviation. That is, use
standardized residuals.
Then, an observation with a standardized residual
greater than 2 or smaller than -2 should be
flagged for further investigation.

32
Standardized residuals versus fits plot
33
Minitab identifies observations with large
standardized residuals
Unusual Observations Obs Tobacco Alcohol Fit
SE Fit Resid St Resid 11 4.56 4.020
5.728 0.482 -1.708 -2.58R R denotes an
observation with a large standardized residual.
34
Anscombe data set 3
35
A residual versus fits plot suggesting an
outlier exists.
36
How an outlier shows up on a residuals vs. fits
plot

The observations residual stands apart from the
basic random pattern of the rest of the
residuals.
The random pattern of the residual plot can even
disappear if one outlier really deviates from the
straight line of the rest of the data.

37
Other simple plots that might help spot an outlier

Boxplots
Stem-n-leaf plots
Dotplots

38
Boxplot of residuals for Alcohol (Y) and Tobacco
(X) example
39
Dotplot of residuals for Alcohol (Y) and Tobacco
(X) example
40
Residuals vs. order plots to assess
non-independence of error terms

If the data are obtained in a time (or space)
sequence, a residuals vs. order plot helps to
see if there is any correlation between error
terms that are near each other in the sequence.
A horizontal band bouncing randomly around 0
suggests errors are independent, while a
systematic pattern suggests not.

41
Residuals vs order plots suggesting
non-independence of error terms
42
Normal probability plot to assess normality of
error terms

Plot of residuals on horizontal axis against
expected values of the residuals under normality
(normal scores) on vertical axis.
Plot that is nearly linear suggests normality of
error terms.

43
Normal probability plot interpretation
skewed left
normal
skewed right
44
Normal probability plot for Alcohol (X) and
Strength (Y) example
45
Normal probability plot for Tree diameter (X) and
C-dating Age (Y)
46
Residuals vs omitted predictors plots

To determine whether there are any other key
variables that could provide additional
predictive power to the response.
Look for systematic patterns.
If the plot reveals that the residuals vary
systematically, we dont say the original model
is wrong. Its just that it can be improved.

47
Residuals vs omitted plot
48
In summary,
49
Nonlinearity of regression function

Scatter plot of response versus predictor
(Standardized) residuals versus fits plot
(Standardized) residuals versus predictor plot

50
Nonconstancy or error variance

(Standardized) residuals versus fits plot
(Standardized) residuals versus predictor plot

51
Presence of outliers

(Standardized) residuals versus fits plot
(Standardized) residuals versus predictor plot
Box plots, stem-n-leaf plots, dot plots of
(standardized) residuals

52
Non-independence of error terms

(Standardized) residuals versus order plot

53
Non-normality of error terms

Normality probability plots
Box plots, dotplots, stem-n-leaf plots
Mean far from median?

54
Residual vs plots in Minitab

Stat gtgt Regression gtgt Regression.
Specify predictor and response.
Under Graphs, specify whether regular or
standardized residuals desired. Select which
residual plots are desired. If residual versus
predictor plot desired, specify predictor in box.
Select OK. Select OK.

55
Boxplots, dotplots, etc. of residuals

Stat gtgt Regression gtgt Regression
Specify predictor and response.
Under Storage, select residuals and/or
standardized residuals. They will be stored in
worksheet. Then
Graph gtgt Boxplot or Graph gtgtDotplot or
GraphgtgtStemleaf

Write a Comment

User Comments (0)