Title: Diagnostics
1Diagnostics Part I
- Using plots to check to see if the assumptions we
made about the model are realistic
2Diagnostic methods
- Some simple (but subjective) plots. (Now)
- Formal statistical tests. (Next)
3Review of some simple plots
- while checking scope of model
4Dot Plot
5Dot Plot
- Summarizes quantitative data.
- Horizontal axis represents measurement scale.
- Plot one dot for each data point.
6Stem-and-Leaf Plot
Stem-and-leaf of Shoes N 139 Leaf Unit
1.0 12 0 223334444444 63 0
55555555555556666666667777777888888888888899999999
9 (33) 1 000000000000011112222233333333444
43 1 555555556667777888 25 2
0000000000023 12 2 5557 8 3 0023
4 3 4 4 00 2 4 2 5 0
1 5 1 6 1 6 1 7
1 7 5
7Stem-and-Leaf Plot
- Summarizes quantitative data.
- Each data point is broken down into a stem and
a leaf. - First, stems are aligned in a column.
- Then, leaves are attached to the stems.
8Box Plot
9Box Plot
- Summarizes quantitative data.
- Vertical (or horizontal) axis represents
measurement scale. - Lines in box represent the 25th percentile
(first quartile), the 50th percentile
(median), and the 75th percentile (third
quartile), respectively.
10Box Plot (contd)
- Whiskers are drawn to the most extreme data
points that are not more than 1.5 times the
length of the box beyond either quartile. - Whiskers are useful for identifying outliers.
- Outliers, or extreme observations, are denoted
by asterisks. - Generally, data points falling beyond the
whiskers are considered outliers.
11Okay, now the really new stuff
12Simple linear regression model
The response Yi is a function of a systematic
linear component and a random error component
with assumptions that
- Error terms have mean 0, i.e., E(?i) 0.
- ?i and ?j are uncorrelated (independent).
- Error terms have same variance, i.e., Var(?i)
?2. - Error terms ?i are normally distributed.
13Why should we keep nagging ourselves about the
model?
- All of the estimates, confidence intervals,
prediction intervals, hypothesis tests, etc. have
been developed assuming that the model is
correct. - If the model is incorrect, then the formulas and
methods we use are at risk of being incorrect.
(Some are more forgiving than others.)
14Things that can go wrong with the model
- Regression function is not linear.
- Error terms do not have constant variance.
- Error terms are not independent.
- The model fits all but one or a few outlier
observations. - Error terms are not normally distributed.
- Important predictor variable(s) has been left out
of the model.
15Residual analysis the basic idea
We would think the observed residuals
would reflect the properties assumed for the
unknown true error terms
So, investigate the observed residuals to see if
they behave properly.
16Some points of clarification about residuals
- The mean of the residuals, e-bar, is 0. So, no
need to check that the mean of the residuals is 0
the LS estimation method has made it so. - The residuals are not independent, since they are
all a function of the same estimated regression
function.
17Example Alcohol consumption (X) and Arm muscle
strength (Y)
18A well-behaved residuals vs. fits plot
19Characteristics of a well-behaved residual
versus fits plot
- The residuals bounce randomly around the 0
line. (Linear is reasonable). - No one residual stands out from the basic
random pattern of residuals. (No outliers). - The residuals roughly form a horizontal band
around 0 line. (Constant variance).
20Residuals versus predictor plot offers nothing
different.
21Example Is tire tread wear linearly related to
mileage?
mileage groove 0 394.33 4 329.50 8
291.00 12 255.17 16 229.33 20
204.83 24 179.00 28 163.83 32
150.33
X mileage in 1000 miles Y groove depth in
mils (0.001 inches)
22Example Is tire tread wear linearly related to
mileage?
23A residual versus fits plot suggesting
relationship is not linear
24How a non-linear function shows up on a residual
versus fits plot
- The residuals depart from 0 in a systematic
fashion, such as being positive for small X
values, negative for medium X values, and
positive again for large X values.
25Example How is plutonium activity related to
alpha particle counts?
26A residual versus fits plot suggesting
non-constant error variance
27How non-constant error variance shows up on a
residual vs. fits plot
- The plot has a fanning effect, such as the
residuals being close to 0 for small X values and
being much more spread out for large X values. - The fanning effect can also be in the reverse
direction. - Or, the spread of the residuals can vary in some
complex fashion.
28Example Relationship between tobacco use and
alcohol use?
Region Alcohol Tobacco North
6.47 4.03 Yorkshire 6.13
3.76 Northeast 6.19
3.77 EastMidlands 4.89
3.34 WestMidlands 5.63
3.47 EastAnglia 4.52 2.92
Southeast 5.89 3.20 Southwest
4.79 2.71 Wales 5.27
3.53 Scotland 6.08 4.51 Northern
Ireland 4.02 4.56
- Family Expenditure Survey of British Dept. of
Employment - X average weekly expenditure on tobacco
- Y average weekly expenditure on alcohol
29Example Relationship between tobacco use and
alcohol use?
30A residual versus fits plot suggesting an
outlier exists.
outlier
31How large does a residual need to be before being
flagged?
- The magnitude of the residuals depends on the
units of the response variable. - Make the residuals unitless by dividing by
their standard deviation. That is, use
standardized residuals. - Then, an observation with a standardized residual
greater than 2 or smaller than -2 should be
flagged for further investigation.
32Standardized residuals versus fits plot
33Minitab identifies observations with large
standardized residuals
Unusual Observations Obs Tobacco Alcohol Fit
SE Fit Resid St Resid 11 4.56 4.020
5.728 0.482 -1.708 -2.58R R denotes an
observation with a large standardized residual.
34Anscombe data set 3
35A residual versus fits plot suggesting an
outlier exists.
36How an outlier shows up on a residuals vs. fits
plot
- The observations residual stands apart from the
basic random pattern of the rest of the
residuals. - The random pattern of the residual plot can even
disappear if one outlier really deviates from the
straight line of the rest of the data.
37Other simple plots that might help spot an outlier
- Boxplots
- Stem-n-leaf plots
- Dotplots
38Boxplot of residuals for Alcohol (Y) and Tobacco
(X) example
39Dotplot of residuals for Alcohol (Y) and Tobacco
(X) example
40Residuals vs. order plots to assess
non-independence of error terms
- If the data are obtained in a time (or space)
sequence, a residuals vs. order plot helps to
see if there is any correlation between error
terms that are near each other in the sequence. - A horizontal band bouncing randomly around 0
suggests errors are independent, while a
systematic pattern suggests not.
41Residuals vs order plots suggesting
non-independence of error terms
42Normal probability plot to assess normality of
error terms
- Plot of residuals on horizontal axis against
expected values of the residuals under normality
(normal scores) on vertical axis. - Plot that is nearly linear suggests normality of
error terms.
43Normal probability plot interpretation
skewed left
normal
skewed right
44Normal probability plot for Alcohol (X) and
Strength (Y) example
45Normal probability plot for Tree diameter (X) and
C-dating Age (Y)
46Residuals vs omitted predictors plots
- To determine whether there are any other key
variables that could provide additional
predictive power to the response. - Look for systematic patterns.
- If the plot reveals that the residuals vary
systematically, we dont say the original model
is wrong. Its just that it can be improved.
47Residuals vs omitted plot
48In summary,
49Nonlinearity of regression function
- Scatter plot of response versus predictor
- (Standardized) residuals versus fits plot
- (Standardized) residuals versus predictor plot
50Nonconstancy or error variance
- (Standardized) residuals versus fits plot
- (Standardized) residuals versus predictor plot
51Presence of outliers
- (Standardized) residuals versus fits plot
- (Standardized) residuals versus predictor plot
- Box plots, stem-n-leaf plots, dot plots of
(standardized) residuals
52Non-independence of error terms
- (Standardized) residuals versus order plot
53Non-normality of error terms
- Normality probability plots
- Box plots, dotplots, stem-n-leaf plots
- Mean far from median?
54Residual vs plots in Minitab
- Stat gtgt Regression gtgt Regression.
- Specify predictor and response.
- Under Graphs, specify whether regular or
standardized residuals desired. Select which
residual plots are desired. If residual versus
predictor plot desired, specify predictor in box. - Select OK. Select OK.
55Boxplots, dotplots, etc. of residuals
- Stat gtgt Regression gtgt Regression
- Specify predictor and response.
- Under Storage, select residuals and/or
standardized residuals. They will be stored in
worksheet. Then - Graph gtgt Boxplot or Graph gtgtDotplot or
GraphgtgtStemleaf