Regression Diagnostics - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Regression Diagnostics

Description:

The conditions required for the model must be checked. ... R commend: qqnorm(r) Index plot of studentized residuals. STAT611, Term I 09-10 ... – PowerPoint PPT presentation

Number of Views:261

Avg rating:3.0/5.0

Slides: 36

Provided by: sba448

Category:

more less

Transcript and Presenter's Notes

Title: Regression Diagnostics

1
Regression Diagnostics

Chapter 4

Contents
Residuals
Graphical Methods
Multicolinearity, Nonnormality,
Heteroscedasticity, Autocorrelation
Measures of Influence

2
4.1 Introduction

The conditions required for the model must be
checked. Violation of any conditions makes the
inferences invalid.
Is the error variable normally distributed?
Is the error variance constant?
Are the errors independent?
Can we identify outlier?
Is multicolinearity (intercorrelation) a problem?

Draw a histogram of the residuals
Plot the residuals versus the time periods
3
4.2 Residuals

Ordinary least squares residuals
ei Yi ? Yi
where

Studentized residuals

Externally studentized residuals

4
Where pii is the ith diagonal element of the hat
matrix P X(XX)?1X is the estimate
of ? with ith observation deleted.
4.2 Residuals
Once the studentized residuals are calculated,
the externally studentized residuals can be
calculated through the relationship
5
4.3 Graphical Methods

There is no single statistical tool that is as
powerful as a well-chosen graph Chambers et
al. (1983)
Eye-balling can give diagnostic insights no
formal diagnostics will ever provide Huber
(1991).

6
Anscombes Quartet Four Data Sets Having Same
Values of Summary
Mean(Y) 7.501, Mean(X) 9.0, Std(Y) 2.031,
Std(X)3.32, Cor(Y, X) 0.8, Y 3 0.5 X, etc.
7
(No Transcript)
8
Graphical methods can be useful in many ways

Detect errors in the data (e.g., an outlying
point may be a result of a typographical error,
Recognizing patterns in the data (e.g., clusters,
outliers),
Explore relationship among variables,
Discover new phenomena,
Confirm or negate assumptions,
Assess the adequacy of a fitted model,
Suggest remedial actions (e.g., transform the
data),
Enhance numerical analysis in general.

9
Graphical methods can be classified into two
classes

Graphs before fitting the model. These are
useful in correcting errors in data and in
selecting a model
Graphs after fitting a model. These are
particularly useful for checking the model
assumptions and for assessing the goodness of the
fit.

10
Functionality of common plots

Plot of Y vs Xi, I 1, , p to reveal Y-X
relationship.
Normal probability plot of studentized residuals
for checking the normality assumption.
Scatter plots of studentized residuals against
each of the predictor variables for checking
linearity of Y-X relation, and constancy of error
variance.

11
Functionality of common plots

Scatter plots of the studentized residuals versus
the fitted values similar to the above.
Index plot of the studentized residuals for
checking the independence assumption.
Matrix plot of the predictors for checking
multicolinearity

12
Example 4.1. Using R (lm() and plot()) to do the
following plots based the motor inn data in
Example 3.1

Plot Y vs, respectively, X1, X2, X3, X4, X5, X6.
Give a matrix plot of all X variables
Plot the OLS residuals vs Xi, i 1, 2, , 6
Plot of OLS residuals vs the fitted values
Normal probability plot of studentized residuals
r
R commend qqnorm(r)
Index plot of studentized residuals

13
Diagnostics Multicolinearity

Example 4.2 Predicting house price (EX4-01.xls)
A real estate agent believes that a house selling
price can be predicted using the house size,
number of bedrooms, and lot size.
A random sample of 100 houses was drawn and data
recorded.

Analyze the relationship among the four variables

14
Diagnostics Multicolinearity

The proposed model isPRICE ?0 ?1 BEDROOMS
?2 H-SIZE ?3 LOTSIZE ?

The model is valid, but no variable is
significantly related to the selling price !!!
Why?
15
Diagnostics Multicolinearity

Multicolinearity is found to be a problem.

Multicolinearity causes two kinds of
difficulties
The t statistics appear to be too small.
The b coefficients cannot be interpreted as
slopes.

16
(No Transcript)
17
Remedying Violations of the Required Conditions

Nonnormality or heteroscedasticity can be
remedied using transformations on the y variable.
The transformations can improve the linear
relationship between the dependent variable and
the independent variables.
Many computer software systems allow us to make
the transformations easily.

18
Reducing Nonnormality by Transformations

A brief list of transformations
Y log Y (for Y gt 0)
Use when the ? increases with Y, or
Use when the error distribution is positively
skewed
Y Y2
Use when the ?2 is proportional to E(Y), or
Use when the error distribution is negatively
skewed
Y Y1/2 (for Y gt 0)
Use when the ?2 is proportional to E(Y)
Y 1/Y
Use when ?2 increases significantly when y
increases beyond some critical value.

19
Durbin - Watson TestAre the Errors
Autocorrelated?

This test detects first order autocorrelation
between consecutive residuals in a time series
If autocorrelation exists the error variables are
not independent

Residual at time i
20
Positive First Order Autocorrelation

Residuals

0
Time

Positive first order autocorrelation occurs when
consecutive residuals tend to be similar.
Then, the value of d is small (less than 2).
21
Negative First Order Autocorrelation
Residuals

0
Time

Negative first order autocorrelation occurs when
consecutive residuals tend to markedly differ.
Then, the value of d is large (greater than 2).
22
One tail test for Positive First Order
Autocorrelation

If dltdL there is enough evidence to show that
positive first-order correlation exists
If dgtdU there is not enough evidence to show that
positive first-order correlation exists
If d is between dL and dU the test is
inconclusive.

23
One Tail Test for Negative First Order
Autocorrelation

If dgt4-dL, negative first order correlation
exists
If dlt4-dU, negative first order correlation does
not exists
if d falls between 4-dU and 4-dL the test is
inconclusive.

24
Two-Tail Test for First Order Autocorrelation

If dltdL or dgt4-dL first order autocorrelation
exists
If d falls between dL and dU or between 4-dU and
4-dLthe test is inconclusive
If d falls between dU and 4-dU there is no
evidence for first order autocorrelation

25
Testing the Existence of Autocorrelation, Example

Example 4.3 (EX4-03)
How does the weather affect the sales of lift
tickets in a ski resort?
Data of the past 20 years sales of tickets, along
with the total snowfall and the average
temperature during Christmas week in each year,
was collected.
The model hypothesized was
TICKETS b0 b1SNOWFALL b2TEMPERATURE e
Regression analysis yielded the following
results

26
The Regression Equation Assessment (I)
The model seems to be very poor

R-square0.1200
It is not valid (Signif. F 0.3373)
No variable is linearly related to Sales

27
Diagnostics The Error Distribution
The errors histogram
The errors may be normally distributed
28
Diagnostics Heteroscedasticity
29
Diagnostics First Order Autocorrelation
The errors are not independent!!
30
Diagnostics First Order Autocorrelation
Using the computer - Excel
Tools gt Data Analysis gt Regression (check the
residual option and then OK) Tools gt Data
Analysis Plus gt Durbin Watson Statistic gt
Highlight the range of the residuals from the
regression run gt OK
Test for positive first order auto-correlation n
20, p2. From the Durbin-Watson table we have
dL1.10, dU1.54. The statistic
d0.5931 Conclusion Because dltdL , there is
sufficient evidence to infer that positive first
order autocorrelation exists.
The residuals
31
The Modified Model Time Included
The modified regression model (EX4-02mod.xls) TIC
KETSb0 b1SNOWFALL b2TEMPERATURE b3TIME e

All the required conditions are met for this
model.
The fit of this model is high R2 0.7410.
The model is valid. Significance F .0001.
SNOWFALL and TIME are linearly related to
ticket sales.
TEMPERATURE is not linearly related to ticket
sales.

32
4.4. Leverage, Influence, and Outliers

Influential point a point is an influential
point if its deletion causes substantial changes
in the fitted model (estimated coefficients,
fitted values, t-tests, etc.).
Outliers in the Response Variable observations
with large standardized residuals are outliers in
the response variable. A rule of thumb larger
than 3 sd away from the mean (zero).
Leverage value pii, ith diagonal element of P
matrix.
Outliers in the Predictors Outliers in the
predictors (the X-space) are defined based on the
magnitude of pii,

33
4.4. Leverage, Influence, and Outliers

specifically, if
pii gt 2(p1)/n
then, the ith observation is an outlying
observation
with respect to X variables.
This is because pii measures the distance of a
point to X. It is clearer in simple linear
regression

34
4.5. Measures of Influence

Let
be the fitted values and the estimate of ? when
we drop the ith observation.

Cooks Distance measures the influence of the
ith observation by summarizing the differences
between the fitted values obtained from the full
data and the fitted values obtained by deleting
the ith observation

35
4.5. Measures of Influence

Cooks Distance can be calculated through the
relation

A rule of thumb if Ci gt1, then the ith
observation is influential.
More flexible and informative way of detecting
the influential observations is to do an index
plot of Ci .
There are other measures of influence, see text,
Sec 4.9
See R script file MotorInn.r in course website
for details in calculating various quantities and
plotting.