Title: Prediction and Lack of Fit in Regression
1Prediction, Correlation, and Lack of Fit in
Regression (11.4, 11.5, 11.7)
- Outline
- Confidence interval and prediction interval.
- Regression Assumptions.
- Checking Assumptions (model adequacy).
- Correlation.
- Influential observations.
2Prediction
Our regression model is
Number Repair of components time
i xi yi 1 1 23 2 2
29 3 4 64 4 4 72
5 4 80 6 5 87 7 6
96 8 6 105 9 8 127
10 8 119 11 9 145 12 9
149 13 10 165 14 10 154
so that the average value of the response at Xx
is
3The estimated average response at Xx is therefore
The expected value!
This quantity is a statistic, a random variable,
hence it has a sampling distribution.
Regression Assumptions
Normal Distribution for ?
Sample estimate, and associated variance
A (1-a)100 CI for the average response at Xx is
therefore
4Prediction and Predictor Confidence
The best predictor of an individual response y at
Xx, yx,pred, is simply the average response at
Xx.
Random variables -- they vary from
sample-to-sample.
Variance associated with an individual
prediction is larger than that for the mean
value! Why?
Hence the predicted value is also a random
variable.
A (1-a)100 CI for an individual response at Xx
5Prediction band - what would we expect for one
new observation.
Confidence band - what would we expect for the
mean of many observations taken at the value of
Xx.
6(No Transcript)
7Regression Assumptions and Lack of Fit
Regression Model Assumptions
- Effect additivity (multiple regression)
- Normality of the residuals
- Homoscedasticity of the residuals
- Independence of the residuals
8Additivity
Additivity assumption.
The expected value of an observation is a
weighted linear combination of a number of
factors.
- Which factors? (model uncertainty)
- number of factors in the model
- interactions of factors
- powers or transformations of factors
9Homoscedasticity and Normality
Observations never equal their expected
values.
No systematic biases.
Homoscedasticity assumption.
The unexplained component has a common
variance for all values i.
Normality assumption.
The unexplained component has a normal
distribution.
10Independence
Independence assumption.
Responses in one experimental unit are not
correlated with, affected by, or related to,
responses for other experimental units.
11Correlation Coefficient
A measure of the strength of the linear
relationship between two variables.
Product Moment Correlation Coefficient
In SLR, r is related to the slope of the fitted
regression equation.
r2 (or R2) represents that proportion of total
variability of the Y-values that is accounted for
by the linear regression with the independent
variable X.
R2 Proportion of variability in Y explained by X.
12Properties of r
1. r lies between -1 and 1. r gt 0 indicates a
positive linear relationship. r lt
0 indicates a negative linear
relationship. r 0 indicates no linear
relationship. r ?1 indicates perfect linear
relationship. 2. The larger the absolute value of
r, the stronger the linear relationship. 3. r2
also lies between 0 and 1.
13Checking Assumptions
How well does the model fit?
Do predicted values seem to be placed in the
middle of observed values?
Do residuals satisfy the regression assumptions?
(Problems seen in plot of X vs. Y will be
reflected in residual plot.)
y
- Constant variance?
- Regularities suggestive of lack of independence
or more complex model? - Poorly fit observations?
x
14Model Adequacy
Studentized residuals (ei)
Allows us to gauge whether the residual is too
large. It should have a standard normal
distribution, hence it is very unlikely that any
studentized residual will be outside the range
-3,3.
MSE(I) is the calculated MSE leaving observation
i out of the computations. hi is the ith diagonal
of the projection matrix for the predictor space
(ith hat diagonal element).
15Normality of residuals
Kolmogorov-Smirnov Test Shapiro-Wilks Test
(nlt50) DAgostinos Test (n³50)
Formal Goodness of fit tests
All quite conservative - they fail to reject the
hypothesis of normality more often than they
should.
Graphical Approach Quantile-quantile plot
(qq-plot)
1. Compute and sort the simple residuals
e1,e2,en. 2. Associate to each residual a
standard normal quantile zinormsinv((i-.5)/n
). 3. Plot zI versus eI. Compare to 45o
line.
16(No Transcript)
17Influence Diagnostics (Ways to detect
influential observations)
Does a particular observation consisting of a
pair of (X,Y) values (a case) have undue
influence on the fit of the regression model?
i.e. what cases are greatly affecting the
estimates of the p regression parameters in the
model. (For simple linear regression p2.)
Standardized/Studentized Residuals. The ei are
used to detect cases that are outlying with
respect to their Y values. Check cases with ei
gt 2 or 3. Hat diagonal elements. The hi are used
to detect cases that are outlying with respect to
their X values. Check cases with hi gt 2p/n.
18Dffits. Measures the influence that the ith case
has on the ith fitted value. Compares the ith
fitted value with the ith fitted value obtained
by omitting the ith case. Check cases for which
Dffitsgt2Ö(p/n). Cooks Distance. Similar to
Dffits, but considers instead the influence of
the ith case on all n fitted values. Check when
Cooks Dist gt Fp,n-p,0.50. Covariance Ratio. The
change in the determinant of the covariance
matrix that occurs when the ith case is deleted.
Check cases with Cov Ratio ? 1 ³
3p/n. Dfbetas. A measure of the influence of the
ith case on each estimated regression parameter.
For each regression parameter, check cases with
Dfbeta gt 2/Ön.
19Cutoffs Hat0.29, CovRatio0.43, Dffits0.76,
Dfbetas0.53
20(No Transcript)
21Obs 5
Obs 1
Obs 2