Title: Linear Regression
1Linear Regression
2An example
3About the straight line
Y a b x
a intercept
b slope
4Questions
- How to obtain the best straight line ?
- Is this straight line the best curve to use ?
- How to use this straight line ?
5How to obtain the best straight line ?
Proceed in three main steps
- write a (statistical) model
- estimate the parameters
- graphical inspection of data
6Write a model
A statistical model
Mean model functionnal relationship
Variance model Assumptions on the residuals
7Write a model
Mean model
residual (error term)
8Assumptions on the residuals
- the xi 's are not random variables
- they are known with a high precision
- the ei 's have a constant variance
- homoscedasticity
- the ei 's are independent
- the ei 's are normally distributed
- normality
9Homoscedasticity
homoscedasticity
heteroscedasticity
10Normality
Y
x
11Estimate the parameters
A criterion is needed to estimate parameters
A statistical model
A criterion
12How to estimate the "best" a et b ?
Intuitive criterion
minimum
compensation
Reasonnable criterion
minimum
Linear model Homoscedasticity Normality
Least squares criterion (L.S.)
13The least squares criterion
14Result of optimisation
and
change with samples
and
are random variables
15Balance sheet
True mean straight line
Estimated straight line
or
Mean predicted value for the ith observation
ith residual
16Example
Dep Var HPLC N 18 Effect
Coefficient Std Error t P(2 Tail)
CONSTANT 20.046 3.682 5.444
0.000 CONCENT 2.916 0.069
42.030 0.000
Intercept
Estimated straight line
Slope
17Example
18Example
19Residual variance
by construction
but
The residual variance is defined by
standard error of estimate
20Example
Dep Var HPLC N 18 Multiple R 0.996 Squared
multiple R 0.991 Adjusted squared multiple R
0.991 Standard error of estimate 8.282
Effect Coefficient Std Error t
P(2 Tail) CONSTANT 20.046
3.682 5.444 0.000 CONCENT 2.916
0.069 42.030 0.000
21Questions
- How to obtain the best straight line ?
- Is this straight line the best curve to use ?
- How to use this straight line ?
22Is this model the best one to use ?
- Tools to check the mean model
- scatterplot residuals vs fitted values
- test(s)
- Tools to check the variance model
- scatterplot residuals vs fitted values
- Probability plot (Pplot)
23Checking the mean model
scatterplot residuals vs fitted values
0
0
structure in the residuals change the mean model
No structure in the residuals OK
24Checking the mean model tests
Two cases
replications
no replication
25Without replication
try another mean model and test the improvement
Example
If the test on c is significant (c ? 0) then keep
this model
Dep Var HPLC N 18 Multiple R 0.996
Squared multiple R 0.991 Adjusted squared
multiple R 0.991 Standard error of estimate
8.539 Effect Coefficient Std Error
t P(2 Tail) CONSTANT 21.284
6.649 3.201 0.006 CONCENT
2.842 0.335 8.486
0.000 CONCENT CONCENT 0.001
0.003 0.227 0.824
26With replications
Perform a test of lack of fit
Pure error
Principle compare
to
if
gt
then change the model
-
27Test of lack of fit how to do it ?
Three steps
1) Linear regression
2) One way ANOVA
3)
if
then change the model
28Test of lack of fit example
Three steps
1) Linear regression
2) One way ANOVA
Dep Var HPLC N 18
Analysis of Variance Source
Sum-of-Squares df Mean-Square F-ratio
P CONCENT 121251.776 5
24250.355 289.434 0.000 Error
1005.427 12 83.786
3)
if
We keep the straight line
29Checking the variance model homoscedasticity
scatterplot residuals vs fitted values
0
0
No structure in the residuals but
heteroscedasticity change the model (criterion)
homoscedasticity OK
30What to do with heteroscedasticity ?
scatterplot residuals vs fitted values
modelize the dispersion.
0
The standard deviation of the residuals
increases with it increases with x
31What to do with heteroscedasticity ?
Estimate again the slope and the intercept but
with weights proportionnal to the variance.
with
and check that the weight residuals (as defined
above) are homoscedastic
32Checking the variance model normality
0
Expected value for normal distribution
Expected value for normal distribution
0
No curvature Normality
Curvature non normality is it so important ?
33What to do with non normality ?
Try to modelize the distribution of residuals
In general, it is difficult with few observations
If enough observations are available, the non
normality does not affect too much the result.
34An interesting indice R²
R² square correlation coefficient
of dispersion of the Yi's explained by the
straight line (the model)
0 ? R² ? 1
If R² 1, all the ei 0, the straight line
explain all the variation of the Yi's
If R² 0, the slope is 0, the straight line
does not explain any variation of the Yi's
35An interesting indice R²
R² and R (correlation coefficient) are not
designed to measure linearity !
Example
Multiple R 0.990 Squared multiple R
0.980 Adjusted squared multiple R 0.980
36Questions
- How to obtain the best straight line ?
- Is this straight line the best curve to use ?
- How to use this straight line ?
37How to use this straight line ?
- Direct use for a given x
- predict the mean Y
- construct a confidence interval of the mean Y
- construct a prediction interval of Y
- Reverse use calibration (approximate results)
for a given Y - predict the mean x
- construct a confidence interval of the mean x
- construct a prediction interval of X
38For a given x predict the mean Y
Example
39Confidence interval of the mean Y
There is a probability 1-a that abx belongs to
this interval
40Confidence interval of the mean Y
U
L
30
41Example
42Prediction interval of Y
100(1-a) of the measurements carried-out for
this x belongs to this interval
43Prediction interval of Y
U
L
30
44Example
45Reverse use for a given Yy0 predict the mean X
Example
46For a given Yy0 a confidence interval of the
mean X
Y0
X
U
L
47Confidence interval of the mean X
There is a probability 1-a that the mean X
belongs to L , U
L and U are so that
48Example
49What you should no longer believe
One can fit the straight line by inverting x and Y
If the correlation coefficient is high, the
straight line is the best model
Normality of the xi's is required to perform a
regression
Normality of the ei's is essential to perform a
good regression