Cotton example - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Cotton example

Description:

Coefficient of determination. The proportion of the total variation in Y that ... That is, the coefficient for B tells us that B contributes all by itself with no ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 34

Provided by: ericra7

Category:

more less

Transcript and Presenter's Notes

Title: Cotton example

1
Cotton example

Cotton is particularly sensitive to rainfall. Dry
weather in June appears to slow growth.
The following data are records from an
agricultural experiment station

June rainfall (cm) Yield (lb/acre)
3 1120
6 1750
7 1940
9 2130
11 2380
15 2650
17 2990
19 3130

Yield is the response variable
Rainfall in June is the regressor
Denote Yield as Y and Rainfall as X.

4
Blackboard
5
(No Transcript)
6
Coefficient of determination

The proportion of the total variation in Y that
is explained by the fitted regression

7
Cotton example
Yield 999 116 June rainfall
8
Coefficient on determination

96.7 of the total variation is explained by
the fitted model

9
(No Transcript)
10
Recall the assumptions we made for the regression

For each X there is a normal distribution of Y
values
The variance of the normal distribution is the
same for all X values
The mean of the Y values at a given X lies on a
straight line along the regression line

11
Another example

Plot tree height against time as an indication
of growth rate
What is the problem here? The observations
are not independent.

Assumption
Y observations collected for Xi must be
independent the Y observations collected for
Xk.

Make the Y observations independent by just
looking at incremental growth.

14
Linear relationship?
15
(No Transcript)
16

It is fairly easy to misuse regression. A
rule of thumb is to use your common sense. If the
results of an analysis dont make any sense, get
help. Ultimately, statistics is a tool employed
to help us understand life. Although
understanding life can be tricky, it is not
usually perverse. Before accepting conclusions
which seems silly based on statistical analyses,
consult with a veteran data analyst. Most truly
revolutionary results from data analyses are
based on data entry errors.
Cody, R. P. and Smith, J. K. (1991) Applied
Statistics and the SAS Programming Language.

17
Multiple regression

Growth rate of a tree is for example dependent
on
-water supply
-hours of daylight
-soil composition
-genetics

To solely use one of these variables to explain
growth rate is not going to lead to a very
accurate prediction of growth rate
In other words, dividing the total variation into
the regression variation from one of these
variables and the residual variation is likely to
give us a huge residual term

19
Multiple regression
The unknown parameters are called regression
coefficients or partial regression coefficients.
20
In addition to the assumptions of simple linear
regression, we also have to make the following

Assumption The variables X1, X2, X3, X4,..
are independent there is no
correlation between any pair of variables

21
Air pollution in different American cities

The response variable Y is sulfur dioxide (SO2)
X1 Temperature
X2 Number of factories
X3 Population size
X4 Wind
X5 Precipitation
X6 Number of days of precipitation

22
Air pollution

Our aim is to see how well these six
variables explain the amount of SO2.
In total 41 different cities were included in the
study

Just as in the simple linear regression we ask
if the regression coefficients are different
from zero
H0 for all j
H1 for at least one j

24
(No Transcript)
25
Air pollution
SO2 conc 111 - 1.26 Temp (F) 0.0650 No.
factories - 0.0394 Population - 3.17 Wind 0.509
Ppt - 0.050 No. days Ppt Y 111 - 1.26X1
0.065X2 - 0.0394X3 - 3.17X4 0.509X5 - 0.050X6
26
Conclusion

At least one of the regression coefficients is
different from zero.

New Question Which regression coefficients
are different from zero?
Predictor Coef SE Coef P
Temp (F) -1.2592 0.6203 0.049
No. fact 0.06500 0.01575 0.000
Populati -0.03937 0.01514 0.014
Wind -3.169 1.815 0.090
Ppt 0.5092 0.3629 0.170
No. days -0.0498 0.1617 0.760
Only 3 of the 6 variables have regression
coefficients different from zero. What happens if
we
make a regression model omitting these three?

Follow-up null hypothesis
H0 ( )

29
Air pollution
SO2 conc 58 0.584 Temp (F) 0.0713 No.
factories - 0.0467 Population Y 58 0.584 X1
0.0713 X2 0.0467 X3
30

Predictor Coef SE Coef P
Temp (F) -0.5841 0.3710 0.124
No. fact 0.07131 0.01606 0.000
Population -0.04672 0.01537 0.004
What happened here? Suddenly the regression
coefficient of
temperature is not significant!
Also r2 for the model including all six variables
is 0.669
while r2 for the model including three variables
is 0.612

Questions
1.Which of the two models predicts the sulfur
dioxide concentration the best?
2. Should I further reduce the model by
eliminating temperature?

32
Multicollinearity of variables

Many, if not most, regression analyses are
conducted on data sets where the independent
variables show some degree of correlation. These
data sets, resulting from non experimental
research, are common in all fields... The
potential for a researcher to be misled by a non
experimental data set is high for a novice
researcher it is near certainty.
Cody, R. P. and Smith, J. K. (1991) Applied
Statistics and the SAS Programming Language.

The problem is that the correlation among
independent variables causes regression estimates
to change depending on which independent
variables are being used. That is impact of B on
A depends on whether C is in the equation or not.
With C omitted, B can look very influential. With
C included, the impact of B can disappear
completely!
The reason for this is as follows A regression
coefficient tells us the unique contribution of
independent variable to a dependent variable.
That is, the coefficient for B tells us that B
contributes all by itself with no overlap with
any other variable. If B is the only variable in
the equation, this is no problem. But if we add
C, and if B and C are correlated, then the unique
contribution of B on A will be changed.