Title: Cotton example
1Cotton example
- Cotton is particularly sensitive to rainfall. Dry
weather in June appears to slow growth. - The following data are records from an
- agricultural experiment station
2- June rainfall (cm) Yield (lb/acre)
- 3 1120
- 6 1750
- 7 1940
- 9 2130
- 11 2380
- 15 2650
- 17 2990
- 19 3130
3- Yield is the response variable
- Rainfall in June is the regressor
- Denote Yield as Y and Rainfall as X.
4Blackboard
5(No Transcript)
6Coefficient of determination
- The proportion of the total variation in Y that
- is explained by the fitted regression
7Cotton example
Yield 999 116 June rainfall
8Coefficient on determination
- 96.7 of the total variation is explained by
- the fitted model
9(No Transcript)
10Recall the assumptions we made for the regression
- For each X there is a normal distribution of Y
values - The variance of the normal distribution is the
same for all X values - The mean of the Y values at a given X lies on a
straight line along the regression line
11Another example
- Plot tree height against time as an indication
- of growth rate
- What is the problem here? The observations
- are not independent.
12- Assumption
- Y observations collected for Xi must be
- independent the Y observations collected for
- Xk.
13- Make the Y observations independent by just
looking at incremental growth.
14Linear relationship?
15(No Transcript)
16- It is fairly easy to misuse regression. A
rule of thumb is to use your common sense. If the
results of an analysis dont make any sense, get
help. Ultimately, statistics is a tool employed
to help us understand life. Although
understanding life can be tricky, it is not
usually perverse. Before accepting conclusions
which seems silly based on statistical analyses,
consult with a veteran data analyst. Most truly
revolutionary results from data analyses are
based on data entry errors. -
- Cody, R. P. and Smith, J. K. (1991) Applied
Statistics and the SAS Programming Language.
17Multiple regression
- Growth rate of a tree is for example dependent
- on
- -water supply
- -hours of daylight
- -soil composition
- -genetics
18- To solely use one of these variables to explain
growth rate is not going to lead to a very
accurate prediction of growth rate - In other words, dividing the total variation into
the regression variation from one of these
variables and the residual variation is likely to
give us a huge residual term
19Multiple regression
The unknown parameters are called regression
coefficients or partial regression coefficients.
20In addition to the assumptions of simple linear
regression, we also have to make the following
- Assumption The variables X1, X2, X3, X4,..
- are independent there is no
- correlation between any pair of variables
21Air pollution in different American cities
- The response variable Y is sulfur dioxide (SO2)
- X1 Temperature
- X2 Number of factories
- X3 Population size
- X4 Wind
- X5 Precipitation
- X6 Number of days of precipitation
22Air pollution
- Our aim is to see how well these six
- variables explain the amount of SO2.
- In total 41 different cities were included in the
- study
23- Just as in the simple linear regression we ask
- if the regression coefficients are different
- from zero
- H0 for all j
- H1 for at least one j
24(No Transcript)
25Air pollution
SO2 conc 111 - 1.26 Temp (F) 0.0650 No.
factories - 0.0394 Population - 3.17 Wind 0.509
Ppt - 0.050 No. days Ppt Y 111 - 1.26X1
0.065X2 - 0.0394X3 - 3.17X4 0.509X5 - 0.050X6
26Conclusion
- At least one of the regression coefficients is
- different from zero.
27- New Question Which regression coefficients
- are different from zero?
- Predictor Coef SE Coef P
- Temp (F) -1.2592 0.6203 0.049
- No. fact 0.06500 0.01575 0.000
- Populati -0.03937 0.01514 0.014
- Wind -3.169 1.815 0.090
- Ppt 0.5092 0.3629 0.170
- No. days -0.0498 0.1617 0.760
- Only 3 of the 6 variables have regression
- coefficients different from zero. What happens if
we - make a regression model omitting these three?
28- Follow-up null hypothesis
- H0 ( )
29Air pollution
SO2 conc 58 0.584 Temp (F) 0.0713 No.
factories - 0.0467 Population Y 58 0.584 X1
0.0713 X2 0.0467 X3
30- Predictor Coef SE Coef P
- Temp (F) -0.5841 0.3710 0.124
- No. fact 0.07131 0.01606 0.000
- Population -0.04672 0.01537 0.004
- What happened here? Suddenly the regression
coefficient of - temperature is not significant!
- Also r2 for the model including all six variables
is 0.669 - while r2 for the model including three variables
is 0.612
31- Questions
- 1.Which of the two models predicts the sulfur
dioxide concentration the best? - 2. Should I further reduce the model by
eliminating temperature?
32Multicollinearity of variables
- Many, if not most, regression analyses are
conducted on data sets where the independent
variables show some degree of correlation. These
data sets, resulting from non experimental
research, are common in all fields... The
potential for a researcher to be misled by a non
experimental data set is high for a novice
researcher it is near certainty. - Cody, R. P. and Smith, J. K. (1991) Applied
Statistics and the SAS Programming Language.
33- The problem is that the correlation among
independent variables causes regression estimates
to change depending on which independent
variables are being used. That is impact of B on
A depends on whether C is in the equation or not.
With C omitted, B can look very influential. With
C included, the impact of B can disappear
completely! - The reason for this is as follows A regression
coefficient tells us the unique contribution of
independent variable to a dependent variable.
That is, the coefficient for B tells us that B
contributes all by itself with no overlap with
any other variable. If B is the only variable in
the equation, this is no problem. But if we add
C, and if B and C are correlated, then the unique
contribution of B on A will be changed.