Title: STATS 330: Lecture 16
1STATS 330 Lecture 16
Case Study
2Case study
- Aim of todays lecture
- To illustrate the modelling process using the
evaporation data.
3The Evaporation data
- Data in data frame evap.df
- Aims of the analysis
- Understand relationships between explanatory
variables and the response - Be able to predict evaporation loss given the
other variables
4Case Study Evaporation data
- Recall from Lecture 15 variables are
- evap the amount of moisture evaporating from
the soil in the 24 hour period (response) - maxst maximum soil temperature over the 24 hour
period - minst minimum soil temperature over the 24 hour
period - avst average soil temperature over the 24 hour
period - maxat maximum air temperature over the 24 hour
period - minat minimum air temperature over the 24 hour
period - avat average air temperature over the 24 hour
period - maxh maximum humidity over the 24 hour period
- minh minimum humidity over the 24 hour period
- avh average humidity over the 24 hour period
- wind average wind speed over the 24 hour period.
5Modelling cycle
6Modelling cycle (2)
- Our plan of attack
- Graphical check
- Suitability for regression
- Gross outliers
- Preliminary fit
- Model selection (for prediction)
- Transforming if required
- Outlier check
- Use model for prediction
7Step 1 Plots
- Preliminary plots
- Want to get an initial idea of suitability of
data for regression modelling - Check for linear relationships, outliers
- Pairs plots, coplots
- Data looks OK to proceed, but evap/maxh plot
looks curved
8(No Transcript)
9Points to note
- Avh has very few values
- Strong relationships between response and some
variables (particularly maxh, avst) - Not much relationship between response and minst,
minat, wind - strong relationships between min, av and max
- No obvious outliers
10Step 2 preliminary fit
- Coefficients
- Estimate Std. Error t value
Pr(gtt) - (Intercept) -54.074877 130.720826 -0.414
0.68164 - avst 2.231782 1.003882 2.223
0.03276 - minst 0.204854 1.104523 0.185
0.85393 - maxst -0.742580 0.349609 -2.124
0.04081 - avat 0.501055 0.568964 0.881
0.38452 - minat 0.304126 0.788877 0.386
0.70219 - maxat 0.092187 0.218054 0.423
0.67505 - avh 1.109858 1.133126 0.979
0.33407 - minh 0.751405 0.487749 1.541
0.13242 - maxh -0.556292 0.161602 -3.442
0.00151 - wind 0.008918 0.009167 0.973
0.33733 - Residual standard error 6.508 on 35 degrees of
freedom - Multiple R-squared 0.8463, Adjusted
R-squared 0.8023 - F-statistic 19.27 on 10 and 35 DF, p-value
2.073e-11
11(No Transcript)
12Findings
- Plots OK, normality dubious
- Gam plots indicated no transformations
- Point 31 has quite high Cooks distance but
removing it doesnt change regression much - Model is OK.
- Could interpret coefficients, but variables
highly correlated.
13Step 3 Model selection
- Use APR
- Model selected was
- evap maxat maxh wind
- However, this model does not fit all that well
(outliers, non-normality) - Try best AIC model
- evap avst maxst maxat minhmaxh
- Now proceed to step 4
14Step 4 Diagnostic checks
- For a quick check, plot the regression object
produced by lm
model1.lmlt-lm(evap avst maxst maxat
minhmaxh, dataevap.df) plot(model1.lm)
15Outliers ? Non-normal?
16Conclusions?
- No real evidence of non-linearity, but will check
further with gams - Normal plot looks curved
- Some largish outliers
- Points 2, 41 have largish Cooks D
17Checking linearity
- Check for linearity with gams
gt library(mgcv) gtplot(gam(evap s(avst)
s(maxst) s(maxat) s(maxh) s(wind),
dataevap.df))
18Transform avst, maxh ?
19Remedy
- Gam plots for avst and maxh are curved
- Try cubics in these variables
- Plots look better
- Cubic terms are significant
20(No Transcript)
21gt model2.lmlt-lm(evap poly(avst,3) maxst
maxat minhpoly(maxh,3), dataevap.df) gt
summary(model2.lm) Coefficients
Estimate Std. Error t value Pr(gtt)
(Intercept) 74.6521 25.4308 2.935
0.00577 poly(avst, 3)1 83.0106 27.3221
3.038 0.00441 poly(avst, 3)2 21.4666
8.3097 2.583 0.01399 poly(avst, 3)3
14.1680 7.2199 1.962 0.05749 . maxst
-0.8167 0.1697 -4.814 2.65e-05
maxat 0.4175 0.1177 3.546
0.00111 minh 0.4580 0.3253
1.408 0.16766 poly(maxh, 3)1 -89.0809
20.0297 -4.447 8.02e-05 poly(maxh, 3)2
-10.7374 7.9265 -1.355 0.18398
poly(maxh, 3)3 15.1172 6.3209 2.392
0.02212 --- Residual standard error 5.276
on 36 degrees of freedom Multiple R-squared
0.8961, Adjusted R-squared 0.8701
F-statistic 34.49 on 9 and 36 DF, p-value
4.459e-15
22New model
- Lets now adopt model
- lm(evappoly(avst,3)maxstmaxatpoly(maxh,3)
wind - Outliers are not too bad but lets check
gt influenceplots(model2.lm)
23(No Transcript)
24(No Transcript)
25Deletion of points
- Points 2, 6, 7, 41 are affecting the fitted
values, some coefficients. Removing these one at
a time and refitting indicates that the cubics
are not very robust, so we revert to the
non-polynomial model - The coefficients of the non-polynomial model are
fairly stable when we delete these points one at
a time, so we decide to retain them.
26Normality?
- However, the normal plot for the non-polynomial
model is not very straight WB test has p-value
0. - Normality of polynomial model is better
- Try predictions with both
27predict.df data.frame(avst mean(evap.dfavst),
maxst mean(evap.dfmaxst), maxat
mean(evap.dfmaxat), maxh mean(evap.dfmaxh), mi
nh mean(evap.dfminh)) rbind(predict(model1.lm,
predict.df,interval"p" ), predict(model2.lm,
predict.df,interval"p" )) fit lwr
upr 1 34.67391 21.75680 47.59103 1 32.38471
21.39857 43.37084 CV fit fit lwr
upr 1 34.67391 21.02628 48.32154