STATS 330: Lecture 16 - PowerPoint PPT Presentation

About This Presentation

Title:

STATS 330: Lecture 16

Description:

No real evidence of non-linearity, but will check further with gams ... Check for linearity with gams library(mgcv) plot(gam(evap~s(avst) s(maxst) s(maxat) ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 28

Provided by: statAuc

Category:

more less

Transcript and Presenter's Notes

Title: STATS 330: Lecture 16

1
STATS 330 Lecture 16
Case Study
2
Case study

Aim of todays lecture
To illustrate the modelling process using the
evaporation data.

3
The Evaporation data

Data in data frame evap.df
Aims of the analysis
Understand relationships between explanatory
variables and the response
Be able to predict evaporation loss given the
other variables

4
Case Study Evaporation data

Recall from Lecture 15 variables are
evap the amount of moisture evaporating from
the soil in the 24 hour period (response)
maxst maximum soil temperature over the 24 hour
period
minst minimum soil temperature over the 24 hour
period
avst average soil temperature over the 24 hour
period
maxat maximum air temperature over the 24 hour
period
minat minimum air temperature over the 24 hour
period
avat average air temperature over the 24 hour
period
maxh maximum humidity over the 24 hour period
minh minimum humidity over the 24 hour period
avh average humidity over the 24 hour period
wind average wind speed over the 24 hour period.

5
Modelling cycle
6
Modelling cycle (2)

Our plan of attack
Graphical check
Suitability for regression
Gross outliers
Preliminary fit
Model selection (for prediction)
Transforming if required
Outlier check
Use model for prediction

7
Step 1 Plots

Preliminary plots
Want to get an initial idea of suitability of
data for regression modelling
Check for linear relationships, outliers
Pairs plots, coplots
Data looks OK to proceed, but evap/maxh plot
looks curved

8
(No Transcript)
9
Points to note

Avh has very few values
Strong relationships between response and some
variables (particularly maxh, avst)
Not much relationship between response and minst,
minat, wind
strong relationships between min, av and max
No obvious outliers

10
Step 2 preliminary fit

Coefficients
Estimate Std. Error t value
Pr(gtt)
(Intercept) -54.074877 130.720826 -0.414
0.68164
avst 2.231782 1.003882 2.223
0.03276
minst 0.204854 1.104523 0.185
0.85393
maxst -0.742580 0.349609 -2.124
0.04081
avat 0.501055 0.568964 0.881
0.38452
minat 0.304126 0.788877 0.386
0.70219
maxat 0.092187 0.218054 0.423
0.67505
avh 1.109858 1.133126 0.979
0.33407
minh 0.751405 0.487749 1.541
0.13242
maxh -0.556292 0.161602 -3.442
0.00151
wind 0.008918 0.009167 0.973
0.33733
Residual standard error 6.508 on 35 degrees of
freedom
Multiple R-squared 0.8463, Adjusted
R-squared 0.8023
F-statistic 19.27 on 10 and 35 DF, p-value
2.073e-11

11
(No Transcript)
12
Findings

Plots OK, normality dubious
Gam plots indicated no transformations
Point 31 has quite high Cooks distance but
removing it doesnt change regression much
Model is OK.
Could interpret coefficients, but variables
highly correlated.

13
Step 3 Model selection

Use APR
Model selected was
evap maxat maxh wind
However, this model does not fit all that well
(outliers, non-normality)
Try best AIC model
evap avst maxst maxat minhmaxh
Now proceed to step 4

14
Step 4 Diagnostic checks

For a quick check, plot the regression object
produced by lm

model1.lmlt-lm(evap avst maxst maxat
minhmaxh, dataevap.df) plot(model1.lm)
15
Outliers ? Non-normal?
16
Conclusions?

No real evidence of non-linearity, but will check
further with gams
Normal plot looks curved
Some largish outliers
Points 2, 41 have largish Cooks D

17
Checking linearity

Check for linearity with gams

gt library(mgcv) gtplot(gam(evap s(avst)
s(maxst) s(maxat) s(maxh) s(wind),
dataevap.df))
18
Transform avst, maxh ?
19
Remedy

Gam plots for avst and maxh are curved
Try cubics in these variables
Plots look better
Cubic terms are significant

20
(No Transcript)
21
gt model2.lmlt-lm(evap poly(avst,3) maxst
maxat minhpoly(maxh,3), dataevap.df) gt
summary(model2.lm) Coefficients
Estimate Std. Error t value Pr(gtt)
(Intercept) 74.6521 25.4308 2.935
0.00577 poly(avst, 3)1 83.0106 27.3221
3.038 0.00441 poly(avst, 3)2 21.4666
8.3097 2.583 0.01399 poly(avst, 3)3
14.1680 7.2199 1.962 0.05749 . maxst
-0.8167 0.1697 -4.814 2.65e-05
maxat 0.4175 0.1177 3.546
0.00111 minh 0.4580 0.3253
1.408 0.16766 poly(maxh, 3)1 -89.0809
20.0297 -4.447 8.02e-05 poly(maxh, 3)2
-10.7374 7.9265 -1.355 0.18398
poly(maxh, 3)3 15.1172 6.3209 2.392
0.02212 --- Residual standard error 5.276
on 36 degrees of freedom Multiple R-squared
0.8961, Adjusted R-squared 0.8701
F-statistic 34.49 on 9 and 36 DF, p-value
4.459e-15
22
New model

Lets now adopt model
lm(evappoly(avst,3)maxstmaxatpoly(maxh,3)
wind
Outliers are not too bad but lets check

gt influenceplots(model2.lm)
23
(No Transcript)
24
(No Transcript)
25
Deletion of points

Points 2, 6, 7, 41 are affecting the fitted
values, some coefficients. Removing these one at
a time and refitting indicates that the cubics
are not very robust, so we revert to the
non-polynomial model
The coefficients of the non-polynomial model are
fairly stable when we delete these points one at
a time, so we decide to retain them.

26
Normality?

However, the normal plot for the non-polynomial
model is not very straight WB test has p-value
0.
Normality of polynomial model is better
Try predictions with both

27
predict.df data.frame(avst mean(evap.dfavst),
maxst mean(evap.dfmaxst), maxat
mean(evap.dfmaxat), maxh mean(evap.dfmaxh), mi
nh mean(evap.dfminh)) rbind(predict(model1.lm,
predict.df,interval"p" ), predict(model2.lm,
predict.df,interval"p" )) fit lwr
upr 1 34.67391 21.75680 47.59103 1 32.38471
21.39857 43.37084 CV fit fit lwr
upr 1 34.67391 21.02628 48.32154

Write a Comment

User Comments (0)