Title: Stats 330: Lecture 25
1Stats 330 Lecture 25
Multiple Logistic Regression Prediction and a
case study
2Plan of the day
- In todays lecture we discuss prediction and
present a logistic regression case study. - Topics covered will be
- Prediction in logistic regression
- In-sample and out-of-sample error rates
- Cross-validation and bootstrap estimates of error
rates - Sensitivity and specificity
- ROC curves
- Then, a case study
3Housekeeping
- Error in slide 34 in lecture 23
- Function is now influenceplots
- Bug in ROC.curve download replacement from web
page
4Prediction
- Suppose we have fitted a logistic model and we
want to use the model to predict new cases. If a
new case presents with explanatory variables x,
how do we predict the y-value, 0 or 1? - Work out the estimated log-odds for the case
- Work out probability Prob exp(log-odds)/(1exp(
log.odds)) - Predict
- Y1 if prob gt 0.5 (equivalently log.odds gt0)
- Y0 if prob lt 0.5 (equivalently log.odds lt0)
5Estimating the prediction error
- Prediction error is the probability of a wrong
classification ( 0s predicted as 1s, 1s
predicted as 0s) - As in linear regression, using the training data
to estimate these proportions tends to give an
optimistic estimate - We can use cross-validation or the bootstrap to
improve the estimate see the case study
6Sensitivity and specificity
- Sensitivity probability of predicting a 1 when
the case is truly a 1 the true positive rate - Specificity probability of predicting a 0 when
the case is truly a 0 the true negative rate
(1-specificity is called the false positive
rate) - Ideally, want both to be close to 1
- We would like to know what these would be for new
data use cross-validation and the bootstrap as
for normal regression
7Calculating sensitivity and specificity
Model predicts Model predicts
Failure (0) Success (1)
Actual Failure (0) 100 200
Success ( 1) 250 600
Specificity 100/(100200) 33 Sensitivity
600/(600250) 70 In-sample error rate
(200250)/1150
8ROC curves
- We have predicted a success (Y1) if the
log-odds are positive. - We can generalize this to predict a success if
log-odds gtc for some constant c - If c is large and ve, almost every case will be
predicted as a success (1) - Sensitivity close to 1, specificity close to 0
- If c is large and ve, almost every case will be
predicted as a failure (0) - Sensitivity close to 0, specificity close to 1
- Allows a trade-off between sensitivity and
specificity - As c varies, the sensitivity and specificity
change. - ROC curve is a plot of the points (1-specificity,
sensitivity) - as c changes.
9(No Transcript)
10ROC curves - cont
Perfect prediction
Worst case prediction
True positive rate
True positive rate
False positive rate
Predictor no help
False positive rate
True positive rate
False positive rate
11Area under the curve
- For a perfect predictor, the area under the ROC
curve (AUC) is 1. - If the predictor is independent of the response,
the sensitivity and specificity are both 0.5. - AUC curve serves as a measure of how good the
model is at predicting.
12Case study
- The data comes from the University of
Massachusetts AIDS Research Unit IMPACT study, a
medical study performed in the US in the early
90s. -
- The study aimed to evaluate two different
treatments for drug addiction. - Reference Hosmer and Lemeshow, Applied Logistic
Regression (2nd Ed), p28
13List of variables
- Variable Description Codes/Values Name
- Identification Code 1-575 ID
- Age at Enrollment Years AGE
- Beck Depression Score 0-54 BECK
- IV Drug Use History 1 Never, IVHX
- at Admission 2 Previous, 3 Recent
- No of prior Treatments 0-40 NDRUGTX
- Subject's Race 0 White , RACE
- 1
Other - Treatment Duration 0short, 1 Long TREAT
- Treatment Site 0 A, 1 B SITE
- Remained Drug Free 1 Yes, 0 No DFREE
-
14The variables
- The response DFREE is binary records if subject
is drug-free after conclusion of treatment. - There is a mix of categorical and continuous
explanatory variables - Categorical IVHX, RACE, TREAT, SITE
- Continuous AGE, BECK, NDRUGTX.
15Questions
- Is the longer treatment more effective?
- Did Site A deliver the program more effectively
than site B? - What other variables have an effect on successful
rehabilitation of addicts? - Can we predict who is likely to be drug-free in
12 months?
16Analysis strategy
- Preliminary plots, tables
- Variable selection
- Model fitting
- Interpretation of coefficients
- Evaluation as a predictor of recovery from
addiction.
17Preliminary plots
18Preliminary plots (2)
19Preliminary plots (3)
- Seems like number of previous drug treatments
have an effect - Seems like factors IVHX (Recent IV drug use),
SITE (Site A or Site B) and TREAT (short or long
treatment) have an effect
20Preliminary fits (1)
- Call
- glm(formula DFREE . - IVHX - ID
factor(IVHX), family binomial, - data drug.df)
- Deviance Residuals
- Min 1Q Median 3Q Max
- -1.3465 -0.8091 -0.6326 1.1834 2.4231
Dont include ID!
21Preliminary fits (2)
- Coefficients
- Estimate Std. Error z value
Pr(gtz) - (Intercept) -2.4111283 0.5983427 -4.030
5.59e-05 - AGE 0.0504143 0.0174057 2.896
0.00377 - BECK 0.0002759 0.0107982 0.026
0.97961 - NDRUGTX -0.0615329 0.0256441 -2.399
0.01642 - RACE 0.2260262 0.2233685 1.012
0.31159 - TREAT 0.4424802 0.1992922 2.220
0.02640 - SITE 0.1489209 0.2176062 0.684
0.49375 - factor(IVHX)2 -0.6036962 0.2875974 -2.099
0.03581 - factor(IVHX)3 -0.7336591 0.2549893 -2.877
0.00401 - (Dispersion parameter for binomial family taken
to be 1) - Null deviance 653.73 on 574 degrees of freedom
- Residual deviance 619.25 on 566 degrees of
freedom - AIC 637.25
- Number of Fisher Scoring iterations 4
22Preliminary conclusions
- Important variables seem to be AGE, NDRUGTX,
TREAT, IVHX - Data are ungrouped, cant assess goodness of fit
with residual deviance - No extremely large residuals
23Hosmer-Lemeshow test
- gt HLstat(drug.glm)
- Value of HL statistic 5.05
- P-value 0.752
No evidence of a bad fit using this test
24Variable selection (1)
- gt anova(drug.glm, test"Chisq")
- Analysis of Deviance Table
- Model binomial, link logit
- Response DFREE
- Terms added sequentially (first to last)
- Df Deviance Resid. Df Resid. Dev
P(gtChi) - NULL 574 653.73
- AGE 1 1.40 573 652.33
0.24 - BECK 1 0.57 572 651.76
0.45 - NDRUGTX 1 14.07 571 637.69
0.0001760 - RACE 1 3.06 570 634.63
0.08 - TREAT 1 4.96 569 629.67
0.03 - SITE 1 1.07 568 628.60
0.30 - factor(IVHX) 2 9.35 566 619.25
0.01
25Variable selection (2)
- Step AIC 632.59
- DFREE NDRUGTX IVHX AGE TREAT
- Call glm(formula DFREE NDRUGTX IVHX AGE
TREAT, - family binomial, data drug.df)
- Degrees of Freedom 574 Total (i.e. Null) 569
Residual - Null Deviance 653.7
- Residual Deviance 620.6 AIC 632.6
26Sub-model
- gt sub.glmlt-glm(DFREE NDRUGTX factor(IVHX)
AGE TREAT, familybinomial, datadrug.df) - gt summary(sub.glm)
- Call
- glm(formula DFREE NDRUGTX factor(IVHX)
AGE TREAT, family binomial, data drug.df) - Deviance Residuals
- Min 1Q Median 3Q Max
- -1.2598 -0.8051 -0.6284 1.1401 2.4574
27Sub-model (ii)
All variables significant, but use caution
- Coefficients
- Estimate Std. Error z value
Pr(gtz) - (Intercept) -2.33276 0.54838 -4.254
2.1e-05 - NDRUGTX -0.06376 0.02563 -2.488
0.012844 - factor(IVHX)2 -0.62366 0.28470 -2.191
0.028484 - factor(IVHX)3 -0.80561 0.24453 -3.294
0.000986 - AGE 0.05259 0.01721 3.056
0.002244 - TREAT 0.45134 0.19860 2.273
0.023048 - (Dispersion parameter for binomial family taken
to be 1) - Null deviance 653.73 on 574 degrees of freedom
- Residual deviance 620.59 on 569 degrees of
freedom - AIC 632.59
- Number of Fisher Scoring iterations 4
28Do we need interaction terms?
- gt sub.glmlt-glm(DFREE NDRUGTX IVHX AGE
TREAT , familybinomial, datadrug.df) - gt sub2.glmlt-glm(DFREE NDRUGTXIVHX AGEIVHX
AGETREAT NDRUGTXTREAT , familybinomial,
datadrug.df) - gt
- gt anova(sub.glm, sub2.glm, test"Chisq")
- Analysis of Deviance Table
- Model 1 DFREE NDRUGTX IVHX AGE TREAT
- Model 2 DFREE NDRUGTX IVHX AGE IVHX
AGE TREAT NDRUGTX TREAT - Resid. Df Resid. Dev Df Deviance P(gtChi)
- 1 569 620.59
- 2 563 616.16 6 4.42 0.62
Model with interactions
Big p-value so interactions not required
29Do we need to transform?
par(mfrowc(1,2)) sub.gamlt-gam(DFREE s(NDRUGTX)
factor(IVHX) s(AGE) TREAT ,
familybinomial, datadrug.df) plot(sub.gam)
30Transforming
- Suggests a possible quadratic in NDRUGTX
gt subq.glmlt-glm(DFREE poly(NDRUGTX,2) IVHX
AGE TREAT, familybinomial, datadrug.df) gt
summary(subq.glm) Coefficients
Estimate Std. Error z value Pr(gtz)
(Intercept) -2.70763 0.56715 -4.774
1.80e-06 poly(NDRUGTX, 2)1 -7.24501
2.93274 -2.470 0.01350 poly(NDRUGTX, 2)2
4.21319 2.69624 1.563 0.11814 IVHX2
-0.59018 0.28608 -2.063 0.03912
IVHX3 -0.76015 0.24725 -3.074
0.00211 AGE 0.05458 0.01730
3.154 0.00161 TREAT 0.44379
0.19904 2.230 0.02577
But term is not significant, so we stick with no
transformation
31Diagnostics
Pt 85
7, 471
7, 471
32Influence of 7, 471, 85
Effect on coefficients of removing cases
- None 7 85 471
All - (Intercept) -2.333 -2.295 -2.222 -2.447 -2.293
- NDRUGTX -0.064 -0.084 -0.065 -0.075 -0.100
- IVHX2 -0.624 -0.595 -0.635 -0.680 -0.662
- IVHX3 -0.806 -0.783 -0.785 -0.795 -0.747
- AGE 0.053 0.053 0.049 0.057 0.054
- TREAT 0.451 0.434 0.441 0.479 0.450
None seem particularly influential! We will not
delete them
33Over-dispersion
- qsub.glmlt-glm(DFREE NDRUGTX IVHX AGE
TREAT, familyquasibinomial, datadrug.df) - gt summary(qsub.glm)
- Coefficients
- Estimate Std. Error t value Pr(gtt)
- (Intercept) -2.33276 0.55435 -4.208 2.99e-05
- NDRUGTX -0.06376 0.02591 -2.461 0.01414
- IVHX2 -0.62366 0.28780 -2.167 0.03065
- IVHX3 -0.80561 0.24720 -3.259 0.00118
- AGE 0.05259 0.01740 3.023 0.00262
- TREAT 0.45134 0.20076 2.248 0.02495
- ---
- (Dispersion parameter for quasibinomial family
taken to be 1.021892)
Very close to 1 so no overdispersion
34Interpretation
- Coefficients
- Estimate Std. Error z value Pr(gtz)
- (Intercept) -2.33276 0.54838 -4.254 2.1e-05
- NDRUGTX -0.06376 0.02563 -2.488 0.012844
- IVHX2 -0.62366 0.28470 -2.191 0.028484
- IVHX3 -0.80561 0.24453 -3.294 0.000986
- AGE 0.05259 0.01721 3.056 0.002244
- TREAT 0.45134 0.19860 2.273 0.023048
- As the number of prior treatments goes up, the
probability of a drug-free recovery goes down - The probability of a drug-free recovery for
persons with no IV drug use is more than for
persons with previous IV drug use - The probability of a drug-free recovery for
persons with previous IV drug use is more than
for persons with recent IV drug use. - The probability of a drug-free recovery goes up
with age - The probability of a drug-free recovery is higher
for the long treatment
35Interpreting p-values after model selection
- We have seen that this is not valid, as model
selection changes the distribution of the
estimated coefficients. - We can use the bootstrap to examine the revised
distribution - Leave TREAT in the model, use forward selection
to select the other variables.
36Procedure
- Draw a bootstrap sample
- Do forward selection, record value of regression
coef for TREAT (forced to be in every model) - Repeat 200 times, draw histogram of the results
37R code
- bootstrap sample
- n dim(drug.df)1
- B200
- beta.boot numeric(B)
- for(b in 1B)
- ni rmultinom(1, n ,probrep(1/n,n))
- newdata drug.dfrep(1n,ni),
- drug.boot.glm glm(DFREE NDRUGTX
factor(IVHX) AGE BECK TREAT RACE SITE, - familybinomial, data newdata)
- chosen step(drug.boot.glm, list(lower DFREE
TREAT, upper formula(drug.boot.glm)), - direction forward", trace0)
- k match("TREAT",names(coef(chosen)))
- beta.bootb summary(chosen)coefficientsk,1
-
38Histogram
gt mean(beta.boot) 1 0.4540209 gt
sd(beta.boot) 1 0.2019222 gt z.val
mean(beta.boot)/ sd(beta.boot) gt
2(1-pnorm(z.val)) 1 0.02454468 Compare Beta
0.45134 SE 0.19860 P-value 0.023048
39Prediction
- Sensitivity chance the model predicts a
successful recovery (drug-free at end of
program), when one will actually occur - Specificity chance the model predicts a failure
(return to drug use before end of program), when
one actually will occur
40R code
- sub.glmlt-glm(DFREE NDRUGTX IVHX AGE TREAT
, familybinomial, datadrug.df) - gt pred predict(sub.glm, type"response")
- gt predcode ifelse(predlt0.5, 0,1)
- gt table(drug.dfDFREE,predcode)
- predicted
- 0 1
- Actual 0 426 2
- 1 144 3
- Sensitivity 3/147 0.02040816
- Specificity 426/428 0.9953271
- Error rate 146/575 0.2539130
- Proportion correctly classified 429/575
0.746087
41ROC curve
ROC.curve(DFREE NDRUGTX factor(IVHX) AGE
TREAT, data drug.df) in the R330 package
42Prediction (2)
- Use 10-fold cross-validation
- Split data into 10 parts
- Calculate sensitivity and specificity for each
part, using model fitted to the remaining parts - Average results
- Repeat for different splits, average repeats
- E.g. for one part
43CV and bootstrap Results
- gt cross.val(DFREE NDRUGTX factor(IVHX) AGE
TREAT, drug.df) - Mean Specificity 0.9908424
- Mean Sensitivity 0.02854005
- Mean Correctly classified 0.7446491
- gt err.boot(DFREE NDRUGTX factor(IVHX) AGE
TREAT, data drug.df) - err
- 1 0.2539130
- Err
- 1 0.2552974
A poor classifier, but this doesnt mean that
the model fits poorly there are very few cases
with fitted probs over 0.5, and many with
fitted probabilities between 0.2 and 0.5. We
expect a moderate number of these to be
misclassified, as some events (being drug free)
with probs 0.2 to 0.5 have occurred.
Bootstrap estimate
Training set estimate
44Overall conclusions
- Model seems to fit well
- Strong evidence that longer treatments are better
- No apparent difference between sites
- Age and prior IV drug use affect recovery
- Model predicts poorly for the covariates in the
data set effectively always predicts patients
will not be drug free