Stats 330: Lecture 25

About This Presentation

Title:

Stats 330: Lecture 25

Description:

The study aimed to evaluate two different treatments for drug addiction. ... List of variables. Variable Description Codes/Values Name. Identification Code 1-575 ID ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 45

Provided by: T10877

Category:

more less

Transcript and Presenter's Notes

Title: Stats 330: Lecture 25

1
Stats 330 Lecture 25
Multiple Logistic Regression Prediction and a
case study
2
Plan of the day

In todays lecture we discuss prediction and
present a logistic regression case study.
Topics covered will be
Prediction in logistic regression
In-sample and out-of-sample error rates
Cross-validation and bootstrap estimates of error
rates
Sensitivity and specificity
ROC curves
Then, a case study

3
Housekeeping

Error in slide 34 in lecture 23
Function is now influenceplots
Bug in ROC.curve download replacement from web
page

4
Prediction

Suppose we have fitted a logistic model and we
want to use the model to predict new cases. If a
new case presents with explanatory variables x,
how do we predict the y-value, 0 or 1?
Work out the estimated log-odds for the case
Work out probability Prob exp(log-odds)/(1exp(
log.odds))
Predict
Y1 if prob gt 0.5 (equivalently log.odds gt0)
Y0 if prob lt 0.5 (equivalently log.odds lt0)

5
Estimating the prediction error

Prediction error is the probability of a wrong
classification ( 0s predicted as 1s, 1s
predicted as 0s)
As in linear regression, using the training data
to estimate these proportions tends to give an
optimistic estimate
We can use cross-validation or the bootstrap to
improve the estimate see the case study

6
Sensitivity and specificity

Sensitivity probability of predicting a 1 when
the case is truly a 1 the true positive rate
Specificity probability of predicting a 0 when
the case is truly a 0 the true negative rate
(1-specificity is called the false positive
rate)
Ideally, want both to be close to 1
We would like to know what these would be for new
data use cross-validation and the bootstrap as
for normal regression

7
Calculating sensitivity and specificity
Model predicts Model predicts
Failure (0) Success (1)
Actual Failure (0) 100 200
Success ( 1) 250 600
Specificity 100/(100200) 33 Sensitivity
600/(600250) 70 In-sample error rate
(200250)/1150
8
ROC curves

We have predicted a success (Y1) if the
log-odds are positive.
We can generalize this to predict a success if
log-odds gtc for some constant c
If c is large and ve, almost every case will be
predicted as a success (1)
Sensitivity close to 1, specificity close to 0
If c is large and ve, almost every case will be
predicted as a failure (0)
Sensitivity close to 0, specificity close to 1
Allows a trade-off between sensitivity and
specificity
As c varies, the sensitivity and specificity
change.
ROC curve is a plot of the points (1-specificity,
sensitivity)
as c changes.

9
(No Transcript)
10
ROC curves - cont
Perfect prediction
Worst case prediction
True positive rate
True positive rate
False positive rate
Predictor no help
False positive rate
True positive rate
False positive rate
11
Area under the curve

For a perfect predictor, the area under the ROC
curve (AUC) is 1.
If the predictor is independent of the response,
the sensitivity and specificity are both 0.5.
AUC curve serves as a measure of how good the
model is at predicting.

12
Case study

The data comes from the University of
Massachusetts AIDS Research Unit IMPACT study, a
medical study performed in the US in the early
90s.
The study aimed to evaluate two different
treatments for drug addiction.
Reference Hosmer and Lemeshow, Applied Logistic
Regression (2nd Ed), p28

13
List of variables

Variable Description Codes/Values Name
Identification Code 1-575 ID
Age at Enrollment Years AGE
Beck Depression Score 0-54 BECK
IV Drug Use History 1 Never, IVHX
at Admission 2 Previous, 3 Recent
No of prior Treatments 0-40 NDRUGTX
Subject's Race 0 White , RACE
1
Other
Treatment Duration 0short, 1 Long TREAT
Treatment Site 0 A, 1 B SITE
Remained Drug Free 1 Yes, 0 No DFREE

14
The variables

The response DFREE is binary records if subject
is drug-free after conclusion of treatment.
There is a mix of categorical and continuous
explanatory variables
Categorical IVHX, RACE, TREAT, SITE
Continuous AGE, BECK, NDRUGTX.

15
Questions

Is the longer treatment more effective?
Did Site A deliver the program more effectively
than site B?
What other variables have an effect on successful
rehabilitation of addicts?
Can we predict who is likely to be drug-free in
12 months?

16
Analysis strategy

Preliminary plots, tables
Variable selection
Model fitting
Interpretation of coefficients
Evaluation as a predictor of recovery from
addiction.

17
Preliminary plots
18
Preliminary plots (2)
19
Preliminary plots (3)

Seems like number of previous drug treatments
have an effect
Seems like factors IVHX (Recent IV drug use),
SITE (Site A or Site B) and TREAT (short or long
treatment) have an effect

20
Preliminary fits (1)

Call
glm(formula DFREE . - IVHX - ID
factor(IVHX), family binomial,
data drug.df)
Deviance Residuals
Min 1Q Median 3Q Max
-1.3465 -0.8091 -0.6326 1.1834 2.4231

Dont include ID!
21
Preliminary fits (2)

Coefficients
Estimate Std. Error z value
Pr(gtz)
(Intercept) -2.4111283 0.5983427 -4.030
5.59e-05
AGE 0.0504143 0.0174057 2.896
0.00377
BECK 0.0002759 0.0107982 0.026
0.97961
NDRUGTX -0.0615329 0.0256441 -2.399
0.01642
RACE 0.2260262 0.2233685 1.012
0.31159
TREAT 0.4424802 0.1992922 2.220
0.02640
SITE 0.1489209 0.2176062 0.684
0.49375
factor(IVHX)2 -0.6036962 0.2875974 -2.099
0.03581
factor(IVHX)3 -0.7336591 0.2549893 -2.877
0.00401
(Dispersion parameter for binomial family taken
to be 1)
Null deviance 653.73 on 574 degrees of freedom
Residual deviance 619.25 on 566 degrees of
freedom
AIC 637.25
Number of Fisher Scoring iterations 4

22
Preliminary conclusions

Important variables seem to be AGE, NDRUGTX,
TREAT, IVHX
Data are ungrouped, cant assess goodness of fit
with residual deviance
No extremely large residuals

23
Hosmer-Lemeshow test

gt HLstat(drug.glm)
Value of HL statistic 5.05
P-value 0.752

No evidence of a bad fit using this test
24
Variable selection (1)

gt anova(drug.glm, test"Chisq")
Analysis of Deviance Table
Model binomial, link logit
Response DFREE
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev
P(gtChi)
NULL 574 653.73
AGE 1 1.40 573 652.33
0.24
BECK 1 0.57 572 651.76
0.45
NDRUGTX 1 14.07 571 637.69
0.0001760
RACE 1 3.06 570 634.63
0.08
TREAT 1 4.96 569 629.67
0.03
SITE 1 1.07 568 628.60
0.30
factor(IVHX) 2 9.35 566 619.25
0.01

25
Variable selection (2)

Step AIC 632.59
DFREE NDRUGTX IVHX AGE TREAT
Call glm(formula DFREE NDRUGTX IVHX AGE
TREAT,
family binomial, data drug.df)
Degrees of Freedom 574 Total (i.e. Null) 569
Residual
Null Deviance 653.7
Residual Deviance 620.6 AIC 632.6

26
Sub-model

gt sub.glmlt-glm(DFREE NDRUGTX factor(IVHX)
AGE TREAT, familybinomial, datadrug.df)
gt summary(sub.glm)
Call
glm(formula DFREE NDRUGTX factor(IVHX)
AGE TREAT, family binomial, data drug.df)
Deviance Residuals
Min 1Q Median 3Q Max
-1.2598 -0.8051 -0.6284 1.1401 2.4574

27
Sub-model (ii)
All variables significant, but use caution

Coefficients
Estimate Std. Error z value
Pr(gtz)
(Intercept) -2.33276 0.54838 -4.254
2.1e-05
NDRUGTX -0.06376 0.02563 -2.488
0.012844
factor(IVHX)2 -0.62366 0.28470 -2.191
0.028484
factor(IVHX)3 -0.80561 0.24453 -3.294
0.000986
AGE 0.05259 0.01721 3.056
0.002244
TREAT 0.45134 0.19860 2.273
0.023048
(Dispersion parameter for binomial family taken
to be 1)
Null deviance 653.73 on 574 degrees of freedom
Residual deviance 620.59 on 569 degrees of
freedom
AIC 632.59
Number of Fisher Scoring iterations 4

28
Do we need interaction terms?

gt sub.glmlt-glm(DFREE NDRUGTX IVHX AGE
TREAT , familybinomial, datadrug.df)
gt sub2.glmlt-glm(DFREE NDRUGTXIVHX AGEIVHX
AGETREAT NDRUGTXTREAT , familybinomial,
datadrug.df)
gt
gt anova(sub.glm, sub2.glm, test"Chisq")
Analysis of Deviance Table
Model 1 DFREE NDRUGTX IVHX AGE TREAT
Model 2 DFREE NDRUGTX IVHX AGE IVHX
AGE TREAT NDRUGTX TREAT
Resid. Df Resid. Dev Df Deviance P(gtChi)
1 569 620.59
2 563 616.16 6 4.42 0.62

Model with interactions
Big p-value so interactions not required
29
Do we need to transform?
par(mfrowc(1,2)) sub.gamlt-gam(DFREE s(NDRUGTX)
factor(IVHX) s(AGE) TREAT ,
familybinomial, datadrug.df) plot(sub.gam)
30
Transforming

Suggests a possible quadratic in NDRUGTX

gt subq.glmlt-glm(DFREE poly(NDRUGTX,2) IVHX
AGE TREAT, familybinomial, datadrug.df) gt
summary(subq.glm) Coefficients
Estimate Std. Error z value Pr(gtz)
(Intercept) -2.70763 0.56715 -4.774
1.80e-06 poly(NDRUGTX, 2)1 -7.24501
2.93274 -2.470 0.01350 poly(NDRUGTX, 2)2
4.21319 2.69624 1.563 0.11814 IVHX2
-0.59018 0.28608 -2.063 0.03912
IVHX3 -0.76015 0.24725 -3.074
0.00211 AGE 0.05458 0.01730
3.154 0.00161 TREAT 0.44379
0.19904 2.230 0.02577
But term is not significant, so we stick with no
transformation
31
Diagnostics
Pt 85
7, 471
7, 471
32
Influence of 7, 471, 85
Effect on coefficients of removing cases

None 7 85 471
All
(Intercept) -2.333 -2.295 -2.222 -2.447 -2.293
NDRUGTX -0.064 -0.084 -0.065 -0.075 -0.100
IVHX2 -0.624 -0.595 -0.635 -0.680 -0.662
IVHX3 -0.806 -0.783 -0.785 -0.795 -0.747
AGE 0.053 0.053 0.049 0.057 0.054
TREAT 0.451 0.434 0.441 0.479 0.450

None seem particularly influential! We will not
delete them
33
Over-dispersion

qsub.glmlt-glm(DFREE NDRUGTX IVHX AGE
TREAT, familyquasibinomial, datadrug.df)
gt summary(qsub.glm)
Coefficients
Estimate Std. Error t value Pr(gtt)
(Intercept) -2.33276 0.55435 -4.208 2.99e-05
NDRUGTX -0.06376 0.02591 -2.461 0.01414
IVHX2 -0.62366 0.28780 -2.167 0.03065
IVHX3 -0.80561 0.24720 -3.259 0.00118
AGE 0.05259 0.01740 3.023 0.00262
TREAT 0.45134 0.20076 2.248 0.02495
---
(Dispersion parameter for quasibinomial family
taken to be 1.021892)

Very close to 1 so no overdispersion
34
Interpretation

Coefficients
Estimate Std. Error z value Pr(gtz)
(Intercept) -2.33276 0.54838 -4.254 2.1e-05
NDRUGTX -0.06376 0.02563 -2.488 0.012844
IVHX2 -0.62366 0.28470 -2.191 0.028484
IVHX3 -0.80561 0.24453 -3.294 0.000986
AGE 0.05259 0.01721 3.056 0.002244
TREAT 0.45134 0.19860 2.273 0.023048
As the number of prior treatments goes up, the
probability of a drug-free recovery goes down
The probability of a drug-free recovery for
persons with no IV drug use is more than for
persons with previous IV drug use
The probability of a drug-free recovery for
persons with previous IV drug use is more than
for persons with recent IV drug use.
The probability of a drug-free recovery goes up
with age
The probability of a drug-free recovery is higher
for the long treatment

35
Interpreting p-values after model selection

We have seen that this is not valid, as model
selection changes the distribution of the
estimated coefficients.
We can use the bootstrap to examine the revised
distribution
Leave TREAT in the model, use forward selection
to select the other variables.

36
Procedure

Draw a bootstrap sample
Do forward selection, record value of regression
coef for TREAT (forced to be in every model)
Repeat 200 times, draw histogram of the results

37
R code

bootstrap sample
n dim(drug.df)1
B200
beta.boot numeric(B)
for(b in 1B)
ni rmultinom(1, n ,probrep(1/n,n))
newdata drug.dfrep(1n,ni),
drug.boot.glm glm(DFREE NDRUGTX
factor(IVHX) AGE BECK TREAT RACE SITE,
familybinomial, data newdata)
chosen step(drug.boot.glm, list(lower DFREE
TREAT, upper formula(drug.boot.glm)),
direction forward", trace0)
k match("TREAT",names(coef(chosen)))
beta.bootb summary(chosen)coefficientsk,1

38
Histogram
gt mean(beta.boot) 1 0.4540209 gt
sd(beta.boot) 1 0.2019222 gt z.val
mean(beta.boot)/ sd(beta.boot) gt
2(1-pnorm(z.val)) 1 0.02454468 Compare Beta
0.45134 SE 0.19860 P-value 0.023048
39
Prediction

Sensitivity chance the model predicts a
successful recovery (drug-free at end of
program), when one will actually occur
Specificity chance the model predicts a failure
(return to drug use before end of program), when
one actually will occur

40
R code

sub.glmlt-glm(DFREE NDRUGTX IVHX AGE TREAT
, familybinomial, datadrug.df)
gt pred predict(sub.glm, type"response")
gt predcode ifelse(predlt0.5, 0,1)
gt table(drug.dfDFREE,predcode)
predicted
0 1
Actual 0 426 2
1 144 3
Sensitivity 3/147 0.02040816
Specificity 426/428 0.9953271
Error rate 146/575 0.2539130
Proportion correctly classified 429/575
0.746087

41
ROC curve
ROC.curve(DFREE NDRUGTX factor(IVHX) AGE
TREAT, data drug.df) in the R330 package
42
Prediction (2)

Use 10-fold cross-validation
Split data into 10 parts
Calculate sensitivity and specificity for each
part, using model fitted to the remaining parts
Average results
Repeat for different splits, average repeats
E.g. for one part

43
CV and bootstrap Results

gt cross.val(DFREE NDRUGTX factor(IVHX) AGE
TREAT, drug.df)
Mean Specificity 0.9908424
Mean Sensitivity 0.02854005
Mean Correctly classified 0.7446491
gt err.boot(DFREE NDRUGTX factor(IVHX) AGE
TREAT, data drug.df)
err
1 0.2539130
Err
1 0.2552974

A poor classifier, but this doesnt mean that
the model fits poorly there are very few cases
with fitted probs over 0.5, and many with
fitted probabilities between 0.2 and 0.5. We
expect a moderate number of these to be
misclassified, as some events (being drug free)
with probs 0.2 to 0.5 have occurred.
Bootstrap estimate
Training set estimate
44
Overall conclusions

Model seems to fit well
Strong evidence that longer treatments are better
No apparent difference between sites
Age and prior IV drug use affect recovery
Model predicts poorly for the covariates in the
data set effectively always predicts patients
will not be drug free

Write a Comment

User Comments (0)