Building useful models: Some new developments and easily avoidable errors

About This Presentation
Title:

Building useful models: Some new developments and easily avoidable errors

Description:

Penalization/Random effects. Propensity Scoring ' ... Penalize appropriately for any data-driven decision about how to model a variable ... –

Number of Views:94
Avg rating:3.0/5.0
Slides: 91
Provided by: duke
Learn more at: https://people.duke.edu
Category:

less

Transcript and Presenter's Notes

Title: Building useful models: Some new developments and easily avoidable errors


1
Building useful models Some new developments and
easily avoidable errors
  • Michael Babyak, PhD

2
What is a model ?
Y f(x1, x2, x3xn)
Y a b1x1 b2x2bnxn
Y e a b1x1 b2x2bnxn
3
All models are wrong, some are useful --
George Box
  • A useful model is
  • Not very biased
  • Interpretable
  • Replicable (predicts in a new sample)

4
(No Transcript)
5
Some Premises
  • Statistics is a cumulative, evolving field
  • Newer is not necessarily better, but should be
    entertained in the context of the scientific
    question at hand
  • Data analytic practice resides along a continuum,
    from exploratory to confirmatory. Both are
    important, but the difference has to be
    recognized.
  • Theres no substitute for thinking about the
    problem

6
Statistics is a cumulative, evolving field How
do we know this stuff?
  • Theory
  • Simulation

7
Concept of Simulation
Y b X error
bs1
bs2
bsk-1
bsk
bs3
bs4
.
8
Concept of Simulation
Y b X error
bs1
bs2
bsk-1
bsk
bs3
bs4
.
Evaluate
9
Simulation Example
Y .4 X error
bs1
bs2
bsk-1
bsk
bs3
bs4
.
10
Simulation Example
Y .4 X error
bs1
bs2
bsk-1
bsk
bs3
bs4
.
Evaluate
11
True ModelY .4x1 e
12
Ingredients of a Useful Model
Correct probability model
Based on theory
Good measures/no loss of information
Useful Model
Comprehensive
Parsimonious
Tested fairly
Flexible
13
Correct Model
  • Gaussian General Linear Model
  • Multiple linear regression
  • Binary (or ordinal) Generalized Linear Model
  • Logistic Regression
  • Proportional Odds/Ordinal Logistic
  • Time to event
  • Cox Regression or parametric survival models

14
Generalized Linear Model
Normal
Binary/Binomial
Count, heavy skew, Lots of zeros
Poisson, ZIP, negbin, gamma
General Linear Model/ Linear Regression
Logistic Regression
ANOVA/t-test ANCOVA
Chi-square
Regression w/ Transformed DV
Can be applied to clustered (e.g, repeated
measures data)
15
Factor Analytic Family
Structural Equation Models
Partial Least Squares
Latent Variable Models (Confirmatory Factor
Analysis)
Multiple regression
Common Factor Analysis
Principal Components
16
Use Theory
  • Theory and expert information are critical in
    helping sift out artifact
  • Numbers can look very systematic when the are in
    fact random
  • http//www.tufts.edu/gdallal/multtest.htm

17
Measure well
  • Adequate range
  • Representative values
  • Watch for ceiling/floor effects

18
Using all the information
  • Preserving cases in data sets with missing data
  • Conventional approaches
  • Use only complete case
  • Fill in with mean or median
  • Use a missing data indicator in the model

19
Missing Data
  • Imputation or related approaches are almost
    ALWAYS better than deleting incomplete cases
  • Multiple Imputation
  • Full Information Maximum Likelihood

20
Multiple Imputation
21
Modern Missing Data Techniques
  • Preserve more information from original sample
  • Incorporate uncertainty about missingness into
    final estimates
  • Produce better estimates of population (true)
    values

22
Dont throw waste information from variables
  • Use all the information about the variables of
    interest
  • Dont create clinical cutpoints before modeling
  • Model with ALL the data first, then use
    prediction to make decisions about cutpoints

23
Dichotomizing for Convenience Dubious
Practice(C.R.A.P.)
  • Convoluted Reasoning and Anti-intellectual
    Pomposity
  • Streiner Norman Biostatistics The Bare
    Essentials

24
Implausible measurement assumption
not depressed
depressed
A
B
C
Depression score
25
Loss of power
http//psych.colorado.edu/mcclella/MedianSplit/
Sometimes through sampling error You can get a
lucky cut.
http//www.bolderstats.com/jmsl/doc/medianSplit.ht
ml
26
Dichotomization, by definition, reduces the
magnitude of the estimate by a minimum of about
30
Dear Project Officer, In order to facilitate
analysis and interpretation, we have decided to
throw away about 30 of our data. Even though
this will waste about 3 or 4 hundred thousand
dollars worth of subject recruitment and testing
money, we are confident that you will
understand. Sincerely, Dick O. Tomi, PhD Prof.
Richard Obediah Tomi, PhD
27
Power to detect non-zero b-weight when x is
continuous versus dichotomized
True model y .4x e
28
Dichotomizing will obscure non-linearity
Low
High
CESD Score
29
Dichotomizing will obscure non-linearity Same
data as previous slide modeled continuously
30
Type I error rates for the relation between x2
and y after dichotomizing two continuous
predictors. Maxwell and Delaney calculated the
effect of dichotomizing two continuous predictors
as a function of the correlation between them.
The true model is y .5x1 0x2, where all
variables are continuous. If x1 and x2 are
dichotomized, the error rate for the relation
between x2 and y increases as the correlation
between x1 and x2 increases.
Correlation between x1 and x2 Correlation between x1 and x2 Correlation between x1 and x2 Correlation between x1 and x2
N 0 .3 .5 .7
50 .05 .06 .08 .10
100 .05 .08 .12 .18
200 .05 .10 .19 .31
31
Is it ever a good idea to categorize
quantitatively measured variables?
  • Yes
  • when the variable is truly categorical
  • for descriptive/presentational purposes
  • for hypothesis testing, if enough categories are
    made.
  • However, using many categories can lead to
    problems of multiple significance tests and still
    run the risk of misclassification

32
CONCLUSIONS
  • Cutting
  • Doesnt always make measurement sense
  • Almost always reduces power
  • Can fool you with too much power in some
    instances
  • Can completely miss important features of the
    underlying function
  • Modern computing/statistical packages can
    handle continuous variables
  • Want to make good clinical cutpoints? Model
    first, decide on cuts afterward.

33
Sample size and the problem of underfitting vs
overfitting
  • Model assumption is that ALL relevant variables
    be includedthe antiparsimony principle
  • Tempered by fact that estimating too many
    unknowns with too little data will yield junk

34
Sample Size Requirements
  • Linear regression
  • minimum of N 50 8predictor (Green, 1990)
  • Logistic Regression
  • Minimum of N 10-15/predictor among smallest
    group (Peduzzi et al., 1990a)
  • Survival Analysis
  • Minimum of N 10-15/predictor (Peduzzi et al.,
    1990b)

35
Consequences of inadequate sample size
  • Lack of power for individual tests
  • Unstable estimates
  • Spurious good fitlots of unstable estimates will
    produce spurious good-looking (big) regression
    coefficients

36
All-noise, but good fit
R-squares from a population model of
completelyrandom variables
Events per predictor ratio
37
Simulation number of events/predictor ratio
Y .5x1 0x2 .2x3 0x4 -- Where r x1 x4
.4 -- N/p 3, 5, 10, 20, 50
38
Parameter stability and n/p ratio
39
Peduzzis Simulation number of events/predictor
ratio
P(survival) a b1NYHA b2CHF b3VES b4DM
b5STD b6HTN b7LVC --Events/p 2, 5,
10, 15, 20, 25 -- relative bias
(estimated b true b/true b)100
40
Simulation results number of events/predictor
ratio
41
Simulation results number of events/predictor
ratio
42
Approaches to variable selection
  • Stepwise automated selection
  • Pre-screening using univariate tests
  • Combining or eliminating redundant predictors
  • Fixing some coefficients
  • Theory, expert opinion and experience
  • Penalization/Random effects
  • Propensity Scoring
  • Matches individuals on multiple dimensions to
    improve baseline balance
  • Tibshiranis Lasso

43
Any variable selection technique based on looking
at the data first will likely be biased
44
  • I now wish I had never written the stepwise
    selection code for SAS.
  • --Frank Harrell, author of forward and backwards
    selection algorithm for SAS PROC REG

45
Automated Selection Derksen and Keselman (1992)
Simulation Study
  • Studied backward and forward selection
  • Some authentic variables and some noise variables
    among candidate variables
  • Manipulated correlation among candidate
    predictors
  • Manipulated sample size

46
Automated Selection Derksen and Keselman (1992)
Simulation Study
  • The degree of correlation between candidate
    predictors affected the frequency with which the
    authentic predictors found their way into the
    model.
  • The greater the number of candidate predictors,
    the greater the number of noise variables were
    included in the model.
  • Sample size was of little practical importance
    in determining the number of authentic variables
    contained in the final model.

47
Simulation results Number of noise variables
included
Sample Size
20 candidate predictors 100 samples
48
Simulation results R-square from noise variables
Sample Size
20 candidate predictors 100 samples
49
Simulation results R-square from noise variables
Sample Size
20 candidate predictors 100 samples
50
SOME of the problems with stepwise variable
selection.
1. It yields R-squared values that are badly
biased high 2. The F and chi-squared tests
quoted next to each variable on the printout do
not have the claimed distribution 3. The method
yields confidence intervals for effects and
predicted values that are falsely narrow (See
Altman and Anderson Stat in Med) 4. It yields
P-values that do not have the proper meaning and
the proper correction for them is a very
difficult problem 5. It gives biased regression
coefficients that need shrinkage (the
coefficients for remaining variables are too
large see Tibshirani, 1996). 6. It has severe
problems in the presence of collinearity 7. It
is based on methods (e.g. F tests for nested
models) that were intended to be used to test
pre-specified hypotheses. 8. Increasing the
sample size doesn't help very much (see Derksen
and Keselman) 9. It allows us to not think about
the problem 10. It uses a lot of paper
51
author Chatfield, C.,   title  Model
uncertainty, data mining and statistical
inference (with discussion),   journal  JRSSA,
  year     1995,   volume 158,   pages  
419-466,   annote               --bias by
selecting model because it fits the data well
bias in standard errors P. 420 ... need for a
better balance in the literature and in
statistical teaching between techniques and
problem solving strategies.  P. 421 It is well
known' to be logically unsound and practically
misleading' (Zhang, 1992) to make inferences as
if a model is known to be true when it has, in
fact, been selected from the same data to be used
for estimation purposes.  However, although
statisticians may admit this privately (Breiman
(1992) calls it a quiet scandal'), they (we)
continue to ignore the difficulties because it is
not clear what else could or should be done. P.
421 Estimation errors for regression
coefficients are usually smaller than errors from
failing to take into account model specification.
P. 422 Statisticians must stop pretending that
model uncertainty does not exist and begin to
find ways of coping with it.  P. 426 It is
indeed strange that we often admit model
uncertainty by searching for a best model but
then ignore this uncertainty by making inferences
and predictions as if certain that the best
fitting model is actually true.  
52
Phantom Degrees of Freedom
  • Faraway (1992)showed that any pre-modeling
    strategy cost a df over and above df used later
    in modeling.
  • Premodeling strategies included variable
    selection, outlier detection, linearity tests,
    residual analysis.
  • Thus, although not accounted for in final model,
    these phantom df will render the model too
    optimistic

53
Phantom Degrees of Freedom
  • Therefore, if you transform, select, etc., you
    must include the DF in (i.e., penalize for) the
    Final Model

54
Conventional Univariate Pre-selection
  • Non-significant tests also cost a DF
  • Non-significance is NOT necessarily related to
    importance
  • Variables may not behave the same way in a
    multivariable modelvariable not significant at
    univariate test may be very important in the
    presence of other variables

55
Conventional Univariate Pre-selection
  • Despite the convention, testing for confounding
    has not been systematically studiedin many cases
    leads to overadjustment and underestimate of true
    effect of variable of interest.
  • At the very least, pulling variables in and out
    of models inflates the model fit, often
    dramatically

56
Better approach
  • Pick variables a priori
  • Stick with them
  • Penalize appropriately for any data-driven
    decision about how to model a variable

57
Spending DF wisely
  • If not enough N/predictor, combine covariates
    using techniques that do not look at Y in the
    sample, PCA, FA, conceptual clustering,
    collapsing, scoring, established indexes.
  • Save DF for finer-grained look at variables of
    most interest, e.g, non-linear functions

58
Help is on the way?
  • Penalization/Random effects
  • Propensity Scoring
  • Matches individuals on multiple dimensions to
    improve baseline balance
  • Tibshiranis Lasso

59
http//myspace.com/monkeynavigatedrobots
60
Validation
  • Apparent fit
  • Usually too optimistic
  • Internal
  • cross-validation, bootstrap
  • honest estimate for model performance
  • provides an upper limit to what would be found on
    external validation
  • External validation
  • replication with new sample, different
    circumstances

61
Validation
  • Steyerburg, et al. (1999) compared validation
    methods
  • Found that split-half was far too conservative
  • Bootstrap was equal or superior to all other
    techniques

62
Conclusions
  • Measure well
  • Use all the information
  • Recognize the limitations based on how much data
    you actually have
  • In the confirmatory mode, be as explicit as
    possible about the model a priori, test it, and
    live with it
  • By all means, explore data, but recognize and
    state frankly --the limits post hoc analysis
    places on inference

63
Advanced topics and examples
64
Bootstrap
My Sample
?1
?2
?3
?k-1
?k
?4
.
WITH REPLACEMENT
Evaluate
65
1, 3, 4, 5, 7, 10
7 1 1 4 5 10
10 3 2 2 2 1
3 5 1 4 2 7
2 1 1 7 2 7
4 4 1 4 2 10
66
Can use data to determine where to spend DF
  • Use Spearmans Rho to test importance
  • Not peeking because we have chosen to include the
    term in the model regardless of relation to Y
  • Use more DF for non-linearity

67
Example-Predict Survival from age, gender, and
fare on Titanicexample using S-Plus (or R)
software
68
If you have already decided to include them (and
promise to keep them in the model) you can peek
at predictors in order to see where to add
complexity
69
(No Transcript)
70
Non-linearity using splines
71
Linear Spline (piecewise regression)
Y a b1(xlt10) b2(10ltxlt20) b3 (x gt20)
72
Cubic Spline (non-linear piecewise regression)
knots
73
Logistic regression model
fitfarelt-lrm(survived(rcs(fare,3)agesex)2,xT,
yT) anova(fitfare)
Spline with 3 knots
74
Wald Statistics Response survived
Factor
Chi-Square d.f. P fare (FactorHigher
Order Factors) 55.1 6 lt.0001 All
Interactions 13.8 4
0.0079 Nonlinear (FactorHigher Order
Factors) 21.9 3 0.0001 age
(FactorHigher Order Factors) 22.2 4
0.0002 All Interactions
16.7 3 0.0008 sex (FactorHigher
Order Factors) 208.7 4 lt.0001
All Interactions 20.2
3 0.0002 fare age (FactorHigher Order
Factors) 8.5 2 0.0142 Nonlinear
8.5 1 0.0036
Nonlinear Interaction f(A,B) vs. AB 8.5
1 0.0036 fare sex (FactorHigher Order
Factors) 6.4 2 0.0401 Nonlinear
1.5 1 0.2153
Nonlinear Interaction f(A,B) vs. AB 1.5
1 0.2153 age sex (FactorHigher Order
Factors) 9.9 1 0.0016 TOTAL NONLINEAR
21.9 3 0.0001
TOTAL INTERACTION 24.9
5 0.0001 TOTAL NONLINEAR INTERACTION
38.3 6 lt.0001 TOTAL
245.3 9 lt.0001
75
Wald Statistics Response survived
Factor
Chi-Square d.f. P fare (FactorHigher
Order Factors) 55.1 6 lt.0001 All
Interactions 13.8 4
0.0079 Nonlinear (FactorHigher Order
Factors) 21.9 3 0.0001 age
(FactorHigher Order Factors) 22.2 4
0.0002 All Interactions
16.7 3 0.0008 sex (FactorHigher
Order Factors) 208.7 4 lt.0001
All Interactions 20.2
3 0.0002 fare age (FactorHigher Order
Factors) 8.5 2 0.0142 Nonlinear
8.5 1 0.0036
Nonlinear Interaction f(A,B) vs. AB 8.5
1 0.0036 fare sex (FactorHigher Order
Factors) 6.4 2 0.0401 Nonlinear
1.5 1 0.2153
Nonlinear Interaction f(A,B) vs. AB 1.5
1 0.2153 age sex (FactorHigher Order
Factors) 9.9 1 0.0016 TOTAL NONLINEAR
21.9 3 0.0001
TOTAL INTERACTION 24.9
5 0.0001 TOTAL NONLINEAR INTERACTION
38.3 6 lt.0001 TOTAL
245.3 9 lt.0001
76
Wald Statistics Response survived
Factor
Chi-Square d.f. P fare (FactorHigher
Order Factors) 55.1 6 lt.0001 All
Interactions 13.8 4
0.0079 Nonlinear (FactorHigher Order
Factors) 21.9 3 0.0001 age
(FactorHigher Order Factors) 22.2 4
0.0002 All Interactions
16.7 3 0.0008 sex (FactorHigher
Order Factors) 208.7 4 lt.0001
All Interactions 20.2
3 0.0002 fare age (FactorHigher Order
Factors) 8.5 2 0.0142 Nonlinear
8.5 1 0.0036
Nonlinear Interaction f(A,B) vs. AB 8.5
1 0.0036 fare sex (FactorHigher Order
Factors) 6.4 2 0.0401 Nonlinear
1.5 1 0.2153
Nonlinear Interaction f(A,B) vs. AB 1.5
1 0.2153 age sex (FactorHigher Order
Factors) 9.9 1 0.0016 TOTAL NONLINEAR
21.9 3 0.0001
TOTAL INTERACTION 24.9
5 0.0001 TOTAL NONLINEAR INTERACTION
38.3 6 lt.0001 TOTAL
245.3 9 lt.0001
77
Wald Statistics Response survived
Factor
Chi-Square d.f. P fare (FactorHigher
Order Factors) 55.1 6 lt.0001 All
Interactions 13.8 4
0.0079 Nonlinear (FactorHigher Order
Factors) 21.9 3 0.0001 age
(FactorHigher Order Factors) 22.2 4
0.0002 All Interactions
16.7 3 0.0008 sex (FactorHigher
Order Factors) 208.7 4 lt.0001
All Interactions 20.2
3 0.0002 fare age (FactorHigher Order
Factors) 8.5 2 0.0142 Nonlinear
8.5 1 0.0036
Nonlinear Interaction f(A,B) vs. AB 8.5
1 0.0036 fare sex (FactorHigher Order
Factors) 6.4 2 0.0401 Nonlinear
1.5 1 0.2153
Nonlinear Interaction f(A,B) vs. AB 1.5
1 0.2153 age sex (FactorHigher Order
Factors) 9.9 1 0.0016 TOTAL NONLINEAR
21.9 3 0.0001
TOTAL INTERACTION 24.9
5 0.0001 TOTAL NONLINEAR INTERACTION
38.3 6 lt.0001 TOTAL
245.3 9 lt.0001
78
Wald Statistics Response survived
Factor
Chi-Square d.f. P fare (FactorHigher
Order Factors) 55.1 6 lt.0001 All
Interactions 13.8 4
0.0079 Nonlinear (FactorHigher Order
Factors) 21.9 3 0.0001 age
(FactorHigher Order Factors) 22.2 4
0.0002 All Interactions
16.7 3 0.0008 sex (FactorHigher
Order Factors) 208.7 4 lt.0001
All Interactions 20.2
3 0.0002 fare age (FactorHigher Order
Factors) 8.5 2 0.0142 Nonlinear
8.5 1 0.0036
Nonlinear Interaction f(A,B) vs. AB 8.5
1 0.0036 fare sex (FactorHigher Order
Factors) 6.4 2 0.0401 Nonlinear
1.5 1 0.2153
Nonlinear Interaction f(A,B) vs. AB 1.5
1 0.2153 age sex (FactorHigher Order
Factors) 9.9 1 0.0016 TOTAL NONLINEAR
21.9 3 0.0001
TOTAL INTERACTION 24.9
5 0.0001 TOTAL NONLINEAR INTERACTION
38.3 6 lt.0001 TOTAL
245.3 9 lt.0001
79
(No Transcript)
80
(No Transcript)
81
(No Transcript)
82
Bootstrap Validation
Index Training Corrected
Dxy 0.6565 0.646
R2 0.4273 0.407
Intercept 0.0000 -0.011
Slope 1.0000 0.952
83
Summary
  • Think about your model
  • Collect enough data

84
Summary
  • Measure well
  • Dont destroy what youve measured

85
Summary
  • Pick your variables ahead of time and collect
    enough data to test the model you want
  • Keep all your variables in the model unless
    extremely unimportant

86
Summary
  • Use more df on important variables, fewer df on
    nuisance variables
  • Dont peek at Y to combine, discard, or transform
    variables

87
Summary
  • Estimate validity and shrinkage with bootstrap

88
Summary
  • By all means, tinker with the model later, but be
    aware of the costs of tinkering
  • Dont forget to say you tinkered
  • Go collect more data

89
Web links for references, software, and more
  • Harrells regression modeling text
  • http//hesweb1.med.virginia.edu/biostat/rms/
  • SAS Macros for spline estimation
  • http//hesweb1.med.virginia.edu/biostat/SAS/survri
    sk.txt
  • Some results comparing validation methods
  • http//hesweb1.med.virginia.edu/biostat/reports/lo
    gistic.val.pdf
  • SAS code for bootstrap
  • ftp//ftp.sas.com/pub/neural/jackboot.sas
  • S-Plus home page
  • insightful.com
  • Mike Babyaks e-mail
  • michael.babyak_at_duke.edu
  • This presentation
  • http//www.duke.edu/mbabyak

90
  • www.duke.edu/mababyak
  • michael.babyak _at_ duke.edu
  • symptomresearch.nih.gov/chapter_8/
Write a Comment
User Comments (0)
About PowerShow.com