Title: Entering Multidimensional Space: Multiple Regression
1Statistics for Health Research
Entering Multidimensional Space Multiple
Regression
Peter T. Donnan Professor of Epidemiology and
Biostatistics
2Objectives of session
- Recognise the need for multiple regression
- Understand methods of selecting variables
- Understand strengths and weakness of selection
methods - Carry out Multiple
- Regression in SPSS
- and interpret the output
3Why do we need multiple regression?
Research is not as simple as effect of one
variable on one outcome, Especially with
observational data Need to assess many factors
simultaneously more realistic models
4Consider Fitted line of y a b1x1 b2x2
Dependent (y)
Explanatory (x2)
Explanatory (x1)
53-dimensional scatterplot from SPSS of Min LDL in
relation to baseline LDL and age
6When to use multiple regression modelling (1)
Assess relationship between two variables while
adjusting or allowing for another
variable Sometimes the second variable is
considered a nuisance factor Example Physical
Activity allowing for age and medications
7When to use multiple regression modelling (2)
In RCT whenever there is imbalance between arms
of the trial at baseline in characteristics of
subjects e.g. survival in colorectal cancer on
two different randomised therapies adjusted for
age, gender, stage, and co-morbidity at baseline
8When to use multiple regression modelling (2)
A special case of this is when adjusting for
baseline level of the primary outcome in an
RCT Baseline level added as a factor in
regression model This will be covered in Trials
part of the course
9When to use multiple regression modelling (3)
With observational data in order to produce a
prognostic equation for future prediction of risk
of mortality e.g. Predicting future risk of CHD
used 10-year data from the Framingham cohort
10When to use multiple regression modelling (4)
With observational designs in order to adjust for
possible confounders e.g. survival in colorectal
cancer in those with hypertension adjusted for
age, gender, social deprivation and co-morbidity
11Definition of Confounding
A confounder is a factor which is related to both
the variable of interest (explanatory) and the
outcome, but is not an intermediary in a causal
pathway
12Example of Confounding
Lung Cancer
Deprivation
Smoking
13But, also worth adjusting for factors only
related to outcome
Lung Cancer
Deprivation
Exercise
14Not worth adjusting for intermediate factor in a
causal pathway
Exercise
Blood viscosity
Stroke
In a causal pathway each factor is merely a
marker of the other factors i.e correlated -
collinearity
15SPSS Add both baseline LDL and age in the
independent box in linear regression
16Output from SPSS linear regression on Age at
baseline
17Output from SPSS linear regression on Baseline LDL
18Output Multiple regression
R2 now improved to 13
Both variables still significant INDEPENDENTLY of
each other
19How do you select which variables to enter the
model?
- Usually consider what hypotheses are you testing?
- If main exposure variable, enter first and
assess confounders one at a time - For derivation of CPR you want powerful
predictors - Also clinically important factors e.g.
cholesterol in CHD prediction - Significance is important but
- It is acceptable to have an important variable
without statistical significance
20How do you decide what variables to enter in
model? Correlations? With great difficulty!
213-dimensional scatterplot from SPSS of Time from
Surgery in relation to Dukes staging and age
22Approaches to model building
- 1. Let Scientific or Clinical factors guide
selection
2. Use automatic selection algorithms 3. A
mixture of above
231) Let Science or Clinical factors guide
selection
Baseline LDL cholesterol is an important factor
determining LDL outcome so enter first Next allow
for age and gender Add adherence as
important? Add BMI and smoking?
241) Let Science or Clinical factors guide
selection
- Results in model of
- Baseline LDL
- age and gender
- Adherence
- BMI and smoking
- Is this a good model?
251) Let Science or Clinical factors guide
selection Final Model
Note three variables entered but not
statistically significant
261) Let Science or Clinical factors guide
selection
Is this the best model? Should I leave out the
non-significant factors (Model 2)?
Model Adj R2 F from ANOVA No. of Parameters p
1 0.137 37.48 7
2 0.134 72.021 4
Adj R2 lower, F has increased and number of
parameters is less in 2nd model. Is this better?
27Kullback-Leibler Information
f
Kullback and Leibler (1951) quantified the
meaning of information related to Fishers
sufficient statistics
Basically we have reality f And a model g to
approximate f So K-L information is I(f,g)
g
28Kullback-Leibler Information
We want to minimise I (f,g) to obtain the best
model over other models I (f,g) is the
information lost or distance between reality
and a model so need to minimise
29Akaikes Information Criterion
It turns out that the function I(f,g) is related
to a very simple measure of goodness-of-fit Akaik
es Information Criterion or AIC
30Selection Criteria
- With a large number of factors type 1 error
large, likely to have model with many variables - Two standard criteria
- 1) Akaikes Information Criterion (AIC)
- 2) Schwartzs Bayesian Information Criterion
(BIC) - Both penalise models with large number of
variables if sample size is large
31Akaikes Information Criterion
- Where p number of parameters and -2log
likelihood is in the output - Hence AIC penalises models with large number of
variables - Select model that minimises (-2LL2p)
32Generalized linear models
- Unfortunately the standard REGRESSION in SPSS
does not give these statistics - Need to use
- Analyze
- Generalized Linear Models..
33Generalized linear models. Default is linear
- Add Min LDL achieved as dependent as in
REGRESSION in SPSS - Next go to predictors..
34Generalized linear models Predictors
- WARNING!
- Make sure you add the predictors in the correct
box - Categorical in FACTORS box
- Continuous in COVARIATES box
35Generalized linear models Model
- Add all factors and covariates in the model as
main effects
36Generalized Linear Models Parameter Estimates
Note identical to REGRESSION output
37Generalized Linear Models Goodness-of-fit
Note output gives log likelihood and AIC
2835 (AIC -2x-1409.6 2x7 2835) Footnote
explains smaller AIC is better
38Let Science or Clinical factors guide selection
Optimal model
- The log likelihood is a measure of
GOODNESS-OF-FIT - Seek optimal model that maximises the log
likelihood or minimises the AIC
Model 2LL p AIC
1 Full Model -1409.6 7 2835.6
2 Non-significant variables removed -1413.6 4 2837.2
Change is 1.6
391) Let Science or Clinical factors guide
selection
- Key points
- Results demonstrate a significant association
with baseline LDL, Age and Adherence - Difficult choices with Gender, smoking and BMI
- AIC only changes by 1.6 when removed
- Generally changes of 4 or more in AIC are
considered important
401) Let Science or Clinical factors guide
selection
- Key points
- Conclude little to chose between models
- AIC actually lower with larger model and consider
Gender, and BMI important factors so keep larger
model but have to justify - Model building manual, logical, transparent and
under your control
412) Use automatic selection procedures
These are based on automatic mechanical
algorithms usually related to statistical
significance Common ones are stepwise, forward or
backward elimination Can be selected in SPSS
using Method in dialogue box
422) Use automatic selection procedures (e.g
Stepwise)
Select Method Stepwise
432) Use automatic selection procedures (e.g
Stepwise)
1st step
2nd step
Final Model
442) Change in AIC with Stepwise selection
Note Only available from Generalized Linear
Models
Step Model Log Likelihood AIC Change in AIC No. of Parameters p
1 Baseline LDL -1423.1 2852.2 - 2
2 Adherence -1418.0 2844.1 8.1 3
3 Age -1413.6 2837.2 6.9 4
452) Advantages and disadvantages of stepwise
Advantages Simple to implement Gives a
parsimonious model Selection is certainly
objective Disadvantages Non stable selection
stepwise considers many models that are very
similar P-value on entry may be smaller once
procedure is finished so exaggeration of
p-value Predictions in external dataset usually
worse for stepwise procedures tends to add bias
462) Automatic procedures Backward elimination
- Backward starts by eliminating the least
significant factor form the full model and has a
few advantages over forward - Modeller has to consider the full model and
sees results for all factors simultaneously - Correlated factors can remain in the model (in
forward methods they may not even enter) - Criteria for removal tend to be more lax in
backward so end up with more parameters
472) Use automatic selection procedures (e.g
Backward)
Select Method Backward
482) Backward elimination in SPSS
1st step Gender removed
2nd step BMI removed
Final Model
49Summary of automatic selection
- Automatic selection may not give optimal model
(may leave out important factors) - Different methods may give different results
(forward vs. backward elimination) - Backward elimination preferred as less stringent
- Too easily fitted in SPSS!
- Model assessment still requires some thought
503) A mixture of automatic procedures and self
selection
- Use automatic procedures as a guide
- Think about what factors are important
- Add important factors
- Do not blindly follow statistical significance
- Consider AIC
51Summary of Model selection
- Selection of factors for Multiple Linear
regression models requires some judgement - Automatic procedures are available but treat
results with caution - They are easily fitted in SPSS
- Check AIC or log likelihood for fit
52Summary
- Multiple regression models are the most used
analytical tool in quantitative research - They are easily fitted in SPSS
- Model assessment requires some thought
- Parsimony is better Occams Razor
- Donnelly LA, Palmer CNA, Whitley AL, Lang C,
Doney ASF, Morris AD, Donnan PT. Apolipoprotein E
genotypes are associated with lipid lowering
response to statin treatment in diabetes A
Go-DARTS study. Pharmacogenetics and Genomics,
2008 18 279-87.
53Remember Occams Razor
Entia non sunt multiplicanda praeter
necessitatem Entities must not be multiplied
beyond necessity
William of Ockham 14th century Friar and
logician 1288-1347
54Practical on Multiple Regression
- Read in LDL Data.sav
- Try fitting multiple regression model on Min LDL
obtained using forward and backward elimination.
Are the results the same? Add other factors than
those considered in the presentation such as BMI,
smoking. Remember the goal is to assess the
association of APOE with LDL response. - Try fitting multiple regression models for Min
Chol achieved. Is the model similar to that found
for Min Chol?