Entering Multidimensional Space: Multiple Regression - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Entering Multidimensional Space: Multiple Regression

Description:

Statistics for Health Research Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Generalized linear ... – PowerPoint PPT presentation

Number of Views:165
Avg rating:3.0/5.0
Slides: 55
Provided by: mcadd
Category:

less

Transcript and Presenter's Notes

Title: Entering Multidimensional Space: Multiple Regression


1
Statistics for Health Research
Entering Multidimensional Space Multiple
Regression
Peter T. Donnan Professor of Epidemiology and
Biostatistics
2
Objectives of session
  • Recognise the need for multiple regression
  • Understand methods of selecting variables
  • Understand strengths and weakness of selection
    methods
  • Carry out Multiple
  • Regression in SPSS
  • and interpret the output

3
Why do we need multiple regression?
Research is not as simple as effect of one
variable on one outcome, Especially with
observational data Need to assess many factors
simultaneously more realistic models
4
Consider Fitted line of y a b1x1 b2x2
Dependent (y)
Explanatory (x2)
Explanatory (x1)
5
3-dimensional scatterplot from SPSS of Min LDL in
relation to baseline LDL and age
6
When to use multiple regression modelling (1)
Assess relationship between two variables while
adjusting or allowing for another
variable Sometimes the second variable is
considered a nuisance factor Example Physical
Activity allowing for age and medications
7
When to use multiple regression modelling (2)
In RCT whenever there is imbalance between arms
of the trial at baseline in characteristics of
subjects e.g. survival in colorectal cancer on
two different randomised therapies adjusted for
age, gender, stage, and co-morbidity at baseline
8
When to use multiple regression modelling (2)
A special case of this is when adjusting for
baseline level of the primary outcome in an
RCT Baseline level added as a factor in
regression model This will be covered in Trials
part of the course
9
When to use multiple regression modelling (3)
With observational data in order to produce a
prognostic equation for future prediction of risk
of mortality e.g. Predicting future risk of CHD
used 10-year data from the Framingham cohort
10
When to use multiple regression modelling (4)
With observational designs in order to adjust for
possible confounders e.g. survival in colorectal
cancer in those with hypertension adjusted for
age, gender, social deprivation and co-morbidity
11
Definition of Confounding
A confounder is a factor which is related to both
the variable of interest (explanatory) and the
outcome, but is not an intermediary in a causal
pathway
12
Example of Confounding
Lung Cancer
Deprivation
Smoking
13
But, also worth adjusting for factors only
related to outcome
Lung Cancer
Deprivation
Exercise
14
Not worth adjusting for intermediate factor in a
causal pathway
Exercise
Blood viscosity
Stroke
In a causal pathway each factor is merely a
marker of the other factors i.e correlated -
collinearity
15
SPSS Add both baseline LDL and age in the
independent box in linear regression
16
Output from SPSS linear regression on Age at
baseline
17
Output from SPSS linear regression on Baseline LDL
18
Output Multiple regression
R2 now improved to 13
Both variables still significant INDEPENDENTLY of
each other
19
How do you select which variables to enter the
model?
  • Usually consider what hypotheses are you testing?
  • If main exposure variable, enter first and
    assess confounders one at a time
  • For derivation of CPR you want powerful
    predictors
  • Also clinically important factors e.g.
    cholesterol in CHD prediction
  • Significance is important but
  • It is acceptable to have an important variable
    without statistical significance

20
How do you decide what variables to enter in
model? Correlations? With great difficulty!
21
3-dimensional scatterplot from SPSS of Time from
Surgery in relation to Dukes staging and age
22
Approaches to model building
  • 1. Let Scientific or Clinical factors guide
    selection

2. Use automatic selection algorithms 3. A
mixture of above
23
1) Let Science or Clinical factors guide
selection
Baseline LDL cholesterol is an important factor
determining LDL outcome so enter first Next allow
for age and gender Add adherence as
important? Add BMI and smoking?
24
1) Let Science or Clinical factors guide
selection
  • Results in model of
  • Baseline LDL
  • age and gender
  • Adherence
  • BMI and smoking
  • Is this a good model?

25
1) Let Science or Clinical factors guide
selection Final Model
Note three variables entered but not
statistically significant
26
1) Let Science or Clinical factors guide
selection
Is this the best model? Should I leave out the
non-significant factors (Model 2)?
Model Adj R2 F from ANOVA No. of Parameters p

1 0.137 37.48 7

2 0.134 72.021 4




Adj R2 lower, F has increased and number of
parameters is less in 2nd model. Is this better?
27
Kullback-Leibler Information
f
Kullback and Leibler (1951) quantified the
meaning of information related to Fishers
sufficient statistics
Basically we have reality f And a model g to
approximate f So K-L information is I(f,g)
g
28
Kullback-Leibler Information
We want to minimise I (f,g) to obtain the best
model over other models I (f,g) is the
information lost or distance between reality
and a model so need to minimise
29
Akaikes Information Criterion
It turns out that the function I(f,g) is related
to a very simple measure of goodness-of-fit Akaik
es Information Criterion or AIC
30
Selection Criteria
  • With a large number of factors type 1 error
    large, likely to have model with many variables
  • Two standard criteria
  • 1) Akaikes Information Criterion (AIC)
  • 2) Schwartzs Bayesian Information Criterion
    (BIC)
  • Both penalise models with large number of
    variables if sample size is large

31
Akaikes Information Criterion
  • Where p number of parameters and -2log
    likelihood is in the output
  • Hence AIC penalises models with large number of
    variables
  • Select model that minimises (-2LL2p)

32
Generalized linear models
  • Unfortunately the standard REGRESSION in SPSS
    does not give these statistics
  • Need to use
  • Analyze
  • Generalized Linear Models..

33
Generalized linear models. Default is linear
  • Add Min LDL achieved as dependent as in
    REGRESSION in SPSS
  • Next go to predictors..

34
Generalized linear models Predictors
  • WARNING!
  • Make sure you add the predictors in the correct
    box
  • Categorical in FACTORS box
  • Continuous in COVARIATES box

35
Generalized linear models Model
  • Add all factors and covariates in the model as
    main effects

36
Generalized Linear Models Parameter Estimates
Note identical to REGRESSION output
37
Generalized Linear Models Goodness-of-fit
Note output gives log likelihood and AIC
2835 (AIC -2x-1409.6 2x7 2835) Footnote
explains smaller AIC is better
38
Let Science or Clinical factors guide selection
Optimal model
  • The log likelihood is a measure of
    GOODNESS-OF-FIT
  • Seek optimal model that maximises the log
    likelihood or minimises the AIC




Model 2LL p AIC

1 Full Model -1409.6 7 2835.6

2 Non-significant variables removed -1413.6 4 2837.2

Change is 1.6
39
1) Let Science or Clinical factors guide
selection
  • Key points
  • Results demonstrate a significant association
    with baseline LDL, Age and Adherence
  • Difficult choices with Gender, smoking and BMI
  • AIC only changes by 1.6 when removed
  • Generally changes of 4 or more in AIC are
    considered important

40
1) Let Science or Clinical factors guide
selection
  • Key points
  • Conclude little to chose between models
  • AIC actually lower with larger model and consider
    Gender, and BMI important factors so keep larger
    model but have to justify
  • Model building manual, logical, transparent and
    under your control

41
2) Use automatic selection procedures
These are based on automatic mechanical
algorithms usually related to statistical
significance Common ones are stepwise, forward or
backward elimination Can be selected in SPSS
using Method in dialogue box
42
2) Use automatic selection procedures (e.g
Stepwise)
Select Method Stepwise
43
2) Use automatic selection procedures (e.g
Stepwise)
1st step
2nd step
Final Model
44
2) Change in AIC with Stepwise selection
Note Only available from Generalized Linear
Models
Step Model Log Likelihood AIC Change in AIC No. of Parameters p

1 Baseline LDL -1423.1 2852.2 - 2

2 Adherence -1418.0 2844.1 8.1 3

3 Age -1413.6 2837.2 6.9 4




45
2) Advantages and disadvantages of stepwise
Advantages Simple to implement Gives a
parsimonious model Selection is certainly
objective Disadvantages Non stable selection
stepwise considers many models that are very
similar P-value on entry may be smaller once
procedure is finished so exaggeration of
p-value Predictions in external dataset usually
worse for stepwise procedures tends to add bias
46
2) Automatic procedures Backward elimination
  • Backward starts by eliminating the least
    significant factor form the full model and has a
    few advantages over forward
  • Modeller has to consider the full model and
    sees results for all factors simultaneously
  • Correlated factors can remain in the model (in
    forward methods they may not even enter)
  • Criteria for removal tend to be more lax in
    backward so end up with more parameters

47
2) Use automatic selection procedures (e.g
Backward)
Select Method Backward
48
2) Backward elimination in SPSS
1st step Gender removed
2nd step BMI removed
Final Model
49
Summary of automatic selection
  • Automatic selection may not give optimal model
    (may leave out important factors)
  • Different methods may give different results
    (forward vs. backward elimination)
  • Backward elimination preferred as less stringent
  • Too easily fitted in SPSS!
  • Model assessment still requires some thought

50
3) A mixture of automatic procedures and self
selection
  • Use automatic procedures as a guide
  • Think about what factors are important
  • Add important factors
  • Do not blindly follow statistical significance
  • Consider AIC

51
Summary of Model selection
  • Selection of factors for Multiple Linear
    regression models requires some judgement
  • Automatic procedures are available but treat
    results with caution
  • They are easily fitted in SPSS
  • Check AIC or log likelihood for fit

52
Summary
  • Multiple regression models are the most used
    analytical tool in quantitative research
  • They are easily fitted in SPSS
  • Model assessment requires some thought
  • Parsimony is better Occams Razor
  • Donnelly LA, Palmer CNA, Whitley AL, Lang C,
    Doney ASF, Morris AD, Donnan PT. Apolipoprotein E
    genotypes are associated with lipid lowering
    response to statin treatment in diabetes A
    Go-DARTS study. Pharmacogenetics and Genomics,
    2008 18 279-87.

53
Remember Occams Razor
Entia non sunt multiplicanda praeter
necessitatem Entities must not be multiplied
beyond necessity
William of Ockham 14th century Friar and
logician 1288-1347
54
Practical on Multiple Regression
  • Read in LDL Data.sav
  • Try fitting multiple regression model on Min LDL
    obtained using forward and backward elimination.
    Are the results the same? Add other factors than
    those considered in the presentation such as BMI,
    smoking. Remember the goal is to assess the
    association of APOE with LDL response.
  • Try fitting multiple regression models for Min
    Chol achieved. Is the model similar to that found
    for Min Chol?
Write a Comment
User Comments (0)
About PowerShow.com