Title: Nicky Best
1 Bayesian graphical models for multiple bias
modelling in epidemiological studies
Nicky Best Department of Epidemiology and Public
Health Imperial College, London n.best_at_imperial.a
c.uk
2BIAS Project
Bayesian methods for integrated bias modelling
and analysis of multiple data sources www.bias-pr
oject.org.uk
- Talk Outline
- Common biases in observational data
- Graphical models
- Case Study combining multiple data sources to
study effects of water disinfection by-products
on risk of low birth weight
3Biases in observational data
- Random errors (sampling variation)
- Missing data
- Unmeasured confounders
- Selection biases
- Measurement errors
- Multiple data sources often necessary to identify
the biases and inform about different aspects of
the research question
4Simple example of graphical model
- The genotypes of the couple are independent, just
two random sets out in the world
5Simple example of graphical model
Mendelian inheritance
M
F
C
- C genotype of child
- Once the couple have a child and become parents,
their genotypes become associated through the
child e.g. paternity testing
6Building complex models
A
D
C
F
E
B
- Conditional independence provides mathematical
basis for expressing large system as fusion of
smaller components
7Building complex models
D
C
C
A
D
F
E
B
E
- Conditional independence provides mathematical
basis for expressing large system as fusion of
smaller components
8Building complex models
- Key idea
- understand complex system
- through global model
- built from small pieces
- comprehensible
- each with only a few variables
- modular
- Present context each piece could represent
separate data source
9Case study Combining birth register, survey and
census data to study effects of water
disinfection by-products on risk of low birth
weight
10Low birth-weight and chlorine byproducts
- Does exposure to chlorine byproducts (i.e. total
trihalomethanes (THMs) ) during pregnancy
increase the risk of low birth-weight baby? - Combine datasets with different strengths
- Survey data (Millennium Cohort Study)
- Small, great individual detail.
- Administrative data (national births register)
- Large, but little individual detail.
- Single underlying model assumed to govern both
datasets elaborate as appropriate to handle
biases
11Low birth-weight
- Important determinant of future health ?
population health indicator - Low birth-weight needs to be stratified by
gestational age - Full-term low birth-weight babies born gt 37
weeks - Pre-term low birth-weight babies born lt 37
weeks - Established risk factors
- Mothers tobacco smoking status during pregnancy.
- Mothers ethnicity (South Asian), maternal age,
weight, height, number of previous births. - Babies sex
- Role of environmental risk factors, such as THMs,
less clear (inconclusive). - Some recent studies suggest a link, but others do
not.
12Data sources (1) Millennium Cohort Study
- About 11,695 births in the England between Sep
2000 and August 2001 - About 1,333 singleton births when restricted to
the United Utility (UU) water company - UU company is located in northwest part of
England. - Postcode made available to us under strict
security arrangements - Match individuals with exposure to chlorine
byproducts estimated in separate model (Whitaker
et al, 2005) - Birth weight, babys gestation age and reasonably
complete set of confounder data available - Allows a reasonable analysis, but issues remain
- Low power to detect small effect ? could be
improved by incorporating other data. - Potential selection bias
13Data sources (2) National birth register (NBR)
- Every birth in the population recorded.
- Individual data with postcode (? THM exposure)
and birth weight available to us under strict
security. - We study subjects from wards which were covered
by the UU water company and which are present in
both MCS and NBR samples 7945 singleton births
between Sep 2000 and Aug 2001. - Larger dataset, no selection bias
- but no confounder information, especially
ethnicity and smoking. - No record of gestation age.
14Data sources (3) Aggregate data
- Ethnic composition of the population
- 2001 census
- for census output areas (500 individuals)
- Tobacco expenditure
- consumer surveys (CACI, who produce ACORN
consumer classification data) - for census output areas.
- linked by postcode to Millennium Cohort and
national register data.
15Birth weight THM (sourceMCS)
Birth weight Race (sourceMCS)
Birth weight Smoke (sourceMCS)
16Models for formally analysing combined data
- Want estimate of the association between low
birth-weight (full-term and pre-term) and THM
exposure, using all data, accounting for - Selection bias in MCS
- Adjust models for predictors of selection
- Missing confounders in register
- Bayesian graphical model
- Missing outcomes in register data no gestation
age information to stratify the birth weight - Bayesian graphical model
17Graphical model representation
THMj
Cj
Multinomial logistic regression model BWIj
Multinomial(pj ,13 ,1) log(pj,2 / pj,1) b10
b11THMj b12Cj log(pj,3 / pj,1) b20 b21THMj
b22Cj
MODEL parameters
LBWP
Normal
LBWF
baby j in MCS
BWI Birth weight indicator (1 normal, 2 LBWP,
3 LBWF) LBWP low birth weight pre-term LBWF
low birth weight full-term THM THM (chlorine
byproduct) exposure C confounders such as
ethnicity and smoking - only in MCS
known
Babies gestation age only observed in MCS
unknown
18Graphical model representation
THMi
THMj
Ci
Cj
MODEL parameters.
LBWP
LBWP
Normal
Normal
LBWF
LBWF
baby i in register
baby j in MCS
LBWF low birth-weight full-term LBWP low
birth-weight pre-term THM THM (chlorine
byproduct) exposure C confounders such as
ethnicity and smoking - only in the MCS Same
MODEL assumed to govern both datasets
Babies gestation age only observed in MCS
known
unknown
19Missing confounder imputation model
small area for baby i
small area for baby j
MODEL parameters
AGGi
AGGj
Ci
Cj
Multivariate probit regression
baby i in register
baby j in MCS
AGGi aggregate ethnicity/smoking data for area
of residence of baby i MODEL for imputation of
Ci in terms of aggregate data and MCS data
20Combining models
small area for baby i
small area for baby j
MODEL parameters
AGGi
AGGj
THMi
THMj
Ci
Cj
MODEL parameters
LBWP
Normal
LBWP
Normal
LBWF
LBWF
baby i in register
baby j in MCS
We used the unified model to impute (multiple
draws) LBWP and LBWF in register
21Investigating the performance of the unified model
Missing Outcome Model
Y (1, 2, 3)
- Good Performance of model depends on
- how well the aggregate data can inform C
(covariate) - how strongly C and Y are linked
MCS data show 1. strong association between
aggre. data and race, smoke 2. strong
association between race, smoke and Y (LBW)
22Simulation Study
Step 1 Create data (N1333) under the scenario
Strong C-Aggre. association Strong Y-C link
- Step 3 Compare the prediction based on
-
- analysis using fully observed data (no
imputation) - analysis using partially observed data
(imputation).
23Examining the missing outcome model imputing Y
- Missing outcome data are either pre- or
full-term LBW - (Y2 or Y3)
- If we are to accurately impute Y, these
probabilities - must be accurately estimated.
24Examining the missing outcome model imputing Y
Y contains 50 missing values at categories 2
and 3 S and R is totally observed
25Examining the missing outcome model imputing Y
Y contains 50 missing values at categories 2
and 3 S and R is totally observed
More challenging ! Y contains 50 missing values
at categories 2 and 3 S R contain 80
missing values
26Examining the missing covariate model imputing
C (smoke race)
Smoke
R A C E
27Examining the missing covariate model imputing
C (smoke race)
28Real data analysis United Utilities water
company
Data Restrict on Singleton birth Period Sep
2000 Aug 2001 Subjects
Total 9278
MCS 1333
NBR 7945
Missing Race Missing Smoke Missing outcome at
levels of 2 (LBWP) and 3 (LBWF)
Complete Observed information
Missing in Race and Smoke 85 Missing in
Outcome 7
29Real data analysis United Utilities water
company
- Exposure variable THMs
- Dichotomized into 2 groups
- low-medium exposure group (lt 60 g/l) 57.35
- high exposure group (gt60 g/l) 42.65
- Estimated in separate model (Whitaker et al,
2005) and linked to health data via postcode
30Models for real data analysis
Standard (STATA) vs. Multiple bias (Bayesian)
a. Multinomial logistic regression model for
MCS data only - no imputation b.
Bayesian multiple bias model for combined
NBR, MCS and aggregate data - impute missing
outcome and covariates
31Results for the real data analysis (Low
birth-weight full-term VS Normal)
95 Bayesian Credible Interval
All parameter estimates adjusted for babys sex,
maternal age, ward of residence
32Conclusion
- Evidence for association between THM exposure and
low birth-weight full-term (but not with pre-term
LBW) - Combining the datasets can
- increase statistical power of the survey data
- alleviate bias due to unmeasured confounding in
the administrative data - Benefits of combining data via graphical model
will depend on amount of information and strength
of association provided by each sub-model - Must allow for selection mechanism of survey when
combining data, and check compatibility of data
sources
33THANKS
- Jassy Molitor
- Sylvia Richardson
- Chris Jackson
- Mireille Toledano
- Mark Nieuwenhuijsen
- James Bennett
- Peter Hambly
- Daniela Fecht
www.bias-project.org.uk
34using one-level imputation
35using one-level imputation
Strong Y-C
Y contains 50 missing values at categories 2
and 3
Weak Y-C
36two-levels VS one-level imputation
Y1 Y2 Y3
Strong C-aggre Strong Y-C
Weak C-aggre Strong Y-C
Strong C-aggre Weak Y-C
37Without cut function
Cut function
38using two-level imputation
Without cut function
Cut function