Nicky Best

1 / 38

About This Presentation

Title:

Nicky Best

Description:

Bayesian graphical models for multiple bias modelling in ... Multivariate probit regression. Combining models. THMi. THMj. baby i in register. baby j in MCS ... –

Number of Views:66

Avg rating:3.0/5.0

Slides: 39

Provided by: cja52

Category:

more less

Transcript and Presenter's Notes

Title: Nicky Best

1
Bayesian graphical models for multiple bias
modelling in epidemiological studies
Nicky Best Department of Epidemiology and Public
Health Imperial College, London n.best_at_imperial.a
c.uk
2
BIAS Project
Bayesian methods for integrated bias modelling
and analysis of multiple data sources www.bias-pr
oject.org.uk

Talk Outline
Common biases in observational data
Graphical models
Case Study combining multiple data sources to
study effects of water disinfection by-products
on risk of low birth weight

3
Biases in observational data

Random errors (sampling variation)
Missing data
Unmeasured confounders
Selection biases
Measurement errors
Multiple data sources often necessary to identify
the biases and inform about different aspects of
the research question

4
Simple example of graphical model

The genotypes of the couple are independent, just
two random sets out in the world

They meet and...

5
Simple example of graphical model
Mendelian inheritance
M
F

C

C genotype of child
Once the couple have a child and become parents,
their genotypes become associated through the
child e.g. paternity testing

6
Building complex models
A
D
C
F
E
B

Conditional independence provides mathematical
basis for expressing large system as fusion of
smaller components

7
Building complex models
D
C
C
A
D
F
E
B
E

Conditional independence provides mathematical
basis for expressing large system as fusion of
smaller components

8
Building complex models

Key idea
understand complex system
through global model
built from small pieces
comprehensible
each with only a few variables
modular
Present context each piece could represent
separate data source

9
Case study Combining birth register, survey and
census data to study effects of water
disinfection by-products on risk of low birth
weight
10
Low birth-weight and chlorine byproducts

Does exposure to chlorine byproducts (i.e. total
trihalomethanes (THMs) ) during pregnancy
increase the risk of low birth-weight baby?
Combine datasets with different strengths
Survey data (Millennium Cohort Study)
Small, great individual detail.
Administrative data (national births register)
Large, but little individual detail.
Single underlying model assumed to govern both
datasets elaborate as appropriate to handle
biases

11
Low birth-weight

Important determinant of future health ?
population health indicator
Low birth-weight needs to be stratified by
gestational age
Full-term low birth-weight babies born gt 37
weeks
Pre-term low birth-weight babies born lt 37
weeks
Established risk factors
Mothers tobacco smoking status during pregnancy.
Mothers ethnicity (South Asian), maternal age,
weight, height, number of previous births.
Babies sex
Role of environmental risk factors, such as THMs,
less clear (inconclusive).
Some recent studies suggest a link, but others do
not.

12
Data sources (1) Millennium Cohort Study

About 11,695 births in the England between Sep
2000 and August 2001
About 1,333 singleton births when restricted to
the United Utility (UU) water company
UU company is located in northwest part of
England.
Postcode made available to us under strict
security arrangements
Match individuals with exposure to chlorine
byproducts estimated in separate model (Whitaker
et al, 2005)
Birth weight, babys gestation age and reasonably
complete set of confounder data available
Allows a reasonable analysis, but issues remain
Low power to detect small effect ? could be
improved by incorporating other data.
Potential selection bias

13
Data sources (2) National birth register (NBR)

Every birth in the population recorded.
Individual data with postcode (? THM exposure)
and birth weight available to us under strict
security.
We study subjects from wards which were covered
by the UU water company and which are present in
both MCS and NBR samples 7945 singleton births
between Sep 2000 and Aug 2001.
Larger dataset, no selection bias
but no confounder information, especially
ethnicity and smoking.
No record of gestation age.

14
Data sources (3) Aggregate data

Ethnic composition of the population
2001 census
for census output areas (500 individuals)
Tobacco expenditure
consumer surveys (CACI, who produce ACORN
consumer classification data)
for census output areas.
linked by postcode to Millennium Cohort and
national register data.

15
Birth weight THM (sourceMCS)
Birth weight Race (sourceMCS)
Birth weight Smoke (sourceMCS)
16
Models for formally analysing combined data

Want estimate of the association between low
birth-weight (full-term and pre-term) and THM
exposure, using all data, accounting for
Selection bias in MCS
Adjust models for predictors of selection
Missing confounders in register
Bayesian graphical model
Missing outcomes in register data no gestation
age information to stratify the birth weight
Bayesian graphical model

17
Graphical model representation
THMj
Cj
Multinomial logistic regression model BWIj
Multinomial(pj ,13 ,1) log(pj,2 / pj,1) b10
b11THMj b12Cj log(pj,3 / pj,1) b20 b21THMj
b22Cj
MODEL parameters
LBWP
Normal
LBWF
baby j in MCS
BWI Birth weight indicator (1 normal, 2 LBWP,
3 LBWF) LBWP low birth weight pre-term LBWF
low birth weight full-term THM THM (chlorine
byproduct) exposure C confounders such as
ethnicity and smoking - only in MCS
known
Babies gestation age only observed in MCS
unknown
18
Graphical model representation
THMi
THMj
Ci
Cj
MODEL parameters.
LBWP
LBWP
Normal
Normal
LBWF
LBWF
baby i in register
baby j in MCS
LBWF low birth-weight full-term LBWP low
birth-weight pre-term THM THM (chlorine
byproduct) exposure C confounders such as
ethnicity and smoking - only in the MCS Same
MODEL assumed to govern both datasets
Babies gestation age only observed in MCS
known
unknown
19
Missing confounder imputation model
small area for baby i
small area for baby j
MODEL parameters
AGGi
AGGj
Ci
Cj
Multivariate probit regression
baby i in register
baby j in MCS
AGGi aggregate ethnicity/smoking data for area
of residence of baby i MODEL for imputation of
Ci in terms of aggregate data and MCS data
20
Combining models
small area for baby i
small area for baby j
MODEL parameters
AGGi
AGGj
THMi
THMj
Ci
Cj
MODEL parameters
LBWP
Normal
LBWP
Normal
LBWF
LBWF
baby i in register
baby j in MCS
We used the unified model to impute (multiple
draws) LBWP and LBWF in register
21
Investigating the performance of the unified model
Missing Outcome Model
Y (1, 2, 3)

Good Performance of model depends on
how well the aggregate data can inform C
(covariate)
how strongly C and Y are linked

MCS data show 1. strong association between
aggre. data and race, smoke 2. strong
association between race, smoke and Y (LBW)
22
Simulation Study
Step 1 Create data (N1333) under the scenario
Strong C-Aggre. association Strong Y-C link

Step 3 Compare the prediction based on
analysis using fully observed data (no
imputation)
analysis using partially observed data
(imputation).

23
Examining the missing outcome model imputing Y

Missing outcome data are either pre- or
full-term LBW
(Y2 or Y3)

If we are to accurately impute Y, these
probabilities
must be accurately estimated.

24
Examining the missing outcome model imputing Y
Y contains 50 missing values at categories 2
and 3 S and R is totally observed
25
Examining the missing outcome model imputing Y
Y contains 50 missing values at categories 2
and 3 S and R is totally observed
More challenging ! Y contains 50 missing values
at categories 2 and 3 S R contain 80
missing values
26
Examining the missing covariate model imputing
C (smoke race)
Smoke
R A C E
27
Examining the missing covariate model imputing
C (smoke race)
28
Real data analysis United Utilities water
company
Data Restrict on Singleton birth Period Sep
2000 Aug 2001 Subjects
Total 9278
MCS 1333
NBR 7945

Missing Race Missing Smoke Missing outcome at
levels of 2 (LBWP) and 3 (LBWF)
Complete Observed information
Missing in Race and Smoke 85 Missing in
Outcome 7
29
Real data analysis United Utilities water
company

Exposure variable THMs
Dichotomized into 2 groups
low-medium exposure group (lt 60 g/l) 57.35
high exposure group (gt60 g/l) 42.65
Estimated in separate model (Whitaker et al,
2005) and linked to health data via postcode

30
Models for real data analysis
Standard (STATA) vs. Multiple bias (Bayesian)
a. Multinomial logistic regression model for
MCS data only - no imputation b.
Bayesian multiple bias model for combined
NBR, MCS and aggregate data - impute missing
outcome and covariates
31
Results for the real data analysis (Low
birth-weight full-term VS Normal)
95 Bayesian Credible Interval
All parameter estimates adjusted for babys sex,
maternal age, ward of residence
32
Conclusion

Evidence for association between THM exposure and
low birth-weight full-term (but not with pre-term
LBW)
Combining the datasets can
increase statistical power of the survey data
alleviate bias due to unmeasured confounding in
the administrative data
Benefits of combining data via graphical model
will depend on amount of information and strength
of association provided by each sub-model
Must allow for selection mechanism of survey when
combining data, and check compatibility of data
sources

33
THANKS

Jassy Molitor
Sylvia Richardson
Chris Jackson

Mireille Toledano
Mark Nieuwenhuijsen
James Bennett
Peter Hambly
Daniela Fecht

www.bias-project.org.uk
34
using one-level imputation
35
using one-level imputation
Strong Y-C
Y contains 50 missing values at categories 2
and 3
Weak Y-C
36
two-levels VS one-level imputation
Y1 Y2 Y3
Strong C-aggre Strong Y-C
Weak C-aggre Strong Y-C
Strong C-aggre Weak Y-C
37
Without cut function
Cut function
38
using two-level imputation
Without cut function
Cut function

Write a Comment

User Comments (0)