Nicky Best - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Nicky Best

Description:

Bayesian graphical models for multiple bias modelling in ... Multivariate probit regression. Combining models. THMi. THMj. baby i in register. baby j in MCS ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 39
Provided by: cja52
Category:
Tags: best | nicky | probit

less

Transcript and Presenter's Notes

Title: Nicky Best


1
Bayesian graphical models for multiple bias
modelling in epidemiological studies
Nicky Best Department of Epidemiology and Public
Health Imperial College, London n.best_at_imperial.a
c.uk
2
BIAS Project
Bayesian methods for integrated bias modelling
and analysis of multiple data sources www.bias-pr
oject.org.uk
  • Talk Outline
  • Common biases in observational data
  • Graphical models
  • Case Study combining multiple data sources to
    study effects of water disinfection by-products
    on risk of low birth weight

3
Biases in observational data
  • Random errors (sampling variation)
  • Missing data
  • Unmeasured confounders
  • Selection biases
  • Measurement errors
  • Multiple data sources often necessary to identify
    the biases and inform about different aspects of
    the research question

4
Simple example of graphical model

  • The genotypes of the couple are independent, just
    two random sets out in the world
  • They meet and...

5
Simple example of graphical model
Mendelian inheritance
M
F

C
  • C genotype of child
  • Once the couple have a child and become parents,
    their genotypes become associated through the
    child e.g. paternity testing

6
Building complex models
A
D
C
F
E
B
  • Conditional independence provides mathematical
    basis for expressing large system as fusion of
    smaller components

7
Building complex models
D
C
C
A
D
F
E
B
E
  • Conditional independence provides mathematical
    basis for expressing large system as fusion of
    smaller components

8
Building complex models
  • Key idea
  • understand complex system
  • through global model
  • built from small pieces
  • comprehensible
  • each with only a few variables
  • modular
  • Present context each piece could represent
    separate data source

9
Case study Combining birth register, survey and
census data to study effects of water
disinfection by-products on risk of low birth
weight
10
Low birth-weight and chlorine byproducts
  • Does exposure to chlorine byproducts (i.e. total
    trihalomethanes (THMs) ) during pregnancy
    increase the risk of low birth-weight baby?
  • Combine datasets with different strengths
  • Survey data (Millennium Cohort Study)
  • Small, great individual detail.
  • Administrative data (national births register)
  • Large, but little individual detail.
  • Single underlying model assumed to govern both
    datasets elaborate as appropriate to handle
    biases

11
Low birth-weight
  • Important determinant of future health ?
    population health indicator
  • Low birth-weight needs to be stratified by
    gestational age
  • Full-term low birth-weight babies born gt 37
    weeks
  • Pre-term low birth-weight babies born lt 37
    weeks
  • Established risk factors
  • Mothers tobacco smoking status during pregnancy.
  • Mothers ethnicity (South Asian), maternal age,
    weight, height, number of previous births.
  • Babies sex
  • Role of environmental risk factors, such as THMs,
    less clear (inconclusive).
  • Some recent studies suggest a link, but others do
    not.

12
Data sources (1) Millennium Cohort Study
  • About 11,695 births in the England between Sep
    2000 and August 2001
  • About 1,333 singleton births when restricted to
    the United Utility (UU) water company
  • UU company is located in northwest part of
    England.
  • Postcode made available to us under strict
    security arrangements
  • Match individuals with exposure to chlorine
    byproducts estimated in separate model (Whitaker
    et al, 2005)
  • Birth weight, babys gestation age and reasonably
    complete set of confounder data available
  • Allows a reasonable analysis, but issues remain
  • Low power to detect small effect ? could be
    improved by incorporating other data.
  • Potential selection bias

13
Data sources (2) National birth register (NBR)
  • Every birth in the population recorded.
  • Individual data with postcode (? THM exposure)
    and birth weight available to us under strict
    security.
  • We study subjects from wards which were covered
    by the UU water company and which are present in
    both MCS and NBR samples 7945 singleton births
    between Sep 2000 and Aug 2001.
  • Larger dataset, no selection bias
  • but no confounder information, especially
    ethnicity and smoking.
  • No record of gestation age.

14
Data sources (3) Aggregate data
  • Ethnic composition of the population
  • 2001 census
  • for census output areas (500 individuals)
  • Tobacco expenditure
  • consumer surveys (CACI, who produce ACORN
    consumer classification data)
  • for census output areas.
  • linked by postcode to Millennium Cohort and
    national register data.

15
Birth weight THM (sourceMCS)
Birth weight Race (sourceMCS)
Birth weight Smoke (sourceMCS)
16
Models for formally analysing combined data
  • Want estimate of the association between low
    birth-weight (full-term and pre-term) and THM
    exposure, using all data, accounting for
  • Selection bias in MCS
  • Adjust models for predictors of selection
  • Missing confounders in register
  • Bayesian graphical model
  • Missing outcomes in register data no gestation
    age information to stratify the birth weight
  • Bayesian graphical model

17
Graphical model representation
THMj
Cj
Multinomial logistic regression model BWIj
Multinomial(pj ,13 ,1) log(pj,2 / pj,1) b10
b11THMj b12Cj log(pj,3 / pj,1) b20 b21THMj
b22Cj
MODEL parameters
LBWP
Normal
LBWF
baby j in MCS
BWI Birth weight indicator (1 normal, 2 LBWP,
3 LBWF) LBWP low birth weight pre-term LBWF
low birth weight full-term THM THM (chlorine
byproduct) exposure C confounders such as
ethnicity and smoking - only in MCS
known
Babies gestation age only observed in MCS
unknown
18
Graphical model representation
THMi
THMj
Ci
Cj
MODEL parameters.
LBWP
LBWP
Normal
Normal
LBWF
LBWF
baby i in register
baby j in MCS
LBWF low birth-weight full-term LBWP low
birth-weight pre-term THM THM (chlorine
byproduct) exposure C confounders such as
ethnicity and smoking - only in the MCS Same
MODEL assumed to govern both datasets
Babies gestation age only observed in MCS
known
unknown
19
Missing confounder imputation model
small area for baby i
small area for baby j
MODEL parameters
AGGi
AGGj
Ci
Cj
Multivariate probit regression
baby i in register
baby j in MCS
AGGi aggregate ethnicity/smoking data for area
of residence of baby i MODEL for imputation of
Ci in terms of aggregate data and MCS data
20
Combining models
small area for baby i
small area for baby j
MODEL parameters
AGGi
AGGj
THMi
THMj
Ci
Cj
MODEL parameters
LBWP
Normal
LBWP
Normal
LBWF
LBWF
baby i in register
baby j in MCS
We used the unified model to impute (multiple
draws) LBWP and LBWF in register
21
Investigating the performance of the unified model
Missing Outcome Model
Y (1, 2, 3)
  • Good Performance of model depends on
  • how well the aggregate data can inform C
    (covariate)
  • how strongly C and Y are linked

MCS data show 1. strong association between
aggre. data and race, smoke 2. strong
association between race, smoke and Y (LBW)
22
Simulation Study
Step 1 Create data (N1333) under the scenario
Strong C-Aggre. association Strong Y-C link
  • Step 3 Compare the prediction based on
  • analysis using fully observed data (no
    imputation)
  • analysis using partially observed data
    (imputation).

23
Examining the missing outcome model imputing Y
  • Missing outcome data are either pre- or
    full-term LBW
  • (Y2 or Y3)
  • If we are to accurately impute Y, these
    probabilities
  • must be accurately estimated.

24
Examining the missing outcome model imputing Y
Y contains 50 missing values at categories 2
and 3 S and R is totally observed
25
Examining the missing outcome model imputing Y
Y contains 50 missing values at categories 2
and 3 S and R is totally observed
More challenging ! Y contains 50 missing values
at categories 2 and 3 S R contain 80
missing values
26
Examining the missing covariate model imputing
C (smoke race)
Smoke
R A C E
27
Examining the missing covariate model imputing
C (smoke race)
28
Real data analysis United Utilities water
company
Data Restrict on Singleton birth Period Sep
2000 Aug 2001 Subjects
Total 9278
MCS 1333
NBR 7945


Missing Race Missing Smoke Missing outcome at
levels of 2 (LBWP) and 3 (LBWF)
Complete Observed information
Missing in Race and Smoke 85 Missing in
Outcome 7
29
Real data analysis United Utilities water
company
  • Exposure variable THMs
  • Dichotomized into 2 groups
  • low-medium exposure group (lt 60 g/l) 57.35
  • high exposure group (gt60 g/l) 42.65
  • Estimated in separate model (Whitaker et al,
    2005) and linked to health data via postcode

30
Models for real data analysis
Standard (STATA) vs. Multiple bias (Bayesian)
a. Multinomial logistic regression model for
MCS data only - no imputation b.
Bayesian multiple bias model for combined
NBR, MCS and aggregate data - impute missing
outcome and covariates
31
Results for the real data analysis (Low
birth-weight full-term VS Normal)
95 Bayesian Credible Interval
All parameter estimates adjusted for babys sex,
maternal age, ward of residence
32
Conclusion
  • Evidence for association between THM exposure and
    low birth-weight full-term (but not with pre-term
    LBW)
  • Combining the datasets can
  • increase statistical power of the survey data
  • alleviate bias due to unmeasured confounding in
    the administrative data
  • Benefits of combining data via graphical model
    will depend on amount of information and strength
    of association provided by each sub-model
  • Must allow for selection mechanism of survey when
    combining data, and check compatibility of data
    sources

33
THANKS
  • Jassy Molitor
  • Sylvia Richardson
  • Chris Jackson
  • Mireille Toledano
  • Mark Nieuwenhuijsen
  • James Bennett
  • Peter Hambly
  • Daniela Fecht

www.bias-project.org.uk
34
using one-level imputation
35
using one-level imputation
Strong Y-C
Y contains 50 missing values at categories 2
and 3
Weak Y-C
36
two-levels VS one-level imputation
Y1 Y2 Y3
Strong C-aggre Strong Y-C
Weak C-aggre Strong Y-C
Strong C-aggre Weak Y-C
37
Without cut function
Cut function
38
using two-level imputation
Without cut function
Cut function
Write a Comment
User Comments (0)
About PowerShow.com