Modeling Cryptosporidium Concentrations: A Bayesian GLMM of Regional Count Data

1 / 56

About This Presentation

Title:

Modeling Cryptosporidium Concentrations: A Bayesian GLMM of Regional Count Data

Description:

Modeling Cryptosporidium Concentrations: A Bayesian GLMM of ... Stomacher. Suspended. Solution. Centrifuge. Pellet of. Solids. Fraction (F) of Pellet. Suspended ... –

Number of Views:87

Avg rating:3.0/5.0

Slides: 57

Provided by: chr1227

Category:

more less

Transcript and Presenter's Notes

Title: Modeling Cryptosporidium Concentrations: A Bayesian GLMM of Regional Count Data

1
Modeling Cryptosporidium Concentrations A
Bayesian GLMM of Regional Count Data

Christopher Behr
Advisor Prof. Jery Stedinger
Prof. David Ruppert (ORIE)
Ciprian Craniceanu (Statistics)

2
Outline

Background on Cryptosporidium parvum
Issues in Cryptosporidium Data
Model Formulation
Model Computation and Parameterization
Analyses of Cryptosporidium concentrations
Water Quality Prediction Health Risk Analysis
(ICR)
Alternative Structure of Variance Components
(ICR)
ICR and ICRSS datasets
Conclusions

3
Background on Cryptosporidium

Waterborne pathogenic protozoa
Surface waters (65-97 of all in U.S.)
Difficult to kill
Can survive more than 100 days in environment
Resistant to chlorination found in 8-27
finished water supplies Swimming pools

4
Concern about Cryptosporidium

C. parvum causes mild to serious infections
Outbreaks largest in Milwaukee (1994)
Potentially high endemic levels
2,000-3,000 reported cases/yr between 1995-97
Unreported cases estimated between 0.2 and 2 of
population for industrialized countries
? 0.5 to 5 million cases/year in U.S.

5
EPA Regulatory Status

New Surface Water Treatment Rule being considered
for Cryptosporidium control
Data Information Collection Rule (ICR) and ICR
Supplementary Survey (ICRSS)
Survey of drinking water supply quality
Samples collected a utility intakes
Spiking studies evaluate testing method
effectiveness
Need appropriate statistical models

6
ICR Datasets
7
Issues in Cryptosporidium Data Variability in
Cryptosporidium Measurement

Cryptosporidium testing methods
Immunoflourescence assay method (IFA)
Immunomagnetic separation (1623)
Difficult to detect highly variable recovery
rates
Low expected recovery rates
IFA ? 10 1623 ? 40
High variability (in CV)
IFA ? 100 1623 ? 50
Recovery rate may be modeled as beta-distributed

8
Oocyst Recovery IFA Method
Walker, 1995
Stomacher
Centrifuge
Fiber Filter
Raw Water (V)
Suspended Solution
Separated Liquid
Top Layer
Centrifuge
Fraction (F) of Pellet
Suspended Solution
Pellet of Solids
Oocysts counted DISCRETELY!
Acetate Membrane
Dye added
Slides
9
Issues in Cryptosporidium Data Variability in
Cryptosporidium Measurement

Cryptosporidium testing methods
Immunoflourescence assay method (ICR)
Immunomagnetic separation (1623)
Difficult to detect highly variable recovery
rates
Low expected recovery rates
ICR ? 10 1623 ? 40
High variability (in CV)
ICR ? 100 1623 ? 50
Recovery rate may be modeled as beta-distributed

10
Issues in Cryptosporidium Data Low Counts
Across Many Sites

Many zero counts (93 - ICR 85 - ICRSS)
Many sites have only zero counts

11
Modeling Implications

Linear regression is in appropriate
No information in so many zero counts
We cannot assume normal errors
Information at most sites is insufficient to
estimate concentrations
Want to combine information at sites
Account for correlation within and between sites
Together Generalized Linear Mixed Model

Use a Hierarchical Poisson Model
Introduce random effects to create a mixed model
12
Generalized Linear Mixed Model Example

Hierarchical Poisson-lognormal structure
Log link function
Random effects captured by t, s

13
Bayesian Pathogen Concentration Model

Model Elements
Yij pathogen counts
Cij pathogen conc.
vij volume of water
Rij recovery rate
Xij predictor matrix
random effects
tij time-site effects
sj site effects
rk(j) regional effects

Hierarchical Model Yij ? PoissonvijCijRij lo
g Cij XijT? tij Rij Beta (a, b) where
tij N sj , st2 sj N rk(j), ss2 rk(j) N
m, sr2 diffuse prior distributions
coefficients (q) Normal Var comp. (s2) Inv
Gamma
14
Bayesian Statistical Approach

Frequentist Approach
Uses likelihood function f(y?) given data y
Point estimates of ? by maximizing likelihood
(MLE)
Bayesian Approach
Provide prior distribution(s), ?(?)
Obtain posterior distribution, p(?y)
where p(?y) ? f(y?) ?(?)

15
Justification of Bayesian Approach

Posterior of q provides more than a point
estimate
Estimating MLE in GLMM requires approximation
that may induce large biases
Prior may be chosen to induce little theoretical
difference between posterior mean and MLE
WinBUGS software offers flexible format

16
Outline

Background on Cryptosporidium parvum
Issues in Cryptosporidium Data
Model Formulation
Model Computation and Parameterization
Analyses of Cryptosporidium concentrations
Water Quality Prediction Health Risk Analysis
(ICR)
Alternative Structure of Variance Components
(ICR)
ICR and ICRSS datasets
Conclusions

17
Bayesian Computation

Want posterior conditional on fixed effects only
Random effects that must be integrated out
With 100 sites and 15 observations per site,
this is analytically intractable
Instead use Markov Chain Monte Carlo methods,
such as Gibbs Sampling

18
Gibbs Sampling Method

Gibbs Sampling is one type of Markov Chain Monte
Carlo method
General Idea to obtain p(?y)
Start with initial values
Iteratively sample values from p(?y)
Over many iterations obtain p(?y) empirically

19
Gibbs Sampling Algorithm

Each iteration i, Gibbs Sampler (GS) generates
q1(i) p(q1 q2 (i-1) , ... , qd (i-1) , y)
qd(i) p(qd q1 (i) , ... , qd-1 (i) , y)
Since q1(i) , ... , qd(i) ? p (qy) as i ? ?,
After T iterations, the posterior mean
qj ? mj (?qj(i)) /T E(mj) qj

20
Evaluation of Posterior Means

Let M be the estimator of the posterior mean for
q
With T iterations
Var(M) MC Error2 v sM2/T where v 1 2
S rk
We approximate S rk by
where for ARMA (1,1) model rk r1fk-1
estimate r16 and f ln(rk) ln(r16) (k-16)
ln(f)
for k16,, 49
Effective sample size (ESS) T/v

21
Performance of MCMC

Mixing the pattern over which samples are drawn
from the joint posterior distribution
Good mixing quickly produces samples throughout
the support of the posterior
Apply ESS to evaluate mixing
Various parameterizations improve mixing

22
Reparameterizations Center and Orthogonalize
Covariates (Gilks and Roberts, 1995)

Centering
Model yi m xib ? yi m xib
where xi xi - mean(x) and
m m mean(x)b
Gram-Schmidt Orthogonalization
X matrix of centered data
Determine X UA
where A is a triangular matrix and
U is orthogonal basis in subspace spanned by X
Model yi m XiTb ? yi m UiTg
To recover original coefficients b A-1g

23
Reparameterization Hierarchically Centering
Random Effects (Gelfand et al., 1995)

Model yjkm sj ejk j1,,m k1,,n
sjN(0,ss2)
ejk N(0,se2)
Posterior correlations are high (poor mixing) for
large n or large ss2 / se2

24
Posterior Correlation for m sites or regions
(not Hierarchically Centered)
25
Reparameterization Hierarchically Centering
Random Effects (Gelfand et al., 1995)

Model yjk hj ejk j1,,m k1,,n
hj m sj
hjN(m,ss2)
ejk N(0,se2)
Posterior correlations improved under the same
conditions (large n or large ss2 / se2 )

26
Effect of Hierarchical Centering on Posterior
Correlation for m sites or regions
27
Hierarchical Centering in ICR Model
Hierarchical Model Yij ? PoissonvijCijRij lo
g Cij XijT? tij Rij Beta (a, b) where
tij N sj , st2 sj N rk(j), ss2 rk(j) N
m, sr2 diffuse prior distributions
coefficients (q) Normal Var comp. (s2) Inv
Gamma
Standard Model Yij ? PoissonvijCijRij log
Cij XijT?tijsjrk(j)m Rij Beta (a,
b) where tij N 0 , st2 sj N 0,
ss2 rk(j) N 0, sr2 diffuse prior
distributions coefficients (q) Normal Var
comp. (s2) Inv Gamma
28
Parameterization Comparison

Effective Sample Sizes after 10,000 iterations
Reparameterizations C Centered
OOrthogonalized HC Hierarchical centered

29
Key Findings

Hierarchical centering of random effects and
centering data dramatically increase MCMC
efficiency.
Orthogonalization offers modest improvements over
centering data for some variables.
Best performance with hierarchically centering
random effects and with centered and/or
orthogonalized fixed effects.

30
Outline

Background on Cryptosporidium parvum
Issues in Cryptosporidium Data
Model Formulation
Model Computation and Parameterization
Analyses of Cryptosporidium concentrations
Water Quality Prediction Health Risk Analysis
(ICR)
Alternative Structure of Variance Components
(ICR)
Water Quality Prediction (ICR and ICRSS)
Conclusions

31
Model Covariates

Time-site Covariates
Log-turbidity, Carbonate hardness, Total organic
carbon
Hydrologic Variables (stream sites only)
Seasonal Effects
Spline Function chart
Temperature Anomaly
Site-Specific Covariates
Urban land area, sediment export potential,
population
Log-Residence Time (reservoir/lake sites only)

32
Spline Function for Seasonal Effect

Consists of
4 basis functions hn( )
? hn(di) Constant
di sampling date
Model estimates
ßn for n1,2,3
ß40 is model constraint
Seasonal Effect
? ßn hn(di)

33
Model Covariates

Time-site Covariates
Log-turbidity, Carbonate Hardness, Total Organic
Carbon
Hydrologic Variables (stream sites only)
Seasonal Effects
Spline Function chart
Temperature Anomaly
Site-Specific Covariates
Urban land area, sediment export potential,
population
Log-Residence Time (reservoir/lake sites only)

34
Modeling Objectives of Pathogen Concentrations

Water Quality Prediction (WQP)
Treatment plant-focused
Identify readily-measured indicators of
concentration levels
Estimate parameters for prediction at all times
Health Risk Analysis (HRA)
EPA-focused
Estimate annual average concentrations
Estimate must be applicable over the long-run

35
Implications from Modeling Objectives

Water Quality Prediction
Focus covariates that vary over time and place
Includes all relevant covariates
Model for Health Risk Analysis
Focus covariates known over time at a given
place
Includes site characteristics and the spline
function

36
Notes on Covariate Transformations

Special log-transformation if covariate includes
values of zero
log(xij 0.1mean(x))
Normalize to compare posterior means
(xij-mean(x))/SD(x)

37
Posterior Means in Models of Reservoir/Lakes
Not Significant Parameters. Other parameters in
Full model Temp. anomaly, log-population,
log-urban land area, soil permeability sediment
export, log residence time, seasonal spline
coefficients.
38
Observations on WQP/HRA Reservoir/Lake

WQP
Geometric mean 1 oocyst per 100 liters
Log-turbidity, Carbonate hardness, Tot. Org.
Carbon have large positive influence
Regions explain high proportion of variance
HRA
Model HRA-1 has lowest sts but no sig. param.
Negative reservoir residence time parameter
indicates smaller reservoirs have higher
concentrations
Regions contribute smaller proportion of variance

39
Posterior Means in Models of Streams
Not Significant Parameters. Other parameters in
Full model Total Organic Carbon, Temp. anomaly,
log-population, soil permeability, sediment
export, hydrologic variables. Seasonal Spline
Coefficients are not significant in WQP model
40
Seasonal Adjustment
41
Observations on WQP/HRA Streams

WQP
Log-turbidity, Carbonate Hardness, Urban Land
Area
Seasonal spline coefficients not significant
Regions not important in model
HRA
Seasonal spline captures some of effect of
turbidity in WQP model with lowest concentrations
in Winter
Regional variation is larger than in WQP models

42
Summary of Results
43
Issues in Site Definition Reservoirs
44
Issues in Site Definition

This research defines a site for each water body
where data was collected
Some water bodies supply several ICR treatment
plants
An alternative index defines sites by treatment
plant

45
Issues in Site Definition Streams
46
Observations on Site Definition

Variance increases with ICR-defined sites
Larger increase in variance for reservoir sites
than for stream sites
Carbonate hardness (Stream model) change due to
specific relationship with treatment plants from
same source

47
Issues in ICR / ICRSS Datasets

ICRSS includes a sample of sites also in ICR and
sites not represented in ICR
ICRSS includes 12 months of data ICR includes
(up to) 18 months
Concentrations modeled with dataset-specific
recovery rates
Interaction terms labeled ss
Total effect from ICRSS
ICRSSTICRSS avg (ICRSS-seasonal effects )

48
ICR-ICRSS Dataset Differences Reservoirs
49
Seasonal Adjustment (ICRSS)
50
ICR-ICRSS Dataset Differences Streams
51
Seasonal Adjustment (ICRSS)
52
Observations on ICR-ICRSS Data

ICRSS predicted concentrations
Higher than ICR overall
Large seasonal effect in March/April
Log-turbidity has smaller effect on
concentrations modeled in ICRSS but difference is
not significant

53
Summary of Results

Significant covariates include
Log-Turbidity, Carbonate Hardness, Total Organic
Carbon (Resv.), Log-urban land area (Stream)
Larger concentrations
Streams
High turbidity, carbonate hardness,
Spring season

54
Summary of Results (contd)