Title: Modeling Cryptosporidium Concentrations: A Bayesian GLMM of Regional Count Data
1Modeling Cryptosporidium Concentrations A
Bayesian GLMM of Regional Count Data
- Christopher Behr
- Advisor Prof. Jery Stedinger
- Prof. David Ruppert (ORIE)
- Ciprian Craniceanu (Statistics)
2Outline
- Background on Cryptosporidium parvum
- Issues in Cryptosporidium Data
- Model Formulation
- Model Computation and Parameterization
- Analyses of Cryptosporidium concentrations
- Water Quality Prediction Health Risk Analysis
(ICR) - Alternative Structure of Variance Components
(ICR) - ICR and ICRSS datasets
- Conclusions
3Background on Cryptosporidium
- Waterborne pathogenic protozoa
- Surface waters (65-97 of all in U.S.)
- Difficult to kill
- Can survive more than 100 days in environment
- Resistant to chlorination found in 8-27
finished water supplies Swimming pools
4Concern about Cryptosporidium
- C. parvum causes mild to serious infections
- Outbreaks largest in Milwaukee (1994)
- Potentially high endemic levels
- 2,000-3,000 reported cases/yr between 1995-97
- Unreported cases estimated between 0.2 and 2 of
population for industrialized countries - ? 0.5 to 5 million cases/year in U.S.
5EPA Regulatory Status
- New Surface Water Treatment Rule being considered
for Cryptosporidium control - Data Information Collection Rule (ICR) and ICR
Supplementary Survey (ICRSS) - Survey of drinking water supply quality
- Samples collected a utility intakes
- Spiking studies evaluate testing method
effectiveness - Need appropriate statistical models
6ICR Datasets
7Issues in Cryptosporidium Data Variability in
Cryptosporidium Measurement
- Cryptosporidium testing methods
- Immunoflourescence assay method (IFA)
- Immunomagnetic separation (1623)
- Difficult to detect highly variable recovery
rates - Low expected recovery rates
- IFA ? 10 1623 ? 40
- High variability (in CV)
- IFA ? 100 1623 ? 50
- Recovery rate may be modeled as beta-distributed
8Oocyst Recovery IFA Method
Walker, 1995
Stomacher
Centrifuge
Fiber Filter
Raw Water (V)
Suspended Solution
Separated Liquid
Top Layer
Centrifuge
Fraction (F) of Pellet
Suspended Solution
Pellet of Solids
Oocysts counted DISCRETELY!
Acetate Membrane
Dye added
Slides
9Issues in Cryptosporidium Data Variability in
Cryptosporidium Measurement
- Cryptosporidium testing methods
- Immunoflourescence assay method (ICR)
- Immunomagnetic separation (1623)
- Difficult to detect highly variable recovery
rates - Low expected recovery rates
- ICR ? 10 1623 ? 40
- High variability (in CV)
- ICR ? 100 1623 ? 50
- Recovery rate may be modeled as beta-distributed
10Issues in Cryptosporidium Data Low Counts
Across Many Sites
- Many zero counts (93 - ICR 85 - ICRSS)
- Many sites have only zero counts
11Modeling Implications
- Linear regression is in appropriate
- No information in so many zero counts
- We cannot assume normal errors
-
- Information at most sites is insufficient to
estimate concentrations - Want to combine information at sites
- Account for correlation within and between sites
- Together Generalized Linear Mixed Model
Use a Hierarchical Poisson Model
Introduce random effects to create a mixed model
12Generalized Linear Mixed Model Example
- Hierarchical Poisson-lognormal structure
- Log link function
- Random effects captured by t, s
13Bayesian Pathogen Concentration Model
- Model Elements
- Yij pathogen counts
- Cij pathogen conc.
- vij volume of water
- Rij recovery rate
- Xij predictor matrix
- random effects
- tij time-site effects
- sj site effects
- rk(j) regional effects
Hierarchical Model Yij ? PoissonvijCijRij lo
g Cij XijT? tij Rij Beta (a, b) where
tij N sj , st2 sj N rk(j), ss2 rk(j) N
m, sr2 diffuse prior distributions
coefficients (q) Normal Var comp. (s2) Inv
Gamma
14Bayesian Statistical Approach
- Frequentist Approach
- Uses likelihood function f(y?) given data y
- Point estimates of ? by maximizing likelihood
(MLE) - Bayesian Approach
- Provide prior distribution(s), ?(?)
- Obtain posterior distribution, p(?y)
- where p(?y) ? f(y?) ?(?)
15Justification of Bayesian Approach
- Posterior of q provides more than a point
estimate - Estimating MLE in GLMM requires approximation
that may induce large biases - Prior may be chosen to induce little theoretical
difference between posterior mean and MLE - WinBUGS software offers flexible format
16Outline
- Background on Cryptosporidium parvum
- Issues in Cryptosporidium Data
- Model Formulation
- Model Computation and Parameterization
- Analyses of Cryptosporidium concentrations
- Water Quality Prediction Health Risk Analysis
(ICR) - Alternative Structure of Variance Components
(ICR) - ICR and ICRSS datasets
- Conclusions
17Bayesian Computation
- Want posterior conditional on fixed effects only
- Random effects that must be integrated out
- With 100 sites and 15 observations per site,
this is analytically intractable - Instead use Markov Chain Monte Carlo methods,
such as Gibbs Sampling
18Gibbs Sampling Method
- Gibbs Sampling is one type of Markov Chain Monte
Carlo method - General Idea to obtain p(?y)
- Start with initial values
- Iteratively sample values from p(?y)
- Over many iterations obtain p(?y) empirically
19Gibbs Sampling Algorithm
- Each iteration i, Gibbs Sampler (GS) generates
- q1(i) p(q1 q2 (i-1) , ... , qd (i-1) , y)
-
- qd(i) p(qd q1 (i) , ... , qd-1 (i) , y)
- Since q1(i) , ... , qd(i) ? p (qy) as i ? ?,
- After T iterations, the posterior mean
- qj ? mj (?qj(i)) /T E(mj) qj
20Evaluation of Posterior Means
- Let M be the estimator of the posterior mean for
q - With T iterations
- Var(M) MC Error2 v sM2/T where v 1 2
S rk - We approximate S rk by
-
- where for ARMA (1,1) model rk r1fk-1
- estimate r16 and f ln(rk) ln(r16) (k-16)
ln(f) - for k16,, 49
- Effective sample size (ESS) T/v
21Performance of MCMC
- Mixing the pattern over which samples are drawn
from the joint posterior distribution - Good mixing quickly produces samples throughout
the support of the posterior - Apply ESS to evaluate mixing
- Various parameterizations improve mixing
22Reparameterizations Center and Orthogonalize
Covariates (Gilks and Roberts, 1995)
- Centering
- Model yi m xib ? yi m xib
- where xi xi - mean(x) and
- m m mean(x)b
- Gram-Schmidt Orthogonalization
- X matrix of centered data
- Determine X UA
- where A is a triangular matrix and
- U is orthogonal basis in subspace spanned by X
- Model yi m XiTb ? yi m UiTg
- To recover original coefficients b A-1g
23Reparameterization Hierarchically Centering
Random Effects (Gelfand et al., 1995)
- Model yjkm sj ejk j1,,m k1,,n
- sjN(0,ss2)
- ejk N(0,se2)
- Posterior correlations are high (poor mixing) for
large n or large ss2 / se2
24Posterior Correlation for m sites or regions
(not Hierarchically Centered)
25Reparameterization Hierarchically Centering
Random Effects (Gelfand et al., 1995)
- Model yjk hj ejk j1,,m k1,,n
- hj m sj
- hjN(m,ss2)
- ejk N(0,se2)
- Posterior correlations improved under the same
conditions (large n or large ss2 / se2 )
26Effect of Hierarchical Centering on Posterior
Correlation for m sites or regions
27Hierarchical Centering in ICR Model
Hierarchical Model Yij ? PoissonvijCijRij lo
g Cij XijT? tij Rij Beta (a, b) where
tij N sj , st2 sj N rk(j), ss2 rk(j) N
m, sr2 diffuse prior distributions
coefficients (q) Normal Var comp. (s2) Inv
Gamma
Standard Model Yij ? PoissonvijCijRij log
Cij XijT?tijsjrk(j)m Rij Beta (a,
b) where tij N 0 , st2 sj N 0,
ss2 rk(j) N 0, sr2 diffuse prior
distributions coefficients (q) Normal Var
comp. (s2) Inv Gamma
28Parameterization Comparison
- Effective Sample Sizes after 10,000 iterations
- Reparameterizations C Centered
OOrthogonalized HC Hierarchical centered
29Key Findings
- Hierarchical centering of random effects and
centering data dramatically increase MCMC
efficiency. - Orthogonalization offers modest improvements over
centering data for some variables. - Best performance with hierarchically centering
random effects and with centered and/or
orthogonalized fixed effects.
30Outline
- Background on Cryptosporidium parvum
- Issues in Cryptosporidium Data
- Model Formulation
- Model Computation and Parameterization
- Analyses of Cryptosporidium concentrations
- Water Quality Prediction Health Risk Analysis
(ICR) - Alternative Structure of Variance Components
(ICR) - Water Quality Prediction (ICR and ICRSS)
- Conclusions
31Model Covariates
- Time-site Covariates
- Log-turbidity, Carbonate hardness, Total organic
carbon - Hydrologic Variables (stream sites only)
- Seasonal Effects
- Spline Function chart
- Temperature Anomaly
- Site-Specific Covariates
- Urban land area, sediment export potential,
population - Log-Residence Time (reservoir/lake sites only)
32Spline Function for Seasonal Effect
- Consists of
- 4 basis functions hn( )
- ? hn(di) Constant
- di sampling date
- Model estimates
- ßn for n1,2,3
- ß40 is model constraint
- Seasonal Effect
- ? ßn hn(di)
33Model Covariates
- Time-site Covariates
- Log-turbidity, Carbonate Hardness, Total Organic
Carbon - Hydrologic Variables (stream sites only)
- Seasonal Effects
- Spline Function chart
- Temperature Anomaly
- Site-Specific Covariates
- Urban land area, sediment export potential,
population - Log-Residence Time (reservoir/lake sites only)
34Modeling Objectives of Pathogen Concentrations
- Water Quality Prediction (WQP)
- Treatment plant-focused
- Identify readily-measured indicators of
concentration levels - Estimate parameters for prediction at all times
- Health Risk Analysis (HRA)
- EPA-focused
- Estimate annual average concentrations
- Estimate must be applicable over the long-run
35Implications from Modeling Objectives
- Water Quality Prediction
- Focus covariates that vary over time and place
- Includes all relevant covariates
- Model for Health Risk Analysis
- Focus covariates known over time at a given
place - Includes site characteristics and the spline
function
36Notes on Covariate Transformations
- Special log-transformation if covariate includes
values of zero -
- log(xij 0.1mean(x))
-
- Normalize to compare posterior means
-
- (xij-mean(x))/SD(x)
37Posterior Means in Models of Reservoir/Lakes
Not Significant Parameters. Other parameters in
Full model Temp. anomaly, log-population,
log-urban land area, soil permeability sediment
export, log residence time, seasonal spline
coefficients.
38Observations on WQP/HRA Reservoir/Lake
- WQP
- Geometric mean 1 oocyst per 100 liters
- Log-turbidity, Carbonate hardness, Tot. Org.
Carbon have large positive influence - Regions explain high proportion of variance
- HRA
- Model HRA-1 has lowest sts but no sig. param.
- Negative reservoir residence time parameter
indicates smaller reservoirs have higher
concentrations - Regions contribute smaller proportion of variance
39Posterior Means in Models of Streams
Not Significant Parameters. Other parameters in
Full model Total Organic Carbon, Temp. anomaly,
log-population, soil permeability, sediment
export, hydrologic variables. Seasonal Spline
Coefficients are not significant in WQP model
40Seasonal Adjustment
41Observations on WQP/HRA Streams
- WQP
- Log-turbidity, Carbonate Hardness, Urban Land
Area - Seasonal spline coefficients not significant
- Regions not important in model
- HRA
- Seasonal spline captures some of effect of
turbidity in WQP model with lowest concentrations
in Winter - Regional variation is larger than in WQP models
42Summary of Results
43Issues in Site Definition Reservoirs
44Issues in Site Definition
- This research defines a site for each water body
where data was collected - Some water bodies supply several ICR treatment
plants - An alternative index defines sites by treatment
plant
45Issues in Site Definition Streams
46Observations on Site Definition
- Variance increases with ICR-defined sites
- Larger increase in variance for reservoir sites
than for stream sites - Carbonate hardness (Stream model) change due to
specific relationship with treatment plants from
same source
47Issues in ICR / ICRSS Datasets
- ICRSS includes a sample of sites also in ICR and
sites not represented in ICR - ICRSS includes 12 months of data ICR includes
(up to) 18 months - Concentrations modeled with dataset-specific
recovery rates - Interaction terms labeled ss
- Total effect from ICRSS
- ICRSSTICRSS avg (ICRSS-seasonal effects )
48ICR-ICRSS Dataset Differences Reservoirs
49Seasonal Adjustment (ICRSS)
50ICR-ICRSS Dataset Differences Streams
51Seasonal Adjustment (ICRSS)
52Observations on ICR-ICRSS Data
- ICRSS predicted concentrations
- Higher than ICR overall
- Large seasonal effect in March/April
- Log-turbidity has smaller effect on
concentrations modeled in ICRSS but difference is
not significant
53Summary of Results
- Significant covariates include
- Log-Turbidity, Carbonate Hardness, Total Organic
Carbon (Resv.), Log-urban land area (Stream) - Larger concentrations
- Streams
- High turbidity, carbonate hardness,
- Spring season
54Summary of Results (contd)
- Variance components are quite large
- Covariates add modest improvement in prediction
- Source-water sites definition induces smaller
variance - ICRSS suggests higher concentrations
55Future Research Areas
- Develop model assessment approaches
- Explore alternative model for ICR recovery rate
- Develop model for missing values in significant
covariates - Apply model to new datasets
56(No Transcript)