Modeling Cryptosporidium Concentrations: A Bayesian GLMM of Regional Count Data

1 / 56
About This Presentation
Title:

Modeling Cryptosporidium Concentrations: A Bayesian GLMM of Regional Count Data

Description:

Modeling Cryptosporidium Concentrations: A Bayesian GLMM of ... Stomacher. Suspended. Solution. Centrifuge. Pellet of. Solids. Fraction (F) of Pellet. Suspended ... –

Number of Views:87
Avg rating:3.0/5.0
Slides: 57
Provided by: chr1227
Category:

less

Transcript and Presenter's Notes

Title: Modeling Cryptosporidium Concentrations: A Bayesian GLMM of Regional Count Data


1
Modeling Cryptosporidium Concentrations A
Bayesian GLMM of Regional Count Data
  • Christopher Behr
  • Advisor Prof. Jery Stedinger
  • Prof. David Ruppert (ORIE)
  • Ciprian Craniceanu (Statistics)

2
Outline
  • Background on Cryptosporidium parvum
  • Issues in Cryptosporidium Data
  • Model Formulation
  • Model Computation and Parameterization
  • Analyses of Cryptosporidium concentrations
  • Water Quality Prediction Health Risk Analysis
    (ICR)
  • Alternative Structure of Variance Components
    (ICR)
  • ICR and ICRSS datasets
  • Conclusions

3
Background on Cryptosporidium
  • Waterborne pathogenic protozoa
  • Surface waters (65-97 of all in U.S.)
  • Difficult to kill
  • Can survive more than 100 days in environment
  • Resistant to chlorination found in 8-27
    finished water supplies Swimming pools

4
Concern about Cryptosporidium
  • C. parvum causes mild to serious infections
  • Outbreaks largest in Milwaukee (1994)
  • Potentially high endemic levels
  • 2,000-3,000 reported cases/yr between 1995-97
  • Unreported cases estimated between 0.2 and 2 of
    population for industrialized countries
  • ? 0.5 to 5 million cases/year in U.S.

5
EPA Regulatory Status
  • New Surface Water Treatment Rule being considered
    for Cryptosporidium control
  • Data Information Collection Rule (ICR) and ICR
    Supplementary Survey (ICRSS)
  • Survey of drinking water supply quality
  • Samples collected a utility intakes
  • Spiking studies evaluate testing method
    effectiveness
  • Need appropriate statistical models

6
ICR Datasets
7
Issues in Cryptosporidium Data Variability in
Cryptosporidium Measurement
  • Cryptosporidium testing methods
  • Immunoflourescence assay method (IFA)
  • Immunomagnetic separation (1623)
  • Difficult to detect highly variable recovery
    rates
  • Low expected recovery rates
  • IFA ? 10 1623 ? 40
  • High variability (in CV)
  • IFA ? 100 1623 ? 50
  • Recovery rate may be modeled as beta-distributed

8
Oocyst Recovery IFA Method
Walker, 1995
Stomacher
Centrifuge
Fiber Filter
Raw Water (V)
Suspended Solution
Separated Liquid
Top Layer
Centrifuge
Fraction (F) of Pellet
Suspended Solution
Pellet of Solids
Oocysts counted DISCRETELY!
Acetate Membrane
Dye added
Slides
9
Issues in Cryptosporidium Data Variability in
Cryptosporidium Measurement
  • Cryptosporidium testing methods
  • Immunoflourescence assay method (ICR)
  • Immunomagnetic separation (1623)
  • Difficult to detect highly variable recovery
    rates
  • Low expected recovery rates
  • ICR ? 10 1623 ? 40
  • High variability (in CV)
  • ICR ? 100 1623 ? 50
  • Recovery rate may be modeled as beta-distributed

10
Issues in Cryptosporidium Data Low Counts
Across Many Sites
  • Many zero counts (93 - ICR 85 - ICRSS)
  • Many sites have only zero counts

11
Modeling Implications
  • Linear regression is in appropriate
  • No information in so many zero counts
  • We cannot assume normal errors
  • Information at most sites is insufficient to
    estimate concentrations
  • Want to combine information at sites
  • Account for correlation within and between sites
  • Together Generalized Linear Mixed Model

Use a Hierarchical Poisson Model
Introduce random effects to create a mixed model
12
Generalized Linear Mixed Model Example
  • Hierarchical Poisson-lognormal structure
  • Log link function
  • Random effects captured by t, s

13
Bayesian Pathogen Concentration Model
  • Model Elements
  • Yij pathogen counts
  • Cij pathogen conc.
  • vij volume of water
  • Rij recovery rate
  • Xij predictor matrix
  • random effects
  • tij time-site effects
  • sj site effects
  • rk(j) regional effects

Hierarchical Model Yij ? PoissonvijCijRij lo
g Cij XijT? tij Rij Beta (a, b) where
tij N sj , st2 sj N rk(j), ss2 rk(j) N
m, sr2 diffuse prior distributions
coefficients (q) Normal Var comp. (s2) Inv
Gamma
14
Bayesian Statistical Approach
  • Frequentist Approach
  • Uses likelihood function f(y?) given data y
  • Point estimates of ? by maximizing likelihood
    (MLE)
  • Bayesian Approach
  • Provide prior distribution(s), ?(?)
  • Obtain posterior distribution, p(?y)
  • where p(?y) ? f(y?) ?(?)

15
Justification of Bayesian Approach
  • Posterior of q provides more than a point
    estimate
  • Estimating MLE in GLMM requires approximation
    that may induce large biases
  • Prior may be chosen to induce little theoretical
    difference between posterior mean and MLE
  • WinBUGS software offers flexible format

16
Outline
  • Background on Cryptosporidium parvum
  • Issues in Cryptosporidium Data
  • Model Formulation
  • Model Computation and Parameterization
  • Analyses of Cryptosporidium concentrations
  • Water Quality Prediction Health Risk Analysis
    (ICR)
  • Alternative Structure of Variance Components
    (ICR)
  • ICR and ICRSS datasets
  • Conclusions

17
Bayesian Computation
  • Want posterior conditional on fixed effects only
  • Random effects that must be integrated out
  • With 100 sites and 15 observations per site,
    this is analytically intractable
  • Instead use Markov Chain Monte Carlo methods,
    such as Gibbs Sampling

18
Gibbs Sampling Method
  • Gibbs Sampling is one type of Markov Chain Monte
    Carlo method
  • General Idea to obtain p(?y)
  • Start with initial values
  • Iteratively sample values from p(?y)
  • Over many iterations obtain p(?y) empirically

19
Gibbs Sampling Algorithm
  • Each iteration i, Gibbs Sampler (GS) generates
  • q1(i) p(q1 q2 (i-1) , ... , qd (i-1) , y)
  • qd(i) p(qd q1 (i) , ... , qd-1 (i) , y)
  • Since q1(i) , ... , qd(i) ? p (qy) as i ? ?,
  • After T iterations, the posterior mean
  • qj ? mj (?qj(i)) /T E(mj) qj

20
Evaluation of Posterior Means
  • Let M be the estimator of the posterior mean for
    q
  • With T iterations
  • Var(M) MC Error2 v sM2/T where v 1 2
    S rk
  • We approximate S rk by
  • where for ARMA (1,1) model rk r1fk-1
  • estimate r16 and f ln(rk) ln(r16) (k-16)
    ln(f)
  • for k16,, 49
  • Effective sample size (ESS) T/v

21
Performance of MCMC
  • Mixing the pattern over which samples are drawn
    from the joint posterior distribution
  • Good mixing quickly produces samples throughout
    the support of the posterior
  • Apply ESS to evaluate mixing
  • Various parameterizations improve mixing

22
Reparameterizations Center and Orthogonalize
Covariates (Gilks and Roberts, 1995)
  • Centering
  • Model yi m xib ? yi m xib
  • where xi xi - mean(x) and
  • m m mean(x)b
  • Gram-Schmidt Orthogonalization
  • X matrix of centered data
  • Determine X UA
  • where A is a triangular matrix and
  • U is orthogonal basis in subspace spanned by X
  • Model yi m XiTb ? yi m UiTg
  • To recover original coefficients b A-1g

23
Reparameterization Hierarchically Centering
Random Effects (Gelfand et al., 1995)
  • Model yjkm sj ejk j1,,m k1,,n
  • sjN(0,ss2)
  • ejk N(0,se2)
  • Posterior correlations are high (poor mixing) for
    large n or large ss2 / se2

24
Posterior Correlation for m sites or regions
(not Hierarchically Centered)
25
Reparameterization Hierarchically Centering
Random Effects (Gelfand et al., 1995)
  • Model yjk hj ejk j1,,m k1,,n
  • hj m sj
  • hjN(m,ss2)
  • ejk N(0,se2)
  • Posterior correlations improved under the same
    conditions (large n or large ss2 / se2 )

26
Effect of Hierarchical Centering on Posterior
Correlation for m sites or regions
27
Hierarchical Centering in ICR Model
Hierarchical Model Yij ? PoissonvijCijRij lo
g Cij XijT? tij Rij Beta (a, b) where
tij N sj , st2 sj N rk(j), ss2 rk(j) N
m, sr2 diffuse prior distributions
coefficients (q) Normal Var comp. (s2) Inv
Gamma
Standard Model Yij ? PoissonvijCijRij log
Cij XijT?tijsjrk(j)m Rij Beta (a,
b) where tij N 0 , st2 sj N 0,
ss2 rk(j) N 0, sr2 diffuse prior
distributions coefficients (q) Normal Var
comp. (s2) Inv Gamma
28
Parameterization Comparison
  • Effective Sample Sizes after 10,000 iterations
  • Reparameterizations C Centered
    OOrthogonalized HC Hierarchical centered

29
Key Findings
  • Hierarchical centering of random effects and
    centering data dramatically increase MCMC
    efficiency.
  • Orthogonalization offers modest improvements over
    centering data for some variables.
  • Best performance with hierarchically centering
    random effects and with centered and/or
    orthogonalized fixed effects.

30
Outline
  • Background on Cryptosporidium parvum
  • Issues in Cryptosporidium Data
  • Model Formulation
  • Model Computation and Parameterization
  • Analyses of Cryptosporidium concentrations
  • Water Quality Prediction Health Risk Analysis
    (ICR)
  • Alternative Structure of Variance Components
    (ICR)
  • Water Quality Prediction (ICR and ICRSS)
  • Conclusions

31
Model Covariates
  • Time-site Covariates
  • Log-turbidity, Carbonate hardness, Total organic
    carbon
  • Hydrologic Variables (stream sites only)
  • Seasonal Effects
  • Spline Function chart
  • Temperature Anomaly
  • Site-Specific Covariates
  • Urban land area, sediment export potential,
    population
  • Log-Residence Time (reservoir/lake sites only)

32
Spline Function for Seasonal Effect
  • Consists of
  • 4 basis functions hn( )
  • ? hn(di) Constant
  • di sampling date
  • Model estimates
  • ßn for n1,2,3
  • ß40 is model constraint
  • Seasonal Effect
  • ? ßn hn(di)

33
Model Covariates
  • Time-site Covariates
  • Log-turbidity, Carbonate Hardness, Total Organic
    Carbon
  • Hydrologic Variables (stream sites only)
  • Seasonal Effects
  • Spline Function chart
  • Temperature Anomaly
  • Site-Specific Covariates
  • Urban land area, sediment export potential,
    population
  • Log-Residence Time (reservoir/lake sites only)

34
Modeling Objectives of Pathogen Concentrations
  • Water Quality Prediction (WQP)
  • Treatment plant-focused
  • Identify readily-measured indicators of
    concentration levels
  • Estimate parameters for prediction at all times
  • Health Risk Analysis (HRA)
  • EPA-focused
  • Estimate annual average concentrations
  • Estimate must be applicable over the long-run

35
Implications from Modeling Objectives
  • Water Quality Prediction
  • Focus covariates that vary over time and place
  • Includes all relevant covariates
  • Model for Health Risk Analysis
  • Focus covariates known over time at a given
    place
  • Includes site characteristics and the spline
    function

36
Notes on Covariate Transformations
  • Special log-transformation if covariate includes
    values of zero
  • log(xij 0.1mean(x))
  • Normalize to compare posterior means
  • (xij-mean(x))/SD(x)

37
Posterior Means in Models of Reservoir/Lakes
Not Significant Parameters. Other parameters in
Full model Temp. anomaly, log-population,
log-urban land area, soil permeability sediment
export, log residence time, seasonal spline
coefficients.
38
Observations on WQP/HRA Reservoir/Lake
  • WQP
  • Geometric mean 1 oocyst per 100 liters
  • Log-turbidity, Carbonate hardness, Tot. Org.
    Carbon have large positive influence
  • Regions explain high proportion of variance
  • HRA
  • Model HRA-1 has lowest sts but no sig. param.
  • Negative reservoir residence time parameter
    indicates smaller reservoirs have higher
    concentrations
  • Regions contribute smaller proportion of variance

39
Posterior Means in Models of Streams
Not Significant Parameters. Other parameters in
Full model Total Organic Carbon, Temp. anomaly,
log-population, soil permeability, sediment
export, hydrologic variables. Seasonal Spline
Coefficients are not significant in WQP model
40
Seasonal Adjustment
41
Observations on WQP/HRA Streams
  • WQP
  • Log-turbidity, Carbonate Hardness, Urban Land
    Area
  • Seasonal spline coefficients not significant
  • Regions not important in model
  • HRA
  • Seasonal spline captures some of effect of
    turbidity in WQP model with lowest concentrations
    in Winter
  • Regional variation is larger than in WQP models

42
Summary of Results
43
Issues in Site Definition Reservoirs
44
Issues in Site Definition
  • This research defines a site for each water body
    where data was collected
  • Some water bodies supply several ICR treatment
    plants
  • An alternative index defines sites by treatment
    plant

45
Issues in Site Definition Streams
46
Observations on Site Definition
  • Variance increases with ICR-defined sites
  • Larger increase in variance for reservoir sites
    than for stream sites
  • Carbonate hardness (Stream model) change due to
    specific relationship with treatment plants from
    same source

47
Issues in ICR / ICRSS Datasets
  • ICRSS includes a sample of sites also in ICR and
    sites not represented in ICR
  • ICRSS includes 12 months of data ICR includes
    (up to) 18 months
  • Concentrations modeled with dataset-specific
    recovery rates
  • Interaction terms labeled ss
  • Total effect from ICRSS
  • ICRSSTICRSS avg (ICRSS-seasonal effects )

48
ICR-ICRSS Dataset Differences Reservoirs
49
Seasonal Adjustment (ICRSS)
50
ICR-ICRSS Dataset Differences Streams
51
Seasonal Adjustment (ICRSS)
52
Observations on ICR-ICRSS Data
  • ICRSS predicted concentrations
  • Higher than ICR overall
  • Large seasonal effect in March/April
  • Log-turbidity has smaller effect on
    concentrations modeled in ICRSS but difference is
    not significant

53
Summary of Results
  • Significant covariates include
  • Log-Turbidity, Carbonate Hardness, Total Organic
    Carbon (Resv.), Log-urban land area (Stream)
  • Larger concentrations
  • Streams
  • High turbidity, carbonate hardness,
  • Spring season

54
Summary of Results (contd)
  • Variance components are quite large
  • Covariates add modest improvement in prediction
  • Source-water sites definition induces smaller
    variance
  • ICRSS suggests higher concentrations

55
Future Research Areas
  • Develop model assessment approaches
  • Explore alternative model for ICR recovery rate
  • Develop model for missing values in significant
    covariates
  • Apply model to new datasets

56
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com