Title: Model checks for complex hierarchical models
1Model checks for complex hierarchical models
- Alex Lewin and Sylvia Richardson
- Imperial College
- Centre for Biostatistics
2Background and Aims
- Many complex models used in bioinformatics
- Classification/clustering can be greatly affected
by choice of distributions - Our approach exploit the structure of the model
to perform predictive checks - hierarchical models generally involve
exchangeability assumptions - mixture models are partially exchangeable
3Outline of Talk
- Mixture model for gene expression data
- Model checks for mixture model
- distribution for gene-specific variances
- different mixture priors
- Future work model checks for a clustering and
variable selection model (Tadesse et al. 2005)
4Hierarchical mixture model for gene expression
data
w Dirichlet(1,,1), various priors for dg, ?g
dg ? Swjhj(?j), ?g2 µ,t ? f(µ,t)
ygr dg, ?g ? N(dg, ?g2)
g gene r replicate j mixture component
5Mixture model for gene expression data
- Many mixture models have been proposed for gene
expression data - Set-up is similar to variable selection prior
point mass alternative distribution - Particular choices for alternative
- Normal (Lönnstedt and Speed)
- Uniform (Parmigiani et al)
- many others
6Mixture model for gene expression data
Allow for asymmetry in over-and under-expressed
genes ? 3-component mixture model dg ?
w1h1(?1) w2h2(?2) w3h3(?3)
6 knock-out and 5 wildtype mice MAS5.0 processed
data
7Mixture model for gene expression data
Classify each gene into mixture components using
posterior probabilities
8Choice of mixture prior affects classification
results
Mixture Prior for dg Est. w2 ( in null)
w1Unif(-?-,0) w2d(0) w3Unif(0,?) 0.96
w1Gam-(1.5,?-) w2 d(0) w3Gam(1.5,?) 0.68
w1Gam-(1.5,?-) w2N(0,e) w3Gam(1.5,?) 0.99
9Outline of Talk
- Mixture model for gene expression data
- Models checks for mixture model
- distribution for gene-specific variances
- different mixture priors
- Future work model checks for a clustering and
variable selection model (Tadesse et al. 2005)
10Predictive model checks
- Predict new data from the model
- Use posterior predictive distribution
- Condition on hyperparameters (mixed predictive
? not very conservative) - Get Bayesian p-value for each gene/marker/sample
- Use all p-values together (100s or 1000s) to
assess model fit - Gelman, Meng and Stern 1995 Marshall and
Spiegelhalter 2003
11Checking distribution for gene variances
Bayesian p-value for gene g pg Prob( Smpred gt
Sgobs data )
All genes are exchangeable ? histogram of
p-values for all genes together
12Mixed v. posterior predictive
- Predictive p-values for data simulated from the
model - Histograms should be Uniform
- Mixed predictive distribution much less
conservative than posterior predictive
Using global distribution
Using gene-specific distributions
13Checking different variance models
?g2 µ,t ? Gam(µ,t), µ fixed
?g2 ?2 for all genes
Model differential expression between 3
transgenic and 3 wildtype mice
?g2 µ,t ? Gam(µ,t)
?g2 µ,t ? logNorm(µ,t)
14Implementation (MCMC)
- pg 0
- for t 1,,niter
- stpred ? f(µt,tt)
- Stmpred ? Gam( m, m(stpred)-2 )
- pg ? pg I Stmpred gt Sgobs
-
- pg ? pg / niter
niter no. MCMC iterations m (no. replicates
1)/2
Just two extra parameters predicted at each
iteration
15Outline of Talk
- Mixture model for gene expression data
- Model checks for mixture model
- distribution for gene-specific variances
- different mixture priors
- Future work model checks for a clustering and
variable selection model (Tadesse et al. 2005)
16Checking mixture prior
dg ? w1h1(?1) w2h2(?2) w3h3(?3) OR dg
?, zg j hj(?j) j 1,,3 P(zg j)
wj Model checking focus on separate mixture
components
17Issues for mixture model checking
- dg ?, zg j hj(?j) j 1,,3
- Think about MCMC iterations
- Mixture component is estimated from genes
currently assigned to that component - Can only define p-value for given gene and mix.
component when the gene is assigned to that
component (i.e. condition on zg in p-value) - So check each component using only the genes
currently assigned (i.e. condition on zg in
histogram)
18Predictive checks for mixture model
Bayesian p-value for gene g and mix. component
j pgj Prob( ybargjmpred gt ybargobs data,
zgj )
- Genes assigned to the same mix. component are
exchangeable - histogram of p-values for each mix. component
separately - histogram for component j made only from genes
with large P(zg j)
19Condition on classification to check separate
components
Predictive p-values for data simulated from the
model
All genes with P(zg j) gt 0 Only genes with
P(zg j) gt 0.5
Effectively we condition on a best classification
20Checking different mixture distributions
w1Unif(-?-,0) w2d(0) w3Unif(0,?)
- Outer mix. components skewed too much away from
zero - Null component too narrow
21Checking different mixture distributions
w1Gam-(1.5,?-) w2 d(0) w3Gam(1.5,?)
- Outer components skewed opposite
- Null still too narrow?
22Checking different mixture distributions
w1Gam-(1.5,?-) w2N(0,e) w3Gam(1.5,?)
- Better fit for all components
23Implementation
- pgj 0
- for t 1,,niter
- djtpred hjt(?jt) j 1,,3
- ybargtmpred ? N( djtpred , ?g2/nrep ) for
j zgt - pgj ? pgj I ybargtmpred gt ybargobs for j
zgt -
- pgj ? pgj / niter(zgj)
Need ngenes extra parameters at each iteration
24Summary of model checking procedure
- Find part of model where individuals are assumed
to be exchangeable (so information is shared) - Choose test statistic T (eg. sample mean or
variance) - Predict Tpred from distribution for exchangeable
individuals (whole posterior for Tpred) - Compare observed Ti for each individual i to
distribution of Tpred - For checking mixture components, condition on the
best classification
25Outline of Talk
- Mixture model for gene expression data
- Model checks for mixture model
- distribution for gene-specific variances
- different mixture priors
- Future work model checks for a clustering and
variable selection model (Tadesse et al. 2005)
26Clustering and variable selection (Tadesse et al.
2005)
- yi vector of gene expression for each sample i
1,,n - Multi-variate mixture model for clustering
samples - yi zi j ? MVN(?j, ?j) j 1,,J
- P(zi j) wj
- No. of mix. components (J) is estimated in the
model - Aim to select genes which are informative for
clustering the samples
27Clustering and variable selection (Tadesse et al.
2005)
? vector of indices of variables not used to
cluster samples
Likelihood conditional on allocation to mixture
? vector of indices of selected variables
Conjugate priors on multivariate means and
covariance matrices P(?g 1) f
i sample g gene j mix. component
28Clustering and variable selection (Tadesse et al.
2005)
Model checking want to check the distribution
for each mixture component separately
(conditional on J) In addition, need to condition
on a given variable selection Clearly impossible
computationally
i sample g gene j mix. component
29Computing predictive p-values
- Run model with no prediction
- Find the best configuration
- set of selected variables (?)
- no. mixture components J
- allocation of samples to mixture components zi
- Re-run model, with (?), J and zi fixed,
calculated predictive p-values
pij Prob( Tjpred gt Tiobs data, zij, J, (?)
) where T y2 (for example)
30Conclusions
- Choice of model distributions can greatly
influence results of clustering and
classification - For models where information is shared across
individuals, predictive checks can be used as an
alternative to cross-validation - Should be possible to do this even for quite
complex models (if you can fit the model, you can
check it)
31Acknowledgements
Collaborators on BBSRC Exploiting Genomics
Grant Natalia Bochkina, Clare Marshall Peter
Green Meeting on model checking in
Cambridge David Spiegelhalter Shaun Seaman BBSRC
Exploiting Genomics Grant Paper and software at
http//www.bgx.org.uk/