Title: b
1Bayesian Modelling for Differential Gene
Expression
Alex Lewin (Imperial College) Sylvia Richardson
(IC Epidemiology) Tim Aitman (IC Microarray
Centre) In collaboration with Anne-Mette Hein,
Natalia Bochkina (IC Epidemiology) Helen Causton
(IC Microarray Centre) Peter Green (Bristol)
2Insulin-resistance gene Cd36
cDNA microarray hybridisation signal for SHR
much lower than for Brown Norway and SHR.4
control strains Aitman et al 1999, Nature
Genet 2176-83
3Larger microarray experiment look for other
genes associated with Cd36
Microarray Data 3 SHR compared with 3 transgenic
rats (with Cd36) 3 wildtype (normal) mice
compared with 3 mice with Cd36 knocked out ?
12000 genes on each array Biological
Question Find genes which are expressed
differently between animals with and without Cd36.
4- Bayesian Hierarchical Model for Differential
Expression - Decision Rules
- Predictive Model Checks
- Simultaneous estimation of normalization and
differential expression - Gene Ontology analysis for differentially
expressed genes
5Microarray analysis is amulti-step process
We aim to integrate all the steps in a common
statistical framework
6Bayesian Modelling Framework
- Model different sources of variability
simultaneously, - within array, between array
- Uncertainty propagated from data to parameter
estimates (so not over-optimistic in
conclusions). - Share information in appropriate ways to get
robust estimates.
7Gene Expression Data
3 wildtype mice, Fat tissue hybridised to
Affymetrix chips
Newton et al. 2001 Showed data fit well by Gamma
or Log Normal distributions Kerr et al.
2000 Linear model on log scale
8Bayesian hierarchical model for differential
expression
- Data ygsr log expression for gene g, condition
s, replicate r - ?g gene effect
- dg differential effect for gene g between 2
conditions - ?r(g)s array effect (expression-level
dependent) - ?gs2 gene variance
- 1st level
- yg1r ?g, dg, ?g1 ? N(?g ½ dg ?r(g)1 ,
?g12), - yg2r ?g, dg, ?g2 ? N(?g ½ dg ?r(g)2 ,
?g22), - Sr ?r(g)s 0
- ?r(g)s function of ?g , parameters a
and b
9Priors for gene effects
- Mean effect ?g
- ?g Unif (much wider than data range)
- Differential effect dg
- dg N(0,104) fixed effects (no structure
in prior) - OR mixture
- dg p0d0 p1G_ (1.5, ?1) p2G (1.5, ?2)
10References
- Fixed Effects
- Kerr et al. 2000
- Mixture Models
- Newton et al. 2004 (non-parametric mixture)
- Löenstedt and Speed 2003, Smyth 2004
- (conjugate mixture prior)
- Broet et al. 2002 (several levels of DE)
11Prior for gene variances
- Two extreme cases
- (1) Constant variance ?gsr ? N(0, ?2)
- Too stringent Poor fit
- (2) Independent variances ?gsr ? N(0, ?g2)
- ! Variance estimates based on few replications
are highly variable -
- Need to share information between genes to
- better estimate their variance, while allowing
- some variability
- Hierarchical
model
12Prior for gene variances
- 2nd level
- ?gs2 µs, ts ? logNormal (µs, ts)
- Hyper-parameters µs and ts can be influential.
- Empirical Bayes
- Eg. Löenstedt and Speed 2003, Smyth 2004
- Fixes µs , ts
- Fully Bayesian
- 3rd level
- µs ? N( c, d) ts ? Gamma (e, f)
13Gene specific variances are stabilised
- Variances estimated using information from all G
x R measurements (12000 x 3) rather than just 3 - Variances stabilised and shrunk towards average
variance
14Prior for array effects (Normalization)
- Spline Curve
- ?r(g)s quadratic in ?g for ars(k-1) ?g
ars(k) - with coeff (brsk(1), brsk(2) ), k 1,
breakpoints
Locations of break points not fixed Must do
sensitivity checks on break points
15Array effect as a function of gene effect
16Effect of normalisation on density
Before (ygsr)
After (ygsr- ?r(g)s )
17Bayesian hierarchical model for differential
expression
- 1st level
- ygsr ?g, dg, ?gs ? N(?g ½ dg ?r(g)s ,
?gs2), -
- 2nd level
- Fixed effect priors for ?g, dg
- Array effect coefficients, Normal and Uniform
- ?gs2 µs, ts ? logNormal (µs, ts)
- 3rd level
- µs ? N( c, d)
- ts ? Gamma (e, f)
18WinBUGS software for fitting Bayesian models
WinBUGS does the calculations
for( i in 1 ngenes ) for( j in 1 nreps)
y1i, j dnorm(x1i, j, tau1i)
x1i, j lt- alphai - 0.5deltai
beta1i, j for( i
in 1 ngenes ) tau1i lt- 1.0/sig21i
sig21i lt- exp(lsig21i) lsig21i
dnorm(mm1,tt1) mm1 dnorm( 0.0,1.0E-3) tt1
dgamma(0.01,0.01)
19WinBUGS software for fitting Bayesian models
- Whole posterior distribution
- Posterior means, medians, quantiles
20- Bayesian Hierarchical Model for Differential
Expression - Decision Rules
- Predictive Model Checks
- Simultaneous estimation of normalization and
differential expression - Gene Ontology analysis for differentially
expressed genes
21Decision Rules for Inference
- So far, discussed fitting the model.
- How do we decide which genes are differentially
expressed? - Parameters of interest ?g , dg , ?g
- What quantity do we consider, dg , (dg /?g) , ?
- How do we summarize the posterior distribution?
22Fixed Effects Model
- Inference on d
- (1) dg E(dg data) posterior mean
- Like point estimate of log fold change.
- Decision Rule gene g is DE if dg gt dcut
- (2) pg P( dg gt dcut data)
- posterior probability (incorporates
uncertainty) - Decision Rule gene g is DE if pg gt pcut
- This allows biologist to specify what size of
effect is interesting (not just statistical
significance)
23Fixed Effects Model
- Inference on d, ?
- (1) tg E(dg data) / E(?g data)
- Like t-statistic.
- Decision Rule gene g is DE if tg gt tcut
- (2) pg P( dg /?g gt tcut data)
- Decision Rule gene g is DE if pg gt pcut
-
- Bochkina and Richardson (in preparation)
24Mixture Model
- dg p0d0 p1G_ (1.5, ?1) p2G (1.5, ?2)
(1) dg E(dg data) posterior mean Shrunk
estimate of log fold change. Decision Rule
gene g is DE if dg gt dcut (2) Classify genes
into the mixture components. pg P(gene g not
in H0 data) Decision Rule gene g is DE if pg
gt pcut
25Illustration of decision rule
pg P( dg gt log(2) and ?g gt 4 data) x
pg gt 0.8 ? t-statistic gt 2.78 (95 CI)
26- Bayesian Hierarchical Model for Differential
Expression - Decision Rules
- Predictive Model Checks
- Simultaneous estimation of normalization and
differential expression - Gene Ontology analysis for differentially
expressed genes
27Bayesian P-values
- Compare observed data to a null distribution
- P-value probability of an observation from the
null distribution being more extreme than the
actual observation - If all observations come from the null
distribution, the distribution of p-values is
Uniform
28Cross-validation p-values
Idea of cross validation is to split the data
one part for fitting the model, the rest for
validation n units of observation For each
observation yi, run model on rest of data y-i,
predict new data yinew from posterior
distribution.
Bayesian p-value pi Prob(yinew gt yi data y-i)
Distribution of p-values pi, i1,,n is
approximately Uniform if model adequately
describes the data.
29Posterior Predictive p-values
For large n, not possible to run model n
times. Run model on all data. For each
observation yi, predict new data yinew from
posterior distribution.
Bayesian p-value pi Prob(yinew gt yi all data)
all data includes yi p-values are less extreme
than they should be p-values are
conservative (not quite Uniform).
30Example Check priors on gene variances
- Compare equal and exchangeable variance models
- Compare different exchangeable priors
Want to compare data for each gene, not gene and
replicate, so use sample variance Sg2 (suppress
index s here)
Bayesian p-value Prob( Sg2 new gt Sg2 obs data)
31WinBUGS code for posterior predictive checks
replicate relevant sampling distribution calculat
e sample variances count no. times predicted
sample variance is bigger than observed sample
variance
for( i in 1 ngenes ) for( j in 1 nreps)
y1i, j dnorm(x1i, j, tau1i)
ynew1i, j dnorm(x1i, j, tau1i)
x1i, j lt- alphai - 0.5deltai
beta1i, j s21i
lt- pow(sd(y1i, ), 2) s2new1i lt-
pow(sd(ynew1i, ), 2) pval1i lt-
step(s2new1i - s21i)
32Posterior predictive
Graph shows structure of model
33Less conservative than posterior
predictive (Marshall and Spiegelhalter, 2003)
Mixed predictive
34Four models for gene variances
Equal variance model Model 1 ?2 ? log Normal
(0, 10000) Exchangeable variance models Model
2 ?g-2 ? Gamma (2, ß) Model 3 ?g-2 ? Gamma
(a, ß) Model 4 ?g2 ? log Normal (µ, t) (a, ß,
µ, t all parameters)
35Bayesian predictive p-values
36- Bayesian Hierarchical Model for Differential
Expression - Decision Rules
- Predictive Model Checks
- Simultaneous estimation of normalization and
differential expression - Gene Ontology analysis for differentially
expressed genes
37Expression level dependent normalization
Many gene expression data sets need normalization
which depends on expression level. Usually
normalization is performed in a pre-processing
step before the model for differential expression
is used. These analyses ignore the fact that the
expression level is measured with
variability. Ignoring this variability leads to
bias in the function used for normalization.
38Simulated Data
Gene variances similar range and distribution to
mouse data Array effects cubic functions of
expression level Differential effects 900
genes dg 0 50 genes dg ? N( log(3),
0.12) 50 genes dg ? N( -log(3), 0.12)
39Array Effects and Variability for Simulated Data
40Two-step method (using loess)
- Use loess smoothing to obtain array effects
?loessr(g)s - Subtract loess array effects from data
yloessgsr ygsr - ?loessr(g)s - Run our model on yloessgsr with no array effects
41Decision rules for selecting differentially
expressed genes
If P( dg gt dcut data) gt pcut then gene g is
called differentially expressed. dcut chosen
according to biological hypothesis of interest
(here we use log(3) ). pcut corresponds to the
error rate (e.g. False Discovery Rate or
Mis-classification Penalty) considered acceptable.
42Full model v. two-step method
Plot observed False Discovery Rate against pcut
(averaged over 5 simulations) Solid line for
full model Dashed line for pre-normalized method
43Different two-step methods
- yloessgsr ygsr - ?loessr(g)s
- ymodelgsr ygsr - E(?r(g)s data)
- Results from 2 different two-step methods are
much closer to each other than to full model
results.
44- Bayesian Hierarchical Model for Differential
Expression - Decision Rules
- Predictive Model Checks
- Simultaneous estimation of normalization and
differential expression - Gene Ontology analysis for differentially
expressed genes
45Gene Ontology (GO)
Database of biological terms Arranged in graph
connecting related terms Directed Acyclic Graph
links indicate more specific terms 16,000 terms
from QuickGO website (EBI)
46Gene Ontology (GO)
from QuickGO website (EBI)
47Gene Annotations
- Genes/proteins annotated to relevant GO terms
- Gene may be annotated to several GO terms
- GO term may have 1000s of genes annotated to it
(or none) - Gene annotated to term A ? annotated to all
ancestors of A (terms that are related and more
general)
48GO annotations of genes associated with the
insulin-resistance gene Cd36
Compare GO annotations of genes most and least
differentially expressed Most differentially
expressed ? pg gt 0.5 (280 genes) Least
differentially expressed ? pg lt 0.2 (11171
genes)
49GO annotations of genes associated with the
insulin-resistance gene Cd36
For each GO term, Fishers exact test on
proportion of differentially expressed genes
with annotations v. proportion of
non-differentially expressed genes with
annotations observed O A expected E
C(AB)/(CD) if no association of GO annotation
with DE FatiGO website http//fatigo.bioinfo.cnio
.es/
50GO annotations of genes associated with the
insulin-resistance gene Cd36
O observed no. differentially expressed genes E
expected no. differentially expressed genes
51All GO ancestors of Inflammatory response
This term was not accessed by FatiGO Relations
between GO terms were found using
QuickGO http//www.ebi.ac.uk/ego/
52Further Work to do on GO
- Account for dependencies between GO terms
- Multiple testing corrections
- Uncertainty in annotation
- ( work in preparation )
53Summary
- Bayesian hierarchical model flexible, estimates
variances robustly - Predictive model checks show exchangeable prior
good for gene variances - Useful to find GO terms over-represented in the
most differentially-expressed genes - Paper available (Lewin et al. 2005, Biometrics,
in press) - http //www.bgx.org.uk/
54(No Transcript)
55Decision Rules
- In full Bayesian framework, introduce latent
allocation variable zg 0,1 for gene g in null,
alternative - For each gene, calculate posterior probability of
belonging to unmodified component pg Pr( zg
0 data ) - Classify using cut-off on pg (Bayes rule
corresponds to 0.5) - For any given pg , can estimate FDR, FNR.
For gene-list S, est. (FDR data) Sg ? S pg
/ S
56The Null Hypothesis
Composite Null Point Null, alternative not
modelled Point Null, alternative modelled