Title: b
1Bayesian hierarchical modelling of gene
expression data
Sylvia Richardson Centre for Biostatistics Imperia
l College, London
In collaboration with Natalia Bochkina, Anne
Mette Hein, Alex Lewin (St Marys) Helen Causton
and Tim Aitman (Hammersmith) Graeme Ambler and
Peter Green (Bristol) Philippe Broët (INSERM,
Paris) BBSRC Exploiting Genomics grant
2Outline
- Hierarchical modelling framework
- A Bayesian gene expression index
- Modelling differential expression
- False discovery rate and mixture models
3Introduction
- Gene expression is a hierarchical process
- Substantive question
- Experimental design
- Sample preparation
- Array design manufacture
- Gene expression matrix
- Probe level data
- Image level data
Interesting variability (signal)
Obscuring variability (noise)
- Interest in using statistical framework capable
to handle multiple sources of variability
coherently
Bayesian statistics
4Bayesian hierarchical model framework
- Has the flexibility to model various sources of
variability between probes, gene specific,
within array, between array, - Building of all these features into a common
model - Avoids the need to use systematically a plug-in
approach - uncertainty is propagated
- Borrow strength / share out information
according to principle - Allows some model checking
5Gene expression analysis is amulti-step process
Low-level Model (how is the measured expression
related to the signal)
Multi-arrays processing (how to make appropriate
combined inference)
We build all these steps in a common statistical
framework
Differential Expression
Clustering Partition Model
6Integrated modelling of Affymetrix data
Condition 2
Condition 1
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
Gene specific variability (probe) Gene index BGX
Gene specific variability (probe) Gene index BGX
Hierarchical model of replicate (biological)
variability and array effect
Hierarchical model of replicate (biological)
variability and array effect
Gene and condition BGX index
Gene and condition BGX index
Differential expression parameter
7 A fully Bayesian Gene eXpression index for
Affymetrix GeneChip arraysAnne Mette HeinSR,
Helen Causton, Graeme Ambler, Peter Green
PM MM
PM MM
PM MM
PM MM
Gene specific variability (probe)
Gene index BGX
8Single array model Motivation
Key observations
Conclusions
- PMs and MMs both increase with spike-in
concentration (MMs slower than PMs)
MMs bind fraction of signal
Multiplicative (and additive) error
transformation needed
- Spread of PMs increase with level
- Considerable variability in PM (and MM) response
within a probe set
Varying reliability in gene expression estimation
for different genes
Estimate gene expression measure from PMs and MMs
on log scale
- Probe effects approximately additive on
log-scale
9Model assumptions and key biological parameters
- The intensity for the PM measurement for probe
(reporter) j and gene g is due to binding - of labelled fragments that perfectly match
the oligos in the spot - The true Signal Sgj
- of labelled fragments that do not
- perfectly match these
oligos - The non-specific hybridisation Hgj
- The intensity of the corresponding MM measurement
is caused - by a binding fraction F of the true signal Sgj
- by non-specific hybridisation Hgj
10BGX single array modelg1,,G (thousands),
j1,,J (11-20)
11Implementation
- In WinBugs for ease of model development
- and C for efficiency
- Joint estimation of parameters in full Bayesian
framework - Base inference on posterior distribution
- of all unknown quantities, Sgj, Hgj ,
- qg Median of TN(mg, ? g2), .
- and use appropriate summaries
12Single array model performanceData set
varying concentrations (geneLogic)
- 14 samples of cRNA from acute myeloid leukemia
(AML) tumor cell line - In sample k each of 11 genes spiked in at
concentration ck - sample k 1 2 3
4 5 6 7 8 9 10 11
12 13 14 - conc. ck(pM) 0.0 0.5 0.75 1.0
1.5 2.0 3.0 5.0 12.5 25 50 75
100 150 - Each sample hybridised to an array
Consider subset consisting of 500 normal genes
11 spike-ins
13Single array model performanceOne array four
genes spiked in at concentration 5.0
Log Sgj
Posterior distributions
TN(medPost(mg),medPost(? g2))
2.5-97.5 credibility intervals
Log(Sgj1)
qg
o log(PM-MM)
posterior distributions reflect variability
BGX index
14Single array model performance signal and
expression index 10 arrays gene 1 spiked-in at
increasing concentrations
true signal/ expression index BGX increases
with concentration
log(Hgj1)
as previously
o log(PM-MM)
Posterior distributions
TN(medPost(mg),medPost(? g2))
2.5-97.5 credibility intervals
Log(Sgj1)
qg
15Single array model performance non-specific
hybridization 10 arrays gene 1 spiked-in at
increasing concentrations
Signals Signals/cross
Non-specific hybridization does not increase
with concentration
2.5-97.5 credibility interval
log(Hgj1)
TN(medPost(mg),medPost(? g2))
16Single array model performance11 genes spiked
in at 13 (increasing) concentrations
Comparison with other expression measures
BGX index qg increases with concentration ..
except for gene 7 (spiked-in??)
Indication of smooth sustained increase over
a wider range of concentrations
172.5 97.5 credibility intervals for the
Bayesian expression index
11 spike-in genes at 13 different concentration
(data set A)
Note how the variability is substantially larger
for low expression level
Each colour corresponds to a different spike-in
gene Gene 7 broken red line
18What variability is captured?
- For some genes, there is considerable discrepancy
between the information given by the different
probes - Posterior becomes flat or bimodal
- Hard to summarise by a single number
- Less reproducibility of point estimates of
expression level - Model improvement -- stratify F by CG
- content ? -- less weight to the MM in some
cases? more robust summary of index
distribution or heavy tail distributions?
19Single array modelexamples of posterior
distributions of BGX expression indices
Each curve represents a gene
Examples with data o log(PMgj-MMgj)
j1,,Jg (at 0 if not defined)
Mean - 1SD
20Differential expression and array effectsAlex
Lewin SR, Natalia Bochkina, Anne Glazier, Tim
Aitman
21Data Set and Biological question
Previous Work (Tim Aitman, Anne Marie
Glazier) The spontaneously hypertensive rat
(SHR) A model of human insulin resistance
syndromes. Deficiency in gene Cd36 found to be
associated with insulin resistance in
SHR Following this, several animal models were
developed where other relevant genes are knocked
out comparison between knocked out and
wildtype (normal) mice or rats.
22Data Set and Biological question
Microarray Data Data set A (MAS 5) (? 12000
genes on each array) 3 SHR compared with 3
transgenic rats Data set B (RMA) (? 22700 genes
on each array) 8 wildtype (normal) mice compared
with 8 knocked out mice Biological
Question Find genes which are expressed
differently in wildtype and knockout / transgenic
mice
23Condition 2
Condition 1
PM MM
Gene specific error term
Gene specific error term
Hierarchical model of replicate Variability and
array effect
Hierarchical model of replicate Variability and
array effect
Posterior distribution (flat prior)
Differential expression parameter
Mixture modelling for classification
24Model for Differential Expression
- Expression-level-dependent normalisation
- Only few replicates per gene, so share
information between genes to estimate variability
of gene expression between the replicates - To select interesting genes
- Use posterior distribution of quantities of
interest, function of, ranks . - Use mixture prior on the differential expression
parameter
25Bayesian hierarchical model for replicate
expression data (under one condition)
- Data ygr log gene expression for gene g,
replicate r - (for the present, ygr is treated as known data)
- ?g gene effect
- ?r( ) array effect (possibly expression-level
dependent) - ?g2 gene specific variance
- 1st level
- ygr ? N(?g ?r(?g), ?g2), Sr ?r (?g) 0
- ?r( ) smooth function of ?g
Piecewise polynomial with unknown break points
26Exploratory analysis of array effect
Needs normalisation Spline curves shown
27Hierarchical structure for gene specific
parameters
- 2nd level
- Priors for ?g (flat) , coefficients and break
points - Sr ? (?g) 0 constraint imposed
- ?g2 ? lognormal (µ, t)
- Hyper-parameters µ and t can be influential.
- In a full Bayesian analysis, these are not fixed
- 3rd level
- µ ? N( c, d) t ? lognormal (e, f)
28Smoothing of the gene specific variances
- Variances are estimated using information from
all G x R measurements (12000 x 3) rather than
just 3 - Variances are stabilised and shrunk towards
average variance
29Bayesian Model Checking
- Check assumptions on gene variances, e.g.
exchangeable variances, what distribution ? - Predict sample variance Sg2 new (a chosen
checking function) from the model specification
(not using the data for this) - Compare predicted Sg2 new with observed Sg2 obs
- Bayesian p-value Prob( Sg2 new gt Sg2 obs )
- Distribution of p-values approx Uniform if model
is true (Marshall and Spiegelhalter, 2003) - Easily implemented in MCMC algorithm
30Data set A
31 Differential expression model
The quantity of interest is the difference
between conditions for each gene dg , g 1,
,N Joint model for the 2 conditions yg1r ?g
- ½ dg ?1r(?g) ?g1r , r 1, R1 yg2r
?g ½ dg ?2r(?g) ?g2r , r 1, R2
- ?g is now the overall gene effect over the
conditions - The parameter of interest dg is given a flat
prior (for now) - Same assumptions for the distribution of s2gs
- Modelling of ?sr(?g) as before, s 1, 2 , sum
to zero constraint imposed within each condition
32Possible Statistics for Differential Expression
dg log fold change dg dg / (s2 g1 / R1
s2 g2 / R2 )½ (standardised difference)
- We obtain the posterior distribution of all dg
and/or dg - Can compute directly posterior probability of
genes satisfying criterion X of interest - pg,X Prob( g of interest Criterion X,
data) - Can compute the distributions of ranks
33Data set A 3 wildtype mice compared to 3
knockout mice (U74A chip) Mas5
Criterion X
Gene is of interest if log fold change gt
log(2) and log (overall expression) gt 4
Plot of log fold change versus overall expression
level
Genes with pg,X gt 0.5 (green) 280 pg,X gt 0.8
(red) 46
The majority of the genes have very small pg,X
90 of genes have pg,X lt 0.2
pg,X 0.49
Genes with low overall expression have a greater
range of fold change than those with higher
expression
34Experiment 8 wildtype mice compared to 8
knockout mice RMA
Gene is of interest if log fold change gt log
(1.5)
Criterion X
Plot of log fold change versus overall expression
level
Genes with pg,X gt 0.5 (green) 292 pg,X gt 0.8
(red) 139
The majority of the genes have very small pg,X
97 of genes have pg,X lt 0.2
35Posterior probabilities and log fold change
Data set A 3 replicates MAS5
Data set B 8 replicates RMA
36Credibility intervals for ranks
Data set B
100 genes with lowest rank (most under/ over
expressed)
Low rank, high uncertainty
Low rank, low uncertainty
37Using the posterior distribution of dg
(standardised difference)
- Compute
- Probability ( dg gt 2 data)
- Bayesian analogue of a t test !
- Order genes
- Select genes such that
- Probability ( dg gt 2 data) gt cut-off
38Volcano plots
Bayesian T test
(Bayesian estimate)
For illustration, cut-offs lines drawn at 0.95
39Integrated modelling of Affymetrix data
Condition 2
Condition 1
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
Gene specific variability (probe) Gene index BGX
Gene specific variability (probe) Gene index BGX
Hierarchical model of replicate (biological)
variability and array effect
Hierarchical model of replicate (biological)
variability and array effect
Distribution of expression index for gene g ,
condition 1
Distribution of expression index for gene g ,
condition 2
Distribution of differential expression parameter
40BGX Multiple array model conditions c1,,C,
replicates r 1,,Rc
PMgjcr ? N( Sgjcr Hgjcr , tcr2) MMgjcr?
N(FSgjcr Hgjcr , tcr2)
Gene and condition specific BGX
qgcmedian(TN(µgc, ? gc2)) Pools
information over replicate probe sets
j 1,J, r 1,,Rc
41Posterior distributions of BGXSingle array vs
multiple array analyses
Three replicate arrays analysed together
(multiple array model)
Mean - 1SD
Three replicate arrays analysed separately
42Subset of AffyU133A spike-in data set(AffyComp)
- Consider
- Six arrays, 1154 genes (every 20th and 42
spike-ins) - Same cRNA hybridised to all arrays EXCEPT for
spike-ins
1 2 3 12
13 14 Spike-in genes 1-3
4-6 7-9 34-36 37-39 40-42 Spike-in
conc (pM) Condition 1 (array 1-3) 0.0
0.25 0.50 128 256 512 Condition 2
(array 4-6) 0.25 0.50 1.00 256
512 0.00 Fold change
- 2 2 2 2
-
43M v A plots
A (1/2)(exprg,1exprg,2), M (exprg,1-exprg,2)
NB! Point estimates used MAS5 and RMA exprgc
mean over three replicates BGX Multiple array
index
True fold changes Black zero Red 2
44BGX measure of uncertainty providedPosterior
mean - 1SD credibility intervals
diffgbgxg,1- bgxg,2
Spike in 1113 -1154 above the blue line
Blue stars show RMA measure
45Mixture and Bayesian estimation of false
discovery ratesNatalia Bochkina, Alex Lewin
SR, Philippe Broët
46Multiple Testing Problem
- Gene lists can be built by computing separately a
criteria for each gene and ranking - Thousands of genes are considered simultaneously
- How to assess the performance of such lists ?
Statistical Challenge Select interesting genes
without including too many false positives in a
gene list
A gene is a false positive if it is included in
the list when it is truly unmodified under the
experimental set up
Want an evaluation of the expected false
discovery rate (FDR)
47Bayesian Estimate of FDR
- Step 1 Choose a gene specific parameter (e.g.
dg ) or a gene statistic (see later) - Step 2 Model its prior (resp marginal)
distribution using a mixture model - -- with one component that models the
unaffected genes (null hypothesis) e.g. point
mass at 0 for dg - -- other components that model (flexibly) the
alternative - Step 3 Calculate the posterior probability for
any gene of belonging to the unmodified component
pg0 data - Step 4 Evaluate FDR (and FNR) for any list
- Assuming that all the gene classification are
independent
Bayes FDR (list) data 1/card(list) Sg ? list
pg0
48Mixture prior
- To obtain a gene list, a commonly used method
- (cf Lonnstedt Speed 2002, Newton 2003, Smyth
2003, ) is to define a mixture prior for dg - H0 dg 0 point mass at 0 with probability p0
- H1 dg flexible 2-sided distribution to model
differential expression - Classify each gene following its posterior
probabilities of not being in the null 1- pg0 - Use Bayes rule or fix the FDR
49Classification with mixture prior
- Joint estimation of all the mixture parameters
(including p0) avoids plugging-in of values
(e.g. p0) that are influential on the
classification - Sensitivity to prior settings of the alternative
distribution and performance has been tested on
simulated data sets - Work in progress
- Poster by Natalia Bochkina
50Performance of the mixture prior
- yg1r ?g - ½ dg ?g1r , r 1, R1
- yg2r ?g ½ dg ?g2r , r 1, R2
- (For simplification, we assume that the data has
been pre normalised) - s2g IG(a, b)
- dg p0d0 p1G (1.5, ?1) p2G (1.5, ?2)
- H0 H1
- Dirichlet distribution for (p0, p1, p2)
- Exponential hyper prior for ?1 and ?2
51Simulated data
Plot of the true differences
ygr N(dg , s2g) (8 replicates) s2g IG(1.5,
0.05) dg (-1)Bern(0.5) G(2,2), g1200 dg 0,
g2011000
Choice of simulation parameters inspired by
estimates found in analyses of biological data
sets
52Posterior estimates of fold change using mixture
model
53Comparison of mixture classification and
posterior probabilities for the standardised
differences
Post Prob (g ? H1)
10 6 False positive
In red, 200 genes with dg ? 0
31 4 False negative
Probability ( dg gt 2 data)
54Bayes rule
FDR (black) FNR (blue) as a function of 1- pg0
Post Prob (g ? H1) 1- pg0
55Using mixtures for modelling the marginal
distribution of gene statistics
- Instead of modelling the prior for dg as a
mixture, an alternative is - To summarise differential expression by a gene
statistic - To model is marginal distribution as a mixture
such that the distribution is approximately known
under H0 and use a flexible distribution for the
alternative
56Mixture modelling of transformed F statistics
Gene statistic based on classical F statistic
(this was developed to analyse multiclass ( gt 2
conditions) experiments) Gives a de-centred
asymmetric marginal distribution rather than a
two-tailed one Transform F -gt approx. standard
Normal if no change across conditions (H0). Use
a mixture of normals (variable number) for
modelling the alternative (following Richardson
and Green 1997)
57Results for Simulated Data (to detect modified
profile over 3 conditions) Broet, Lewin, SR 2004
Bayes mixture estimate of FDR is close to true
value
Case A well separated null and alternative
hypotheses
Case B less separated null and alternative
hypotheses
For details, see the poster by Alex Lewin
58Marginal mixture performance for the simulated
data (2 conditions, same data as for the prior
mixture)
Number on list as a function of cut-off prob
Expected number of false positive
59Simulated data, comparison of prior and marginal
mixture classification
Good agreement between the 2 approaches
The marginal mixture has more false
positives Transformation to Normality for 2
conditions??
Further comparison in progress
60Summary
- Bayesian gene expression measure (BGX)
- Good range of resolution , provides credibility
intervals - Differential Expression
- Expression-level-dependent normalisation
- Borrow information across genes for variances
- Joint distribution of ranks, gene lists based on
posterior probabilities - False Discovery Rate
- Mixture gives good estimate of FDR and classifies
well - Future work
- Mixture prior on BGX index, with uncertainty
propagated to mixture parameters, comparison of
marginal and prior mixture approaches, clustering
for more general experimental set-ups
61- Papers and technical reports
- Hein AM., Richardson S., Causton H., Ambler G.
and Green P. (2004) - BGX a fully Bayesian gene expression index for
Affymetrix GeneChip data (submitted) - Lewin A., Richardson S., Marshall C., Glazier A.
and Aitman T. (2003) - Bayesian Modelling of Differential Gene
Expression (submitted) - Broët P., Lewin A., Richardson S., Dalmasso C.
and Magdelenat H. (2004) - A mixture model-based strategy for selecting
sets of genes in multiclass response microarray
experiments. (Bioinformatics, advanced access
April 29 2004) - Broët, P., Richardson, S. and Radvanyi, F. (2002)
- Bayesian Hierarchical Model for Identifying
Changes in Gene Expression from Microarray
Experiments , Journal of Computational Biology 9,
671-683. - Available at
- http //www.bgx.org.uk/
Thanks