PPT – b PowerPoint presentation | free to download

About This Presentation

Title:

b

Description:

Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Mary's) ... Gene is of interest if |log fold change| log(2) and log (overall expression) 4 ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 62

Provided by: richar509

Category:

Tags: changelog

more less

Transcript and Presenter's Notes

Title: b

1
Bayesian hierarchical modelling of gene
expression data
Sylvia Richardson Centre for Biostatistics Imperia
l College, London
In collaboration with Natalia Bochkina, Anne
Mette Hein, Alex Lewin (St Marys) Helen Causton
and Tim Aitman (Hammersmith) Graeme Ambler and
Peter Green (Bristol) Philippe Broët (INSERM,
Paris) BBSRC Exploiting Genomics grant
2
Outline

Hierarchical modelling framework
A Bayesian gene expression index
Modelling differential expression
False discovery rate and mixture models

3
Introduction

Gene expression is a hierarchical process
Substantive question
Experimental design
Sample preparation
Array design manufacture
Gene expression matrix
Probe level data
Image level data

Interesting variability (signal)

Obscuring variability (noise)

Interest in using statistical framework capable
to handle multiple sources of variability
coherently

Bayesian statistics
4
Bayesian hierarchical model framework

Has the flexibility to model various sources of
variability between probes, gene specific,
within array, between array,
Building of all these features into a common
model
Avoids the need to use systematically a plug-in
approach
uncertainty is propagated
Borrow strength / share out information
according to principle
Allows some model checking

5
Gene expression analysis is amulti-step process
Low-level Model (how is the measured expression
related to the signal)
Multi-arrays processing (how to make appropriate
combined inference)
We build all these steps in a common statistical
framework
Differential Expression
Clustering Partition Model
6
Integrated modelling of Affymetrix data
Condition 2
Condition 1
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
Gene specific variability (probe) Gene index BGX
Gene specific variability (probe) Gene index BGX
Hierarchical model of replicate (biological)
variability and array effect
Hierarchical model of replicate (biological)
variability and array effect
Gene and condition BGX index
Gene and condition BGX index
Differential expression parameter
7
A fully Bayesian Gene eXpression index for
Affymetrix GeneChip arraysAnne Mette HeinSR,
Helen Causton, Graeme Ambler, Peter Green
PM MM
PM MM
PM MM
PM MM
Gene specific variability (probe)
Gene index BGX
8
Single array model Motivation
Key observations
Conclusions

PMs and MMs both increase with spike-in
concentration (MMs slower than PMs)

MMs bind fraction of signal
Multiplicative (and additive) error
transformation needed

Spread of PMs increase with level

Considerable variability in PM (and MM) response
within a probe set

Varying reliability in gene expression estimation
for different genes
Estimate gene expression measure from PMs and MMs
on log scale

Probe effects approximately additive on
log-scale

9
Model assumptions and key biological parameters

The intensity for the PM measurement for probe
(reporter) j and gene g is due to binding
of labelled fragments that perfectly match
the oligos in the spot
The true Signal Sgj
of labelled fragments that do not
perfectly match these
oligos
The non-specific hybridisation Hgj
The intensity of the corresponding MM measurement
is caused
by a binding fraction F of the true signal Sgj
by non-specific hybridisation Hgj

10
BGX single array modelg1,,G (thousands),
j1,,J (11-20)
11
Implementation

In WinBugs for ease of model development
and C for efficiency
Joint estimation of parameters in full Bayesian
framework
Base inference on posterior distribution
of all unknown quantities, Sgj, Hgj ,
qg Median of TN(mg, ? g2), .
and use appropriate summaries

12
Single array model performanceData set
varying concentrations (geneLogic)

14 samples of cRNA from acute myeloid leukemia
(AML) tumor cell line
In sample k each of 11 genes spiked in at
concentration ck
sample k 1 2 3
4 5 6 7 8 9 10 11
12 13 14
conc. ck(pM) 0.0 0.5 0.75 1.0
1.5 2.0 3.0 5.0 12.5 25 50 75
100 150
Each sample hybridised to an array

Consider subset consisting of 500 normal genes
11 spike-ins
13
Single array model performanceOne array four
genes spiked in at concentration 5.0
Log Sgj
Posterior distributions
TN(medPost(mg),medPost(? g2))
2.5-97.5 credibility intervals
Log(Sgj1)
qg
o log(PM-MM)
posterior distributions reflect variability
BGX index
14
Single array model performance signal and
expression index 10 arrays gene 1 spiked-in at
increasing concentrations
true signal/ expression index BGX increases
with concentration
log(Hgj1)
as previously
o log(PM-MM)
Posterior distributions
TN(medPost(mg),medPost(? g2))
2.5-97.5 credibility intervals
Log(Sgj1)
qg
15
Single array model performance non-specific
hybridization 10 arrays gene 1 spiked-in at
increasing concentrations
Signals Signals/cross
Non-specific hybridization does not increase
with concentration
2.5-97.5 credibility interval
log(Hgj1)
TN(medPost(mg),medPost(? g2))
16
Single array model performance11 genes spiked
in at 13 (increasing) concentrations
Comparison with other expression measures
BGX index qg increases with concentration ..
except for gene 7 (spiked-in??)
Indication of smooth sustained increase over
a wider range of concentrations
17
2.5 97.5 credibility intervals for the
Bayesian expression index
11 spike-in genes at 13 different concentration
(data set A)
Note how the variability is substantially larger
for low expression level
Each colour corresponds to a different spike-in
gene Gene 7 broken red line
18
What variability is captured?

For some genes, there is considerable discrepancy
between the information given by the different
probes
Posterior becomes flat or bimodal
Hard to summarise by a single number
Less reproducibility of point estimates of
expression level
Model improvement -- stratify F by CG
content ? -- less weight to the MM in some
cases? more robust summary of index
distribution or heavy tail distributions?

19
Single array modelexamples of posterior
distributions of BGX expression indices
Each curve represents a gene
Examples with data o log(PMgj-MMgj)
j1,,Jg (at 0 if not defined)
Mean - 1SD
20
Differential expression and array effectsAlex
Lewin SR, Natalia Bochkina, Anne Glazier, Tim
Aitman
21
Data Set and Biological question
Previous Work (Tim Aitman, Anne Marie
Glazier) The spontaneously hypertensive rat
(SHR) A model of human insulin resistance
syndromes. Deficiency in gene Cd36 found to be
associated with insulin resistance in
SHR Following this, several animal models were
developed where other relevant genes are knocked
out comparison between knocked out and
wildtype (normal) mice or rats.
22
Data Set and Biological question
Microarray Data Data set A (MAS 5) (? 12000
genes on each array) 3 SHR compared with 3
transgenic rats Data set B (RMA) (? 22700 genes
on each array) 8 wildtype (normal) mice compared
with 8 knocked out mice Biological
Question Find genes which are expressed
differently in wildtype and knockout / transgenic
mice
23
Condition 2
Condition 1
PM MM
Gene specific error term
Gene specific error term
Hierarchical model of replicate Variability and
array effect
Hierarchical model of replicate Variability and
array effect
Posterior distribution (flat prior)
Differential expression parameter
Mixture modelling for classification
24
Model for Differential Expression

Expression-level-dependent normalisation
Only few replicates per gene, so share
information between genes to estimate variability
of gene expression between the replicates
To select interesting genes
Use posterior distribution of quantities of
interest, function of, ranks .
Use mixture prior on the differential expression
parameter

25
Bayesian hierarchical model for replicate
expression data (under one condition)

Data ygr log gene expression for gene g,
replicate r
(for the present, ygr is treated as known data)
?g gene effect
?r( ) array effect (possibly expression-level
dependent)
?g2 gene specific variance
1st level
ygr ? N(?g ?r(?g), ?g2), Sr ?r (?g) 0
?r( ) smooth function of ?g

Piecewise polynomial with unknown break points
26
Exploratory analysis of array effect
Needs normalisation Spline curves shown

27
Hierarchical structure for gene specific
parameters

2nd level
Priors for ?g (flat) , coefficients and break
points
Sr ? (?g) 0 constraint imposed
?g2 ? lognormal (µ, t)
Hyper-parameters µ and t can be influential.
In a full Bayesian analysis, these are not fixed
3rd level
µ ? N( c, d) t ? lognormal (e, f)

28
Smoothing of the gene specific variances

Variances are estimated using information from
all G x R measurements (12000 x 3) rather than
just 3
Variances are stabilised and shrunk towards
average variance

29
Bayesian Model Checking

Check assumptions on gene variances, e.g.
exchangeable variances, what distribution ?
Predict sample variance Sg2 new (a chosen
checking function) from the model specification
(not using the data for this)
Compare predicted Sg2 new with observed Sg2 obs
Bayesian p-value Prob( Sg2 new gt Sg2 obs )
Distribution of p-values approx Uniform if model
is true (Marshall and Spiegelhalter, 2003)
Easily implemented in MCMC algorithm

30
Data set A
31
Differential expression model
The quantity of interest is the difference
between conditions for each gene dg , g 1,
,N Joint model for the 2 conditions yg1r ?g
- ½ dg ?1r(?g) ?g1r , r 1, R1 yg2r
?g ½ dg ?2r(?g) ?g2r , r 1, R2

?g is now the overall gene effect over the
conditions
The parameter of interest dg is given a flat
prior (for now)
Same assumptions for the distribution of s2gs
Modelling of ?sr(?g) as before, s 1, 2 , sum
to zero constraint imposed within each condition

32
Possible Statistics for Differential Expression
dg log fold change dg dg / (s2 g1 / R1
s2 g2 / R2 )½ (standardised difference)

We obtain the posterior distribution of all dg
and/or dg
Can compute directly posterior probability of
genes satisfying criterion X of interest
pg,X Prob( g of interest Criterion X,
data)
Can compute the distributions of ranks

33
Data set A 3 wildtype mice compared to 3
knockout mice (U74A chip) Mas5
Criterion X
Gene is of interest if log fold change gt
log(2) and log (overall expression) gt 4
Plot of log fold change versus overall expression
level
Genes with pg,X gt 0.5 (green) 280 pg,X gt 0.8
(red) 46
The majority of the genes have very small pg,X
90 of genes have pg,X lt 0.2
pg,X 0.49
Genes with low overall expression have a greater
range of fold change than those with higher
expression
34
Experiment 8 wildtype mice compared to 8
knockout mice RMA
Gene is of interest if log fold change gt log
(1.5)
Criterion X
Plot of log fold change versus overall expression
level
Genes with pg,X gt 0.5 (green) 292 pg,X gt 0.8
(red) 139
The majority of the genes have very small pg,X
97 of genes have pg,X lt 0.2
35
Posterior probabilities and log fold change
Data set A 3 replicates MAS5
Data set B 8 replicates RMA
36
Credibility intervals for ranks
Data set B
100 genes with lowest rank (most under/ over
expressed)
Low rank, high uncertainty
Low rank, low uncertainty
37
Using the posterior distribution of dg
(standardised difference)

Compute
Probability ( dg gt 2 data)
Bayesian analogue of a t test !
Order genes
Select genes such that
Probability ( dg gt 2 data) gt cut-off

38
Volcano plots
Bayesian T test
(Bayesian estimate)
For illustration, cut-offs lines drawn at 0.95
39
Integrated modelling of Affymetrix data
Condition 2
Condition 1
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
Gene specific variability (probe) Gene index BGX
Gene specific variability (probe) Gene index BGX
Hierarchical model of replicate (biological)
variability and array effect
Hierarchical model of replicate (biological)
variability and array effect
Distribution of expression index for gene g ,
condition 1
Distribution of expression index for gene g ,
condition 2
Distribution of differential expression parameter
40
BGX Multiple array model conditions c1,,C,
replicates r 1,,Rc
PMgjcr ? N( Sgjcr Hgjcr , tcr2) MMgjcr?
N(FSgjcr Hgjcr , tcr2)
Gene and condition specific BGX
qgcmedian(TN(µgc, ? gc2)) Pools
information over replicate probe sets
j 1,J, r 1,,Rc
41
Posterior distributions of BGXSingle array vs
multiple array analyses
Three replicate arrays analysed together
(multiple array model)
Mean - 1SD
Three replicate arrays analysed separately
42
Subset of AffyU133A spike-in data set(AffyComp)

Consider
Six arrays, 1154 genes (every 20th and 42
spike-ins)
Same cRNA hybridised to all arrays EXCEPT for
spike-ins

1 2 3 12
13 14 Spike-in genes 1-3
4-6 7-9 34-36 37-39 40-42 Spike-in
conc (pM) Condition 1 (array 1-3) 0.0
0.25 0.50 128 256 512 Condition 2
(array 4-6) 0.25 0.50 1.00 256
512 0.00 Fold change
- 2 2 2 2
-
43
M v A plots
A (1/2)(exprg,1exprg,2), M (exprg,1-exprg,2)
NB! Point estimates used MAS5 and RMA exprgc
mean over three replicates BGX Multiple array
index
True fold changes Black zero Red 2
44
BGX measure of uncertainty providedPosterior
mean - 1SD credibility intervals
diffgbgxg,1- bgxg,2

Spike in 1113 -1154 above the blue line
Blue stars show RMA measure
45
Mixture and Bayesian estimation of false
discovery ratesNatalia Bochkina, Alex Lewin
SR, Philippe Broët
46
Multiple Testing Problem

Gene lists can be built by computing separately a
criteria for each gene and ranking
Thousands of genes are considered simultaneously
How to assess the performance of such lists ?

Statistical Challenge Select interesting genes
without including too many false positives in a
gene list
A gene is a false positive if it is included in
the list when it is truly unmodified under the
experimental set up
Want an evaluation of the expected false
discovery rate (FDR)
47
Bayesian Estimate of FDR

Step 1 Choose a gene specific parameter (e.g.
dg ) or a gene statistic (see later)
Step 2 Model its prior (resp marginal)
distribution using a mixture model
-- with one component that models the
unaffected genes (null hypothesis) e.g. point
mass at 0 for dg
-- other components that model (flexibly) the
alternative
Step 3 Calculate the posterior probability for
any gene of belonging to the unmodified component
pg0 data
Step 4 Evaluate FDR (and FNR) for any list
Assuming that all the gene classification are
independent

Bayes FDR (list) data 1/card(list) Sg ? list
pg0
48
Mixture prior

To obtain a gene list, a commonly used method
(cf Lonnstedt Speed 2002, Newton 2003, Smyth
2003, ) is to define a mixture prior for dg
H0 dg 0 point mass at 0 with probability p0
H1 dg flexible 2-sided distribution to model
differential expression
Classify each gene following its posterior
probabilities of not being in the null 1- pg0
Use Bayes rule or fix the FDR

49
Classification with mixture prior

Joint estimation of all the mixture parameters
(including p0) avoids plugging-in of values
(e.g. p0) that are influential on the
classification
Sensitivity to prior settings of the alternative
distribution and performance has been tested on
simulated data sets
Work in progress
Poster by Natalia Bochkina

50
Performance of the mixture prior

yg1r ?g - ½ dg ?g1r , r 1, R1
yg2r ?g ½ dg ?g2r , r 1, R2
(For simplification, we assume that the data has
been pre normalised)
s2g IG(a, b)
dg p0d0 p1G (1.5, ?1) p2G (1.5, ?2)
H0 H1
Dirichlet distribution for (p0, p1, p2)
Exponential hyper prior for ?1 and ?2

51
Simulated data
Plot of the true differences
ygr N(dg , s2g) (8 replicates) s2g IG(1.5,
0.05) dg (-1)Bern(0.5) G(2,2), g1200 dg 0,
g2011000
Choice of simulation parameters inspired by
estimates found in analyses of biological data
sets
52
Posterior estimates of fold change using mixture
model
53
Comparison of mixture classification and
posterior probabilities for the standardised
differences
Post Prob (g ? H1)
10 6 False positive
In red, 200 genes with dg ? 0
31 4 False negative
Probability ( dg gt 2 data)
54
Bayes rule
FDR (black) FNR (blue) as a function of 1- pg0
Post Prob (g ? H1) 1- pg0
55
Using mixtures for modelling the marginal
distribution of gene statistics

Instead of modelling the prior for dg as a
mixture, an alternative is
To summarise differential expression by a gene
statistic
To model is marginal distribution as a mixture
such that the distribution is approximately known
under H0 and use a flexible distribution for the
alternative

56
Mixture modelling of transformed F statistics
Gene statistic based on classical F statistic
(this was developed to analyse multiclass ( gt 2
conditions) experiments) Gives a de-centred
asymmetric marginal distribution rather than a
two-tailed one Transform F -gt approx. standard
Normal if no change across conditions (H0). Use
a mixture of normals (variable number) for
modelling the alternative (following Richardson
and Green 1997)

57
Results for Simulated Data (to detect modified
profile over 3 conditions) Broet, Lewin, SR 2004
Bayes mixture estimate of FDR is close to true
value
Case A well separated null and alternative
hypotheses
Case B less separated null and alternative
hypotheses
For details, see the poster by Alex Lewin
58
Marginal mixture performance for the simulated
data (2 conditions, same data as for the prior
mixture)
Number on list as a function of cut-off prob
Expected number of false positive
59
Simulated data, comparison of prior and marginal
mixture classification
Good agreement between the 2 approaches
The marginal mixture has more false
positives Transformation to Normality for 2
conditions??
Further comparison in progress
60
Summary

Bayesian gene expression measure (BGX)
Good range of resolution , provides credibility
intervals
Differential Expression
Expression-level-dependent normalisation
Borrow information across genes for variances
Joint distribution of ranks, gene lists based on
posterior probabilities
False Discovery Rate
Mixture gives good estimate of FDR and classifies
well
Future work
Mixture prior on BGX index, with uncertainty
propagated to mixture parameters, comparison of
marginal and prior mixture approaches, clustering
for more general experimental set-ups

Papers and technical reports
Hein AM., Richardson S., Causton H., Ambler G.
and Green P. (2004)
BGX a fully Bayesian gene expression index for
Affymetrix GeneChip data (submitted)
Lewin A., Richardson S., Marshall C., Glazier A.
and Aitman T. (2003)
Bayesian Modelling of Differential Gene
Expression (submitted)
Broët P., Lewin A., Richardson S., Dalmasso C.
and Magdelenat H. (2004)
A mixture model-based strategy for selecting
sets of genes in multiclass response microarray
experiments. (Bioinformatics, advanced access
April 29 2004)
Broët, P., Richardson, S. and Radvanyi, F. (2002)
Bayesian Hierarchical Model for Identifying
Changes in Gene Expression from Microarray
Experiments , Journal of Computational Biology 9,
671-683.
Available at
http //www.bgx.org.uk/