b - PowerPoint PPT Presentation

About This Presentation
Title:

b

Description:

Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Mary's) ... Gene is of interest if |log fold change| log(2) and log (overall expression) 4 ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 62
Provided by: richar509
Category:
Tags: changelog

less

Transcript and Presenter's Notes

Title: b


1
Bayesian hierarchical modelling of gene
expression data
Sylvia Richardson Centre for Biostatistics Imperia
l College, London
In collaboration with Natalia Bochkina, Anne
Mette Hein, Alex Lewin (St Marys) Helen Causton
and Tim Aitman (Hammersmith) Graeme Ambler and
Peter Green (Bristol) Philippe Broët (INSERM,
Paris) BBSRC Exploiting Genomics grant
2
Outline
  • Hierarchical modelling framework
  • A Bayesian gene expression index
  • Modelling differential expression
  • False discovery rate and mixture models

3
Introduction
  • Gene expression is a hierarchical process
  • Substantive question
  • Experimental design
  • Sample preparation
  • Array design manufacture
  • Gene expression matrix
  • Probe level data
  • Image level data

Interesting variability (signal)

Obscuring variability (noise)
  • Interest in using statistical framework capable
    to handle multiple sources of variability
    coherently

Bayesian statistics
4
Bayesian hierarchical model framework
  • Has the flexibility to model various sources of
    variability between probes, gene specific,
    within array, between array,
  • Building of all these features into a common
    model
  • Avoids the need to use systematically a plug-in
    approach
  • uncertainty is propagated
  • Borrow strength / share out information
    according to principle
  • Allows some model checking

5
Gene expression analysis is amulti-step process
Low-level Model (how is the measured expression
related to the signal)
Multi-arrays processing (how to make appropriate
combined inference)
We build all these steps in a common statistical
framework
Differential Expression
Clustering Partition Model
6
Integrated modelling of Affymetrix data
Condition 2
Condition 1
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
Gene specific variability (probe) Gene index BGX
Gene specific variability (probe) Gene index BGX
Hierarchical model of replicate (biological)
variability and array effect
Hierarchical model of replicate (biological)
variability and array effect
Gene and condition BGX index
Gene and condition BGX index
Differential expression parameter
7
A fully Bayesian Gene eXpression index for
Affymetrix GeneChip arraysAnne Mette HeinSR,
Helen Causton, Graeme Ambler, Peter Green
PM MM
PM MM
PM MM
PM MM
Gene specific variability (probe)
Gene index BGX
8
Single array model Motivation
Key observations
Conclusions
  • PMs and MMs both increase with spike-in
    concentration (MMs slower than PMs)

MMs bind fraction of signal
Multiplicative (and additive) error
transformation needed
  • Spread of PMs increase with level
  • Considerable variability in PM (and MM) response
    within a probe set

Varying reliability in gene expression estimation
for different genes
Estimate gene expression measure from PMs and MMs
on log scale
  • Probe effects approximately additive on
    log-scale

9
Model assumptions and key biological parameters
  • The intensity for the PM measurement for probe
    (reporter) j and gene g is due to binding
  • of labelled fragments that perfectly match
    the oligos in the spot
  • The true Signal Sgj
  • of labelled fragments that do not
  • perfectly match these
    oligos
  • The non-specific hybridisation Hgj
  • The intensity of the corresponding MM measurement
    is caused
  • by a binding fraction F of the true signal Sgj
  • by non-specific hybridisation Hgj

10
BGX single array modelg1,,G (thousands),
j1,,J (11-20)
11
Implementation
  • In WinBugs for ease of model development
  • and C for efficiency
  • Joint estimation of parameters in full Bayesian
    framework
  • Base inference on posterior distribution
  • of all unknown quantities, Sgj, Hgj ,
  • qg Median of TN(mg, ? g2), .
  • and use appropriate summaries

12
Single array model performanceData set
varying concentrations (geneLogic)
  • 14 samples of cRNA from acute myeloid leukemia
    (AML) tumor cell line
  • In sample k each of 11 genes spiked in at
    concentration ck
  • sample k 1 2 3
    4 5 6 7 8 9 10 11
    12 13 14
  • conc. ck(pM) 0.0 0.5 0.75 1.0
    1.5 2.0 3.0 5.0 12.5 25 50 75
    100 150
  • Each sample hybridised to an array

Consider subset consisting of 500 normal genes
11 spike-ins
13
Single array model performanceOne array four
genes spiked in at concentration 5.0
Log Sgj
Posterior distributions
TN(medPost(mg),medPost(? g2))
2.5-97.5 credibility intervals
Log(Sgj1)
qg
o log(PM-MM)
posterior distributions reflect variability
BGX index
14
Single array model performance signal and
expression index 10 arrays gene 1 spiked-in at
increasing concentrations
true signal/ expression index BGX increases
with concentration
log(Hgj1)
as previously
o log(PM-MM)
Posterior distributions
TN(medPost(mg),medPost(? g2))
2.5-97.5 credibility intervals
Log(Sgj1)
qg
15
Single array model performance non-specific
hybridization 10 arrays gene 1 spiked-in at
increasing concentrations
Signals Signals/cross
Non-specific hybridization does not increase
with concentration
2.5-97.5 credibility interval
log(Hgj1)
TN(medPost(mg),medPost(? g2))
16
Single array model performance11 genes spiked
in at 13 (increasing) concentrations
Comparison with other expression measures
BGX index qg increases with concentration ..
except for gene 7 (spiked-in??)
Indication of smooth sustained increase over
a wider range of concentrations
17
2.5 97.5 credibility intervals for the
Bayesian expression index
11 spike-in genes at 13 different concentration
(data set A)
Note how the variability is substantially larger
for low expression level
Each colour corresponds to a different spike-in
gene Gene 7 broken red line
18
What variability is captured?
  • For some genes, there is considerable discrepancy
    between the information given by the different
    probes
  • Posterior becomes flat or bimodal
  • Hard to summarise by a single number
  • Less reproducibility of point estimates of
    expression level
  • Model improvement -- stratify F by CG
  • content ? -- less weight to the MM in some
    cases? more robust summary of index
    distribution or heavy tail distributions?

19
Single array modelexamples of posterior
distributions of BGX expression indices
Each curve represents a gene
Examples with data o log(PMgj-MMgj)
j1,,Jg (at 0 if not defined)
Mean - 1SD
20
Differential expression and array effectsAlex
Lewin SR, Natalia Bochkina, Anne Glazier, Tim
Aitman
21
Data Set and Biological question
Previous Work (Tim Aitman, Anne Marie
Glazier) The spontaneously hypertensive rat
(SHR) A model of human insulin resistance
syndromes. Deficiency in gene Cd36 found to be
associated with insulin resistance in
SHR Following this, several animal models were
developed where other relevant genes are knocked
out comparison between knocked out and
wildtype (normal) mice or rats.
22
Data Set and Biological question
Microarray Data Data set A (MAS 5) (? 12000
genes on each array) 3 SHR compared with 3
transgenic rats Data set B (RMA) (? 22700 genes
on each array) 8 wildtype (normal) mice compared
with 8 knocked out mice Biological
Question Find genes which are expressed
differently in wildtype and knockout / transgenic
mice
23
Condition 2
Condition 1
PM MM
Gene specific error term
Gene specific error term
Hierarchical model of replicate Variability and
array effect
Hierarchical model of replicate Variability and
array effect
Posterior distribution (flat prior)
Differential expression parameter
Mixture modelling for classification
24
Model for Differential Expression
  • Expression-level-dependent normalisation
  • Only few replicates per gene, so share
    information between genes to estimate variability
    of gene expression between the replicates
  • To select interesting genes
  • Use posterior distribution of quantities of
    interest, function of, ranks .
  • Use mixture prior on the differential expression
    parameter

25
Bayesian hierarchical model for replicate
expression data (under one condition)
  • Data ygr log gene expression for gene g,
    replicate r
  • (for the present, ygr is treated as known data)
  • ?g gene effect
  • ?r( ) array effect (possibly expression-level
    dependent)
  • ?g2 gene specific variance
  • 1st level
  • ygr ? N(?g ?r(?g), ?g2), Sr ?r (?g) 0
  • ?r( ) smooth function of ?g

Piecewise polynomial with unknown break points
26
Exploratory analysis of array effect
Needs normalisation Spline curves shown

27
Hierarchical structure for gene specific
parameters
  • 2nd level
  • Priors for ?g (flat) , coefficients and break
    points
  • Sr ? (?g) 0 constraint imposed
  • ?g2 ? lognormal (µ, t)
  • Hyper-parameters µ and t can be influential.
  • In a full Bayesian analysis, these are not fixed
  • 3rd level
  • µ ? N( c, d) t ? lognormal (e, f)

28
Smoothing of the gene specific variances
  • Variances are estimated using information from
    all G x R measurements (12000 x 3) rather than
    just 3
  • Variances are stabilised and shrunk towards
    average variance

29
Bayesian Model Checking
  • Check assumptions on gene variances, e.g.
    exchangeable variances, what distribution ?
  • Predict sample variance Sg2 new (a chosen
    checking function) from the model specification
    (not using the data for this)
  • Compare predicted Sg2 new with observed Sg2 obs
  • Bayesian p-value Prob( Sg2 new gt Sg2 obs )
  • Distribution of p-values approx Uniform if model
    is true (Marshall and Spiegelhalter, 2003)
  • Easily implemented in MCMC algorithm

30
Data set A
31
Differential expression model
The quantity of interest is the difference
between conditions for each gene dg , g 1,
,N Joint model for the 2 conditions yg1r ?g
- ½ dg ?1r(?g) ?g1r , r 1, R1 yg2r
?g ½ dg ?2r(?g) ?g2r , r 1, R2
  • ?g is now the overall gene effect over the
    conditions
  • The parameter of interest dg is given a flat
    prior (for now)
  • Same assumptions for the distribution of s2gs
  • Modelling of ?sr(?g) as before, s 1, 2 , sum
    to zero constraint imposed within each condition

32
Possible Statistics for Differential Expression
dg log fold change dg dg / (s2 g1 / R1
s2 g2 / R2 )½ (standardised difference)
  • We obtain the posterior distribution of all dg
    and/or dg
  • Can compute directly posterior probability of
    genes satisfying criterion X of interest
  • pg,X Prob( g of interest Criterion X,
    data)
  • Can compute the distributions of ranks

33
Data set A 3 wildtype mice compared to 3
knockout mice (U74A chip) Mas5
Criterion X
Gene is of interest if log fold change gt
log(2) and log (overall expression) gt 4
Plot of log fold change versus overall expression
level
Genes with pg,X gt 0.5 (green) 280 pg,X gt 0.8
(red) 46
The majority of the genes have very small pg,X
90 of genes have pg,X lt 0.2
pg,X 0.49
Genes with low overall expression have a greater
range of fold change than those with higher
expression
34
Experiment 8 wildtype mice compared to 8
knockout mice RMA
Gene is of interest if log fold change gt log
(1.5)
Criterion X
Plot of log fold change versus overall expression
level
Genes with pg,X gt 0.5 (green) 292 pg,X gt 0.8
(red) 139
The majority of the genes have very small pg,X
97 of genes have pg,X lt 0.2
35
Posterior probabilities and log fold change
Data set A 3 replicates MAS5
Data set B 8 replicates RMA
36
Credibility intervals for ranks
Data set B
100 genes with lowest rank (most under/ over
expressed)
Low rank, high uncertainty
Low rank, low uncertainty
37
Using the posterior distribution of dg
(standardised difference)
  • Compute
  • Probability ( dg gt 2 data)
  • Bayesian analogue of a t test !
  • Order genes
  • Select genes such that
  • Probability ( dg gt 2 data) gt cut-off

38
Volcano plots
Bayesian T test
(Bayesian estimate)
For illustration, cut-offs lines drawn at 0.95
39
Integrated modelling of Affymetrix data
Condition 2
Condition 1
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
Gene specific variability (probe) Gene index BGX
Gene specific variability (probe) Gene index BGX
Hierarchical model of replicate (biological)
variability and array effect
Hierarchical model of replicate (biological)
variability and array effect
Distribution of expression index for gene g ,
condition 1
Distribution of expression index for gene g ,
condition 2
Distribution of differential expression parameter
40
BGX Multiple array model conditions c1,,C,
replicates r 1,,Rc
PMgjcr ? N( Sgjcr Hgjcr , tcr2) MMgjcr?
N(FSgjcr Hgjcr , tcr2)
Gene and condition specific BGX
qgcmedian(TN(µgc, ? gc2)) Pools
information over replicate probe sets
j 1,J, r 1,,Rc
41
Posterior distributions of BGXSingle array vs
multiple array analyses
Three replicate arrays analysed together
(multiple array model)
Mean - 1SD
Three replicate arrays analysed separately
42
Subset of AffyU133A spike-in data set(AffyComp)
  • Consider
  • Six arrays, 1154 genes (every 20th and 42
    spike-ins)
  • Same cRNA hybridised to all arrays EXCEPT for
    spike-ins

1 2 3 12
13 14 Spike-in genes 1-3
4-6 7-9 34-36 37-39 40-42 Spike-in
conc (pM) Condition 1 (array 1-3) 0.0
0.25 0.50 128 256 512 Condition 2
(array 4-6) 0.25 0.50 1.00 256
512 0.00 Fold change
- 2 2 2 2
-
43
M v A plots
A (1/2)(exprg,1exprg,2), M (exprg,1-exprg,2)
NB! Point estimates used MAS5 and RMA exprgc
mean over three replicates BGX Multiple array
index
True fold changes Black zero Red 2
44
BGX measure of uncertainty providedPosterior
mean - 1SD credibility intervals
diffgbgxg,1- bgxg,2

Spike in 1113 -1154 above the blue line
Blue stars show RMA measure
45
Mixture and Bayesian estimation of false
discovery ratesNatalia Bochkina, Alex Lewin
SR, Philippe Broët
46
Multiple Testing Problem
  • Gene lists can be built by computing separately a
    criteria for each gene and ranking
  • Thousands of genes are considered simultaneously
  • How to assess the performance of such lists ?

Statistical Challenge Select interesting genes
without including too many false positives in a
gene list
A gene is a false positive if it is included in
the list when it is truly unmodified under the
experimental set up
Want an evaluation of the expected false
discovery rate (FDR)
47
Bayesian Estimate of FDR
  • Step 1 Choose a gene specific parameter (e.g.
    dg ) or a gene statistic (see later)
  • Step 2 Model its prior (resp marginal)
    distribution using a mixture model
  • -- with one component that models the
    unaffected genes (null hypothesis) e.g. point
    mass at 0 for dg
  • -- other components that model (flexibly) the
    alternative
  • Step 3 Calculate the posterior probability for
    any gene of belonging to the unmodified component
    pg0 data
  • Step 4 Evaluate FDR (and FNR) for any list
  • Assuming that all the gene classification are
    independent

Bayes FDR (list) data 1/card(list) Sg ? list
pg0
48
Mixture prior
  • To obtain a gene list, a commonly used method
  • (cf Lonnstedt Speed 2002, Newton 2003, Smyth
    2003, ) is to define a mixture prior for dg
  • H0 dg 0 point mass at 0 with probability p0
  • H1 dg flexible 2-sided distribution to model
    differential expression
  • Classify each gene following its posterior
    probabilities of not being in the null 1- pg0
  • Use Bayes rule or fix the FDR

49
Classification with mixture prior
  • Joint estimation of all the mixture parameters
    (including p0) avoids plugging-in of values
    (e.g. p0) that are influential on the
    classification
  • Sensitivity to prior settings of the alternative
    distribution and performance has been tested on
    simulated data sets
  • Work in progress
  • Poster by Natalia Bochkina

50
Performance of the mixture prior
  • yg1r ?g - ½ dg ?g1r , r 1, R1
  • yg2r ?g ½ dg ?g2r , r 1, R2
  • (For simplification, we assume that the data has
    been pre normalised)
  • s2g IG(a, b)
  • dg p0d0 p1G (1.5, ?1) p2G (1.5, ?2)
  • H0 H1
  • Dirichlet distribution for (p0, p1, p2)
  • Exponential hyper prior for ?1 and ?2

51
Simulated data
Plot of the true differences
ygr N(dg , s2g) (8 replicates) s2g IG(1.5,
0.05) dg (-1)Bern(0.5) G(2,2), g1200 dg 0,
g2011000
Choice of simulation parameters inspired by
estimates found in analyses of biological data
sets
52
Posterior estimates of fold change using mixture
model
53
Comparison of mixture classification and
posterior probabilities for the standardised
differences
Post Prob (g ? H1)
10 6 False positive
In red, 200 genes with dg ? 0
31 4 False negative
Probability ( dg gt 2 data)
54
Bayes rule
FDR (black) FNR (blue) as a function of 1- pg0
Post Prob (g ? H1) 1- pg0
55
Using mixtures for modelling the marginal
distribution of gene statistics
  • Instead of modelling the prior for dg as a
    mixture, an alternative is
  • To summarise differential expression by a gene
    statistic
  • To model is marginal distribution as a mixture
    such that the distribution is approximately known
    under H0 and use a flexible distribution for the
    alternative

56
Mixture modelling of transformed F statistics
Gene statistic based on classical F statistic
(this was developed to analyse multiclass ( gt 2
conditions) experiments) Gives a de-centred
asymmetric marginal distribution rather than a
two-tailed one Transform F -gt approx. standard
Normal if no change across conditions (H0). Use
a mixture of normals (variable number) for
modelling the alternative (following Richardson
and Green 1997)

57
Results for Simulated Data (to detect modified
profile over 3 conditions) Broet, Lewin, SR 2004
Bayes mixture estimate of FDR is close to true
value
Case A well separated null and alternative
hypotheses
Case B less separated null and alternative
hypotheses
For details, see the poster by Alex Lewin
58
Marginal mixture performance for the simulated
data (2 conditions, same data as for the prior
mixture)
Number on list as a function of cut-off prob
Expected number of false positive
59
Simulated data, comparison of prior and marginal
mixture classification
Good agreement between the 2 approaches
The marginal mixture has more false
positives Transformation to Normality for 2
conditions??
Further comparison in progress
60
Summary
  • Bayesian gene expression measure (BGX)
  • Good range of resolution , provides credibility
    intervals
  • Differential Expression
  • Expression-level-dependent normalisation
  • Borrow information across genes for variances
  • Joint distribution of ranks, gene lists based on
    posterior probabilities
  • False Discovery Rate
  • Mixture gives good estimate of FDR and classifies
    well
  • Future work
  • Mixture prior on BGX index, with uncertainty
    propagated to mixture parameters, comparison of
    marginal and prior mixture approaches, clustering
    for more general experimental set-ups

61
  • Papers and technical reports
  • Hein AM., Richardson S., Causton H., Ambler G.
    and Green P. (2004)
  • BGX a fully Bayesian gene expression index for
    Affymetrix GeneChip data (submitted)
  • Lewin A., Richardson S., Marshall C., Glazier A.
    and Aitman T. (2003)
  • Bayesian Modelling of Differential Gene
    Expression (submitted)
  • Broët P., Lewin A., Richardson S., Dalmasso C.
    and Magdelenat H. (2004)
  • A mixture model-based strategy for selecting
    sets of genes in multiclass response microarray
    experiments. (Bioinformatics, advanced access
    April 29 2004)
  • Broët, P., Richardson, S. and Radvanyi, F. (2002)
  • Bayesian Hierarchical Model for Identifying
    Changes in Gene Expression from Microarray
    Experiments , Journal of Computational Biology 9,
    671-683.
  • Available at
  • http //www.bgx.org.uk/

Thanks
Write a Comment
User Comments (0)
About PowerShow.com