Title: Classification of Microarray Data - Recent Statistical Approaches
1Classification of Microarray Data- Recent
Statistical Approaches
Geoff McLachlan and Liat Ben-Tovim
Jones Department of Mathematics Institute for
Molecular Bioscience University of
Queensland Tutorial for the APBC January 2005
2Institute for Molecular Bioscience, University
of Queensland
3Outline of Tutorial
- Introduction to Microarray Technology
- Detecting Differentially Expressed Genes in
Known Classes - of Tissue Samples
- Cluster Analysis Clustering Genes and
Clustering Tissues
- Supervised Classification of Tissue Samples
- Linking Microarray Data with Survival Analysis
4Outline of Tutorial
- Introduction to Microarray Technology
- Detecting Differentially Expressed Genes in
Known Classes - of Tissue Samples
BREAK
- Cluster Analysis Clustering Genes and
Clustering Tissues
- Supervised Classification of Tissue Samples
- Linking Microarray Data with Survival Analysis
5(No Transcript)
6Outline of Tutorial
- Introduction to Microarray Technology
- Detecting Differentially Expressed Genes in
Known Classes - of Tissue Samples
- Cluster Analysis Clustering Genes and
Clustering Tissues
- Supervised Classification of Tissue Samples
- Linking Microarray Data with Survival Analysis
7Large-scale gene expression studies are not a
passing fashion, but are instead one aspect of
new work of biological experimentation, one
involving large-scale, high throughput assays.
Speed et al., 2002, Statistical Analysis of Gene
Expression Microarray Data, Chapman and Hall/ CRC
8Growth of microarray and microarray methodology
literature listed in PubMed from 1995 to 2003.
The category all microarray papers includes
those found by searching PubMed for microarray
OR gene expression profiling. The category
statistical microarray papers includes those
found by searching PubMed for statistical
method OR statistical techniq OR
statistical approach AND microarray OR gene
expression profiling.
9A microarray is a new technology which allows the
measurement of the expression levels of
thousands of genes simultaneously.
- sequencing of the genome (human, mouse, and
others) - (2) improvement in technology to generate
high-density - arrays on chips (glass slides or nylon
membrane).
The entire genome of an organism can be probed at
a single point in time.
10(1) mRNA Levels Indirectly Measure Gene Activity
Every cell contains the same DNA.
The activity of a gene (expression) can be
determined by the presence of its complementary
mRNA.
Cells differ in the DNA (gene) which is active at
any one time.
Genes code for proteins through the intermediary
of mRNA.
11Target and Probe DNA
Probe DNA - known
Sample (target) - unknown
12(2) Microarrays Indirectly Measure Levels of mRNA
- mRNA is extracted from the cell
- mRNA is reverse transcribed to cDNA (mRNA itself
is unstable)
- cDNA is labeled with fluorescent dye TARGET
- The sample is hybridized to known DNA sequences
on the array - (tens of thousands of genes) PROBE
- If present, complementary target binds to probe
DNA - (complementary base pairing)
- Target bound to probe DNA fluoresces
13Spotted cDNA Microarray
Compare the gene expression levels for two cell
populations on a single microarray.
14(No Transcript)
15 Microarray Image Red High expression in
target labelled with cyanine 5 dye Green High
expression in target labelled with cyanine 3
dye Yellow Similar expression in both target
samples
16Assumptions
Gene Expression
(1)
cellular mRNA levels directly reflect gene
expression
mRNA
intensity of bound target is a measure of the
abundance of the mRNA in the sample.
(2)
Fluorescence Intensity
17Experimental Error
Sample contamination
Poor quality/insufficient mRNA
Reverse transcription bias
Fluorescent labeling bias
Hybridization bias
Cross-linking of DNA (double strands)
Poor probe design (cross-hybridization)
Defective chips (scratches, degradation)
Background from non-specific hybridization
18The Microarray Technologies
Spotted Microarray
Affymetrix GeneChip
cDNAs, clones, or short and long
oligonucleotides deposited onto glass
slides Each gene (or EST) represented by its
purified PCR product Simultaneous analysis of
two samples (treated vs untreated
cells) provides internal control.
short oligonucleotides synthesized in situ onto
glass wafers Each gene represented multiply -
using 16-20 (preferably non-overlapping) 25-mers.
Each oligonucleotide has single-base mismatch
partner for internal control of hybridization
specifity.
relative gene expressions
absolute gene expressions
Each with its own advantages and disadvantages
19Pros and Cons of the Technologies
Spotted Microarray
Affymetrix GeneChip
More expensive yet less flexible Good for whole
genome expression analysis where genome of that
organism has been sequenced High quality with
little variability between slides Gives a
measure of absolute expression of genes
Flexible and cheaper Allows study of genes not
yet sequenced (spotted ESTs can be used to
discover new genes and their functions) Variabil
ity in spot quality from slide to slide Provide
information only on relative gene expressions
between cells or tissue samples
20Aims of a Microarray Experiment
- observe changes in a gene in response to
external stimuli - (cell samples exposed to hormones, drugs,
toxins) - compare gene expressions between different
tissue types - (tumour vs normal cell samples)
- To gain understanding of
- function of unknown genes
- disease process at the molecular level
- Ultimately to use as tools in Clinical Medicine
for diagnosis, - prognosis and therapeutic management.
21Importance of Experimental Design
- Good DNA microarray experiments should have
clear objectives. - not performed as aimless data mining in search
of unanticipated patterns that will provide
answers to unasked - questions
- (Richard Simon, BioTechniques 34S16-S21, 2003)
22Replicates
Technical replicates arrays that have been
hybridized to the same biological source (using
the same treatment, protocols, etc.) Biological
replicates arrays that have been hybridized to
different biological sources, but with the same
preparation, treatments, etc.
23Extracting Data from the Microarray
- Cleaning
- Image processing
- Filtering
- Missing value estimation
- Normalization
- Remove sources of systematic variation.
Sample 1
Sample 2
Sample 3
Sample 4 etc
24Gene Expressions from Measured Intensities
Spotted Microarray
log 2(Intensity Cy5 / Intensity Cy3)
Affymetrix
(Perfect Match Intensity Mismatch Intensity)
25Data Transformation
Rocke and Durbin (2001), Munson (2001), Durbin et
al. (2002), and Huber et al. (2002)
26Representation of Data from M Microarray
Experiments
Sample 1 Sample 2 Sample
M
Gene 1 Gene 2 Gene N
Assume we have extracted gene expressions values
from intensities.
Expression Signature
Gene expressions can be shown as Heat Maps
Expression Profile
27(No Transcript)
28(No Transcript)
29Microarrays present new problems for statistics
because the data is very high dimensional with
very little replication.
30Gene Expression Data represented as N x M Matrix
Sample 1 Sample 2 Sample
M
Gene 1 Gene 2 Gene N
Expression Signature
N rows correspond to the N genes. M columns
correspond to the M samples (microarray
experiments).
Expression Profile
31Microarray Data Notation
Represent the N x M matrix A
A (y1, ... , yM) Classifying Tissues
on Gene Expressions
the feature vector yj contains the expression
levels on the N genes in the jth experiment (j
1, ... , M). yj is the expression signature.
AT (y1, ... , yN) Classifying Genes
on the Tissues
the feature vector yj contains the expression
levels on the M tissues on the jth gene (j 1,
... , N). yj is the expression profile.
32In the N x M matrix A N No. of genes
(103-104) M No. of tissues (10-102)
Classification of Tissues on Gene
Expressions Standard statistical methodology
appropriate when M gtgt N, BUT here N gtgt
M Classification of Genes on the Basis of the
Tissues Falls in standard framework, BUT not all
the genes are independently distributed.
33Mehta et al (Nature Genetics, Sept. 2004)
The field of expression data analysis is
particularly active with novel analysis
strategies and tools being published weekly, and
the value of many of these methods is
questionable. Some results produced by using
these methods are so anomalous that a breed of
forensic statisticians (Ambroise and McLachlan,
2002 Baggerly et al., 2003) who doggedly detect
and correct other HDB (high-dimensional biology)
investigators prominent mistakes, has been
created.
34Outline of Tutorial
- Introduction to Microarray Technology
- Detecting Differentially Expressed Genes in Known
Classes - of Tissue Samples
- Cluster Analysis Clustering Genes and Clustering
Tissues
- Supervised Classification of Tissue Samples
- Linking Microarray Data with Survival Analysis
35(No Transcript)
36Class 1
Class 2
37Fold Change is the Simplest Method Calculate the
log ratio between the two classes and consider
all genes that differ by more than an arbitrary
cutoff value to be differentially expressed. A
two-fold difference is often chosen. Fold
change is not a statistical test.
38Multiplicity Problem
Perform a test for each gene to determine the
statistical significance of differential
expression for that gene. Problem When many
hypotheses are tested, the probability of a type
I error (false positive) increases sharply with
the number of hypotheses.
Further complicated by gene co-regulation and
subsequent correlation between the test
statistics.
39Example
Suppose we measure the expression of 10,000 genes
in a microarray experiment.
If all 10,000 genes were not differentially
expressed, then we would expect for P 0.05
for each test, 500 false positives. P
0.05/10,000 for each test, .05 false positives.
40Methods for dealing with the Multiplicity Problem
- The Bonferroni Method
- controls the family wise error rate (FWER).
- (FWER is the probability that at least one false
positive - error will be made.) - but this method is very
- conservative, as it tries to make it unlikely
that even - one false rejection is made.
- The False Discovery Rate (FDR)
- emphasizes the proportion of false positives
among the identified differentially expressed
genes.
41Test of a Single Hypothesis
The M tissue samples are classified with respect
to g classes on the basis of the N gene
expressions. Assume that there are ni tissue
samples from each class Ci (i 1, , g), where
M n1 ng.
Take a simple case where g 2. The aim is to
detect whether some of the thousands of genes
have different expression levels in class C1
than in class C2.
42Test of a Single Hypothesis (cont.)
For gene j, let Hj denote the null hypothesis of
no association between its expression level and
membership of the class, where (j 1, , N).
Hj 0 Null hypothesis for the jth gene
holds. Hj 1 Null hypothesis for the jth gene
does not hold.
Retain Null Reject Null
Hj 0 type I error
Hj 1 type II error
43Two-Sample t-Statistic
Students t-statistic
44Two-Sample t-Statistic
Pooled form of the Students t-statistic, assumed
common variance in the two classes
45Two-Sample t-Statistic
Modified t-statistic of Tusher et al. (2001)
46Multiple Hypothesis Testing
Consider measures of error for multiple
hypotheses. Focus on the rate of false positives
with respect to the number of rejected
hypotheses, Nr.
47Â Â
Possible Outcomes for N Hypothesis Tests
48Â Â
Possible Outcomes for N Hypothesis Tests
FWER is the probability of getting one or
more false positives out of all the hypotheses
tested
49Bonferroni method for controlling the FWER
Consider N hypothesis tests
H0j versus H1j, j 1, , N
and let P1, , PN denote the N P-values for
these tests.
The Bonferroni Method Given P-values P1, , PN
reject null hypothesis H0j if
Pj lt a / N .
50False Discovery Rate (FDR)
The FDR is essentially the expectation of the
proportion of false positives among the
identified differentially expressed genes.
51Â Â
Possible Outcomes for N Hypothesis Tests
52Controlling FDR
Benjamini and Hochberg (1995)
Key papers on controlling the FDR
- Genovese and Wasserman (2002)
- Storey (2002, 2003)
- Storey and Tibshirani (2003a, 2003b)
- Storey, Taylor and Siegmund (2004)
- Black (2004)
- Cox and Wong (2004)
53Benjamini-Hochberg (BH) Procedure
Controls the FDR at level a when the P-values
following the null distribution are independent
and uniformly distributed.
(1) Let be the
observed P-values.
(2) Calculate
.
(3) If exists then reject null hypotheses
corresponding to
. Otherwise, reject nothing.
54Example Bonferroni and BH Tests
Suppose that 10 independent hypothesis tests are
carried out leading to the following ordered
P-values
0.00017 0.00448 0.00671 0.00907
0.01220 0.33626 0.39341 0.53882 0.58125
0.98617
(a) With a 0.05, the Bonferroni test rejects
any hypothesis whose P-value is less than a / 10
0.005.
Thus only the first two hypotheses are rejected.
(b) For the BH test, we find the largest k such
that P(k) lt ka / m.
Here k 5, thus we reject the first five
hypotheses.
55SHORT BREAK
56Null Distribution of the Test Statistic
Permutation Method
The null distribution has a resolution on the
order of the number of permutations. If we
perform B permutations, then the P-value will be
estimated with a resolution of 1/B. If we
assume that each gene has the same null
distribution and combine the permutations, then
the resolution will be 1/(NB) for the pooled
null distribution.
57Using just the B permutations of the class labels
for the gene-specific statistic Tj , the P-value
for Tj tj is assessed as
where t(b)0j is the null version of tj after the
bth permutation of the class labels.
58If we pool over all N genes, then
59Null Distribution of the Test Statistic Example
Class 1 Class 2
Gene 1 A1(1) A2(1) A3(1) B4(1) B5(1)
B6(1)
Gene 2 A1(2) A2(2) A3(2)
B4(2) B5(2) B6(2)
Suppose we have two classes of tissue samples,
with three samples from each class. Consider
the expressions of two genes, Gene 1 and Gene 2.
60Class 1 Class 2
Gene 1 A1(1) A2(1) A3(1) B4(1) B5(1)
B6(1)
Gene 2 A1(2) A2(2) A3(2)
B4(2) B5(2) B6(2)
To find the null distribution of the test
statistic for Gene 1, we proceed under the
assumption that there is no difference between
the classes (for Gene 1) so that
Gene 1 A1(1) A2(1) A3(1) A4(1) A5(1)
A6(1)
And permute the class labels
Perm. 1 A1(1) A2(1) A4(1) A3(1) A5(1)
A6(1) ... There are 10 distinct permutations.
61Ten Permutations of Gene 1
A1(1) A2(1) A3(1) A4(1) A5(1) A6(1) A1(1)
A2(1) A4(1) A3(1) A5(1) A6(1) A1(1) A2(1)
A5(1) A3(1) A4(1) A6(1) A1(1) A2(1) A6(1)
A3(1) A4(1) A5(1) A1(1) A3(1) A4(1) A2(1)
A5(1) A6(1) A1(1) A3(1) A5(1) A2(1) A4(1)
A6(1) A1(1) A3(1) A6(1) A2(1) A4(1)
A5(1) A1(1) A4(1) A5(1) A2(1) A3(1)
A6(1) A1(1) A4(1) A6(1) A2(1) A3(1)
A5(1) A1(1) A5(1) A6(1) A2(1) A3(1) A4(1)
62As there are only 10 distinct permutations here,
the null distribution based on these permutations
is too granular. Hence consideration is given
to permuting the labels of each of the other
genes and estimating the null distribution of a
gene based on the pooled permutations so
obtained. But there is a problem with this
method in that the null values of the test
statistic for each gene does not necessarily
have the theoretical null distribution that we
are trying to estimate.
63Suppose we were to use Gene 2 also to estimate
the null distribution of Gene 1. Suppose that
Gene 2 is differentially expressed, then the
null values of the test statistic for Gene 2 will
have a mixed distribution.
64Class 1 Class 2
Gene 1 A1(1) A2(1) A3(1) B4(1) B5(1)
B6(1)
Gene 2 A1(2) A2(2) A3(2)
B4(2) B5(2) B6(2)
Gene 2 A1(2) A2(2) A3(2) B4(2) B5(2)
B6(2)
Permute the class labels
Perm. 1 A1(2) A2(2) B4(2) A3(2) B5(2)
B6(2) ...
65Example of a null case with 7 N(0,1) points
and 8 N(0,1) points histogram of the pooled
two-sample t-statistic under 1000 permutations
of the class labels with t13 density
superimposed.
ty
66Example of a null case with 7 N(0,1) points
and 8 N(10,9) points histogram of the pooled
two-sample t-statistic under 1000 permutations
of the class labels with t13 density
superimposed.
ty
67The SAM Method
Use the permutation method to calculate the null
distribution of the modified t-statistic (Tusher
et al., 2001).
The order statistics t(1), ... , t(N) are
plotted against their null expectations above.
A good test in situations where there are more
genes being over-expressed than under-expressed,
or vice-versa.
68Two-component Mixture Model Framework
69Two-component model
is the proportion of genes that are not
differentially expressed, and
is the proportion that are.
70Two-component model
is the proportion of genes that are not
differentially expressed, and
is the proportion that are.
Then
is the posterior probability that gene j is not
differentially expressed.
711) Form a statistic tj for each gene, where a
large positivevalue of tj corresponds to a gene
that is differentially expressed across the
tissues.2) Compute the Pj-values according to
the tj and fit a mixture of beta distributions
(including a uniform component) to them where
the latter corresponds to the class of genes
that are not differentially expressed. or  Â
  Â
723) Fit to t1,...,tp a mixture of two normal
densities with a common variance, where the
first component has the smaller mean (it
corresponds to the class of genes that are not
differentially expressed). It is assumed that
the tj have been transformed so that they are
normally distributed (approximately).
tj
4) Let 0(tj) denote the (estimated) posterior
probability that gene j belongs to the first
component of the mixture.
73If we conclude that gene j is differentially
expressed if
0(tj) c0,
then this decision minimizes the (estimated)
Bayes risk
where
74Estimated FDR
where
75Suppose t0(t) is monotonic decreasing in t. Then
for
76Suppose t0(t) is monotonic decreasing in t. Then
for
77Suppose t0(t) is monotonic decreasing in t. Then
for
where
78For a desired control level a, say a 0.05,
define
(1)
t
is monotonic in t, then using (1)
If
to control the FDR with and
taken to be the empirical distribution
function is equivalent to using the
Benjamini-Hochberg procedure based on the
P-values corresponding to the statistic tj.
79Example
The study of Hedenfalk et al. (2001), used cDNA
arrays to measure the gene expressions in breast
tumour tissues taken from patients with
mutations in either the BRCA1 or BRCA2 genes. We
consider their data set of M 15 patients,
comprising two patient groups BRCA1 (7) versus
BRCA2-mutation positive (8), with N 3,226
genes.
The problem is to find genes which are
differentially expressed between the BRCA1 and
BRCA2 patients.
80Example of a null case with 7 N(0,1) points
and 8 N(0,1) points histogram of the pooled
two-sample t-statistic under 1000 permutations
of the class labels with t13 density
superimposed.
ty
81Example of a null case with 7 N(0,1) points
and 8 N(10,9) points histogram of the pooled
two-sample t-statistic under 1000 permutations
of the class labels with t13 density
superimposed.
ty
82Fit to the N values of tj (pooled two-sample
t-statistic)
jth gene is taken to be differentially expressed
if
83Estimated FDR for various levels of c0
c0 Nr
0.5 1702 0.29
0.4 1235 0.23
0.3 850 0.18
0.2 483 0.12
0.1 175 0.06
84Use of the P-Value as a Summary
Statistic (Allison et al., 2002)
Instead of using the pooled form of the
t-statistic, we can work with the value pj,
which is the P-value associated with tj in the
test of the null hypothesis of no difference in
expression between the two classes.
The distribution of the P-value is modelled by
the h-component mixture model
,
where a11 a12 1.
85Use of the P-Value as a Summary Statistic
Under the null hypothesis of no difference in
expression for the jth gene, pj will have a
uniform distribution on the unit interval ie
the b1,1 distribution.
The ba1,a2 density is given by
where
86(No Transcript)
87Outline of Tutorial
- Introduction to microarray technology
- Detecting Differentially Expressed Genes in Known
Classes - of Tissue Samples
- Cluster Analysis Clustering Genes and Clustering
Tissues
- Supervised Classification of Tissue Samples
- Linking Microarray Data with Survival Analysis
88Gene Expression Data represented as N x M Matrix
Sample 1 Sample 2 Sample
M
Gene 1 Gene 2 Gene N
Expression Signature
N rows correspond to the N genes. M columns
correspond to the M samples (microarray
experiments).
Expression Profile
89Two Clustering Problems
Clustering of genes on basis of tissues
genes not independent
(n N, p M)
- Clustering of tissues on basis of genes -
- latter is a nonstandard problem in
- cluster analysis
(n M, p N, so n ltlt p)
90(No Transcript)
91CLUSTERING OF GENES WITH REPEATED
MEASUREMENTS Medvedovic and Sivaganesan
(2002) Celeux, Martin, and Lavergne
(2004) Chapter 5 of McLachlan et al. (2004)
92UNSUPERVISED CLASSIFICATION (CLUSTER
ANALYSIS) INFER CLASS LABELS z1, , zn of y1,
, yn
Initially, hierarchical distance-based methods of
cluster analysis were used to cluster the tissues
and the genes Eisen, Spellman, Brown, Botstein
(1998, PNAS)
93(No Transcript)
94(No Transcript)
95Hierarchical (agglomerative) clustering
algorithms are largely heuristically motivated
and there exist a number of unresolved issues
associated with their use, including how to
determine the number of clusters.
in the absence of a well-grounded statistical
model, it seems difficult to define what is meant
by a good clustering algorithm or the right
number of clusters.
(Yeung et al., 2001, Model-Based Clustering and
Data Transformations for Gene Expression Data,
Bioinformatics 17)
96McLachlan and Khan (2004). On a resampling
approach for tests on the number of clusters with
mixture model-based clustering of the tissue
samples. Special issue of the Journal of
Multivariate Analysis 90 (2004) edited by Mark
van der Laan and Sandrine Dudoit (UC Berkeley).
97Attention is now turning towards a model-based
approach to the analysis of microarray data
For example
- Broet, Richarson, and Radvanyi (2002). Bayesian
hierarchical model for identifying changes in
gene expression from microarray experiments.
Journal of Computational Biology 9 - Ghosh and Chinnaiyan (2002). Mixture modelling of
gene expression data from microarray experiments.
Bioinformatics 18 - Liu, Zhang, Palumbo, and Lawrence (2003).
Bayesian clustering with variable and
transformation selection. In Bayesian Statistics
7 - Pan, Lin, and Le, 2002, Model-based cluster
analysis of microarray gene expression data.
Genome Biology 3 - Yeung et al., 2001, Model based clustering and
data transformations for gene expression data,
Bioinformatics 17
98The notion of a cluster is not easy to
define. There is a very large literature devoted
to clustering when there is a metric known in
advance e.g. k-means. Usually, there is no a
priori metric (or equivalently a user-defined
distance matrix) for a cluster analysis. That
is, the difficulty is that the shape of the
clusters is not known until the clusters have
been identified, and the clusters cannot be
effectively identified unless the shapes are
known.
99In this case, one attractive feature of adopting
mixture models with elliptically symmetric
components such as the normal or t densities, is
that the implied clustering is invariant under
affine transformations of the data (that is,
under operations relating to changes in location,
scale, and rotation of the data). Thus the
clustering process does not depend on irrelevant
factors such as the units of measurement or the
orientation of the clusters in space.
100(No Transcript)
101MIXTURE OF g NORMAL COMPONENTS
102MIXTURE OF g NORMAL COMPONENTS
103Equal spherical covariance matrices
104With a mixture model-based approach to
clustering, an observation is assigned outright
to the ith cluster if its density in the ith
component of the mixture distribution (weighted
by the prior probability of that component) is
greater than in the other (g-1) components.
105Figure 7 Contours of the fitted component
densities on the 2nd 3rd variates for the blue
crab data set.
106Estimation of Mixture Distributions
- It was the publication of the seminal paper of
Dempster, Laird, and Rubin (1977) on the EM
algorithm that greatly stimulated interest in the
use of finite mixture distributions to model
heterogeneous data. - McLachlan and Krishnan (1997, Wiley)
107- If need be, the normal mixture model can be
made less sensitive to outlying observations by
using t component densities. - With this t mixture model-based approach, the
normal distribution for each component in the
mixture is embedded in a wider class of
elliptically symmetric distributions with an
additional parameter called the degrees of
freedom.
108The advantage of the t mixture model is that,
although the number of outliers needed for
breakdown is almost the same as with the normal
mixture model, the outliers have to be much
larger.
109Mixtures of Factor Analyzers
A normal mixture model without restrictions on
the component-covariance matrices may be viewed
as too general for many situations in practice,
in particular, with high dimensional data. One
approach for reducing the number of parameters
is to work in a lower dimensional space by using
principal components another is to use mixtures
of factor analyzers (Ghahramani Hinton, 1997).
110Two Groups in Two Dimensions. All cluster
information would be lost by collapsing to the
first principal component. The principal
ellipses of the two groups are shown as solid
curves.
111Mixtures of Factor Analyzers
- Principal components or a single-factor analysis
model provides only a global linear model. - A global nonlinear approach by postulating a
mixture of linear submodels
112Bi is a p x q matrix and Di is a diagonal
matrix.
113Single-Factor Analysis Model
114The Uj are iid N(O, Iq) independently of the
errors ej, which are iid as N(O, D), where D is a
diagonal matrix
115Conditional on ith component membership of the
mixture,
where Ui1, ..., Uin are independent,
identically distibuted (iid) N(O, Iq),
independently of the eij, which are iid
N(O, Di), where Di is a diagonal matrix
(i 1, ..., g).
116An infinity of choices for Bi for model still
holds if Bi is replaced by BiCi where Ci is an
orthogonal matrix. Choose Ci so that
is diagonal
Number of free parameters is then
117Reduction in the number of parameters is then
- We can fit the mixture of factor analyzers model
using an alternating ECM algorithm.
1181st cycle declare the missing data to be the
component-indicator vectors. Update the
estimates of
and
2nd cycle declare the missing data to be also
the factors. Update the estimates of
and
119M-step on 1st cycle
for i 1, ... , g .
120M step on 2nd cycle
where
121(No Transcript)
122Work in q-dim space
(BiBiT Di ) -1 Di 1 - Di -1Bi (Iq BiTDi
-1Bi) -1BiTDi -1,
BiBiTD i Di / Iq
-BiT(BiBiTDi) -1Bi .
123PROVIDES A MODEL-BASED APPROACH TO
CLUSTERING McLachlan, Bean, and Peel, 2002, A
Mixture Model-Based Approach to the Clustering of
Microarray Expression Data, Bioinformatics 18,
413-422
http//www.bioinformatics.oupjournals.org/cgi/scre
enpdf/18/3/413.pdf
124(No Transcript)
125Example Microarray DataColon Data of Alon et
al. (1999)
M 62 (40 tumours 22 normals) tissue samples
of N 2,000 genes in a 2,000 ? 62 matrix.
126(No Transcript)
127(No Transcript)
128Mixture of 2 normal components
129Mixture of 2 t components
130(No Transcript)
131Clustering of COLON Data Genes using EMMIX-GENE
132Grouping for Colon Data
133(No Transcript)
134(No Transcript)
135Clustering of COLON Data Tissues using EMMIX-GENE
136Grouping for Colon Data
137Heat Map Displaying the Reduced Set of 4,869
Genes on the 98 Breast Cancer Tumours
138Insert heat map of 1867 genes
Heat Map of Top 1867 Genes
139(No Transcript)
140(No Transcript)
141(No Transcript)
142(No Transcript)
143(No Transcript)
144where i group number mi number in group
i Ui -2 log ?i
145Heat Map of Genes in Group G1
146Heat Map of Genes in Group G2
147Heat Map of Genes in Group G3
148Number of Components in a Mixture Model
- Testing for the number of components, g, in a
mixture is an important but very difficult
problem which has not been completely resolved.
149Order of a Mixture Model
- A mixture density with g components might be
empirically indistinguishable from one with
either fewer than g components or more than g
components. It is therefore sensible in practice
to approach the question of the number of
components in a mixture model in terms of an
assessment of the smallest number of components
in the mixture compatible with the data.
150Likelihood Ratio Test Statistic
- An obvious way of approaching the problem of
testing for the smallest value of the number of
components in a mixture model is to use the LRTS,
-2logl. Suppose we wish to test the null
hypothesis,
versus
for some g1gtg0.
151- We let denote the MLE of calculated
under Hi , (i0,1). Then the evidence against H0
will be strong if l is sufficiently small, or
equivalently, if -2logl is sufficiently large,
where
152Bootstrapping the LRTS
- McLachlan (1987) proposed a resampling approach
to the assessment of the P-value of the LRTS in
testing - for a specified value of g0.
153Bayesian Information Criterion
The Bayesian information criterion (BIC) of
Schwarz (1978) is given by
as the penalized log likelihood to be maximized
in model selection, including the present
situation for the number of components g in a
mixture model.
154 Gap statistic (Tibshirani et al., 2001)
Clest (Dudoit and Fridlyand, 2002)
155Outline of Tutorial
- Introduction to Microarray Technology
- Detecting Differentially Expressed Genes in Known
Classes - of Tissue Samples
- Cluster Analysis Clustering Genes and Clustering
Tissues
- Supervised Classification of Tissue Samples
- Linking Microarray Data with Survival Analysis
156Supervised Classification (Two Classes)
. . . . . . .
Sample 1
Sample n
Gene 1
. . . . . . .
Gene p
Class 2 (poor prognosis)
Class 1 (good prognosis)
157Microarray to be used as routine clinical
screen by C. M. Schubert Nature Medicine 9, 9,
2003.
The Netherlands Cancer Institute in Amsterdam is
to become the first institution in the world to
use microarray techniques for the routine
prognostic screening of cancer patients. Aiming
for a June 2003 start date, the center will use a
panoply of 70 genes to assess the tumor profile
of breast cancer patients and to determine which
women will receive adjuvant treatment after
surgery.
158Selection bias in gene extraction on the basis
of microarray gene-expression data
Ambroise and McLachlan Proceedings of the
National Academy of Sciences Vol. 99, Issue 10,
6562-6566, May 14, 2002 http//www.pnas.org/cgi/
content/full/99/10/6562
159Supervised Classification of Tissue Samples
160Sample 1 Sample 2 Sample
M
Gene 1 Gene 2 Gene N
Expression Signature
Expression Profile
161LINEAR CLASSIFIER
FORM
for the production of the group label z of a
future entity with feature vector y.
162FISHERS LINEAR DISCRIMINANT FUNCTION
where
and S are the sample means and pooled sample
and
covariance matrix found from the training data
163SUPPORT VECTOR CLASSIFIER
Vapnik (1995)
where ß0 and ß are obtained as follows
subject to
relate to the slack variables
separable case
164with non-zero
only for those observations j for which the
constraints are exactly met (the support vectors).
165Support Vector Machine (SVM)
by
REPLACE
where the kernel function
is the inner product in the transformed feature
space.
166 HASTIE et al. (2001, Chapter 12) The
Lagrange (primal function) is
which we maximize w.r.t. ß, ß0, and ?j. Setting
the respective derivatives to zero, we get
with
and
167By substituting (2) to (4) into (1), we obtain
the Lagrangian dual function
We maximize (5) subject to
In addition to (2) to (4), the constraints include
Together these equations (2) to (8) uniquely
characterize the solution to the primal and dual
problem.
168Leo Breiman (2001)Statistical modeling the
two cultures (with discussion).Statistical
Science 16, 199-231.Discussants include Brad
Efron and David Cox
169GUYON, WESTON, BARNHILL VAPNIK (2002, Machine
Learning)
- LEUKAEMIA DATA
- Only 2 genes are needed to obtain a zero CVE
(cross-validated error rate) - COLON DATA
- Using only 4 genes, CVE is 2
170 Since pgtgtn, consideration given to
selection of suitable genes
SVM FORWARD or BACKWARD (in terms of
magnitude of weight ßi) RECURSIVE FEATURE
ELIMINATION (RFE) FISHER FORWARD ONLY
(in terms of CVE)
171 GUYON et al.
(2002) LEUKAEMIA DATA Only 2 genes are
needed to obtain a zero CVE (cross-validated
error rate) COLON DATA Using only 4 genes,
CVE is 2
172GUYON et al. (2002)
The success of the RFE indicates that RFE has a
built in regularization mechanism that we do not
understand yet that prevents overfitting the
training data in its selection of gene subsets.
173Error Rate Estimation
Suppose there are two groups G1 and G2
c(y) is a classifier formed from the data set
(y1, y2, y3,, yn)
The apparent error is the proportion of the data
set misallocated by c(y).
174Cross-Validation
Then form the classifier c(1)(y ) from this
reduced set.
Use c(1)(y1) to allocate y1 to either G1 or G2.
175Repeat this process for the second data point,
y2.
So that this point is assigned to either G1 or G2
on the basis of the classifier c(2)(y2).
And so on up to yn.
176Ten-Fold Cross Validation
1 2 3 4 5 6 7 8 9 10
Test
T r a i n i n g
177Figure 1 Error rates of the SVM rule with RFE
procedure averaged over 50 random splits of colon
tissue samples
178Figure 3 Error rates of Fishers rule with
stepwise forward selection procedure using all
the colon data
179Figure 5 Error rates of the SVM rule averaged
over 20 noninformative samples generated by
random permutations of the class labels of the
colon tumor tissues
180 ADDITIONAL REFERENCES Selection bias
ignored XIONG et al. (2001, Molecular Genetics
and Metabolism) XIONG et al. (2001, Genome
Research) ZHANG et al. (2001, PNAS)
Aware of selection bias SPANG et al. (2001,
Silico Biology) WEST et al. (2001,
PNAS) NGUYEN and ROCKE (2002)
181BOOTSTRAP APPROACH
Efrons (1983, JASA) .632 estimator
where B1 is the bootstrap when rule is
applied to a point not in the training sample. A
Monte Carlo estimate of B1 is
where
182Toussaint Sharpe (1975) proposed the ERROR
RATE ESTIMATOR
where
McLachlan (1977) proposed wwo where wo is chosen
to minimize asymptotic bias of A(w) in the case
of two homoscedastic normal groups. Value of w0
was found to range between 0.6 and 0.7, depending
on the values of
183.632 estimate of Efron Tibshirani (1997, JASA)
where
(relative overfitting rate)
(estimate of no information error rate)
If r 0, w .632, and so B.632
B.632 r 1, w 1, and so B.632 B1
184MARKER GENES FOR HARVARD DATA For a SVM based on
64 genes, and using 10-fold CV, we noted the
number of times a gene was selected. No. of
genes Times selected 55
1 18
2 11 3
7 4 8
5 6
6 10
7 8 8
12 9 17
10
185MARKER GENES FOR HARVARD DATA
tubulin, alpha, ubiquitous Cluster Incl N90862 cyclin-dependent kinase inhibitor 2C (p18, inhibits CDK4) DEK oncogene (DNA binding) Cluster Incl AF035316 transducin-like enhancer of split 2, homolog of Drosophila E(sp1) ADP-ribosyltransferase (NAD poly (ADP-ribose) polymerase) benzodiazapine receptor (peripheral) Cluster Incl D21063 galactosidase, beta 1 high-mobility group (nonhistone chromosomal) protein 2 cold inducible RNA-binding protein Cluster Incl U79287 BAF53 tubulin, beta polypeptide thromboxane A2 receptor H1 histone family, member X Fc fragment of IgG, receptor, transporter, alpha sine oculis homeobox (Drosophila) homolog 3 transcriptional intermediary factor 1 gamma transcription elongation factor A (SII)-like 1 like mouse brain protein E46 minichromosome maintenance deficient (mis5, S. pombe) 6 transcription factor 12 (HTF4, helix-loop-helix transcription factors 4) guanine nucleotide binding protein (G protein), gamma 3, linked dihydropyrimidinase-like 2 Cluster Incl AI951946 transforming growth factor, beta receptor II (70-80kD) protein kinase C-like 1
No. of Times genes selected 55
1 18 2 11
3 7 4 8 5
6 6 10 7
8 8 12 9 17
10
186Breast cancer data set in vant Veer et al.
(vant Veer et al., 2002, Gene Expression
Profiling Predicts Clinical Outcome Of Breast
Cancer, Nature 415)
These data were the result of microarray
experiments on three patient groups with
different classes of breast cancer tumours. The
overall goal was to identify a set of genes that
could distinguish between the different tumour
groups based upon the gene expression information
for these groups.
187van de Vijver et al. (2002) considered a further
234 breast cancer tumours but have only made
available the data for the top 70 genes based on
the previous study of van t Veer et al. (2002)
188Number of Genes Error Rate for Top 70 Genes (without correction for Selection Bias as Top 70) Error Rate for Top 70 Genes (with correction for Selection Bias as Top 70) Error Rate for 5422 Genes (with correction for Selection Bias)
1 0.50 0.53 0.56
2 0.32 0.41 0.44
4 0.26 0.40 0.41
8 0.27 0.32 0.43
16 0.28 0.31 0.35
32 0.22 0.35 0.34
64 0.20 0.34 0.35
70 0.19 0.33 -
128 - - 0.39
256 - - 0.33
512 - - 0.34
1024 - - 0.33
2048 - - 0.37
4096 - - 0.40
5422 - - 0.44
189(No Transcript)
190Nearest-Shrunken Centroids (Tibshirani et al.,
2002)
The usual estimates of the class means are
shrunk toward the overall mean of the data,
where
and
191The nearest-centroid rule is given by
where yv is the vth element of the feature
vector y and .
192In the previous definition, we replace the sample
mean of the vth gene by its shrunken
estimate
v
where
193Comparison of Nearest-Shrunken Centroids with SVM
Apply (i) nearest-shrunken centroids and
(ii) the SVM with RFE to colon data set of Alon
et al. (1999), with N 2000 genes and M 62
tissues (40
tumours, 22 normals)
194Nearest-Shrunken Centroids applied to Alon data
(a) Overall Error Rates
(b) Class-specific Error Rates
195SVM with RFE applied to Alon data
(a) Overall Error Rates
(b) Class-specific Error Rates
196Outline of Tutorial
- Introduction to Microarray Technology
- Detecting Differentially Expressed Genes in Known
Classes - of Tissue Samples
- Cluster Analysis Clustering Genes and Clustering
Tissues
- Supervised Classification of Tissue Samples
- Linking Microarray Data with Survival Analysis
197Breast tumours have a genetic signature. The
expression pattern of a set of 70 genes can
predict whether a tumour is going to prove
lethal, despite treatment, or not. This gene
expression profile will outperform all currently
used clinical parameters in predicting disease
outcome.
van t Veer et al. (2002), van de Vijver et al.
(2002)
198Problems
- Censored Observations the time of occurrence of
the event - (death) has not yet been observed.
- Small Sample Sizes study limited by patient
numbers - Specific Patient Group is the study applicable
to other - populations?
- Difficulty in integrating different studies
(different - microarray platforms)
199A Case Study The Lung Cancer data sets from
CAMDA03
Four independently acquired lung cancer data sets
(Harvard, Michigan, Stanford and Ontario). The
challenge To integrate information from
different data sets (2 Affy chips of different
versions, 2 cDNA arrays). The final goal To
make an impact on cancer biology and eventually
patient care. Especially, we welcome the
methodology of survival analysis using
microarrays for cancer prognosis (Park et al.
Bioinformatics S120, 2002).
200Methodology of Survival Analysis using Microarrays
Cluster the tissue samples (eg using hierarchical
clustering), then compare the survival curves for
each cluster using a non-parametric Kaplan-Meier
analysis (Alizadeh et al. 2000). Park et al.
(2002), Nguyen and Rocke (2002) used partial
least squares with the proportional hazards
model of Cox. Unsupervised vs. Supervised
Methods Semi-supervised approach of Bair and
Tibshirani (2004), to combine gene expression
data with the clinical data.
201AIM To link gene-expression data with survival
from lung cancer in the CAMDA03
challenge A CLUSTER ANALYSIS We apply a
model-based clustering approach to classify
tumour tissues on the basis of microarray gene
expression. B SURVIVAL ANALYSIS The
association between the clusters so formed and
patient survival (recurrence) times is
established. C DISCRIMINANT ANALYSIS We
demonstrate the potential of the
clustering-based prognosis as a predictor of the
outcome of disease.
202Lung Cancer
Approx. 80 of lung cancer patients have NSCLC
(of which adenocarcinoma is the most common
form). All Patients diagnosed with NSCLC are
treated on the basis of stage at presentation
(tumour size, lymph node involvement and
presence of metastases). Yet 30 of patients
with resected stage I lung cancer will die of
metastatic cancer within 5 years of
surgery. Want a prognostic test for early-stage
lung adenocarcinoma to identify patients more
likely to recur, and therefore who would benefit
from adjuvant therapy.
203Lung Cancer Data Sets
(see http//www.camda.duke.edu/camda03)
Wigle et al. (2002), Garber et al. (2001),
Bhattacharjee et al. (2001), Beer et al. (2002).
204Heat Map for 2880 Ontario Genes (39 Tissues)
Genes
Tissues
205(No Transcript)
206Heat Maps for the 20 Ontario Gene-Groups (39
Tissues)
Genes
Tissues
Tissues are ordered as Recurrence (1-24) and
Censored (25-39)
207Expression Profiles for Useful Metagenes (Ontario
39 Tissues)
Gene Group 1
Gene Group 2
Our Tissue Cluster 1
Our Tissue Cluster 2
Log Expression Value
Recurrence (1-24)
Censored (25-39)
Gene Group 19
Gene Group 20
Tissues
208Tissue Clusters
CLUSTER ANALYSIS via EMMIX-GENE of 20 METAGENES
yields TWO CLUSTERS CLUSTER 1 (38) 23
(recurrence) plus
8 (censored) CLUSTER 2 (8) 1 (recurrence)
plus 7
(censored)
Poor-prognosis
Good-prognosis
209SURVIVAL ANALYSIS LONG-TERM SURVIVOR (LTS)
MODEL where T is time to recurrence and p1
1- p2 is the prior prob. of recurrence. Adopt
Weibull model for the survival function
for recurrence S1(t).
210Fitted LTS Model vs. Kaplan-Meier
211PCA of Tissues Based on Metagenes
Second PC
First PC
212PCA of Tissues Based on Metagenes
Second PC
First PC
213PCA of Tissues Based on All Genes (via SVD)
Second PC
First PC
214PCA of Tissues Based on All Genes (via SVD)
Second PC
First PC
215Cluster-Specific Kaplan-Meier Plots
216Survival Analysis for Ontario Dataset
Cluster No. of Tissues No. of Censored Mean time to Failure (?SE)
1 2 29 8 8 7 665 ? 85.9 1388 ? 155.7
A significant difference between Kaplan-Meier
estimates for the two clusters (P0.027).
- Coxs proportional hazards analysis
Variable Hazard ratio (95 CI) P-value
Cluster 1 vs. Cluster 2 Tumor stage (I vs. IIIII) 6.78 (0.9 51.5) 1.07 (0.57 2.0) 0.06 0.83
217Discriminant Analysis (Supervised
Classification) A prognosis classifier was
developed to predict the class of origin of a
tumor tissue with a small error rate after
correction for the selection bias. A support
vector machine (SVM) was adopted to identify
important genes that play a key role on
predicting the clinical outcome, using all the
genes, and the metagenes. A cross-validation
(CV) procedure was used to calculate the
prediction error, after correction for the
selection bias. Â
218ONTARIO DATA (39 tissues) Support Vector Machine
(SVM) with Recursive Feature Elimination (RFE)
0.12
0.1
0.08
Error Rate (CV10E)
0.06
0.04
0.02
0
0
2
4
6
8
10
12
log2 (number of genes)
Ten-fold Cross-Validation Error Rate (CV10E) of
Support Vector Machine (SVM). applied to g2
clusters (G1 1-14, 16- 29,33,36,38 G2
15,30-32,34,35,37,39)
219STANFORD DATA
918 genes based on 73 tissue samples from 67
patients. Row and column normalized, retained
451 genes after select-genes step. Used 20
metagenes to cluster tissues. Retrieved
histological groups.
220Heat Maps for the 20 Stanford Gene-Groups (73
Tissues)
Genes
Tissues
Tissues are ordered by their histological
classification Adenocarcinoma (1-41), Fetal Lung
(42), Large cell (43-47), Normal (48-52),
Squamous cell (53-68), Small cell (69-73)
221STANFORD CLASSIFICATION Cluster 1 1-19
(good prognosis) Cluster 2 20-26
(long-term survivors) Cluster 3 27-35
(poor prognosis)
222Heat Maps for the 15 Stanford Gene-Groups (35
Tissues)
Genes
Tissues
Tissues are ordered by the Stanford
classification into AC groups AC group 1 (1-19),
AC group 2 (20-26), AC group 3 (27-35)
223Expression Profiles for Top Metagenes (Stanford
35 AC Tissues)
Gene Group 1
Gene Group 2
Stanford AC group 1
Stanford AC group 2
Stanford AC group 3
Misallocated
Log Expression Value
Gene Group 4
Gene Group 3
Tissues
224Cluster-Specific Kaplan-Meier Plots
225Cluster-Specific Kaplan-Meier Plots
226Survival Analysis for Stanford Dataset
Cluster No. of Tissues No. of Censored Mean time to Failure (?SE)
1 2 17 5 10 0 37.5 ? 5.0 5.2 ? 2.3
A significant difference in survival between
clusters (Plt0.001)
- Coxs proportional hazards analysis
Variable Hazard ratio (95 CI) P-value
Cluster 3 vs. Clusters 12 Grade 3 vs. grades 1 or 2 Tumor size No. of tumors in lymph nodes Presence of metastases 13.2 (2.1 81.1) 1.94 (0.5 8.5) 0.96 (0.3 2.8) 1.65 (0.7 3.9) 4.41 (1.0 19.8) 0.005 0.38 0.93 0.25 0.05
227Survival Analysis for Stanford Dataset
- Univariate Coxs proportional hazards analysis
(metagenes)
Metagene Coefficient (SE) P-value
1 2 3 4 5 1.37 (0.44) -0.24 (0.31) 0.14 (0.34) -1.01 (0.56) 0.66 (0.65) 0.002 0.44 0.68 0.07 0.31
6 7 8 9 10 -0.63 (0.50) -0.68 (0.57) 0.75 (0.46) -1.13 (0.50) 0.73 (0.39) 0.20 0.24 0.10 0.02 0.06
11 12 13 14 15 0.35 (0.50) -0.55 (0.41) -0.61 (0.48) 0.22 (0.36) 1.70 (0.92) 0.48 0.18 0.20 0.53 0.06
228Survival Analysis for Stanford Dataset
- Multivariate Coxs proportional hazards
analysis (metagenes)
Metagene Coefficient (SE) P-value
1 2 8 11 3.44 (0.95) -1.60 (0.62) -1.55 (0.73) 1.16 (0.54) 0.0003 0.010 0.033 0.031
The final model consists of four metagenes.
229STANFORD DATA Support Vector Machine (SVM) with
Recursive Feature Elimination (RFE)
0.07
0.06
0.05
0.04
Error Rate (CV10E)
0.03
0.02
0.01
0
0
1
2
3
4
5
6
7
8
9
10
log2 (number of genes)
Ten-fold Cross-Validation Error Rate (CV10E) of
Support Vector Machine (SVM). Applied to g2
clusters.
230- CONCLUSIONS
- We applied a model-based clustering approach to
- classify tumors using their gene signatures into
- clusters corresponding to tumor type
- clusters corresponding to clinical outcomes for
tumors of a given subtype - In (a), almost perfect correspondence between
- cluster and tumor type, at least for non-AC
- tumors (but not in the Ontario dataset).
231CONCLUSIONS (cont.)
The clusters in (b) were identified with clinical
outcomes (e.g. recurrence/recurrence-free and
death/long-term survival). We were able to show
that gene-expression data provide prognostic
information, beyond that of clinical indicators
such as stage.
232CONCLUSIONS (cont.)
Based on the tissue clusters, a discriminant
analysis using support vector machines (SVM)
demonstrated further the potential of gene
expression as a tool for guiding treatment
therapy and patient care to lung cancer patients.
This supervised classification procedure was
used to provide marker genes for prediction of
clinical outcomes. (In addition to those
provided by the cluster-genes step in the initial
unsupervised classification.)
233LIMITATIONS
Small number of tumors available (e.g Ontario and
Stanford datasets). Clinical data available
for only subsets of the tumors often for only
one tumor type (AC). High proportion of
censored observations limits comparison of
survival rates.