PPT – Classification of Microarray Data - Recent Statistical Approaches PowerPoint presentation

About This Presentation

Title:

Classification of Microarray Data - Recent Statistical Approaches

Description:

Detecting Differentially Expressed Genes in Known Classes. of Tissue Samples ... Genovese and Wasserman (2002) Storey (2002, 2003) Storey and Tibshirani (2003a, 2003b) ... – PowerPoint PPT presentation

Number of Views:140

Avg rating:3.0/5.0

Slides: 234

Provided by: maths

Category:

more less

Transcript and Presenter's Notes

Title: Classification of Microarray Data - Recent Statistical Approaches

1
Classification of Microarray Data- Recent
Statistical Approaches
Geoff McLachlan and Liat Ben-Tovim
Jones Department of Mathematics Institute for
Molecular Bioscience University of
Queensland Tutorial for the APBC January 2005
2
Institute for Molecular Bioscience, University
of Queensland
3
Outline of Tutorial

Introduction to Microarray Technology

Detecting Differentially Expressed Genes in
Known Classes
of Tissue Samples

Cluster Analysis Clustering Genes and
Clustering Tissues

Supervised Classification of Tissue Samples

Linking Microarray Data with Survival Analysis

4
Outline of Tutorial

Introduction to Microarray Technology

Detecting Differentially Expressed Genes in
Known Classes
of Tissue Samples

BREAK

Cluster Analysis Clustering Genes and
Clustering Tissues

Supervised Classification of Tissue Samples

Linking Microarray Data with Survival Analysis

5
(No Transcript)
6
Outline of Tutorial

Introduction to Microarray Technology

Detecting Differentially Expressed Genes in
Known Classes
of Tissue Samples

Cluster Analysis Clustering Genes and
Clustering Tissues

Supervised Classification of Tissue Samples

Linking Microarray Data with Survival Analysis

7
Large-scale gene expression studies are not a
passing fashion, but are instead one aspect of
new work of biological experimentation, one
involving large-scale, high throughput assays.
Speed et al., 2002, Statistical Analysis of Gene
Expression Microarray Data, Chapman and Hall/ CRC
8
Growth of microarray and microarray methodology
literature listed in PubMed from 1995 to 2003.
The category all microarray papers includes
those found by searching PubMed for microarray
OR gene expression profiling. The category
statistical microarray papers includes those
found by searching PubMed for statistical
method OR statistical techniq OR
statistical approach AND microarray OR gene
expression profiling.
9
A microarray is a new technology which allows the
measurement of the expression levels of
thousands of genes simultaneously.

sequencing of the genome (human, mouse, and
others)
(2) improvement in technology to generate
high-density
arrays on chips (glass slides or nylon
membrane).

The entire genome of an organism can be probed at
a single point in time.
10
(1) mRNA Levels Indirectly Measure Gene Activity
Every cell contains the same DNA.
The activity of a gene (expression) can be
determined by the presence of its complementary
mRNA.
Cells differ in the DNA (gene) which is active at
any one time.
Genes code for proteins through the intermediary
of mRNA.
11
Target and Probe DNA
Probe DNA - known
Sample (target) - unknown
12
(2) Microarrays Indirectly Measure Levels of mRNA

mRNA is extracted from the cell

mRNA is reverse transcribed to cDNA (mRNA itself
is unstable)

cDNA is labeled with fluorescent dye TARGET

The sample is hybridized to known DNA sequences
on the array
(tens of thousands of genes) PROBE

If present, complementary target binds to probe
DNA
(complementary base pairing)

Target bound to probe DNA fluoresces

13
Spotted cDNA Microarray
Compare the gene expression levels for two cell
populations on a single microarray.
14
(No Transcript)
15
Microarray Image Red High expression in
target labelled with cyanine 5 dye Green High
expression in target labelled with cyanine 3
dye Yellow Similar expression in both target
samples
16
Assumptions
Gene Expression
(1)
cellular mRNA levels directly reflect gene
expression
mRNA
intensity of bound target is a measure of the
abundance of the mRNA in the sample.
(2)
Fluorescence Intensity
17
Experimental Error
Sample contamination
Poor quality/insufficient mRNA
Reverse transcription bias
Fluorescent labeling bias
Hybridization bias
Cross-linking of DNA (double strands)
Poor probe design (cross-hybridization)
Defective chips (scratches, degradation)
Background from non-specific hybridization
18
The Microarray Technologies
Spotted Microarray
Affymetrix GeneChip
cDNAs, clones, or short and long
oligonucleotides deposited onto glass
slides Each gene (or EST) represented by its
purified PCR product Simultaneous analysis of
two samples (treated vs untreated
cells) provides internal control.
short oligonucleotides synthesized in situ onto
glass wafers Each gene represented multiply -
using 16-20 (preferably non-overlapping) 25-mers.
Each oligonucleotide has single-base mismatch
partner for internal control of hybridization
specifity.
relative gene expressions

absolute gene expressions
Each with its own advantages and disadvantages
19
Pros and Cons of the Technologies
Spotted Microarray
Affymetrix GeneChip
More expensive yet less flexible Good for whole
genome expression analysis where genome of that
organism has been sequenced High quality with
little variability between slides Gives a
measure of absolute expression of genes
Flexible and cheaper Allows study of genes not
yet sequenced (spotted ESTs can be used to
discover new genes and their functions) Variabil
ity in spot quality from slide to slide Provide
information only on relative gene expressions
between cells or tissue samples

20
Aims of a Microarray Experiment

observe changes in a gene in response to
external stimuli
(cell samples exposed to hormones, drugs,
toxins)
compare gene expressions between different
tissue types
(tumour vs normal cell samples)

To gain understanding of
function of unknown genes
disease process at the molecular level
Ultimately to use as tools in Clinical Medicine
for diagnosis,
prognosis and therapeutic management.

21
Importance of Experimental Design

Good DNA microarray experiments should have
clear objectives.
not performed as aimless data mining in search
of unanticipated patterns that will provide
answers to unasked
questions
(Richard Simon, BioTechniques 34S16-S21, 2003)

22
Replicates
Technical replicates arrays that have been
hybridized to the same biological source (using
the same treatment, protocols, etc.) Biological
replicates arrays that have been hybridized to
different biological sources, but with the same
preparation, treatments, etc.
23
Extracting Data from the Microarray

Cleaning
Image processing
Filtering
Missing value estimation

Normalization
Remove sources of systematic variation.

Sample 1
Sample 2
Sample 3
Sample 4 etc
24
Gene Expressions from Measured Intensities
Spotted Microarray
log 2(Intensity Cy5 / Intensity Cy3)
Affymetrix
(Perfect Match Intensity Mismatch Intensity)
25
Data Transformation
Rocke and Durbin (2001), Munson (2001), Durbin et
al. (2002), and Huber et al. (2002)
26
Representation of Data from M Microarray
Experiments
Sample 1 Sample 2 Sample
M
Gene 1 Gene 2 Gene N
Assume we have extracted gene expressions values
from intensities.
Expression Signature
Gene expressions can be shown as Heat Maps
Expression Profile
27
(No Transcript)
28
(No Transcript)
29
Microarrays present new problems for statistics
because the data is very high dimensional with
very little replication.
30
Gene Expression Data represented as N x M Matrix
Sample 1 Sample 2 Sample
M
Gene 1 Gene 2 Gene N
Expression Signature
N rows correspond to the N genes. M columns
correspond to the M samples (microarray
experiments).
Expression Profile
31
Microarray Data Notation
Represent the N x M matrix A
A (y1, ... , yM) Classifying Tissues
on Gene Expressions
the feature vector yj contains the expression
levels on the N genes in the jth experiment (j
1, ... , M). yj is the expression signature.
AT (y1, ... , yN) Classifying Genes
on the Tissues
the feature vector yj contains the expression
levels on the M tissues on the jth gene (j 1,
... , N). yj is the expression profile.
32
In the N x M matrix A N No. of genes
(103-104) M No. of tissues (10-102)
Classification of Tissues on Gene
Expressions Standard statistical methodology
appropriate when M gtgt N, BUT here N gtgt
M Classification of Genes on the Basis of the
Tissues Falls in standard framework, BUT not all
the genes are independently distributed.
33
Mehta et al (Nature Genetics, Sept. 2004)
The field of expression data analysis is
particularly active with novel analysis
strategies and tools being published weekly, and
the value of many of these methods is
questionable. Some results produced by using
these methods are so anomalous that a breed of
forensic statisticians (Ambroise and McLachlan,
2002 Baggerly et al., 2003) who doggedly detect
and correct other HDB (high-dimensional biology)
investigators prominent mistakes, has been
created.
34
Outline of Tutorial

Introduction to Microarray Technology

Detecting Differentially Expressed Genes in Known
Classes
of Tissue Samples

Cluster Analysis Clustering Genes and Clustering
Tissues

Supervised Classification of Tissue Samples

Linking Microarray Data with Survival Analysis

35
(No Transcript)
36
Class 1
Class 2
37
Fold Change is the Simplest Method Calculate the
log ratio between the two classes and consider
all genes that differ by more than an arbitrary
cutoff value to be differentially expressed. A
two-fold difference is often chosen. Fold
change is not a statistical test.
38
Multiplicity Problem
Perform a test for each gene to determine the
statistical significance of differential
expression for that gene. Problem When many
hypotheses are tested, the probability of a type
I error (false positive) increases sharply with
the number of hypotheses.
Further complicated by gene co-regulation and
subsequent correlation between the test
statistics.
39
Example
Suppose we measure the expression of 10,000 genes
in a microarray experiment.
If all 10,000 genes were not differentially
expressed, then we would expect for P 0.05
for each test, 500 false positives. P
0.05/10,000 for each test, .05 false positives.
40
Methods for dealing with the Multiplicity Problem

The Bonferroni Method
controls the family wise error rate (FWER).
(FWER is the probability that at least one false
positive
error will be made.) - but this method is very
conservative, as it tries to make it unlikely
that even
one false rejection is made.

The False Discovery Rate (FDR)
emphasizes the proportion of false positives
among the identified differentially expressed
genes.

41
Test of a Single Hypothesis
The M tissue samples are classified with respect
to g classes on the basis of the N gene
expressions. Assume that there are ni tissue
samples from each class Ci (i 1, , g), where
M n1 ng.
Take a simple case where g 2. The aim is to
detect whether some of the thousands of genes
have different expression levels in class C1
than in class C2.
42
Test of a Single Hypothesis (cont.)
For gene j, let Hj denote the null hypothesis of
no association between its expression level and
membership of the class, where (j 1, , N).
Hj 0 Null hypothesis for the jth gene
holds. Hj 1 Null hypothesis for the jth gene
does not hold.
Retain Null Reject Null
Hj 0 type I error
Hj 1 type II error
43
Two-Sample t-Statistic
Students t-statistic
44
Two-Sample t-Statistic
Pooled form of the Students t-statistic, assumed
common variance in the two classes
45
Two-Sample t-Statistic
Modified t-statistic of Tusher et al. (2001)
46
Multiple Hypothesis Testing
Consider measures of error for multiple
hypotheses. Focus on the rate of false positives
with respect to the number of rejected
hypotheses, Nr.
47

Possible Outcomes for N Hypothesis Tests
48

Possible Outcomes for N Hypothesis Tests
FWER is the probability of getting one or
more false positives out of all the hypotheses
tested
49
Bonferroni method for controlling the FWER
Consider N hypothesis tests
H0j versus H1j, j 1, , N
and let P1, , PN denote the N P-values for
these tests.
The Bonferroni Method Given P-values P1, , PN
reject null hypothesis H0j if
Pj lt a / N .
50
False Discovery Rate (FDR)
The FDR is essentially the expectation of the
proportion of false positives among the
identified differentially expressed genes.
51

Possible Outcomes for N Hypothesis Tests
52
Controlling FDR
Benjamini and Hochberg (1995)
Key papers on controlling the FDR

Genovese and Wasserman (2002)
Storey (2002, 2003)
Storey and Tibshirani (2003a, 2003b)
Storey, Taylor and Siegmund (2004)
Black (2004)
Cox and Wong (2004)

53
Benjamini-Hochberg (BH) Procedure
Controls the FDR at level a when the P-values
following the null distribution are independent
and uniformly distributed.
(1) Let be the
observed P-values.
(2) Calculate
.
(3) If exists then reject null hypotheses
corresponding to
. Otherwise, reject nothing.
54
Example Bonferroni and BH Tests
Suppose that 10 independent hypothesis tests are
carried out leading to the following ordered
P-values
0.00017 0.00448 0.00671 0.00907
0.01220 0.33626 0.39341 0.53882 0.58125
0.98617
(a) With a 0.05, the Bonferroni test rejects
any hypothesis whose P-value is less than a / 10
0.005.
Thus only the first two hypotheses are rejected.
(b) For the BH test, we find the largest k such
that P(k) lt ka / m.
Here k 5, thus we reject the first five
hypotheses.
55
SHORT BREAK
56
Null Distribution of the Test Statistic
Permutation Method
The null distribution has a resolution on the
order of the number of permutations. If we
perform B permutations, then the P-value will be
estimated with a resolution of 1/B. If we
assume that each gene has the same null
distribution and combine the permutations, then
the resolution will be 1/(NB) for the pooled
null distribution.
57
Using just the B permutations of the class labels
for the gene-specific statistic Tj , the P-value
for Tj tj is assessed as
where t(b)0j is the null version of tj after the
bth permutation of the class labels.
58
If we pool over all N genes, then
59
Null Distribution of the Test Statistic Example
Class 1 Class 2
Gene 1 A1(1) A2(1) A3(1) B4(1) B5(1)
B6(1)
Gene 2 A1(2) A2(2) A3(2)
B4(2) B5(2) B6(2)
Suppose we have two classes of tissue samples,
with three samples from each class. Consider
the expressions of two genes, Gene 1 and Gene 2.
60
Class 1 Class 2
Gene 1 A1(1) A2(1) A3(1) B4(1) B5(1)
B6(1)
Gene 2 A1(2) A2(2) A3(2)
B4(2) B5(2) B6(2)
To find the null distribution of the test
statistic for Gene 1, we proceed under the
assumption that there is no difference between
the classes (for Gene 1) so that
Gene 1 A1(1) A2(1) A3(1) A4(1) A5(1)
A6(1)
And permute the class labels
Perm. 1 A1(1) A2(1) A4(1) A3(1) A5(1)
A6(1) ... There are 10 distinct permutations.
61
Ten Permutations of Gene 1
A1(1) A2(1) A3(1) A4(1) A5(1) A6(1) A1(1)
A2(1) A4(1) A3(1) A5(1) A6(1) A1(1) A2(1)
A5(1) A3(1) A4(1) A6(1) A1(1) A2(1) A6(1)
A3(1) A4(1) A5(1) A1(1) A3(1) A4(1) A2(1)
A5(1) A6(1) A1(1) A3(1) A5(1) A2(1) A4(1)
A6(1) A1(1) A3(1) A6(1) A2(1) A4(1)
A5(1) A1(1) A4(1) A5(1) A2(1) A3(1)
A6(1) A1(1) A4(1) A6(1) A2(1) A3(1)
A5(1) A1(1) A5(1) A6(1) A2(1) A3(1) A4(1)
62
As there are only 10 distinct permutations here,
the null distribution based on these permutations
is too granular. Hence consideration is given
to permuting the labels of each of the other
genes and estimating the null distribution of a
gene based on the pooled permutations so
obtained. But there is a problem with this
method in that the null values of the test
statistic for each gene does not necessarily
have the theoretical null distribution that we
are trying to estimate.
63
Suppose we were to use Gene 2 also to estimate
the null distribution of Gene 1. Suppose that
Gene 2 is differentially expressed, then the
null values of the test statistic for Gene 2 will
have a mixed distribution.
64
Class 1 Class 2
Gene 1 A1(1) A2(1) A3(1) B4(1) B5(1)
B6(1)
Gene 2 A1(2) A2(2) A3(2)
B4(2) B5(2) B6(2)
Gene 2 A1(2) A2(2) A3(2) B4(2) B5(2)
B6(2)
Permute the class labels
Perm. 1 A1(2) A2(2) B4(2) A3(2) B5(2)
B6(2) ...
65
Example of a null case with 7 N(0,1) points
and 8 N(0,1) points histogram of the pooled
two-sample t-statistic under 1000 permutations
of the class labels with t13 density
superimposed.
ty
66
Example of a null case with 7 N(0,1) points
and 8 N(10,9) points histogram of the pooled
two-sample t-statistic under 1000 permutations
of the class labels with t13 density
superimposed.
ty
67
The SAM Method
Use the permutation method to calculate the null
distribution of the modified t-statistic (Tusher
et al., 2001).
The order statistics t(1), ... , t(N) are
plotted against their null expectations above.
A good test in situations where there are more
genes being over-expressed than under-expressed,
or vice-versa.
68
Two-component Mixture Model Framework
69
Two-component model
is the proportion of genes that are not
differentially expressed, and
is the proportion that are.
70
Two-component model
is the proportion of genes that are not
differentially expressed, and
is the proportion that are.
Then
is the posterior probability that gene j is not
differentially expressed.
71
1) Form a statistic tj for each gene, where a
large positivevalue of tj corresponds to a gene
that is differentially expressed across the
tissues.2) Compute the Pj-values according to
the tj and fit a mixture of beta distributions
(including a uniform component) to them where
the latter corresponds to the class of genes
that are not differentially expressed. or

72
3) Fit to t1,...,tp a mixture of two normal
densities with a common variance, where the
first component has the smaller mean (it
corresponds to the class of genes that are not
differentially expressed). It is assumed that
the tj have been transformed so that they are
normally distributed (approximately).
tj
4) Let 0(tj) denote the (estimated) posterior
probability that gene j belongs to the first
component of the mixture.
73
If we conclude that gene j is differentially
expressed if
0(tj) c0,
then this decision minimizes the (estimated)
Bayes risk
where
74
Estimated FDR
where
75
Suppose t0(t) is monotonic decreasing in t. Then
for
76
Suppose t0(t) is monotonic decreasing in t. Then
for
77
Suppose t0(t) is monotonic decreasing in t. Then
for
where
78
For a desired control level a, say a 0.05,
define
(1)
t
is monotonic in t, then using (1)
If
to control the FDR with and
taken to be the empirical distribution
function is equivalent to using the
Benjamini-Hochberg procedure based on the
P-values corresponding to the statistic tj.
79
Example
The study of Hedenfalk et al. (2001), used cDNA
arrays to measure the gene expressions in breast
tumour tissues taken from patients with
mutations in either the BRCA1 or BRCA2 genes. We
consider their data set of M 15 patients,
comprising two patient groups BRCA1 (7) versus
BRCA2-mutation positive (8), with N 3,226
genes.
The problem is to find genes which are
differentially expressed between the BRCA1 and
BRCA2 patients.
80
Example of a null case with 7 N(0,1) points
and 8 N(0,1) points histogram of the pooled
two-sample t-statistic under 1000 permutations
of the class labels with t13 density
superimposed.
ty
81
Example of a null case with 7 N(0,1) points
and 8 N(10,9) points histogram of the pooled
two-sample t-statistic under 1000 permutations
of the class labels with t13 density
superimposed.
ty
82
Fit to the N values of tj (pooled two-sample
t-statistic)
jth gene is taken to be differentially expressed
if
83
Estimated FDR for various levels of c0
c0 Nr
0.5 1702 0.29
0.4 1235 0.23
0.3 850 0.18
0.2 483 0.12
0.1 175 0.06
84
Use of the P-Value as a Summary
Statistic (Allison et al., 2002)
Instead of using the pooled form of the
t-statistic, we can work with the value pj,
which is the P-value associated with tj in the
test of the null hypothesis of no difference in
expression between the two classes.
The distribution of the P-value is modelled by
the h-component mixture model
,
where a11 a12 1.
85
Use of the P-Value as a Summary Statistic
Under the null hypothesis of no difference in
expression for the jth gene, pj will have a
uniform distribution on the unit interval ie
the b1,1 distribution.
The ba1,a2 density is given by
where
86
(No Transcript)
87
Outline of Tutorial

Introduction to microarray technology

Detecting Differentially Expressed Genes in Known
Classes
of Tissue Samples

Cluster Analysis Clustering Genes and Clustering
Tissues

Supervised Classification of Tissue Samples

Linking Microarray Data with Survival Analysis

88
Gene Expression Data represented as N x M Matrix
Sample 1 Sample 2 Sample
M
Gene 1 Gene 2 Gene N
Expression Signature
N rows correspond to the N genes. M columns
correspond to the M samples (microarray
experiments).
Expression Profile
89
Two Clustering Problems
Clustering of genes on basis of tissues
genes not independent
(n N, p M)

Clustering of tissues on basis of genes -
latter is a nonstandard problem in
cluster analysis

(n M, p N, so n ltlt p)
90
(No Transcript)
91
CLUSTERING OF GENES WITH REPEATED
MEASUREMENTS Medvedovic and Sivaganesan
(2002) Celeux, Martin, and Lavergne
(2004) Chapter 5 of McLachlan et al. (2004)
92
UNSUPERVISED CLASSIFICATION (CLUSTER
ANALYSIS) INFER CLASS LABELS z1, , zn of y1,
, yn
Initially, hierarchical distance-based methods of
cluster analysis were used to cluster the tissues
and the genes Eisen, Spellman, Brown, Botstein
(1998, PNAS)
93
(No Transcript)
94
(No Transcript)
95
Hierarchical (agglomerative) clustering
algorithms are largely heuristically motivated
and there exist a number of unresolved issues
associated with their use, including how to
determine the number of clusters.
in the absence of a well-grounded statistical
model, it seems difficult to define what is meant
by a good clustering algorithm or the right
number of clusters.
(Yeung et al., 2001, Model-Based Clustering and
Data Transformations for Gene Expression Data,
Bioinformatics 17)
96
McLachlan and Khan (2004). On a resampling
approach for tests on the number of clusters with
mixture model-based clustering of the tissue
samples. Special issue of the Journal of
Multivariate Analysis 90 (2004) edited by Mark
van der Laan and Sandrine Dudoit (UC Berkeley).
97
Attention is now turning towards a model-based
approach to the analysis of microarray data
For example

Broet, Richarson, and Radvanyi (2002). Bayesian
hierarchical model for identifying changes in
gene expression from microarray experiments.
Journal of Computational Biology 9
Ghosh and Chinnaiyan (2002). Mixture modelling of
gene expression data from microarray experiments.
Bioinformatics 18
Liu, Zhang, Palumbo, and Lawrence (2003).
Bayesian clustering with variable and
transformation selection. In Bayesian Statistics
7
Pan, Lin, and Le, 2002, Model-based cluster
analysis of microarray gene expression data.
Genome Biology 3
Yeung et al., 2001, Model based clustering and
data transformations for gene expression data,
Bioinformatics 17

98
The notion of a cluster is not easy to
define. There is a very large literature devoted
to clustering when there is a metric known in
advance e.g. k-means. Usually, there is no a
priori metric (or equivalently a user-defined
distance matrix) for a cluster analysis. That
is, the difficulty is that the shape of the
clusters is not known until the clusters have
been identified, and the clusters cannot be
effectively identified unless the shapes are
known.
99
In this case, one attractive feature of adopting
mixture models with elliptically symmetric
components such as the normal or t densities, is
that the implied clustering is invariant under
affine transformations of the data (that is,
under operations relating to changes in location,
scale, and rotation of the data). Thus the
clustering process does not depend on irrelevant
factors such as the units of measurement or the
orientation of the clusters in space.
100
(No Transcript)
101
MIXTURE OF g NORMAL COMPONENTS
102
MIXTURE OF g NORMAL COMPONENTS
103
Equal spherical covariance matrices
104
With a mixture model-based approach to
clustering, an observation is assigned outright
to the ith cluster if its density in the ith
component of the mixture distribution (weighted
by the prior probability of that component) is
greater than in the other (g-1) components.
105
Figure 7 Contours of the fitted component
densities on the 2nd 3rd variates for the blue
crab data set.
106
Estimation of Mixture Distributions

It was the publication of the seminal paper of
Dempster, Laird, and Rubin (1977) on the EM
algorithm that greatly stimulated interest in the
use of finite mixture distributions to model
heterogeneous data.
McLachlan and Krishnan (1997, Wiley)

107

If need be, the normal mixture model can be
made less sensitive to outlying observations by
using t component densities.
With this t mixture model-based approach, the
normal distribution for each component in the
mixture is embedded in a wider class of
elliptically symmetric distributions with an
additional parameter called the degrees of
freedom.

108
The advantage of the t mixture model is that,
although the number of outliers needed for
breakdown is almost the same as with the normal
mixture model, the outliers have to be much
larger.
109
Mixtures of Factor Analyzers
A normal mixture model without restrictions on
the component-covariance matrices may be viewed
as too general for many situations in practice,
in particular, with high dimensional data. One
approach for reducing the number of parameters
is to work in a lower dimensional space by using
principal components another is to use mixtures
of factor analyzers (Ghahramani Hinton, 1997).
110
Two Groups in Two Dimensions. All cluster
information would be lost by collapsing to the
first principal component. The principal
ellipses of the two groups are shown as solid
curves.
111
Mixtures of Factor Analyzers

Principal components or a single-factor analysis
model provides only a global linear model.
A global nonlinear approach by postulating a
mixture of linear submodels

112
Bi is a p x q matrix and Di is a diagonal
matrix.
113
Single-Factor Analysis Model
114
The Uj are iid N(O, Iq) independently of the
errors ej, which are iid as N(O, D), where D is a
diagonal matrix
115
Conditional on ith component membership of the
mixture,
where Ui1, ..., Uin are independent,
identically distibuted (iid) N(O, Iq),
independently of the eij, which are iid
N(O, Di), where Di is a diagonal matrix
(i 1, ..., g).
116
An infinity of choices for Bi for model still
holds if Bi is replaced by BiCi where Ci is an
orthogonal matrix. Choose Ci so that
is diagonal
Number of free parameters is then
117
Reduction in the number of parameters is then

We can fit the mixture of factor analyzers model
using an alternating ECM algorithm.

118
1st cycle declare the missing data to be the
component-indicator vectors. Update the
estimates of
and
2nd cycle declare the missing data to be also
the factors. Update the estimates of
and
119
M-step on 1st cycle
for i 1, ... , g .
120
M step on 2nd cycle
where
121
(No Transcript)
122
Work in q-dim space
(BiBiT Di ) -1 Di 1 - Di -1Bi (Iq BiTDi
-1Bi) -1BiTDi -1,
BiBiTD i Di / Iq
-BiT(BiBiTDi) -1Bi .
123
PROVIDES A MODEL-BASED APPROACH TO
CLUSTERING McLachlan, Bean, and Peel, 2002, A
Mixture Model-Based Approach to the Clustering of
Microarray Expression Data, Bioinformatics 18,
413-422
http//www.bioinformatics.oupjournals.org/cgi/scre
enpdf/18/3/413.pdf
124
(No Transcript)
125
Example Microarray DataColon Data of Alon et
al. (1999)
M 62 (40 tumours 22 normals) tissue samples
of N 2,000 genes in a 2,000 ? 62 matrix.
126
(No Transcript)
127
(No Transcript)
128
Mixture of 2 normal components
129
Mixture of 2 t components
130
(No Transcript)
131
Clustering of COLON Data Genes using EMMIX-GENE
132
Grouping for Colon Data
133
(No Transcript)
134
(No Transcript)
135
Clustering of COLON Data Tissues using EMMIX-GENE
136
Grouping for Colon Data
137
Heat Map Displaying the Reduced Set of 4,869
Genes on the 98 Breast Cancer Tumours
138
Insert heat map of 1867 genes
Heat Map of Top 1867 Genes
139
(No Transcript)
140
(No Transcript)
141
(No Transcript)
142
(No Transcript)
143
(No Transcript)
144
where i group number mi number in group
i Ui -2 log ?i
145
Heat Map of Genes in Group G1
146
Heat Map of Genes in Group G2
147
Heat Map of Genes in Group G3
148
Number of Components in a Mixture Model

Testing for the number of components, g, in a
mixture is an important but very difficult
problem which has not been completely resolved.

149
Order of a Mixture Model

A mixture density with g components might be
empirically indistinguishable from one with
either fewer than g components or more than g
components. It is therefore sensible in practice
to approach the question of the number of
components in a mixture model in terms of an
assessment of the smallest number of components
in the mixture compatible with the data.

150
Likelihood Ratio Test Statistic

An obvious way of approaching the problem of
testing for the smallest value of the number of
components in a mixture model is to use the LRTS,
-2logl. Suppose we wish to test the null
hypothesis,

versus
for some g1gtg0.
151

We let denote the MLE of calculated
under Hi , (i0,1). Then the evidence against H0
will be strong if l is sufficiently small, or
equivalently, if -2logl is sufficiently large,
where

152
Bootstrapping the LRTS

McLachlan (1987) proposed a resampling approach
to the assessment of the P-value of the LRTS in
testing
for a specified value of g0.

153
Bayesian Information Criterion
The Bayesian information criterion (BIC) of
Schwarz (1978) is given by

as the penalized log likelihood to be maximized
in model selection, including the present
situation for the number of components g in a
mixture model.
154
Gap statistic (Tibshirani et al., 2001)
Clest (Dudoit and Fridlyand, 2002)

155
Outline of Tutorial

Introduction to Microarray Technology

Detecting Differentially Expressed Genes in Known
Classes
of Tissue Samples

Cluster Analysis Clustering Genes and Clustering
Tissues

Supervised Classification of Tissue Samples

Linking Microarray Data with Survival Analysis

156
Supervised Classification (Two Classes)
. . . . . . .
Sample 1
Sample n

Gene 1
. . . . . . .
Gene p
Class 2 (poor prognosis)
Class 1 (good prognosis)
157
Microarray to be used as routine clinical
screen by C. M. Schubert Nature Medicine 9, 9,
2003.
The Netherlands Cancer Institute in Amsterdam is
to become the first institution in the world to
use microarray techniques for the routine
prognostic screening of cancer patients. Aiming
for a June 2003 start date, the center will use a
panoply of 70 genes to assess the tumor profile
of breast cancer patients and to determine which
women will receive adjuvant treatment after
surgery.
158
Selection bias in gene extraction on the basis
of microarray gene-expression data
Ambroise and McLachlan Proceedings of the
National Academy of Sciences Vol. 99, Issue 10,
6562-6566, May 14, 2002 http//www.pnas.org/cgi/
content/full/99/10/6562
159
Supervised Classification of Tissue Samples
160
Sample 1 Sample 2 Sample
M
Gene 1 Gene 2 Gene N
Expression Signature
Expression Profile
161
LINEAR CLASSIFIER
FORM
for the production of the group label z of a
future entity with feature vector y.
162
FISHERS LINEAR DISCRIMINANT FUNCTION
where
and S are the sample means and pooled sample
and
covariance matrix found from the training data
163
SUPPORT VECTOR CLASSIFIER
Vapnik (1995)
where ß0 and ß are obtained as follows
subject to
relate to the slack variables
separable case
164
with non-zero
only for those observations j for which the
constraints are exactly met (the support vectors).
165
Support Vector Machine (SVM)
by
REPLACE
where the kernel function
is the inner product in the transformed feature
space.
166
HASTIE et al. (2001, Chapter 12) The
Lagrange (primal function) is
which we maximize w.r.t. ß, ß0, and ?j. Setting
the respective derivatives to zero, we get
with
and
167
By substituting (2) to (4) into (1), we obtain
the Lagrangian dual function
We maximize (5) subject to
In addition to (2) to (4), the constraints include
Together these equations (2) to (8) uniquely
characterize the solution to the primal and dual
problem.
168
Leo Breiman (2001)Statistical modeling the
two cultures (with discussion).Statistical
Science 16, 199-231.Discussants include Brad
Efron and David Cox
169
GUYON, WESTON, BARNHILL VAPNIK (2002, Machine
Learning)

LEUKAEMIA DATA
Only 2 genes are needed to obtain a zero CVE
(cross-validated error rate)
COLON DATA
Using only 4 genes, CVE is 2

170
Since pgtgtn, consideration given to
selection of suitable genes
SVM FORWARD or BACKWARD (in terms of
magnitude of weight ßi) RECURSIVE FEATURE
ELIMINATION (RFE) FISHER FORWARD ONLY
(in terms of CVE)
171
GUYON et al.
(2002) LEUKAEMIA DATA Only 2 genes are
needed to obtain a zero CVE (cross-validated
error rate) COLON DATA Using only 4 genes,
CVE is 2
172
GUYON et al. (2002)
The success of the RFE indicates that RFE has a
built in regularization mechanism that we do not
understand yet that prevents overfitting the
training data in its selection of gene subsets.
173
Error Rate Estimation
Suppose there are two groups G1 and G2
c(y) is a classifier formed from the data set
(y1, y2, y3,, yn)
The apparent error is the proportion of the data
set misallocated by c(y).
174
Cross-Validation
Then form the classifier c(1)(y ) from this
reduced set.
Use c(1)(y1) to allocate y1 to either G1 or G2.
175
Repeat this process for the second data point,
y2.
So that this point is assigned to either G1 or G2
on the basis of the classifier c(2)(y2).
And so on up to yn.
176
Ten-Fold Cross Validation
1 2 3 4 5 6 7 8 9 10
Test
T r a i n i n g
177
Figure 1 Error rates of the SVM rule with RFE
procedure averaged over 50 random splits of colon
tissue samples
178
Figure 3 Error rates of Fishers rule with
stepwise forward selection procedure using all
the colon data
179
Figure 5 Error rates of the SVM rule averaged
over 20 noninformative samples generated by
random permutations of the class labels of the
colon tumor tissues
180
ADDITIONAL REFERENCES Selection bias
ignored XIONG et al. (2001, Molecular Genetics
and Metabolism) XIONG et al. (2001, Genome
Research) ZHANG et al. (2001, PNAS)
Aware of selection bias SPANG et al. (2001,
Silico Biology) WEST et al. (2001,
PNAS) NGUYEN and ROCKE (2002)
181
BOOTSTRAP APPROACH
Efrons (1983, JASA) .632 estimator
where B1 is the bootstrap when rule is
applied to a point not in the training sample. A
Monte Carlo estimate of B1 is
where
182
Toussaint Sharpe (1975) proposed the ERROR
RATE ESTIMATOR
where
McLachlan (1977) proposed wwo where wo is chosen
to minimize asymptotic bias of A(w) in the case
of two homoscedastic normal groups. Value of w0
was found to range between 0.6 and 0.7, depending
on the values of
183
.632 estimate of Efron Tibshirani (1997, JASA)
where
(relative overfitting rate)
(estimate of no information error rate)
If r 0, w .632, and so B.632
B.632 r 1, w 1, and so B.632 B1
184
MARKER GENES FOR HARVARD DATA For a SVM based on
64 genes, and using 10-fold CV, we noted the
number of times a gene was selected. No. of
genes Times selected 55
1 18
2 11 3
7 4 8
5 6
6 10
7 8 8
12 9 17
10
185
MARKER GENES FOR HARVARD DATA
tubulin, alpha, ubiquitous Cluster Incl N90862 cyclin-dependent kinase inhibitor 2C (p18, inhibits CDK4) DEK oncogene (DNA binding) Cluster Incl AF035316 transducin-like enhancer of split 2, homolog of Drosophila E(sp1) ADP-ribosyltransferase (NAD poly (ADP-ribose) polymerase) benzodiazapine receptor (peripheral) Cluster Incl D21063 galactosidase, beta 1 high-mobility group (nonhistone chromosomal) protein 2 cold inducible RNA-binding protein Cluster Incl U79287 BAF53 tubulin, beta polypeptide thromboxane A2 receptor H1 histone family, member X Fc fragment of IgG, receptor, transporter, alpha sine oculis homeobox (Drosophila) homolog 3 transcriptional intermediary factor 1 gamma transcription elongation factor A (SII)-like 1 like mouse brain protein E46 minichromosome maintenance deficient (mis5, S. pombe) 6 transcription factor 12 (HTF4, helix-loop-helix transcription factors 4) guanine nucleotide binding protein (G protein), gamma 3, linked dihydropyrimidinase-like 2 Cluster Incl AI951946 transforming growth factor, beta receptor II (70-80kD) protein kinase C-like 1
No. of Times genes selected 55
1 18 2 11
3 7 4 8 5
6 6 10 7
8 8 12 9 17
10
186
Breast cancer data set in vant Veer et al.
(vant Veer et al., 2002, Gene Expression
Profiling Predicts Clinical Outcome Of Breast
Cancer, Nature 415)
These data were the result of microarray
experiments on three patient groups with
different classes of breast cancer tumours. The
overall goal was to identify a set of genes that
could distinguish between the different tumour
groups based upon the gene expression information
for these groups.
187
van de Vijver et al. (2002) considered a further
234 breast cancer tumours but have only made
available the data for the top 70 genes based on
the previous study of van t Veer et al. (2002)
188
Number of Genes Error Rate for Top 70 Genes (without correction for Selection Bias as Top 70) Error Rate for Top 70 Genes (with correction for Selection Bias as Top 70) Error Rate for 5422 Genes (with correction for Selection Bias)
1 0.50 0.53 0.56
2 0.32 0.41 0.44
4 0.26 0.40 0.41
8 0.27 0.32 0.43
16 0.28 0.31 0.35
32 0.22 0.35 0.34
64 0.20 0.34 0.35
70 0.19 0.33 -
128 - - 0.39
256 - - 0.33
512 - - 0.34
1024 - - 0.33
2048 - - 0.37
4096 - - 0.40
5422 - - 0.44
189
(No Transcript)
190
Nearest-Shrunken Centroids (Tibshirani et al.,
2002)
The usual estimates of the class means are
shrunk toward the overall mean of the data,
where
and
191
The nearest-centroid rule is given by
where yv is the vth element of the feature
vector y and .
192
In the previous definition, we replace the sample
mean of the vth gene by its shrunken
estimate
v
where
193
Comparison of Nearest-Shrunken Centroids with SVM
Apply (i) nearest-shrunken centroids and
(ii) the SVM with RFE to colon data set of Alon
et al. (1999), with N 2000 genes and M 62
tissues (40
tumours, 22 normals)
194
Nearest-Shrunken Centroids applied to Alon data
(a) Overall Error Rates
(b) Class-specific Error Rates
195
SVM with RFE applied to Alon data
(a) Overall Error Rates
(b) Class-specific Error Rates
196
Outline of Tutorial

Introduction to Microarray Technology

Detecting Differentially Expressed Genes in Known
Classes
of Tissue Samples

Cluster Analysis Clustering Genes and Clustering
Tissues

Supervised Classification of Tissue Samples

Linking Microarray Data with Survival Analysis

197
Breast tumours have a genetic signature. The
expression pattern of a set of 70 genes can
predict whether a tumour is going to prove
lethal, despite treatment, or not. This gene
expression profile will outperform all currently
used clinical parameters in predicting disease
outcome.
van t Veer et al. (2002), van de Vijver et al.
(2002)
198
Problems

Censored Observations the time of occurrence of
the event
(death) has not yet been observed.
Small Sample Sizes study limited by patient
numbers
Specific Patient Group is the study applicable
to other
populations?
Difficulty in integrating different studies
(different
microarray platforms)

199
A Case Study The Lung Cancer data sets from
CAMDA03
Four independently acquired lung cancer data sets
(Harvard, Michigan, Stanford and Ontario). The
challenge To integrate information from
different data sets (2 Affy chips of different
versions, 2 cDNA arrays). The final goal To
make an impact on cancer biology and eventually
patient care. Especially, we welcome the
methodology of survival analysis using
microarrays for cancer prognosis (Park et al.
Bioinformatics S120, 2002).
200
Methodology of Survival Analysis using Microarrays
Cluster the tissue samples (eg using hierarchical
clustering), then compare the survival curves for
each cluster using a non-parametric Kaplan-Meier
analysis (Alizadeh et al. 2000). Park et al.
(2002), Nguyen and Rocke (2002) used partial
least squares with the proportional hazards
model of Cox. Unsupervised vs. Supervised
Methods Semi-supervised approach of Bair and
Tibshirani (2004), to combine gene expression
data with the clinical data.
201
AIM To link gene-expression data with survival
from lung cancer in the CAMDA03
challenge A CLUSTER ANALYSIS We apply a
model-based clustering approach to classify
tumour tissues on the basis of microarray gene
expression. B SURVIVAL ANALYSIS The
association between the clusters so formed and
patient survival (recurrence) times is
established. C DISCRIMINANT ANALYSIS We
demonstrate the potential of the
clustering-based prognosis as a predictor of the
outcome of disease.
202
Lung Cancer
Approx. 80 of lung cancer patients have NSCLC
(of which adenocarcinoma is the most common
form). All Patients diagnosed with NSCLC are
treated on the basis of stage at presentation
(tumour size, lymph node involvement and
presence of metastases). Yet 30 of patients
with resected stage I lung cancer will die of
metastatic cancer within 5 years of
surgery. Want a prognostic test for early-stage
lung adenocarcinoma to identify patients more
likely to recur, and therefore who would benefit
from adjuvant therapy.
203
Lung Cancer Data Sets
(see http//www.camda.duke.edu/camda03)
Wigle et al. (2002), Garber et al. (2001),
Bhattacharjee et al. (2001), Beer et al. (2002).
204
Heat Map for 2880 Ontario Genes (39 Tissues)
Genes
Tissues
205
(No Transcript)
206
Heat Maps for the 20 Ontario Gene-Groups (39
Tissues)
Genes
Tissues
Tissues are ordered as Recurrence (1-24) and
Censored (25-39)
207
Expression Profiles for Useful Metagenes (Ontario
39 Tissues)
Gene Group 1
Gene Group 2
Our Tissue Cluster 1
Our Tissue Cluster 2
Log Expression Value
Recurrence (1-24)
Censored (25-39)
Gene Group 19
Gene Group 20
Tissues
208
Tissue Clusters
CLUSTER ANALYSIS via EMMIX-GENE of 20 METAGENES
yields TWO CLUSTERS CLUSTER 1 (38) 23
(recurrence) plus
8 (censored) CLUSTER 2 (8) 1 (recurrence)
plus 7
(censored)
Poor-prognosis
Good-prognosis
209
SURVIVAL ANALYSIS LONG-TERM SURVIVOR (LTS)
MODEL where T is time to recurrence and p1
1- p2 is the prior prob. of recurrence. Adopt
Weibull model for the survival function
for recurrence S1(t).
210
Fitted LTS Model vs. Kaplan-Meier
211
PCA of Tissues Based on Metagenes
Second PC
First PC
212
PCA of Tissues Based on Metagenes
Second PC
First PC
213
PCA of Tissues Based on All Genes (via SVD)
Second PC
First PC
214
PCA of Tissues Based on All Genes (via SVD)
Second PC
First PC
215
Cluster-Specific Kaplan-Meier Plots
216
Survival Analysis for Ontario Dataset

Nonparametric analysis

Cluster No. of Tissues No. of Censored Mean time to Failure (?SE)
1 2 29 8 8 7 665 ? 85.9 1388 ? 155.7
A significant difference between Kaplan-Meier
estimates for the two clusters (P0.027).

Coxs proportional hazards analysis

Variable Hazard ratio (95 CI) P-value
Cluster 1 vs. Cluster 2 Tumor stage (I vs. IIIII) 6.78 (0.9 51.5) 1.07 (0.57 2.0) 0.06 0.83
217
Discriminant Analysis (Supervised
Classification) A prognosis classifier was
developed to predict the class of origin of a
tumor tissue with a small error rate after
correction for the selection bias. A support
vector machine (SVM) was adopted to identify
important genes that play a key role on
predicting the clinical outcome, using all the
genes, and the metagenes. A cross-validation
(CV) procedure was used to calculate the
prediction error, after correction for the
selection bias.
218
ONTARIO DATA (39 tissues) Support Vector Machine
(SVM) with Recursive Feature Elimination (RFE)
0.12
0.1
0.08
Error Rate (CV10E)
0.06
0.04
0.02
0
0
2
4
6
8
10
12
log2 (number of genes)
Ten-fold Cross-Validation Error Rate (CV10E) of
Support Vector Machine (SVM). applied to g2
clusters (G1 1-14, 16- 29,33,36,38 G2
15,30-32,34,35,37,39)
219
STANFORD DATA
918 genes based on 73 tissue samples from 67
patients. Row and column normalized, retained
451 genes after select-genes step. Used 20
metagenes to cluster tissues. Retrieved
histological groups.
220
Heat Maps for the 20 Stanford Gene-Groups (73
Tissues)
Genes
Tissues
Tissues are ordered by their histological
classification Adenocarcinoma (1-41), Fetal Lung
(42), Large cell (43-47), Normal (48-52),
Squamous cell (53-68), Small cell (69-73)
221
STANFORD CLASSIFICATION Cluster 1 1-19
(good prognosis) Cluster 2 20-26
(long-term survivors) Cluster 3 27-35
(poor prognosis)
222
Heat Maps for the 15 Stanford Gene-Groups (35
Tissues)
Genes
Tissues
Tissues are ordered by the Stanford
classification into AC groups AC group 1 (1-19),
AC group 2 (20-26), AC group 3 (27-35)
223
Expression Profiles for Top Metagenes (Stanford
35 AC Tissues)
Gene Group 1
Gene Group 2
Stanford AC group 1
Stanford AC group 2
Stanford AC group 3
Misallocated
Log Expression Value
Gene Group 4
Gene Group 3
Tissues
224
Cluster-Specific Kaplan-Meier Plots
225
Cluster-Specific Kaplan-Meier Plots
226
Survival Analysis for Stanford Dataset

Kaplan-Meier estimation

Cluster No. of Tissues No. of Censored Mean time to Failure (?SE)
1 2 17 5 10 0 37.5 ? 5.0 5.2 ? 2.3
A significant difference in survival between
clusters (Plt0.001)

Coxs proportional hazards analysis

Variable Hazard ratio (95 CI) P-value
Cluster 3 vs. Clusters 12 Grade 3 vs. grades 1 or 2 Tumor size No. of tumors in lymph nodes Presence of metastases 13.2 (2.1 81.1) 1.94 (0.5 8.5) 0.96 (0.3 2.8) 1.65 (0.7 3.9) 4.41 (1.0 19.8) 0.005 0.38 0.93 0.25 0.05
227
Survival Analysis for Stanford Dataset

Univariate Coxs proportional hazards analysis
(metagenes)

Metagene Coefficient (SE) P-value
1 2 3 4 5 1.37 (0.44) -0.24 (0.31) 0.14 (0.34) -1.01 (0.56) 0.66 (0.65) 0.002 0.44 0.68 0.07 0.31
6 7 8 9 10 -0.63 (0.50) -0.68 (0.57) 0.75 (0.46) -1.13 (0.50) 0.73 (0.39) 0.20 0.24 0.10 0.02 0.06
11 12 13 14 15 0.35 (0.50) -0.55 (0.41) -0.61 (0.48) 0.22 (0.36) 1.70 (0.92) 0.48 0.18 0.20 0.53 0.06
228
Survival Analysis for Stanford Dataset

Multivariate Coxs proportional hazards
analysis (metagenes)

Metagene Coefficient (SE) P-value
1 2 8 11 3.44 (0.95) -1.60 (0.62) -1.55 (0.73) 1.16 (0.54) 0.0003 0.010 0.033 0.031
The final model consists of four metagenes.
229
STANFORD DATA Support Vector Machine (SVM) with
Recursive Feature Elimination (RFE)
0.07
0.06
0.05
0.04
Error Rate (CV10E)
0.03
0.02
0.01
0
0
1
2
3
4
5
6
7
8
9
10
log2 (number of genes)
Ten-fold Cross-Validation Error Rate (CV10E) of
Support Vector Machine (SVM). Applied to g2
clusters.
230

CONCLUSIONS
We applied a model-based clustering approach to
classify tumors using their gene signatures into
clusters corresponding to tumor type
clusters corresponding to clinical outcomes for
tumors of a given subtype
In (a), almost perfect correspondence between
cluster and tumor type, at least for non-AC
tumors (but not in the Ontario dataset).

231
CONCLUSIONS (cont.)
The clusters in (b) were identified with clinical
outcomes (e.g. recurrence/recurrence-free and
death/long-term survival). We were able to show
that gene-expression data provide prognostic
information, beyond that of clinical indicators
such as stage.
232
CONCLUSIONS (cont.)
Based on the tissue clusters, a discriminant
analysis using support vector machines (SVM)
demonstrated further the potential of gene
expression as a tool for guiding treatment
therapy and patient care to lung cancer patients.
This supervised classification procedure was
used to provide marker genes for prediction of
clinical outcomes. (In addition to those
provided by the cluster-genes step in the initial
unsupervised classification.)
233
LIMITATIONS

Small number of tumors available (e.g Ontario and
Stanford datasets). Clinical data available
for only subsets of the tumors often for only
one tumor type (AC). High proportion of
censored observations limits comparison of
survival rates.

Write a Comment

User Comments (0)