Title: Geoff McLachlan
1The Classification of Microarray Data
Geoff McLachlan Department of Mathematics
Institute of Molecular Bioscience University of
Queensland
http//www.maths.uq.edu.au/gjm
2(No Transcript)
3Outline of Talk
- Introduction
- Supervised classification of tissue samples
selection bias - Unsupervised classification (clustering) of
tissues mixture model-based approach
4(No Transcript)
5(No Transcript)
6Supervised Classification (Two Classes)
. . . . . . .
Sample 1
Sample n
Gene 1
. . . . . . .
Gene p
Class 2 (poor prognosis)
Class 1 (good prognosis)
7Microarray to be used as routine clinical
screen by C. M. Schubert Nature Medicine 9, 9,
2003.
The Netherlands Cancer Institute in Amsterdam is
to become the first institution in the world to
use microarray techniques for the routine
prognostic screening of cancer patients. Aiming
for a June 2003 start date, the center will use a
panoply of 70 genes to assess the tumor profile
of breast cancer patients and to determine which
women will receive adjuvant treatment after
surgery.
8Selection bias in gene extraction on the basis
of microarray gene-expression data
Ambroise and McLachlan Proceedings of the
National Academy of Sciences Vol. 99, Issue 10,
6562-6566, May 14, 2002 http//www.pnas.org/cgi/
content/full/99/10/6562
9LINEAR CLASSIFIER
FORM
for the production of the group label y of a
future entity with feature vector x.
10FISHERS LINEAR DISCRIMINANT FUNCTION
where
and S are the sample means and pooled sample
and
covariance matrix found from the training data
11SUPPORT VECTOR CLASSIFIER
Vapnik (1995)
where ß0 and ß are obtained as follows
subject to
relate to the slack variables
separable case
12Leo Breiman (2001)Statistical modeling the
two cultures (with discussion).Statistical
Science 16, 199-231.Discussants include Brad
Efron and David Cox
13GUYON, WESTON, BARNHILL VAPNIK (2002, Machine
Learning)
- LEUKAEMIA DATA
- Only 2 genes are needed to obtain a zero CVE
(cross-validated error rate) - COLON DATA
- Using only 4 genes, CVE is 2
14Figure 1 Error rates of the SVM rule with RFE
procedure averaged over 50 random splits of colon
tissue samples
15Figure 3 Error rates of Fishers rule with
stepwise forward selection procedure using all
the colon data
16Figure 5 Error rates of the SVM rule averaged
over 20 noninformative samples generated by
random permutations of the class labels of the
colon tumor tissues
17BOOTSTRAP APPROACH
Efrons (1983, JASA) .632 estimator
where B1 is the bootstrap when rule is
applied to a point not in the training sample. A
Monte Carlo estimate of B1 is
where
18Toussaint Sharpe (1975) proposed the ERROR
RATE ESTIMATOR
where
McLachlan (1977) proposed wwo where wo is chosen
to minimize asymptotic bias of A(w) in the case
of two homoscedastic normal groups. Value of w0
was found to range between 0.6 and 0.7, depending
on the values of
19.632 estimate of Efron Tibshirani (1997, JASA)
where
(relative overfitting rate)
(estimate of no information error rate)
If r 0, w .632, and so B.632
B.632 r 1, w 1, and so B.632 B1
20Ten-Fold Cross Validation
Test
T r a i n i n g
21MARKER GENES FOR HARVARD DATA For a SVM based on
64 genes, and using 10-fold CV, we noted the
number of times a gene was selected. No. of
genes Times selected 55
1 18
2 11 3
7 4 8
5 6
6 10
7 8 8
12 9 17
10
22MARKER GENES FOR HARVARD DATA
No. of Times genes selected 55
1 18 2 11
3 7 4 8 5
6 6 10 7
8 8 12 9 17
10
23Breast cancer data set in vant Veer et al.
(vant Veer et al., 2002, Gene Expression
Profiling Predicts Clinical Outcome Of Breast
Cancer, Nature 415)
These data were the result of microarray
experiments on three patient groups with
different classes of breast cancer tumours. The
overall goal was to identify a set of genes that
could distinguish between the different tumour
groups based upon the gene expression information
for these groups.
24van de Vijver et al. (2002) considered a further
234 breast cancer tumours but have only made
available the data for the top 70 genes based on
the previous study of van t Veer et al. (2002)
25(No Transcript)
26(No Transcript)
27(No Transcript)
28Two Clustering Problems
- Clustering of genes on basis of tissues
- genes not independent
- Clustering of tissues on basis of genes -
- latter is a nonstandard problem in
- cluster analysis (n ltlt p)
29(No Transcript)
30Hierarchical clustering methods for the analysis
of gene expression data caught on like the hula
hoop. I, for one, will be glad to see them fade.
Gary Churchill (The Jackson Laboratory) Contributi
on to the discussion of the paper by Sebastiani,
Gussoni, Kohane, and Ramoni. Statistical Science
(2003) 18, 64-69.
31(No Transcript)
32(No Transcript)
33The notion of a cluster is not easy to
define. There is a very large literature devoted
to clustering when there is a metric known in
advance e.g. k-means. Usually, there is no a
priori metric (or equivalently a user-defined
distance matrix) for a cluster analysis. That
is, the difficulty is that the shape of the
clusters is not known until the clusters have
been identified, and the clusters cannot be
effectively identified unless the shapes are
known.
34In this case, one attractive feature of adopting
mixture models with elliptically symmetric
components such as the normal or t densities, is
that the implied clustering is invariant under
affine transformations of the data (that is,
under operations relating to changes in location,
scale, and rotation of the data). Thus the
clustering process does not depend on irrelevant
factors such as the units of measurement or the
orientation of the clusters in space.
35Hierarchical (agglomerative) clustering
algorithms are largely heuristically motivated
and there exist a number of unresolved issues
associated with their use, including how to
determine the number of clusters.
in the absence of a well-grounded statistical
model, it seems difficult to define what is meant
by a good clustering algorithm or the right
number of clusters.
(Yeung et al., 2001, Model-Based Clustering and
Data Transformations for Gene Expression Data,
Bioinformatics 17)
36McLachlan and Khan (2004). On a resampling
approach for tests on the number of clusters with
mixture model-based clustering of the tissue
samples. Special issue of the Journal of
Multivariate Analysis 90 (2004) edited by Mark
van der Laan and Sandrine Dudoit (UC Berkeley).
37MIXTURE OF g NORMAL COMPONENTS
38MIXTURE OF g NORMAL COMPONENTS
39In exploring high-dimensional data sets for
group structure, it is typical to rely on
principal component analysis.
40Two Groups in Two Dimensions. All cluster
information would be lost by collapsing to the
first principal component. The principal
ellipses of the two groups are shown as solid
curves.
41Mixtures of Factor Analyzers
- A normal mixture model without restrictions on
the component-covariance matrices may be viewed
as too general for many situations in practice,
in particular, with high dimensional data. - One approach for reducing the number of
- parameters is to work in a lower dimensional
- space by adopting mixtures of factor analyzers
(Ghahramani Hinton, 1997)
42where
- Bi is a p x q matrix and
- Di is a diagonal matrix.
43Single-Factor Analysis Model
44The Uj are iid N(O, Iq) independently of the
errors ej, which are iid as N(O, D), where D is a
diagonal matrix
45Mixtures of Factor Analyzers
- A single-factor analysis model provides only a
global linear model. - A global nonlinear approach by postulating a
mixture of linear submodels
46Conditional on ith component membership of the
mixture,
where Ui1, ..., Uin are independent,
identically distibuted (iid) N(O, Iq),
independently of the eij, which are iid
N(O, Di), where Di is a diagonal matrix
(i 1, ..., g).
47An infinity of choices for Bi for model still
holds if Bi is replaced by BiCi where Ci is an
orthogonal matrix. Choose Ci so that
is diagonal
Number of free parameters is then
48Reduction in the number of parameters is then
- We can fit the mixture of factor analyzers model
using an alternating ECM algorithm.
491st cycle declare the missing data to be the
component-indicator vectors. Update the
estimates of
and
2nd cycle declare the missing data to be also
the factors. Update the estimates of
and
50M-step on 1st cycle
for i 1, ... , g .
51M step on 2nd cycle
where
52(No Transcript)
53Work in q-dim space
(BiBiT Di ) -1 Di 1 - Di -1Bi (Iq BiTDi
-1Bi) -1BiTDi -1,
BiBiTD i Di / Iq
-BiT(BiBiTDi) -1Bi .
54PROVIDES A MODEL-BASED APPROACH TO
CLUSTERING McLachlan, Bean, and Peel, 2002, A
Mixture Model-Based Approach to the Clustering of
Microarray Expression Data, Bioinformatics 18,
413-422
http//www.bioinformatics.oupjournals.org/cgi/scre
enpdf/18/3/413.pdf
55(No Transcript)
56Example Microarray DataColon Data of Alon et
al. (1999)
n 62 (40 tumours 22 normals) tissue samples
of p 2,000 genes in a 2,000 ? 62 matrix.
57(No Transcript)
58(No Transcript)
59Mixture of 2 normal components
60Mixture of 2 t components
61(No Transcript)
62Clustering of COLON Data Genes using EMMIX-GENE
63Grouping for Colon Data
64(No Transcript)
65(No Transcript)
66Grouping for Colon Data
67(No Transcript)
68(No Transcript)
69Heat Map of Genes in Group G1
70Heat Map of Genes in Group G2
71Heat Map of Genes in Group G3
72(No Transcript)
73An efficient algorithm based on a heuristically
justified objective function, delivered in
reasonable time, is usually preferable to a
principled statistical approach that takes years
to develop or ages to run. Having said this,
the case for a more principled approach can be
made more effectively once cruder approaches have
exhausted their harvest of low-hanging
fruit. Gilks (2004)
74In bioinformatics, algorithms are generally
viewed as more important than models or
statistical efficiency. Unless the
methodological research results in a web-based
tool or, at the very least, downloadable code
that can be easily run by the user, it is
effectively useless.