Geoff McLachlan - PowerPoint PPT Presentation

1 / 74
About This Presentation
Title:

Geoff McLachlan

Description:

like mouse brain protein E46. minichromosome maintenance deficient (mis5, S. pombe) 6 ... once cruder approaches have exhausted their harvest of low-hanging fruit. ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 75
Provided by: kar143
Category:
Tags: mclachlan | geoff

less

Transcript and Presenter's Notes

Title: Geoff McLachlan


1
The Classification of Microarray Data
Geoff McLachlan Department of Mathematics
Institute of Molecular Bioscience University of
Queensland
http//www.maths.uq.edu.au/gjm
2
(No Transcript)
3
Outline of Talk
  • Introduction
  • Supervised classification of tissue samples
    selection bias
  • Unsupervised classification (clustering) of
    tissues mixture model-based approach

4
(No Transcript)
5
(No Transcript)
6
Supervised Classification (Two Classes)
. . . . . . .
Sample 1
Sample n
Gene 1
. . . . . . .
Gene p
Class 2 (poor prognosis)
Class 1 (good prognosis)
7
Microarray to be used as routine clinical
screen by C. M. Schubert Nature Medicine 9, 9,
2003.
The Netherlands Cancer Institute in Amsterdam is
to become the first institution in the world to
use microarray techniques for the routine
prognostic screening of cancer patients. Aiming
for a June 2003 start date, the center will use a
panoply of 70 genes to assess the tumor profile
of breast cancer patients and to determine which
women will receive adjuvant treatment after
surgery.
8
Selection bias in gene extraction on the basis
of microarray gene-expression data
Ambroise and McLachlan Proceedings of the
National Academy of Sciences Vol. 99, Issue 10,
6562-6566, May 14, 2002 http//www.pnas.org/cgi/
content/full/99/10/6562
9
LINEAR CLASSIFIER
FORM
for the production of the group label y of a
future entity with feature vector x.
10
FISHERS LINEAR DISCRIMINANT FUNCTION
where
and S are the sample means and pooled sample
and
covariance matrix found from the training data
11
SUPPORT VECTOR CLASSIFIER
Vapnik (1995)
where ß0 and ß are obtained as follows
subject to
relate to the slack variables
separable case
12
Leo Breiman (2001)Statistical modeling the
two cultures (with discussion).Statistical
Science 16, 199-231.Discussants include Brad
Efron and David Cox
13
GUYON, WESTON, BARNHILL VAPNIK (2002, Machine
Learning)
  • LEUKAEMIA DATA
  • Only 2 genes are needed to obtain a zero CVE
    (cross-validated error rate)
  • COLON DATA
  • Using only 4 genes, CVE is 2

14
Figure 1 Error rates of the SVM rule with RFE
procedure averaged over 50 random splits of colon
tissue samples
15
Figure 3 Error rates of Fishers rule with
stepwise forward selection procedure using all
the colon data
16
Figure 5 Error rates of the SVM rule averaged
over 20 noninformative samples generated by
random permutations of the class labels of the
colon tumor tissues
17
BOOTSTRAP APPROACH
Efrons (1983, JASA) .632 estimator
where B1 is the bootstrap when rule is
applied to a point not in the training sample. A
Monte Carlo estimate of B1 is
where
18
Toussaint Sharpe (1975) proposed the ERROR
RATE ESTIMATOR
where
McLachlan (1977) proposed wwo where wo is chosen
to minimize asymptotic bias of A(w) in the case
of two homoscedastic normal groups. Value of w0
was found to range between 0.6 and 0.7, depending
on the values of
19
.632 estimate of Efron Tibshirani (1997, JASA)
where
(relative overfitting rate)
(estimate of no information error rate)
If r 0, w .632, and so B.632
B.632 r 1, w 1, and so B.632 B1
20
Ten-Fold Cross Validation
Test
T r a i n i n g
21
MARKER GENES FOR HARVARD DATA For a SVM based on
64 genes, and using 10-fold CV, we noted the
number of times a gene was selected. No. of
genes Times selected 55
1 18
2 11 3
7 4 8
5 6
6 10
7 8 8
12 9 17
10
22
MARKER GENES FOR HARVARD DATA
No. of Times genes selected 55
1 18 2 11
3 7 4 8 5
6 6 10 7
8 8 12 9 17
10
23
Breast cancer data set in vant Veer et al.
(vant Veer et al., 2002, Gene Expression
Profiling Predicts Clinical Outcome Of Breast
Cancer, Nature 415)
These data were the result of microarray
experiments on three patient groups with
different classes of breast cancer tumours. The
overall goal was to identify a set of genes that
could distinguish between the different tumour
groups based upon the gene expression information
for these groups.
24
van de Vijver et al. (2002) considered a further
234 breast cancer tumours but have only made
available the data for the top 70 genes based on
the previous study of van t Veer et al. (2002)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
Two Clustering Problems
  • Clustering of genes on basis of tissues
  • genes not independent
  • Clustering of tissues on basis of genes -
  • latter is a nonstandard problem in
  • cluster analysis (n ltlt p)

29
(No Transcript)
30
Hierarchical clustering methods for the analysis
of gene expression data caught on like the hula
hoop. I, for one, will be glad to see them fade.
Gary Churchill (The Jackson Laboratory) Contributi
on to the discussion of the paper by Sebastiani,
Gussoni, Kohane, and Ramoni. Statistical Science
(2003) 18, 64-69.
31
(No Transcript)
32
(No Transcript)
33
The notion of a cluster is not easy to
define. There is a very large literature devoted
to clustering when there is a metric known in
advance e.g. k-means. Usually, there is no a
priori metric (or equivalently a user-defined
distance matrix) for a cluster analysis. That
is, the difficulty is that the shape of the
clusters is not known until the clusters have
been identified, and the clusters cannot be
effectively identified unless the shapes are
known.
34
In this case, one attractive feature of adopting
mixture models with elliptically symmetric
components such as the normal or t densities, is
that the implied clustering is invariant under
affine transformations of the data (that is,
under operations relating to changes in location,
scale, and rotation of the data). Thus the
clustering process does not depend on irrelevant
factors such as the units of measurement or the
orientation of the clusters in space.
35
Hierarchical (agglomerative) clustering
algorithms are largely heuristically motivated
and there exist a number of unresolved issues
associated with their use, including how to
determine the number of clusters.
in the absence of a well-grounded statistical
model, it seems difficult to define what is meant
by a good clustering algorithm or the right
number of clusters.
(Yeung et al., 2001, Model-Based Clustering and
Data Transformations for Gene Expression Data,
Bioinformatics 17)
36
McLachlan and Khan (2004). On a resampling
approach for tests on the number of clusters with
mixture model-based clustering of the tissue
samples. Special issue of the Journal of
Multivariate Analysis 90 (2004) edited by Mark
van der Laan and Sandrine Dudoit (UC Berkeley).
37
MIXTURE OF g NORMAL COMPONENTS
38
MIXTURE OF g NORMAL COMPONENTS
39
In exploring high-dimensional data sets for
group structure, it is typical to rely on
principal component analysis.
40
Two Groups in Two Dimensions. All cluster
information would be lost by collapsing to the
first principal component. The principal
ellipses of the two groups are shown as solid
curves.
41
Mixtures of Factor Analyzers
  • A normal mixture model without restrictions on
    the component-covariance matrices may be viewed
    as too general for many situations in practice,
    in particular, with high dimensional data.
  • One approach for reducing the number of
  • parameters is to work in a lower dimensional
  • space by adopting mixtures of factor analyzers
    (Ghahramani Hinton, 1997)

42
where
  • Bi is a p x q matrix and
  • Di is a diagonal matrix.

43
Single-Factor Analysis Model
44
The Uj are iid N(O, Iq) independently of the
errors ej, which are iid as N(O, D), where D is a
diagonal matrix
45
Mixtures of Factor Analyzers
  • A single-factor analysis model provides only a
    global linear model.
  • A global nonlinear approach by postulating a
    mixture of linear submodels

46
Conditional on ith component membership of the
mixture,
where Ui1, ..., Uin are independent,
identically distibuted (iid) N(O, Iq),
independently of the eij, which are iid
N(O, Di), where Di is a diagonal matrix
(i 1, ..., g).
47
An infinity of choices for Bi for model still
holds if Bi is replaced by BiCi where Ci is an
orthogonal matrix. Choose Ci so that
is diagonal
Number of free parameters is then
48
Reduction in the number of parameters is then
  • We can fit the mixture of factor analyzers model
    using an alternating ECM algorithm.

49
1st cycle declare the missing data to be the
component-indicator vectors. Update the
estimates of
and
2nd cycle declare the missing data to be also
the factors. Update the estimates of
and
50
M-step on 1st cycle
for i 1, ... , g .
51
M step on 2nd cycle
where
52
(No Transcript)
53
Work in q-dim space
(BiBiT Di ) -1 Di 1 - Di -1Bi (Iq BiTDi
-1Bi) -1BiTDi -1,
BiBiTD i Di / Iq
-BiT(BiBiTDi) -1Bi .
54
PROVIDES A MODEL-BASED APPROACH TO
CLUSTERING McLachlan, Bean, and Peel, 2002, A
Mixture Model-Based Approach to the Clustering of
Microarray Expression Data, Bioinformatics 18,
413-422
http//www.bioinformatics.oupjournals.org/cgi/scre
enpdf/18/3/413.pdf
55
(No Transcript)
56
Example Microarray DataColon Data of Alon et
al. (1999)
n 62 (40 tumours 22 normals) tissue samples
of p 2,000 genes in a 2,000 ? 62 matrix.
57
(No Transcript)
58
(No Transcript)
59
Mixture of 2 normal components
60
Mixture of 2 t components
61
(No Transcript)
62
Clustering of COLON Data Genes using EMMIX-GENE
63
Grouping for Colon Data
64
(No Transcript)
65
(No Transcript)
66
Grouping for Colon Data
67
(No Transcript)
68
(No Transcript)
69
Heat Map of Genes in Group G1
70
Heat Map of Genes in Group G2
71
Heat Map of Genes in Group G3
72
(No Transcript)
73
An efficient algorithm based on a heuristically
justified objective function, delivered in
reasonable time, is usually preferable to a
principled statistical approach that takes years
to develop or ages to run. Having said this,
the case for a more principled approach can be
made more effectively once cruder approaches have
exhausted their harvest of low-hanging
fruit. Gilks (2004)
74
In bioinformatics, algorithms are generally
viewed as more important than models or
statistical efficiency. Unless the
methodological research results in a web-based
tool or, at the very least, downloadable code
that can be easily run by the user, it is
effectively useless.
Write a Comment
User Comments (0)
About PowerShow.com