MICROARRAY DATA - PowerPoint PPT Presentation

About This Presentation
Title:

MICROARRAY DATA

Description:

MICROARRAY DATA REPRESENTED by a N M matrix contains the gene expressions for the N genes of the jth tissue sample (j = 1, ,M). N = No. of genes (103 - 104) – PowerPoint PPT presentation

Number of Views:200
Avg rating:3.0/5.0
Slides: 104
Provided by: kar143
Category:

less

Transcript and Presenter's Notes

Title: MICROARRAY DATA


1
MICROARRAY DATA
REPRESENTED by a N M matrix
contains the gene expressions for the N genes
of the jth tissue sample (j 1, ,M).
N No. of genes (103 - 104) M No. of
tissue samples (10 - 102)
STANDARD STATISTICAL METHODOLOGY APPROPRIATE
FOR M gtgt N
HERE N gtgt M
2
Microarray Data represented as N x M Matrix
Sample 1 Sample 2 Sample
M
Gene 1 Gene 2 Gene N
Expression Signature
M columns (samples) 102
N rows (genes) 104
Expression Profile
3
Two Clustering Problems
  • Clustering of genes on basis of tissues
  • genes not independent
  • Clustering of tissues on basis of genes
  • latter is a nonstandard problem in
  • cluster analysis (n ltlt p)

4
UNSUPERVISED CLASSIFICATION (CLUSTER
ANALYSIS) INFER CLASS LABELS z1, , zn of y1,
, yn
Initially, hierarchical distance-based methods of
cluster analysis were used to cluster the tissues
and the genes Eisen, Spellman, Brown, Botstein
(1998, PNAS)
5
The notion of a cluster is not easy to
define. There is a very large literature devoted
to clustering when there is a metric known in
advance e.g. k-means. Usually, there is no a
priori metric (or equivalently a user-defined
distance matrix) for a cluster analysis. That
is, the difficulty is that the shape of the
clusters is not known until the clusters have
been identified, and the clusters cannot be
effectively identified unless the shapes are
known.
6
In this case, one attractive feature of adopting
mixture models with elliptically symmetric
components such as the normal or t densities, is
that the implied clustering is invariant under
affine transformations of the data (that is,
under operations relating to changes in location,
scale, and rotation of the data). Thus the
clustering process does not depend on irrelevant
factors such as the units of measurement or the
orientation of the clusters in space.
7
(No Transcript)
8
(No Transcript)
9
Hierarchical clustering methods for the analysis
of gene expression data caught on like the hula
hoop. I, for one, will be glad to see them fade.
Gary Churchill (The Jackson Laboratory) Contributi
on to the discussion of the paper by Sebastiani,
Gussoni, Kohane, and Ramoni. Statistical Science
(2003) 18, 64-69.
10
Hierarchical (agglomerative) clustering
algorithms are largely heuristically motivated
and there exist a number of unresolved issues
associated with their use, including how to
determine the number of clusters.
in the absence of a well-grounded statistical
model, it seems difficult to define what is meant
by a good clustering algorithm or the right
number of clusters.
(Yeung et al., 2001, Model-Based Clustering and
Data Transformations for Gene Expression Data,
Bioinformatics 17)
11
McLachlan and Khan (2004). On a resampling
approach for tests on the number of clusters with
mixture model-based clustering of the tissue
samples. Special issue of the Journal of
Multivariate Analysis 90 (2004) edited by Mark
van der Laan and Sandrine Dudoit (UC Berkeley).
12
Attention is now turning towards a model-based
approach to the analysis of microarray data
For example
  • Broet, Richarson, and Radvanyi (2002). Bayesian
    hierarchical model for identifying changes in
    gene expression from microarray experiments.
    Journal of Computational Biology 9
  • Ghosh and Chinnaiyan (2002). Mixture modelling of
    gene expression data from microarray experiments.
    Bioinformatics 18
  • Liu, Zhang, Palumbo, and Lawrence (2003).
    Bayesian clustering with variable and
    transformation selection. In Bayesian Statistics
    7
  • Pan, Lin, and Le, 2002, Model-based cluster
    analysis of microarray gene expression data.
    Genome Biology 3
  • Yeung et al., 2001, Model based clustering and
    data transformations for gene expression data,
    Bioinformatics 17

13
The notion of a cluster is not easy to
define. There is a very large literature devoted
to clustering when there is a metric known in
advance e.g. k-means. Usually, there is no a
priori metric (or equivalently a user-defined
distance matrix) for a cluster analysis. That
is, the difficulty is that the shape of the
clusters is not known until the clusters have
been identified, and the clusters cannot be
effectively identified unless the shapes are
known.
14
In this case, one attractive feature of adopting
mixture models with elliptically symmetric
components such as the normal or t densities, is
that the implied clustering is invariant under
affine transformations of the data (that is,
under operations relating to changes in location,
scale, and rotation of the data). Thus the
clustering process does not depend on irrelevant
factors such as the units of measurement or the
orientation of the clusters in space.
15
(No Transcript)
16
http//www.maths.uq.edu.au/gjm
McLachlan and Peel (2000), Finite Mixture
Models. Wiley.
17
(No Transcript)
18
Mixture Software EMMIX
EMMIX for UNIX
McLachlan, Peel, Adams, and Basford http//www.mat
hs.uq.edu.au/gjm/emmix/emmix.html
19
Basic Definition
  • We let Y1,. Yn denote a random sample of size n
    where Yj is a p-dimensional random vector with
    probability density function f (yj)
  • where the f i(yj) are densities and the pi are
    nonnegative quantities that sum to one.

20
Mixture distributions are applied to data with
two main purposes in mind
  • To provide an appealing semiparametric framework
    in which to model unknown distributional shapes,
    as an alternative to, say, the kernel density
    method.
  • To use the mixture model to provide a model-based
    clustering. (In both situations, there is the
    question of how many components to include in the
    mixture.)

21
Shapes of Some Univariate Normal Mixtures
  • Consider
  • where
  • denotes the univariate normal density with mean m
    and variance s2.

22
Figure 1 Plot of a mixture density of two
univariate normal components in equal proportions
with common variance s21
23
Figure 2 Plot of a mixture density of two
univariate normal components in proportions 0.75
and 0.25 with common variance
24
(No Transcript)
25
(No Transcript)
26
Normal Mixtures
  • Computationally convenient for multivariate data
  • Provide an arbitrarily accurate estimate of the
    underlying density with g sufficiently large
  • Provide a probabilistic clustering of the data
    into g clusters - outright clustering by
    assigning a data point to the component to which
    it has the greatest posterior probability of
    belonging

27
Synthetic Data Set 1
28
Synthetic Data Set 2
29
y True Values Initial Values Estimates by EM
p1 0.333 0.333 0.294
p2 0.333 0.333 0.337
p3 0.333 0.333 0.370
m1 (0 2)T (-1 0) T (-0.154 1.961) T
m2 (0 0) T (0 0) T (0.360 0.115) T
m3 (0 2) T (1 0) T (-0.004 2.027) T
S1
S1
S1
30
Figure 7
31
(No Transcript)
32
Figure 8
33
(No Transcript)
34
MIXTURE OF g NORMAL COMPONENTS
35
MIXTURE OF g NORMAL COMPONENTS
36
Equal spherical covariance matrices
37
With a mixture model-based approach to
clustering, an observation is assigned outright
to the ith cluster if its density in the ith
component of the mixture distribution (weighted
by the prior probability of that component) is
greater than in the other (g-1) components.
38
Figure 7 Contours of the fitted component
densities on the 2nd 3rd variates for the blue
crab data set.
39
Estimation of Mixture Distributions
  • It was the publication of the seminal paper of
    Dempster, Laird, and Rubin (1977) on the EM
    algorithm that greatly stimulated interest in the
    use of finite mixture distributions to model
    heterogeneous data.
  • McLachlan and Krishnan (1997, Wiley)

40
  • If need be, the normal mixture model can be
    made less sensitive to outlying observations by
    using t component densities.
  • With this t mixture model-based approach, the
    normal distribution for each component in the
    mixture is embedded in a wider class of
    elliptically symmetric distributions with an
    additional parameter called the degrees of
    freedom.

41
The advantage of the t mixture model is that,
although the number of outliers needed for
breakdown is almost the same as with the normal
mixture model, the outliers have to be much
larger.
42
In exploring high-dimensional data sets for group
structure, it is typical to rely on principal
component analysis.
43
Two Groups in Two Dimensions. All cluster
information would be lost by collapsing to the
first principal component. The principal
ellipses of the two groups are shown as solid
curves.
44
Mixtures of Factor Analyzers
A normal mixture model without restrictions on
the component-covariance matrices may be viewed
as too general for many situations in practice,
in particular, with high dimensional data. One
approach for reducing the number of parameters
is to work in a lower dimensional space by using
principal components another is to use mixtures
of factor analyzers (Ghahramani Hinton, 1997).
45
Mixtures of Factor Analyzers
  • Principal components or a single-factor analysis
    model provides only a global linear model.
  • A global nonlinear approach by postulating a
    mixture of linear submodels

46
Bi is a p x q matrix and Di is a diagonal
matrix.
47
Single-Factor Analysis Model
48
The Uj are iid N(O, Iq) independently of the
errors ej, which are iid as N(O, D), where D is a
diagonal matrix
49
Conditional on ith component membership of the
mixture,
where Ui1, ..., Uin are independent,
identically distibuted (iid) N(O, Iq),
independently of the eij, which are iid
N(O, Di), where Di is a diagonal matrix
(i 1, ..., g).
50
An infinity of choices for Bi for model still
holds if Bi is replaced by BiCi where Ci is an
orthogonal matrix. Choose Ci so that
is diagonal
Number of free parameters is then
51
  • Reduction in the number of parameters is then
  • We can fit the mixture of factor analyzers model
    using an alternating ECM algorithm.

52
1st cycle declare the missing data to be the
component-indicator vectors. Update the
estimates of
and
2nd cycle declare the missing data to be also
the factors. Update the estimates of
and
53
M-step on 1st cycle
for i 1, ... , g .
54
M step on 2nd cycle
where
55
(No Transcript)
56
Work in q-dim space
(BiBiT Di ) -1 Di 1 - Di -1Bi (Iq BiTDi
-1Bi) -1BiTDi -1,
BiBiTD i Di / Iq
-BiT(BiBiTDi) -1Bi .
57
where
58
With EM
where
59
To avoid potential computational problems with
small-sized clusters, we impose the constraint
60
(No Transcript)
61
Number of Components in a Mixture Model
  • Testing for the number of components, g, in a
    mixture is an important but very difficult
    problem which has not been completely resolved.

62
Order of a Mixture Model
  • A mixture density with g components might be
    empirically indistinguishable from one with
    either fewer than g components or more than g
    components. It is therefore sensible in practice
    to approach the question of the number of
    components in a mixture model in terms of an
    assessment of the smallest number of components
    in the mixture compatible with the data.

63
Likelihood Ratio Test Statistic
  • An obvious way of approaching the problem of
    testing for the smallest value of the number of
    components in a mixture model is to use the LRTS,
    -2logl. Suppose we wish to test the null
    hypothesis,

versus
for some g1gtg0.
64
  • We let denote the MLE of calculated
    under Hi , (i0,1). Then the evidence against H0
    will be strong if l is sufficiently small, or
    equivalently, if -2logl is sufficiently large,
    where

65
Bootstrapping the LRTS
  • McLachlan (1987) proposed a resampling approach
    to the assessment of the P-value of the LRTS in
    testing
  • for a specified value of g0.

66
Bayesian Information Criterion
The Bayesian information criterion (BIC) of
Schwarz (1978) is given by

as the penalized log likelihood to be maximized
in model selection, including the present
situation for the number of components g in a
mixture model.
67
Gap statistic (Tibshirani et al., 2001)
Clest (Dudoit and Fridlyand, 2002)

68
PROVIDES A MODEL-BASED APPROACH TO
CLUSTERING McLachlan, Bean, and Peel, 2002, A
Mixture Model-Based Approach to the Clustering of
Microarray Expression Data, Bioinformatics 18,
413-422
http//www.bioinformatics.oupjournals.org/cgi/scre
enpdf/18/3/413.pdf
69
(No Transcript)
70
Example Microarray DataColon Data of Alon et
al. (1999)
M 62 (40 tumours 22 normals) tissue samples
of N 2,000 genes in a 2,000 ? 62 matrix.
71
(No Transcript)
72
(No Transcript)
73
Mixture of 2 normal components
74
Mixture of 2 t components
75
The t distribution does not have substantially
better breakdown behavior than the normal (Tyler,
1994). The advantage of the t mixture model is
that, although the number of outliers needed for
breakdown is almost the same as with the
normal mixture model, the outliers have to be
much larger. This point is made more precise in
Hennig (2002) who has provided an excellent
account of breakdown points for ML estimation of
location -scale mixtures with a fixed number of
components g. Of course as explained in Hennig
(2002), mixture models can be made more robust by
allowing the number of components g to grow with
the number of outliers.
76
For Normal mixtures breakdown begins with an
additional point at about 15.2. For a mixture of
t3-distributions, the outlier must lie at about
800, t1-mixtures need the outlier at about ,
and a Normal mixture with additional noise
component breaks down with an additional point at
77
(No Transcript)
78
(No Transcript)
79
Clustering of COLON Data Genes using EMMIX-GENE
80
Grouping for Colon Data
81
(No Transcript)
82
(No Transcript)
83
Clustering of COLON Data Tissues using EMMIX-GENE
84
Grouping for Colon Data
85
Heat Map Displaying the Reduced Set of 4,869
Genes on the 98 Breast Cancer Tumours
86
Insert heat map of 1867 genes
Heat Map of Top 1867 Genes
87
(No Transcript)
88
(No Transcript)
89
(No Transcript)
90
(No Transcript)
91
(No Transcript)
92
where i group number mi number in group
i Ui -2 log ?i
93
Heat Map of Genes in Group G1
94
Heat Map of Genes in Group G2
95
Heat Map of Genes in Group G3
96
Clustering of gene expression profiles
  • Longitudinal (with or without replication, for
    example time-course)
  • Cross-sectional data

EMMIX-WIRE EM-based MIXture analysis With Random
Effects
A Mixture Model with Random-Effects Components
for Clustering Correlated Gene-Expression
Profiles. S.K. Ng, G. J. McLachlan, K. Wang, L.
Ben-Tovim Jones, S-W. Ng.
97
Clustering of Correlated Gene Profiles
98
Clustering of gene expression profiles
  • Longitudinal (with or without replication, for
    example time course)
  • Cross-section data

99
N(mh,Bh), with
100
Yeast Cell Cycle
X is an 18 x 2 matrix with the (l1)th row (l
0,,17)
Yeast data is from Spellman (1998) 18 rows
represent the 18 a-factor (pheromone)
synchronization where the yeast cells were
sampled at 7 minute intervals for 119 minutes. ?
is the period of the cell cycle and ? is
the phase offset, estimated using least squares
to be ?53 and ? 0.
101
Clustering Results for Spellman Yeast Cell Cycle
Data
102
Plots of First versus Second Principal Components
(b) Muro clustering
(a) Our clustering
103
A Mixture Model with Random-Effects Components
for Clustering Correlated Gene-Expression
Profiles. S.K. Ng, G. J. McLachlan, K. Wang, L.
Ben-Tovim Jones, S-W. Ng.
Write a Comment
User Comments (0)
About PowerShow.com