Principal Component Analysis PCA for Clustering Gene Expression Data

1 / 33

About This Presentation

Title:

Principal Component Analysis PCA for Clustering Gene Expression Data

Description:

Data: A subset of the sporulation data (477 genes) were classified into seven ... Sporulation Data. The patterns overlap around the origin in (1a) ... –

Number of Views:370

Avg rating:3.0/5.0

Slides: 34

Provided by: Lehe

Category:

more less

Transcript and Presenter's Notes

Title: Principal Component Analysis PCA for Clustering Gene Expression Data

1
Principal Component Analysis(PCA) for
ClusteringGene Expression Data

K. Y. Yeung and W. L. Ruzzo

2
Organization

Association of PCA and this paper
Approach of this paper
Data sets
Clustering algorithms and similarity metrics
Results and discussion

3
The Functions of PCA?

PCA can reduce the dimensionality of the data
set.
Few PCs may capture most of the variation in the
original data set.
PCs are uncorrelated and ordered.
We expect the first few PCs may extract the
cluster structure in the original data set.

4
This Papers Point of View

A theoretical result shows that the first few PCs
may not contain cluster information. (Chang,
1983).
Changs example.
A motivating example. (Coming next).

5
A Motivating Example

Data A subset of the sporulation data (477
genes) were classified into seven temporal
patterns (Chu et al., 1998)
The first 2 PCs contains 85.9 of the variation
in the data. (Figure 1a)
The first 3 PCs contains 93.2 of the variation
in the data. (Figure 1b)

6
Sporulation Data

The patterns overlap around the origin in (1a).
The patterns are much more separated in (1b).

7
The Goal

EMPIRICALLY investigate the effectiveness of
clustering gene expression data using PCs instead
of the original variables.

8
Outline of Methods

Genes are to be clustered, and the experimental
conditions are the variables.
Effectiveness of clustering with the orginal data
and with different sets of PCs is determined,
measured by comparing the clustering results to
an objective external criterion.
Assume the number of clusters is known.

9
Agreement Between Two Partitions

The Rand index (Rand, 1971)
Given a set of n objects S, let U and V be two
different partitions of S. Let
a of pairs that are placed in the same
cluster in U and in the same cluster in V
d of pairs that are placed in different
clusters in U and in different clusters in V
Rand index (ad)/nC2

10
Agreement (Contd)

The adjusted Rand index (ARI, Hubert Arabie,
1985)
Note Higher ARI means higher correspondence
between two partitions.

11
Subset of PCs

Motivated by Changs example, it is possible to
find other subsets of PCs to preserve the cluster
structure better than the first few PCs.
How?
--- The greedy approach.
--- The modified greedy approach.

12
The Greedy Approach

Let m0 be the minimum number of PCs to be
clustered, and p be the number of variables in
the data.
Search for a set of m0 PCs with maximum ARI,
denoted as sm0.
For each m(m01),p, add another PC to s(m-1)
and calculate ARI. The PC giving the maximum ARI
is then added to get sm.

13
The Modified Greedy Approach

In each step of the greedy approach ( of PCs
m), retain the k best subsets of PCs for the next
step ( of PCs m1).
If k1, this is just the greedy approach.
k3 in this paper.

14
The Scheme of the Study

Given a gene expression data set with n genes
(subjects) and p experimental conditions
(variables), apply a clustering algorithm to
the given data set, ARI w/ external criterion.
the first m PCs where mm0,p.
the subset of PCs found by the (modified) greedy
approach.
30 sets of random PCs.
30 sets of random orthogonal projections.

15
Data Sets

Class refers to a group in the external
criterion. Cluster refers to clusters obtained
by a clustering algorithm.
There are two real data sets and three synthetic
data sets in this study.

16
The Ovary Data

The data contains 235 clones and 24 tissue
samples.
For the 24 tissue samples, 7 are from normal
tissues, 4 from blood samples, and 13 from
ovarian cancers.
The 235 clones were found to correspond four
different genes (classes), each having 58, 88, 57
and 32 clones.
The data for each clone was normalized across the
24 experiments to have mean 0 and variance 1.

17
The Yeast Cell Cycle Data

The data set shows the fluctuation of expression
levels over two cell cycles.
380 gene were classified into five phases
(classes).
The data for each gene was normalized to have
mean 0 and variance 1 across each cell cycle.

18
Mixture of Normal on Ovary

In each gene (class), the sample covariance
matrix and the mean vector are computed.
Sample (58, 88, 57, 32) clones from the MVN in
each class. 10 replicates.
It preserves the mean and covariance of the
original data, but relies on the MVN assumption.

19
Marginal Normality
20
Randomly Resample Ovary Data

For each class c (c1,,4) under experimental
condition j (j1,,24), resample the expression
level with replacement. Retain the size of each
class. 10 replicates.
No MVN assumption. The independent sampling for
different experimental conditions is reasonable
as inspected.

21
Cyclic Data

This data set models cyclic behavior of genes
over different time points.
Behavior of genes modeled by the sine function.
A drawback of this model is the arbitrary choice
of several parameters.

22
Clustering Algorithmsand Similarity Metrics

Clustering algorithms
Cluster affinity search technique (CAST)
Hierarchical average-link algorithm
K-mean algorithm
Similarity metrics
Euclidean distance (m02)
Correlation coefficient (m03)

23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
Table 1
29
Table 2

One sided Wilcoxon signed rank test.
CAST always favorites no PCA.
The two significances for PCA are not clear
sucesses.

30
Conclusion

The quality of clustering results on the data
after PCA is not necessarily higher than that on
the original data, sometimes lower.
The first m PCs do not give the highest adjusted
Rand index, i.E. Another set of PCs gives higher
ARI.

31
Conclusion (Contd)

There are no clear trends regarding the choice of
optimal number of PCs over all the data sets and
over all the clustering algorithms and over the
different similarity metrics. There is no obvious
relationship between cluster quality and the
number or set of PCs used.

32
Conclusion (Contd)

On average, the quality of clusters obtained by
clustering random sets of PCs tend to be slightly
lower than those obtained by clustering random
sets of orthogonal projections, esp. when the
number of components is small.

33
Grand Conclusion

In general, we recommend AGAINST using PCA to
reduce dimensionality of the data before applying
clustering algorithms unless external information
is available.

Write a Comment

User Comments (0)