Exploring Data using Dimension Reduction and Clustering - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Exploring Data using Dimension Reduction and Clustering

Description:

Initialize the K cluster centroids (with points chosen at random) ... Points can move from one cluster to another, but the final solution depends ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 26

Provided by: Nao69

Learn more at: http://sites.stat.psu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Exploring Data using Dimension Reduction and Clustering

1
Exploring Data usingDimension Reduction
andClustering

Naomi Altman
Nov. 06

2
Spellman Cell Cycle data

Yeast cells were synchronized by arrest of a
cdc15 temperature-sensitive mutant.
Samples were taken every 10 minutes and one array
was hybridized for each sample using a reference
design. 2 complete cycles are in the data.
I downloaded the data and normalized using loess.
(Print tip data were not available.)
I used the normalized value of M as the primary
data.

3
What they did

Supervised dimension reduction regression
They were looking for genes that have cyclic
behavior - i.e. a sine or cosine wave in time.
They regressed Mi on sine and cosine waves and
selected genes for which the R2 was high.
The period of the wave was known (from observing
the cells?), so they regression against sine(wt)
and cos(wt) where w is set to give the
appropriate period.
If the period is unknown, a method called Fourier
analysis can be used to discover it.

4
Regression

Suppose we are looking for genes that are
associated with a particular quantitative
phenotype, or have a pattern that is known in
advance.
E.g. Suppose we are interested in genes that
change linearly with temperature and
quadratically with pH.
Yb0 b1Temp b2pH b3pH2 noise
We might fit this model for each gene (assuming
that the arrays came from samples subjected to
different levels of Temp and pH.
This is similar to differential expression
analysis - we have a multiple comparisons problem.

5
Regression

We might compute an adjusted p-value, or
goodness-of-fit statistic to select genes based
on the fit to a pattern.
If we have many "conditions" we do not need to
replicate as much as in differential expression
analysis because we consider any deviation from
the "pattern" to be random variation.

6
What I did

Unsupervised dimension reduction
I used SVD on the 832 genes x 24 time points.
We can see that eigengene 5 has the cyclic genes.

7
For class

I extracted the 304 spots with variance greater
than 0.25.
To my surprise, several of these were empty or
control spots. I removed these.
This leaves 295 genes which are in yeast.txt.
Read these into R.
Also timec(10,30,50,10(725),270,290)

yeastread.delim("yeast.txt",headerT)
timec(10,30,50,10(725),270,290)
M.yeastyeast,225 strip off the gene names
svd.msvd(M.yeast) svd
scree plot
plot(124,svd.md)
par(mfrowc(4,4)) plot the first 16
"eigengenes"
for (i in 116) plot(time,svd.mv,i,mainpaste("
Eigen",i),type"l")
par(mfrowc(1,1))
plot(time,svd.mv,1,type"l",ylimc(min(svd.mv)
,max(svd.mv)))
for (i in 24) lines(time,svd.mv,i,coli)
It looks like "eigengenes" 2-4 have the periodic
components.

Reduce dimension by finding genes that are
linear combinations
of these 3 patterns by regression
We can use limma to fit a regression to every
gene and use e.g.
the F or p-value to pick significant genes
library(limma)
design.regmodel.matrix(svd.mv,24)
fit.reglmFit(M.yeast,design.reg)
The "reduced dimension" version of the genes
are the fitted
values b0 b1v2 b2v3 b3v4 vi is the
ith column of svd.mv
bi are the coefficients
Lets look at gene 1 (not periodic) and genes
5, 6, 7
plot(time,M.yeasti,,type"l")
lines(time,fit.regcoefi,1 fit.regcoefi,2sv
d.mv,2
fit.regcoefi,3svd.mv,3fit.regcoefi,4sv
d.mv,4)

Select the genes with a strong period component
We could use R2 but in limma, it is simplest to
compute the
moderated F-test for regression and then use
the p-values.
Limma requires us to remove the intercept from
the coefficients
to get this test (
contrast.matrixcbind(c(0,1,0,0),c(0,0,1,0),c(0,0,
0,1))
fit.contrastcontrasts.fit(fit.reg,contrast.matrix
)
efiteBayes(fit.contrast)
We will use the Bonferroni method to pick a
significance level
a0.05/genes 0.00017
sigGeneswhich(efitF.p.valuelt0.00017)
plot a few of these genes
You might also want to plot a few genes with
p-value gt 0.5

11
Note that we used the normalized but uncentered
unscaled data for this exercise. Things might
look very different if the data were
transformed.
12
Clustering

We might ask which genes have similar expression
patterns.
Once we have expressed (dis)similarity as a
distance measure, we can use this measure to
cluster genes that are similar.
There are many methods. We will discuss 2 -
hierarchical clustering
k-means clustering

13
Hierarchical Clustering (agglomerative)

Choose a distance function for points d(x1,x2)
Choose a distance function for clusters D(C1,C2)
(for clusters formed by just one point, D
reduces to d).
Start from N clusters, each containing one data
point.
At each iteration
a) Using the current matrix of cluster
distances, find the two closest clusters.
b)Update the list of clusters by merging the
two closest.
c) Update the matrix of cluster distances
accordingly
Repeat until all data points are joined in one
cluster.
Remarks
The method is sensitive to anomalous data
points/outliers
F. Chiaromonte Sp 06 5

14
Hierarchical Clustering (agglomerative)

Choose a distance function for points d(x1,x2)
Choose a distance function for clusters D(C1,C2)
(for clusters formed by just one point, D
reduces to d).
Start from N clusters, each containing one data
point.
At each iteration
a) Using the current matrix of cluster
distances, find the two closest clusters.
b)Update the list of clusters by merging the
two closest.
c) Update the matrix of cluster distances
accordingly
Repeat until all data points are joined in one
cluster.
Remarks
The method is sensitive to anomalous data
points/outliers.
Mergers are irreversible bad mergers
occurring early on affect the structure of the
nested sequence.
If two pairs of clusters are equally (and
maximally) close at a given iteration, we have to
choose arbitrarily the choice will affect the
structure of the nested sequence.
F. Chiaromonte Sp 06 5

15
Defining cluster distance the linkage function
D(C1,C2) is a function of the distances f
d(x1i,x2j) x1i in C1 x2j in
C2 Single (string-like, long)
fmin Complete (ball-like, compact)
fmax Average
faverage Centroid
d(ave(x1i),ave(x2j) ) Single and complete
linkages produce nested sequences invariant under
monotone transformations of d not the case for
average linkage. However, the latter is a
compromise between long, stringy clusters
produced by single, and round, compact
clusters produced by complete. F. Chiaromonte Sp
06 5
16
Example Agglomeration step in constructing the
nested sequence (first iteration) 1. 3 and 5
are the closest, and are therefore merged in
cluster 35. 2. new distance matrix
computed with complete linkage. Ordinate
distance, or height, at which each merger
occurred. Horizontal ordering of the data points
is any order preventing intersections of
branches. F. Chiaromonte Sp 06 5
single linkage complete linkage
17
Hierarchical Clustering

Hierarchical clustering, per se, does not dictate
a partition and a number of clusters.
It provides a nested sequence of partitions
(this is more informative than just one
partition).
To settle on one partition, we have to cut the
dendrogram.
Usually we pick a height and cut there - but the
most informative cuts are often at different
heights for different branches.
F. Chiaromonte Sp 06 5

18
hclust(dist(M.yeast), method"single")
19
Partitioning algorithms K-means.

Choose a distance function for points d(xi,xj).
Choose K number of clusters.
Initialize the K cluster centroids (with points
chosen at random).
Use the data to iteratively relocate centroids,
and reallocate points to closest centroid.
At each iteration
Compute distance of each data point from each
current centroid.
Update current cluster membership of each data
point, selecting the centroid to which the point
is closest.
Update current centroids, as averages of the new
clusters formed in 2.
Repeat until cluster memberships, and thus
centroids, stop changing.
F. Chiaromonte Sp 06 5

Remarks
This method is sensitive to anomalous data
points/outliers.
Points can move from one cluster to another, but
the final solution depends strongly on centroid
initialization (so we usually restart several
times to check).
If two centroids are equally (and maximally)
close to an observation at a given iteration, we
have to choose arbitrarily (the problem here is
not so serious because points can move later).
There are several variants of the k-means
algorithm using e.g. median.
K-means converges to a local minimum of the total
within-cluster square distance (total within
cluster sum of squares) not necessarily a
global one.
Clusters tend to be ball-shaped with respect to
the chosen distance.

21
Starting from the arbitrarily chosen open
rectangles Assign every data value to a cluster
defined by the nearest centroid. Recompute the
centroids based on the most current
clustering. Reassign data values to cluster and
repeat.
Remarks The algorithm does not indicate how to
pick K. To change K, redo the partitioning. The
clusters are not necessarily nested. F.
Chiaromonte Sp 06 5
22
Here is the yeast data. (4 runs) To display the
clusters, we often use the main eigendirections
(svdu). These do show that much of the
clustering is defined by these 2 directions, but
it is not clear that there really are clusters.
23
6 clusters 4
clusters k.outkmeans(M.yeast,centers6) plot(s
vd.mu,1,svd.mu,2,colk.out5cl)
24
Other partitioning Methods

Partitioning around medioids (PAM) instead of
averages, use multidim medians as centroids
(cluster prototypes). Dudoit and Freedland
(2002).
Self-organizing maps (SOM) add an underlying
topology (neighboring structureon a lattice)
that relates cluster centroids to one another.
Kohonen (1997), Tamayo et al. (1999).
Fuzzy k-means allow for a gradation of points
between clusters soft partitions. Gash and Eisen
(2002).
Mixture-based clustering implemented through an
EM (Expectation-Maximization)algorithm. This
provides soft partitioning, and allows for
modeling of cluster centroids and shapes. Yeung
et al. (2001), McLachlan et al. (2002)
F. Chiaromonte Sp 06 5

25
Assessing the ClustersComputationally

The bottom line is that the clustering is "good"
if it is biologically meaningful (but this is
hard to assess).
Computationally we can
1) Use a goodness of cluster measure, such as the
within cluster
distances compared to the between cluster
distances.
2) Perturb the data and assess cluster changes
a) add noise (maybe residuals after ANOVA)
b) resample (genes, arrays)