Exploring Data using Dimension Reduction and Clustering - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Exploring Data using Dimension Reduction and Clustering

Description:

Initialize the K cluster centroids (with points chosen at random) ... Points can move from one cluster to another, but the final solution depends ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 26
Provided by: Nao69
Category:

less

Transcript and Presenter's Notes

Title: Exploring Data using Dimension Reduction and Clustering


1
Exploring Data usingDimension Reduction
andClustering
  • Naomi Altman
  • Nov. 06

2
Spellman Cell Cycle data
  • Yeast cells were synchronized by arrest of a
    cdc15 temperature-sensitive mutant.
  • Samples were taken every 10 minutes and one array
    was hybridized for each sample using a reference
    design. 2 complete cycles are in the data.
  • I downloaded the data and normalized using loess.
    (Print tip data were not available.)
  • I used the normalized value of M as the primary
    data.

3
What they did
  • Supervised dimension reduction regression
  • They were looking for genes that have cyclic
    behavior - i.e. a sine or cosine wave in time.
  • They regressed Mi on sine and cosine waves and
    selected genes for which the R2 was high.
  • The period of the wave was known (from observing
    the cells?), so they regression against sine(wt)
    and cos(wt) where w is set to give the
    appropriate period.
  • If the period is unknown, a method called Fourier
    analysis can be used to discover it.

4
Regression
  • Suppose we are looking for genes that are
    associated with a particular quantitative
    phenotype, or have a pattern that is known in
    advance.
  • E.g. Suppose we are interested in genes that
    change linearly with temperature and
    quadratically with pH.
  • Yb0 b1Temp b2pH b3pH2 noise
  • We might fit this model for each gene (assuming
    that the arrays came from samples subjected to
    different levels of Temp and pH.
  • This is similar to differential expression
    analysis - we have a multiple comparisons problem.

5
Regression
  • We might compute an adjusted p-value, or
    goodness-of-fit statistic to select genes based
    on the fit to a pattern.
  • If we have many "conditions" we do not need to
    replicate as much as in differential expression
    analysis because we consider any deviation from
    the "pattern" to be random variation.

6
What I did
  • Unsupervised dimension reduction
  • I used SVD on the 832 genes x 24 time points.
  • We can see that eigengene 5 has the cyclic genes.

7
For class
  • I extracted the 304 spots with variance greater
    than 0.25.
  • To my surprise, several of these were empty or
    control spots. I removed these.
  • This leaves 295 genes which are in yeast.txt.
  • Read these into R.
  • Also timec(10,30,50,10(725),270,290)

8
  • yeastread.delim("yeast.txt",headerT)
  • timec(10,30,50,10(725),270,290)
  • M.yeastyeast,225 strip off the gene names
  • svd.msvd(M.yeast) svd
  • scree plot
  • plot(124,svd.md)
  • par(mfrowc(4,4)) plot the first 16
    "eigengenes"
  • for (i in 116) plot(time,svd.mv,i,mainpaste("
    Eigen",i),type"l")
  • par(mfrowc(1,1))
  • plot(time,svd.mv,1,type"l",ylimc(min(svd.mv)
    ,max(svd.mv)))
  • for (i in 24) lines(time,svd.mv,i,coli)
  • It looks like "eigengenes" 2-4 have the periodic
    components.

9
  • Reduce dimension by finding genes that are
    linear combinations
  • of these 3 patterns by regression
  • We can use limma to fit a regression to every
    gene and use e.g.
  • the F or p-value to pick significant genes
  • library(limma)
  • design.regmodel.matrix(svd.mv,24)
  • fit.reglmFit(M.yeast,design.reg)
  • The "reduced dimension" version of the genes
    are the fitted
  • values b0 b1v2 b2v3 b3v4 vi is the
    ith column of svd.mv
  • bi are the coefficients
  • Lets look at gene 1 (not periodic) and genes
    5, 6, 7
  • plot(time,M.yeasti,,type"l")
  • lines(time,fit.regcoefi,1 fit.regcoefi,2sv
    d.mv,2
  • fit.regcoefi,3svd.mv,3fit.regcoefi,4sv
    d.mv,4)

10
  • Select the genes with a strong period component
  • We could use R2 but in limma, it is simplest to
    compute the
  • moderated F-test for regression and then use
    the p-values.
  • Limma requires us to remove the intercept from
    the coefficients
  • to get this test (
  • contrast.matrixcbind(c(0,1,0,0),c(0,0,1,0),c(0,0,
    0,1))
  • fit.contrastcontrasts.fit(fit.reg,contrast.matrix
    )
  • efiteBayes(fit.contrast)
  • We will use the Bonferroni method to pick a
    significance level
  • a0.05/genes 0.00017
  • sigGeneswhich(efitF.p.valuelt0.00017)
  • plot a few of these genes
  • You might also want to plot a few genes with
    p-value gt 0.5

11
Note that we used the normalized but uncentered
unscaled data for this exercise. Things might
look very different if the data were
transformed.
12
Clustering
  • We might ask which genes have similar expression
    patterns.
  • Once we have expressed (dis)similarity as a
    distance measure, we can use this measure to
    cluster genes that are similar.
  • There are many methods. We will discuss 2 -
    hierarchical clustering
  • k-means clustering

13
Hierarchical Clustering (agglomerative)
  • Choose a distance function for points d(x1,x2)
  • Choose a distance function for clusters D(C1,C2)
    (for clusters formed by just one point, D
    reduces to d).
  • Start from N clusters, each containing one data
    point.
  • At each iteration
  • a) Using the current matrix of cluster
    distances, find the two closest clusters.
  • b)Update the list of clusters by merging the
    two closest.
  • c) Update the matrix of cluster distances
    accordingly
  • Repeat until all data points are joined in one
    cluster.
  • Remarks
  • The method is sensitive to anomalous data
    points/outliers
  • F. Chiaromonte Sp 06 5

14
Hierarchical Clustering (agglomerative)
  • Choose a distance function for points d(x1,x2)
  • Choose a distance function for clusters D(C1,C2)
    (for clusters formed by just one point, D
    reduces to d).
  • Start from N clusters, each containing one data
    point.
  • At each iteration
  • a) Using the current matrix of cluster
    distances, find the two closest clusters.
  • b)Update the list of clusters by merging the
    two closest.
  • c) Update the matrix of cluster distances
    accordingly
  • Repeat until all data points are joined in one
    cluster.
  • Remarks
  • The method is sensitive to anomalous data
    points/outliers.
  • Mergers are irreversible bad mergers
    occurring early on affect the structure of the
    nested sequence.
  • If two pairs of clusters are equally (and
    maximally) close at a given iteration, we have to
    choose arbitrarily the choice will affect the
    structure of the nested sequence.
  • F. Chiaromonte Sp 06 5

15
Defining cluster distance the linkage function
D(C1,C2) is a function of the distances f
d(x1i,x2j) x1i in C1 x2j in
C2 Single (string-like, long)
fmin Complete (ball-like, compact)
fmax Average
faverage Centroid
d(ave(x1i),ave(x2j) ) Single and complete
linkages produce nested sequences invariant under
monotone transformations of d not the case for
average linkage. However, the latter is a
compromise between long, stringy clusters
produced by single, and round, compact
clusters produced by complete. F. Chiaromonte Sp
06 5
16
Example Agglomeration step in constructing the
nested sequence (first iteration) 1. 3 and 5
are the closest, and are therefore merged in
cluster 35. 2. new distance matrix
computed with complete linkage. Ordinate
distance, or height, at which each merger
occurred. Horizontal ordering of the data points
is any order preventing intersections of
branches. F. Chiaromonte Sp 06 5
single linkage complete linkage
17
Hierarchical Clustering
  • Hierarchical clustering, per se, does not dictate
    a partition and a number of clusters.
  • It provides a nested sequence of partitions
    (this is more informative than just one
    partition).
  • To settle on one partition, we have to cut the
    dendrogram.
  • Usually we pick a height and cut there - but the
    most informative cuts are often at different
    heights for different branches.
  • F. Chiaromonte Sp 06 5

18
hclust(dist(M.yeast), method"single")
19
Partitioning algorithms K-means.
  • Choose a distance function for points d(xi,xj).
  • Choose K number of clusters.
  • Initialize the K cluster centroids (with points
    chosen at random).
  • Use the data to iteratively relocate centroids,
    and reallocate points to closest centroid.
  • At each iteration
  • Compute distance of each data point from each
    current centroid.
  • Update current cluster membership of each data
    point, selecting the centroid to which the point
    is closest.
  • Update current centroids, as averages of the new
    clusters formed in 2.
  • Repeat until cluster memberships, and thus
    centroids, stop changing.
    F. Chiaromonte Sp 06 5

20
  • Remarks
  • This method is sensitive to anomalous data
    points/outliers.
  • Points can move from one cluster to another, but
    the final solution depends strongly on centroid
    initialization (so we usually restart several
    times to check).
  • If two centroids are equally (and maximally)
    close to an observation at a given iteration, we
    have to choose arbitrarily (the problem here is
    not so serious because points can move later).
  • There are several variants of the k-means
    algorithm using e.g. median.
  • K-means converges to a local minimum of the total
    within-cluster square distance (total within
    cluster sum of squares) not necessarily a
    global one.
  • Clusters tend to be ball-shaped with respect to
    the chosen distance.

21
Starting from the arbitrarily chosen open
rectangles Assign every data value to a cluster
defined by the nearest centroid. Recompute the
centroids based on the most current
clustering. Reassign data values to cluster and
repeat.
Remarks The algorithm does not indicate how to
pick K. To change K, redo the partitioning. The
clusters are not necessarily nested. F.
Chiaromonte Sp 06 5
22
Here is the yeast data. (4 runs) To display the
clusters, we often use the main eigendirections
(svdu). These do show that much of the
clustering is defined by these 2 directions, but
it is not clear that there really are clusters.
23
6 clusters 4
clusters k.outkmeans(M.yeast,centers6) plot(s
vd.mu,1,svd.mu,2,colk.out5cl)
24
Other partitioning Methods
  • Partitioning around medioids (PAM) instead of
    averages, use multidim medians as centroids
    (cluster prototypes). Dudoit and Freedland
    (2002).
  • Self-organizing maps (SOM) add an underlying
    topology (neighboring structureon a lattice)
    that relates cluster centroids to one another.
    Kohonen (1997), Tamayo et al. (1999).
  • Fuzzy k-means allow for a gradation of points
    between clusters soft partitions. Gash and Eisen
    (2002).
  • Mixture-based clustering implemented through an
    EM (Expectation-Maximization)algorithm. This
    provides soft partitioning, and allows for
    modeling of cluster centroids and shapes. Yeung
    et al. (2001), McLachlan et al. (2002)
  • F. Chiaromonte Sp 06 5

25
Assessing the ClustersComputationally
  • The bottom line is that the clustering is "good"
    if it is biologically meaningful (but this is
    hard to assess).
  • Computationally we can
  • 1) Use a goodness of cluster measure, such as the
    within cluster
  • distances compared to the between cluster
    distances.
  • 2) Perturb the data and assess cluster changes
  • a) add noise (maybe residuals after ANOVA)
  • b) resample (genes, arrays)
Write a Comment
User Comments (0)
About PowerShow.com