Title: Important clustering methods used in microarray data analysis
1Important clustering methods used in microarray
data analysis
- Steve Horvath
- Human Genetics and Biostatistics
- UCLA
2Contents
- Multidimensional scaling plots
- Related to principal component analysis
- k-means clustering
- hierarchical clustering
3Introduction to clustering
4MDS plot of clusters
5MDS plot of clusters
62 references for clustering
- T. Hastie, R. Tibshirani, J. Friedman (2002) The
elements of Statistical Learning. Springer Series - L. Kaufman, P. Rousseeuw (1990) Finding groups in
data. Wiley Series in Probability - Â
7Introduction to clustering
Cluster analysis aims to group or segment a
collection of objects into subsets or "clusters",
such that those within each cluster are more
closely related to one another than objects
assigned to different clusters. Â An object can
be described by a set of measurements (e.g.
covariates, features, attributes) or by its
relation to other objects. Â Sometimes the goal
is to arrange the clusters into a natural
hierarchy, which involves successively grouping
or merging the clusters themselves so that at
each level of the hierarchy clusters within the
same group are more similar to each other than
those in different groups. Â
8Proximity matrices are the input to most
clustering algorithms
Proximity between pairs of objects similarity or
dissimilarity. If the original data were
collected as similarities, a monotone-decreasing
function can be used to convert them to
dissimilarities. Most algorithms use
(symmetric) dissimilarities (e.g. distances) But
the triangle inequality does not have to hold.
Triangle inequality Â
9Different intergroup dissimilarities
Let G and H represent 2 groups.
10Agglomerative clustering, hierarchical clustering
and dendrograms
11Hierarchical clustering plot
12Agglomerative clustering
- Agglomerative clustering algorithms begin with
every observation representing a singleton
cluster. - At each of the N-1 the closest 2 (least
dissimilar) clusters are merged into a single
cluster. - Therefore a measure of dissimilarity between 2
clusters must be defined. - Â
13Comparing different linkage methods
- Â If there is a strong clustering tendency, all 3
methods produce similar results. - Single linkage has a tendency to combine
observations linked by a series of close
intermediate observations ("chaining). Good for
elongated clusters - Bad Complete linkage may lead to clusters where
observations assigned to a cluster can be much
closer to members of other clusters than they are
to some members of their own cluster. Use for
very compact clusters (like perls on a string). - Group average clustering represents a compromise
between the extremes of single and complete
linkage. Use for ball shaped clusters
14Dendrogram
- Recursive binary splitting/agglomeration can be
represented by a rooted binary tree. - The root node represents the entire data set.
- The N terminal nodes of the trees represent
individual observations. - Each nonterminal node ("parent") has two daughter
nodes. - Thus the binary tree can be plotted so that the
height of each node is proportional to the value
of the intergroup dissimilarity between its 2
daughters. - A dendrogram provides a complete description of
the hierarchical clustering in graphical format.
15Comments on dendrograms
- Caution different hierarchical methods as well
as small changes in the data can lead to
different dendrograms. - Hierarchical methods impose hierarchical
structure whether or not such structure actually
exists in the data. - In general dendrograms are a description of the
results of the algorithm and not graphical
summary of the data. - Only valid summary to the extent that the
pairwise observation dissimilarities obey the
ultrametric inequality
for all i,i,k
16Figure 1
average
complete
single