Clustering methods used in microarray data analysis - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Clustering methods used in microarray data analysis

Description:

Gentleman, Carey, et al (Bioinformatics and Comp Biology Solutings Using R) Chapters ... Milligan and Cooper (Psychometrika 50:159-179, 1985) studied 30 methods ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 43
Provided by: mou64
Category:

less

Transcript and Presenter's Notes

Title: Clustering methods used in microarray data analysis


1
Clustering methods used in microarray data
analysis
  • Steve Horvath
  • Human Genetics and Biostatistics
  • UCLA
  • Acknowledgement based in part on lecture notes
    from
  • Darlene Goldstein web site http//ludwig-sun2.un
    il.ch/darlene/

2
Contents
  • Background clustering
  • k-means clustering
  • hierarchical clustering

3
References for clustering
  • Gentleman, Carey, et al (Bioinformatics and Comp
    Biology Solutings Using R) Chapters 11,12,13,
  • T. Hastie, R. Tibshirani, J. Friedman (2002) The
    elements of Statistical Learning. Springer Series
  • L. Kaufman, P. Rousseeuw (1990) Finding groups in
    data. Wiley Series in Probability
  •  

4
Clustering
  • Historically, objects are clustered into groups
  • periodic table of the elements (chemistry)
  • taxonomy (zoology, botany)
  • Why cluster?
  • Understand the global structure of the data see
    the forest instead of the trees
  • detect heterogeneity in the data, e.g. different
    tumor classes
  • Find biological pathways (cluster gene expression
    profiles)
  • Find data outliers (cluster microarray samples)

5
Classification, Clustering and Prediction
  • WARNING
  • many people talk about classification when they
    mean clustering (unsupervised learning)
  • Other people talk about classification when they
    mean prediction (supervised learning)
  • Usually, the meaning is context specific. I
    prefer to avoid the term classification and to
    talk about clustering or prediction or another
    more specific term.
  • Common denominator classification divides
    objects into groups based on a set of values
  • Unlike a theory, clustering is neither true nor
    false, and should be judged largely on the
    usefulness of results.
  • CLUSTERING IS AND ALWAYS WILL BE SOMEWHAT OF AN
    ARTFORM
  • However, a classification (clustering) may be
    useful for suggesting a theory, which could then
    be tested

6
Cluster analysis
  • Addresses the problem Given n objects, each
    described by p variables (or features), derive a
    useful division into a number of classes
  • Usually want a partition of objects
  • But also fuzzy clustering
  • Could also take an exploratory perspective
  • Unsupervised learning

7
Difficulties in defining cluster
8
Wordy Definition
Cluster analysis aims to group or segment a
collection of objects into subsets or "clusters",
such that those within each cluster are more
closely related to one another than objects
assigned to different clusters.   An object can
be described by a set of measurements (e.g.
covariates, features, attributes) or by its
relation to other objects.   Sometimes the goal
is to arrange the clusters into a natural
hierarchy, which involves successively grouping
or merging the clusters themselves so that at
each level of the hierarchy clusters within the
same group are more similar to each other than
those in different groups.  
  •  

9
Clustering Gene Expression Data
  • Can cluster genes (rows), e.g. to (attempt to)
    identify groups of co-regulated genes
  • Can cluster samples (columns), e.g. to identify
    tumors based on profiles
  • Can cluster both rows and columns at the same
    time (to my knowledge, not in R)

10
Clustering Gene Expression Data
  • Leads to readily interpretable figures
  • Can be helpful for identifying patterns in time
    or space
  • Useful (essential?) when seeking new subclasses
    of samples
  • Can be used for exploratory purposes

11
SimilarityProximity
  • Similarity sij indicates the strength of
    relationship between two objects i and j
  • Usually 0 sij 1
  • Ex 1 absolute value of the Pearson correlation
    coefficient
  • Use of correlation-based similarity is quite
    common in gene expression studies but is in
    general contentious...
  • Ex 2 co-expression network methods topological
    overlap matrix
  • Ex 3 random forest similarity

12
Proximity matrices are the input to most
clustering algorithms
  •  

Proximity between pairs of objects similarity or
dissimilarity. If the original data were
collected as similarities, a monotone-decreasing
function can be used to convert them to
dissimilarities. Most algorithms use
(symmetric) dissimilarities (e.g. distances) But
the triangle inequality does not have to hold.
Triangle inequality  
13
Dissimilarity and Distance
  • Associated with similarity measures sij bounded
    by 0 and 1 is a dissimilarity dij 1 - sij
  • Distance measures have the metric property (dij
    dik djk)
  • Many examples Euclidean (as the crow flies),
    Manhattan (city block), etc.
  • Distance measure has a large effect on
    performance
  • Behavior of distance measure related to scale of
    measurement

14
Partitioning Methods
  • Partition the objects into a prespecified number
    of groups K
  • Iteratively reallocate objects to clusters until
    some criterion is met (e.g. minimize within
    cluster sums of squares)
  • Examples k-means, self-organizing maps (SOM),
    partitioning around medoids (PAM), model-based
    clustering

15
K-means clustering
  • Prespecify number of clusters K, and cluster
    centers
  • Minimize within cluster sum of squares from the
    centers
  • Iterate (until cluster assignments do not
    change)
  • For a given cluster assignment, find the cluster
    means
  • For a given set of means, minimize the within
    cluster sum of squares by allocating each object
    to the closest cluster mean
  • Intended for situtations where all variables are
    quantitative, with (squared) Euclidean distance
    (so scale variables suitably before use)

16
PAM clustering
  • Also need to prespecify number of clusters K
  • Unlike K-means, the cluster centers (medoids)
    are objects, not averages of objects
  • Can use general dissimilarity
  • Minimize (unsquared) distances from objects to
    cluster centers, so more robust than K-means

17
Combinatorial clustering algorithms.Example
K-means clustering
18
Clustering algorithms
  • Goal partition the observations into groups
    ("clusters") so that the pairwise dissimilarities
    between those assigned to the same cluster tend
    to be smaller than those in different clusters.
  • 3 types of clustering algorithms mixture
    modeling, mode seekers (e.g. PRIM algorithm), and
    combinatorial algorithms.
  • We focus on the most popular combinatorial
    algorithms.

19
Combinatorial clustering algorithms
  • Most popular clustering algorithms directly
    assign each observation to a group or cluster
    without regard to a probability model describing
    the data.
  • Notation Label observations by an integer i in
    1,...,N and clusters by an integer k in
    1,...,K.
  • The cluster assignments can be characterized by a
  • many to one mapping C(i) that assigns the i-th
  • observation to the k-th cluster C(i)k. (aka
    encoder)
  •  
  • One seeks a particular encoder C(i) that
    minimizes a particular loss function (aka
    energy function).

20
Loss functions for judging clusterings
  • One seeks a particular encoder C(i) that
    minimizes a particular loss function (aka
    energy function).
  • Example within cluster point scatters

21
Cluster analysis by combinatorial optimization
  • Straightforward in principle Simply minimize
    W(C) over all possible assignments of the N data
    points to K clusters.
  • Unfortunately such optimization by complete
    enumeration is feasible only for small data sets.
  • For this reason practical clustering algorithms
    are able to examine only a fraction of all
    possible encoders C.
  • The goal is to identify a small subset that is
    likely to contain the optimal one or at least a
    good sub-optimal partition.
  • Feasible strategies are based on iterative greedy
    descent.

22
K-means clustering is a very popular iterative
descent clustering methods.
  • Setting all variables are of the quantitative
    type and one uses a squared Euclidean distance.
  • In this case
  • Note that this can be re-expressed as 

23
Thus one can obtain the optimal C by solving the
enlarged optimization problem
  •  
  •  

This can be minimized by an alternating
optimization procedure given on the next slide
24
K-means clustering algorithm leads to a local
minimum
  • 1. For a given cluster assignment C, the total
    cluster variance
  • is minimized with respect to m1,...,mk yielding
    the means of the currently assigned clusters,
    i.e. find the cluster means.
  • 2. Given the current set of means, TotVar is
    minimized by assigning each observation to the
    closest (current) cluster mean. That is
  • C(i)argmink xi-mk2
  • 3. Steps 1 and 2 are iterated until the
    assignments do not change.

25
Recommendations for k-means clustering
  •  
  •  
  • Either Start with many different random choices
    of
  • starting means, and choose the solution having
    smallest value of
  • the objective function.
  • Or use another clustering method (e.g.
    hierarchical clustering)
  • to determine an initial set of cluster centers.

26
Agglomerative clustering, hierarchical clustering
and dendrograms
27
Hierarchical clustering plot
28
Hierarchical Clustering
  • Produce a dendrogram
  • Avoid prespecification of the number of clusters
    K
  • The tree can be built in two distinct ways
  • Bottom-up agglomerative clustering
  • Top-down divisive clustering

29
Agglomerative Methods
  • Start with n mRNA sample (or G gene) clusters
  • At each step, merge the two closest clusters
    using a measure of between-cluster dissimilarity
    which reflects the shape of the clusters
  • Examples of between-cluster dissimilarities
  • Unweighted Pair Group Method with Arithmetic Mean
    (UPGMA) average of pairwise dissimilarities
  • Single-link (NN) minimum of pairwise
    dissimilarities
  • Complete-link (FN) maximum of pairwise
    dissimilarities

30
Agglomerative clustering
  • Agglomerative clustering algorithms begin with
    every observation representing a singleton
    cluster.
  • At each of the N-1 the closest 2 (least
    dissimilar) clusters are merged into a single
    cluster.
  • Therefore a measure of dissimilarity between 2
    clusters must be defined.
  •  

31
Between cluster distances (also known as linkage
methods)
32
Different intergroup dissimilarities
Let G and H represent 2 groups.
33
Comparing different linkage methods
  •  If there is a strong clustering tendency, all 3
    methods produce similar results.
  • Single linkage has a tendency to combine
    observations linked by a series of close
    intermediate observations ("chaining). Good for
    elongated clusters
  • Bad Complete linkage may lead to clusters where
    observations assigned to a cluster can be much
    closer to members of other clusters than they are
    to some members of their own cluster. Use for
    very compact clusters (like perls on a string).
  • Group average clustering represents a compromise
    between the extremes of single and complete
    linkage. Use for ball shaped clusters

34
Dendrogram
  • Recursive binary splitting/agglomeration can be
    represented by a rooted binary tree.
  • The root node represents the entire data set.
  • The N terminal nodes of the trees represent
    individual observations.
  • Each nonterminal node ("parent") has two daughter
    nodes.
  • Thus the binary tree can be plotted so that the
    height of each node is proportional to the value
    of the intergroup dissimilarity between its 2
    daughters.
  • A dendrogram provides a complete description of
    the hierarchical clustering in graphical format.

35
Comments on dendrograms
  • Caution different hierarchical methods as well
    as small changes in the data can lead to
    different dendrograms.
  • Hierarchical methods impose hierarchical
    structure whether or not such structure actually
    exists in the data.
  • In general dendrograms are a description of the
    results of the algorithm and not graphical
    summary of the data.
  • Only valid summary to the extent that the
    pairwise observation dissimilarities obey the
    ultrametric inequality

for all i,i,k
36
Figure 1
average
complete
single
37
Divisive Methods
  • Start with only one cluster
  • At each step, split clusters into two parts
  • Advantage Obtain the main structure of the data
    (i.e. focus on upper levels of dendrogram)
  • Disadvantage Computational difficulties when
    considering all possible divisions into two groups

38
Discussion
39
Partitioning vs. Hierarchical
  • Partitioning
  • Advantage Provides clusters that satisfy some
    optimality criterion (approximately)
  • Disadvantages Need initial K, long computation
    time
  • Hierarchical
  • Advantage Fast computation (agglomerative)
  • Disadvantages Rigid, cannot correct later for
    erroneous decisions made earlier
  • Word on the street most data analysts prefer
    hierarchical clustering over partitioning methods
    when it comes to gene expression data

40
Generic Clustering Tasks
  • Estimating number of clusters
  • Assigning each object to a cluster
  • Assessing strength/confidence of cluster
    assignments for individual objects
  • Assessing cluster homogeneity

41
How many clusters K?
  • Many suggestions for how to decide this!
  • Milligan and Cooper (Psychometrika 50159-179,
    1985) studied 30 methods
  • A number of new methods, including GAP
    (Tibshirani) and clest (Fridlyand and Dudoit,
    uses bootstrapping), see also prediction strength
    methods
  • http//www.genetics.ucla.edu/labs/horvath/General
    PredictionStrength/

42
R clustering
  • A number of R packages (libraries) contain
    functions to carry out clustering, including
  • mva kmeans, hclust
  • cluster pam (among others)
  • cclust convex clustering, also methods to
    estimate K
  • mclust model-based clustering
  • GeneSOM
Write a Comment
User Comments (0)
About PowerShow.com