Title: 4. Ad-hoc I: Hierarchical clustering
1- 4. Ad-hoc I Hierarchical clustering
- Hierarchical versus Flat
- Flat methods generate a single partition into k
clusters. The number k of clusters has to be
determined by the user ahead of time. - Hierarchical methods generate a hierarchy of
partitions, i.e. - a partition P1 into 1 clusters (the entire
collection) - a partition P2 into 2 clusters
-
- a partition Pn into n clusters (each object
forms its own cluster) - It is then up to the user to decide which of the
partitions reflects actual sub-populations in the
data.
2Note A sequence of partitions is called
"hierarchical" if each cluster in a given
partition is the union of clusters in the next
larger partition.
Top hierarchical sequence of partitionsBottom
non hierarchical sequence
3- Hierarchical methods again come in two varieties,
agglomerative and divisive. - Agglomerative methods
- Start with partition Pn, where each object forms
its own cluster. - Merge the two closest clusters, obtaining Pn-1.
- Repeat merge until only one cluster is left.
- Divisive methods
- Start with P1.
- Split the collection into two clusters that are
as homogenous (and as different from each
other) as possible. - Apply splitting procedure recursively to the
clusters.
4Note Agglomerative methods require a rule to
decide which clusters to merge. Typically one
defines a distance between clusters and then
merges the two clusters that are
closest. Divisive methods require a rule for
splitting a cluster.
54.1 Hierarchical agglomerative clustering Need
to define a distance d(P,Q) between groups, given
a distance measure d(x,y) between observations.
Commonly used distance measures 1. d1(P,Q)
min d(x,y), for x in P, y in Q ( single
linkage ) 2. d2(P,Q) ave d(x,y), for x in P,
y in Q ( average linkage ) 3. d3(P,Q) max
d(x,y), for x in P, y in Q ( complete
linkage ) 4.
( centroid
method ) 5.
( Wards method )
d5 is called Wards distance.
6- Motivation for Wards distance
- Let Pk P1 ,, Pk be a partition of the
observations into k groups. - Measure goodness of a partition by the sum of
squared distances of observations from their
cluster means
- Consider all possible (k-1)-partitions
obtainable from Pk by a merge - Merging two clusters with smallest Wards
distance optimizes goodness of new partition.
7- 4.2 Hierarchical divisive clustering
- There are divisive versions of single linkage,
average linkage, and Wards method. - Divisive version of single linkage
- Compute minimal spanning tree (graph connecting
all the objects with smallest total edge
length. - Break longest edge to obtain 2 subtrees, and a
corresponding partition of the objects. - Apply process recursively to the subtrees.
- Agglomerative and divisive versions of single
linkage give identical results (more later).
8Divisive version of Wards method. Given cluster
R. Need to find split of R into 2 groups P,Q
to minimize
or, equivalently, to maximize Wards distance
between P and Q. Note No computationally
feasible method to find optimal P, Q for large
R. Have to use approximation.
9- Iterative algorithm to search for the optimal
Wards split - Project observations in R on largest principal
component. - Split at median to obtain initial clusters P, Q.
- Repeat
- Assign each observation to cluster with
closest mean - Re-compute cluster means
- Until convergence
- Note
- Each step reduces RSS(P, Q)
- No guarantee to find optimal partition.
10Divisive version of average linkage Algorithm
Diana, Struyf, Hubert, and Rousseuw, pp. 22
11- 4.3 Dendograms
- Result of hierarchical clustering can be
represented as binary tree - Root of tree represents entire collection
- Terminal nodes represent observations
- Each interior node represents a cluster
- Each subtree represents a partition
- Note The tree defines many more partitions than
the n-2 nontrivial ones constructed during the
merge (or split) process. - Note For HAC methods, the merge order defines a
sequence of n subtrees of the full tree. For HDC
methods a sequence of subtrees can be defined if
there is a figure of merit for each split.
12If distance between daughter clusters is
monotonically increasing as we move up the tree,
we can draw dendogram y-coordinate of vertex
distance between daughter clusters.
Point set and corresponding single linkage
dendogram
13- Standard method to extract clusters from a
dendogram - Pick number of clusters k.
- Cut dendogram at a level that results in k
subtrees.
14- 4.4 Experiment
- Try hierarchical method on unimodal 2D datasets.
- Experiments suggest
-
- Except in completely clear-cut situations, tree
cutting (cutree) is useless for extracting
clusters from a dendogram. - Complete linkage fails completely for elongated
clusters.
15- Needed
- Diagnostics to decide whether the daughters of a
dendogram node really correspond to spatially
separated clusters. - Automatic and manual methods for dendogram
pruning. - Methods for assigning observations in pruned
subtrees to clusters.