Title: Clustering algorithms and methods
1Clustering algorithms and methods
Andreas Held
28.June.2007
2Content
- What is a cluster and the clustering process
- Proximity measures
- Hierarchical clustering
- Agglomerative
- Divisive
- Partitioning clustering
- K-means
- Density-based Clustering
- DBSCAN
3The Cluster
- A cluster is a group or accumulation of objects
with similar attributes - conditions to clusters
- (i) Homogeneity within a cluster
- (ii) Heterogeneity to other clusters
- possible objects in biology
- - genes (transcriptomics)
- - individuals (plant systematics),
- - sequences (sequence analysis)
Ruspini-dataset artifical generated dataset
4Objectives of Clustering
- Generation of preferably homogeneous and
heterogeneous clusters - Identification of categories, classes or groups
in the data - Recognition of relations within the data
- Concise Structuring of the data
- (e.g. dendogram)
5The clustering process
- the expression levels of genes under
- different conditions
experimental data
- take only the expression levels under
- the conditions which interest
- gt attribute vectors xi (y1, , ym)
preprocessing
- create the raw-data-matrix by writing the
- attribute-vectors among each other
raw-data-matrix
- define the distance or similarity functions
- and build the distance-matrix so that on
- their rows and columns the objects are
- confronted.
proximity measures
- choose a clustering algorithm and use it
- on the data
clustering algorithm
6Distance functions for objects
- d(x, y) calculates the distance between the two
objects x and y
- Distance measures
- Example
7Distance measures for cluster
Calculating the distance between two clusters is
important for some algorithms (e.g. hierarchical
algorithms)
Condition 2
Single Linkage min d(a, b) a ? A, b ? B
Cluster Y
5
D
Complete Linkage
4
Average Linkage
Complete Linkage max d(a, b) a ? A, b ? B
3
C
Single Linkage
2
B
Average Linkage
1
Cluster X
A
Condition 1
1
3
4
5
2
8Differentiability of clustering algorithms
9Hierarchical Clustering
- Two methods of hierarchical Clustering
- agglomerative (bottom-up)
- divise (top-down)
- agglomerative vs. divisive
- divise and agglomerative methods produce the same
results - divise algorithms need much more computing power
so in practical only agglomerative methods are
used. - Agglomrative algorithm example UPGMA used in
phylogenetics - Conditions
- given distance or similarity measure for objects
- given distance measure for cluster
- result is a dendrogram
10Agglomerative hierarchical clustering
Algorithm
Find the two Clusters with the closest distance
and put those two Clusters into one. Compute the
new Distance-Matrix.
Construct the finest Partition and compute
Distance matrix D
Until all clusters are agglomerated
Start with objects and given distance measure
between clusters
d
Distance measures - Manhattan Distance - Single
Linkage
9
D
c
8
E
7
C
6
b
b
5
a
c
4
3
B
2
a
A
1
A
B
C
D
E
1
2
3
4
5
6
7
Clusters
Dendrogram
Distance-Matrix
11Hierarchical clustering- conclusions -
- Advantages
- Dendrogram allows interpretation
- depending on the level of the dendogram the
different clustering grades can be explored. - Usage on all data spaces if a distance measure
can be defined - Disadvantages
- The user has to identify the clusters by himself
- recalculations of the great distance-matrix makes
the algorithm resource intensive - Higher runtimes vs. Non-hierarchic-methods
12Partioning Clustering -k-means algorithm-
- merge n objects into k cluster
- calculate centroids from given clustering
- ,ci centroid of cluster Ci
- calculate clustering from given centroids
- gt merge objects into the cluster with
- minimum distance to its centroid
13k-means algorithm principle
Cluster center (centroid)
Clustering
- In general neither the centroids nor the
clustering is known - Guessing
14k-means Algorithm
- euclidian distance - k 3
0) Init place randomly 3 cluster-centroids
1.0
0.9
1) Join every object into the cluster with the
nearest cluster-centroid
0.8
0.7
2) Compute the new cluster-centroids from
the given clustering
0.6
0.5
3) Repeat 1) and 2) until all centroids stop
moving
0.4
0.3
0.2
- in each step the values get better for
- the centroids and the clustering
0.1
0.2
0.4
0.6
0.8
1.0
15k-means algorithm- problems -
- Not every run achieves the same result, because
the result depends on random initiation of the
clusters - Run the algorithm a couple of times and take the
best result - fixed number of clusters need to be known before
starting the algorithm - gt try different values for k and take the best
result - the problem to compute the optimal number of
clusters is not trivial. An approach is the elbow
criterion.
16k-means algorithm - advantages -
- easy to implement
- linear runtime allows execution on large
databases - For example the clustering of microarray data
- depending on the experiment 20.000 dimensional
vectors
17Partioning Clustering- density-based method -
- Condition on the data space
- data space where objects are closely together
separated from areas where the objects are less
closely together - gt Cluster with arbitrary shape get found
18Density-based clustering - parameters -
- ? the environment around an object
- ?(o) all objects in the
- ?-environment of object o
- MinPts minimum number of objects, that have to
be in an object-environment, so that this object
is core object
?
o
o
19Density-based clustering - definitions -
- object o? O is core object,
- if ?(o) MinPts
- object p ? O is directly density-reachable from q
? O, if p ? ?(q) - A object p is density-reachable from an object q,
if there is a chain of directly density-reachable
objects between p and q.
o
q
p
q
p
20Density-based clustering- example DBSCAN-
Parameter
Algorithm
MinPts 4
1) Explore incremental all objects
?
see below
2) find core object (e(o) MinPts 4)
3) Start with a new cluster and merge the
object to this cluster
4) Search for all density-reachable objects
and merge them also to the cluster
21Density-based clustering - conclusions -
- Advantages
- Minimal requirements of domain knowledge to
determine the input parameters. - Discovery of clusters with arbitrary shape
- Good efficiency on large databases
- Disadvantages
- problems on data spaces with strongly different
densities within different ranges - Bad efficiency on high dimensional databases
22More clustering methods
- Hierarchical methods (agglomerative, divisive)
- Partitioning methods (i.e. k-means)
- Density-based methods (i.e. dbscan)
- Fuzzy clustering
- Grid-based methods
- Constraint based methods
- High dimensional clustering
23Clustering algorithms- conclusions -
- Choosing a clustering algorithm for a particular
problem is not trivial. - single algorithms cover only a part of the given
requirements. (corresponding runtime-behavior,
precision, influence of runaways...) - gt there has no algorithm been found (yet), that
would have an optimal usability for every
purpose, and mankind is still waiting for the one
to develop such an algorithm.
24End