Clustering algorithms and methods - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Clustering algorithms and methods

Description:

agglomerative vs. divisive: divise and agglomerative methods produce the same results ... Hierarchical methods (agglomerative, divisive) Partitioning methods (i. ... – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 25
Provided by: unimue
Category:

less

Transcript and Presenter's Notes

Title: Clustering algorithms and methods


1
Clustering algorithms and methods
  • - Review and usage -

Andreas Held
28.June.2007
2
Content
  • What is a cluster and the clustering process
  • Proximity measures
  • Hierarchical clustering
  • Agglomerative
  • Divisive
  • Partitioning clustering
  • K-means
  • Density-based Clustering
  • DBSCAN

3
The Cluster
  • A cluster is a group or accumulation of objects
    with similar attributes
  • conditions to clusters
  • (i) Homogeneity within a cluster
  • (ii) Heterogeneity to other clusters
  • possible objects in biology
  • - genes (transcriptomics)
  • - individuals (plant systematics),
  • - sequences (sequence analysis)

Ruspini-dataset artifical generated dataset
4
Objectives of Clustering
  • Generation of preferably homogeneous and
    heterogeneous clusters
  • Identification of categories, classes or groups
    in the data
  • Recognition of relations within the data
  • Concise Structuring of the data
  • (e.g. dendogram)

5
The clustering process
  • the expression levels of genes under
  • different conditions

experimental data
  • take only the expression levels under
  • the conditions which interest
  • gt attribute vectors xi (y1, , ym)

preprocessing
  • create the raw-data-matrix by writing the
  • attribute-vectors among each other

raw-data-matrix
  • define the distance or similarity functions
  • and build the distance-matrix so that on
  • their rows and columns the objects are
  • confronted.

proximity measures
  • choose a clustering algorithm and use it
  • on the data

clustering algorithm
6
Distance functions for objects
- d(x, y) calculates the distance between the two
objects x and y
- Distance measures
- Example
7
Distance measures for cluster
Calculating the distance between two clusters is
important for some algorithms (e.g. hierarchical
algorithms)
Condition 2
Single Linkage min d(a, b) a ? A, b ? B
Cluster Y
5
D
Complete Linkage
4
Average Linkage
Complete Linkage max d(a, b) a ? A, b ? B
3
C
Single Linkage
2
B
Average Linkage
1
Cluster X
A
Condition 1
1
3
4
5
2
8
Differentiability of clustering algorithms
9
Hierarchical Clustering
  • Two methods of hierarchical Clustering
  • agglomerative (bottom-up)
  • divise (top-down)
  • agglomerative vs. divisive
  • divise and agglomerative methods produce the same
    results
  • divise algorithms need much more computing power
    so in practical only agglomerative methods are
    used.
  • Agglomrative algorithm example UPGMA used in
    phylogenetics
  • Conditions
  • given distance or similarity measure for objects
  • given distance measure for cluster
  • result is a dendrogram

10
Agglomerative hierarchical clustering
Algorithm
Find the two Clusters with the closest distance
and put those two Clusters into one. Compute the
new Distance-Matrix.
Construct the finest Partition and compute
Distance matrix D
Until all clusters are agglomerated
Start with objects and given distance measure
between clusters
d
Distance measures - Manhattan Distance - Single
Linkage
9
D
c
8
E
7
C
6
b
b
5
a
c
4
3
B
2
a
A
1
A
B
C
D
E
1
2
3
4
5
6
7
Clusters
Dendrogram
Distance-Matrix
11
Hierarchical clustering- conclusions -
  • Advantages
  • Dendrogram allows interpretation
  • depending on the level of the dendogram the
    different clustering grades can be explored.
  • Usage on all data spaces if a distance measure
    can be defined
  • Disadvantages
  • The user has to identify the clusters by himself
  • recalculations of the great distance-matrix makes
    the algorithm resource intensive
  • Higher runtimes vs. Non-hierarchic-methods

12
Partioning Clustering -k-means algorithm-
  • merge n objects into k cluster
  • calculate centroids from given clustering
  • ,ci centroid of cluster Ci
  • calculate clustering from given centroids
  • gt merge objects into the cluster with
  • minimum distance to its centroid

13
k-means algorithm principle
Cluster center (centroid)
Clustering
  • In general neither the centroids nor the
    clustering is known
  • Guessing

14
k-means Algorithm
- euclidian distance - k 3
0) Init place randomly 3 cluster-centroids
1.0
0.9
1) Join every object into the cluster with the
nearest cluster-centroid
0.8
0.7
2) Compute the new cluster-centroids from
the given clustering
0.6
0.5
3) Repeat 1) and 2) until all centroids stop
moving
0.4
0.3
0.2
  • in each step the values get better for
  • the centroids and the clustering

0.1
0.2
0.4
0.6
0.8
1.0
15
k-means algorithm- problems -
  • Not every run achieves the same result, because
    the result depends on random initiation of the
    clusters
  • Run the algorithm a couple of times and take the
    best result
  • fixed number of clusters need to be known before
    starting the algorithm
  • gt try different values for k and take the best
    result
  • the problem to compute the optimal number of
    clusters is not trivial. An approach is the elbow
    criterion.

16
k-means algorithm - advantages -
  • easy to implement
  • linear runtime allows execution on large
    databases
  • For example the clustering of microarray data
  • depending on the experiment 20.000 dimensional
    vectors

17
Partioning Clustering- density-based method -
  • Condition on the data space
  • data space where objects are closely together
    separated from areas where the objects are less
    closely together
  • gt Cluster with arbitrary shape get found

18
Density-based clustering - parameters -
  • ? the environment around an object
  • ?(o) all objects in the
  • ?-environment of object o
  • MinPts minimum number of objects, that have to
    be in an object-environment, so that this object
    is core object

?
o
o
19
Density-based clustering - definitions -
  • object o? O is core object,
  • if ?(o) MinPts
  • object p ? O is directly density-reachable from q
    ? O, if p ? ?(q)
  • A object p is density-reachable from an object q,
    if there is a chain of directly density-reachable
    objects between p and q.

o
q
p
q
p
20
Density-based clustering- example DBSCAN-
Parameter
Algorithm
MinPts 4
1) Explore incremental all objects
?
see below
2) find core object (e(o) MinPts 4)
3) Start with a new cluster and merge the
object to this cluster
4) Search for all density-reachable objects
and merge them also to the cluster
21
Density-based clustering - conclusions -
  • Advantages
  • Minimal requirements of domain knowledge to
    determine the input parameters.
  • Discovery of clusters with arbitrary shape
  • Good efficiency on large databases
  • Disadvantages
  • problems on data spaces with strongly different
    densities within different ranges
  • Bad efficiency on high dimensional databases

22
More clustering methods
  • Hierarchical methods (agglomerative, divisive)
  • Partitioning methods (i.e. k-means)
  • Density-based methods (i.e. dbscan)
  • Fuzzy clustering
  • Grid-based methods
  • Constraint based methods
  • High dimensional clustering

23
Clustering algorithms- conclusions -
  • Choosing a clustering algorithm for a particular
    problem is not trivial.
  • single algorithms cover only a part of the given
    requirements. (corresponding runtime-behavior,
    precision, influence of runaways...)
  • gt there has no algorithm been found (yet), that
    would have an optimal usability for every
    purpose, and mankind is still waiting for the one
    to develop such an algorithm.

24
End
Write a Comment
User Comments (0)
About PowerShow.com