Title: Clustering in General
1Clustering in General
- In vector space, clusters are vectors found
within e of a cluster vector, with different
techniques for determining the cluster vector and
e. - Clustering is unsupervised pattern
classification. - Unsupervised means no correct answer or feedback.
- Patterns typically are samples of feature vectors
or matrices. - Classification means collecting the samples into
groups of similar members.
2Clustering Decisions
- Pattern Representation
- feature selection (e.g., stop word removal,
stemming) - number of categories
- Pattern proximity
- distance measure on pairs of patterns
- Grouping
- characteristics of clusters (e.g., fuzzy,
hierarchical) - Clustering algorithms embody different
assumptions about these decisions and the form of
clusters.
3Formal Definitions
- Feature vector x is a single datum of d
measurements. - Hard clustering techniques assign a class label
to each cluster members of clusters are mutually
exclusive. - Fuzzy clustering techniques assign a fractional
degree of membership to each label for each x.
4Proximity Measures
- Generally, use Euclidean distance or mean squared
distance. - In IR, use similarity measure from retrieval
(e.g., cosine measure for TFIDF).
5Jain, Murty Flynn Taxonomy of Clustering
Clustering
Hierarchical
Partitional
Single Link
Complete Link
Square Error
Graph Theoretic
Mixture Resolving
Mode Seeking
Expectation Minimization
HAC
k-means
6Clustering Issues
7Hierarchical Algorithms
- Produce hierarchy of classes (taxonomy) from
singleton clusters to just one cluster. - Select level for extracting cluster set.
- Representation is a dendrogram.
8Complete-Link Revisited
- Used to create statistical thesaurus
- Agglomerative, hard, deterministic, batch
- Start with 1 cluster/sample
- Find two clusters with lowest distance
- Merge two clusters and add to hierarchy
- Repeat from 2 until termination criterion or
until all clusters have merged
9Single-Link
- Like Complete-Link except
- use minimum of distances between all pairs of
samples in the two clusters (complete-link uses
maximum). - Single-link has chaining effect with elongated
clusters, but can construct more complex shapes.
10ExamplePlot
11Example Proximity Matrix
12Complete-Link Solution
C15
C13
C14
C10
C11
C12
C6
C7
C8
C9
C1
C2
C3
C4
C5
29,26
1,28
9,16
21,15
29,22
45,42
46,30
23,32
4,9
13,18
31,15
33,21
35,35
42,45
21,27
26,25
13Single-Link Solution
C15
C14
C12
C11
C13
C8
C7
C10
C3
C9
C2
C4
C5
C6
C1
29,26
1,28
9,16
21,15
29,22
45,42
46,30
23,32
4,9
13,18
31,15
33,21
35,35
42,45
21,27
26,25
14Hierarchical Agglomerative Clustering (HAC)
- Agglomerative, hard, deterministic, batch
- Start with 1 cluster/sample and compute a
proximity matrix between pairs of clusters. - Merge most similar pair of clusters and update
proximity matrix. - Repeat 2 until all clusters merged.
- Difference is in how proximity matrix is updated.
- Ability to combine benefits of both single and
complete link algorithms.
15HAC for IR
- Intra-cluster Similarity
- where S is TFIDF vectors for documents, c is
centroid of cluster X, and d is a document.
- Proximity is similarity of all documents to the
cluster centroid. - Select pair of clusters that produces the
smallest decrease in similarity, e.g., if
merge(X,Y)gtZ, then - maxSim(Z)-(Sim(X)Sim(Y))
16HAC for IR- Alternatives
- Centroid Similarity
- cosine similarity between the centroid of the two
clusters
17Partitional Algorithms
- Results in set of unrelated clusters.
- Issues
- how many clusters is enough?
- how to search space of possible partitions?
- what is appropriate clustering criterion?
18K Means
- Number of clusters is set by user to be k.
- Non-deterministic
- Clustering criterion is squared error
- where S is document set, L is a clustering, K is
number of clusters, x is ith document in jth
cluster and c is centroid of jth cluster.
19k-Means Clustering Algorithm
- Randomly select k samples as cluster centroids.
- Assign each pattern to the closest cluster
centroid. - Recompute centroids.
- If convergence criterion (e.g., minimal decrease
in error or no change in cluster composition) is
not met, return to 2.
20ExampleK-Means Solutions
21k-Means Sensitivity to Initialization
F
G
C
D
E
B
A
K3, red started w/A, D, F yellow w/A, B, C
22k-Means for IR
- Update centroids incrementally
- Calculate centroid as with hierarchical methods.
- Can refine into a divisive hierarchical method by
starting with single cluster and splitting using
k-means until forms k clusters with highest
summed similarities. (bisecting k-means)
23Other Types of Clustering Algorithms
- Graph Theoretic construct minimal spanning tree
and delete edges with largest lengths - Expectation Minimization (EM) assume clusters
are drawn from distributions, use maximum
likelihood to estimate parameters of
distributions. - Nearest Neighbors iteratively assign each sample
to the cluster of its nearest labelled neighbor,
so long as distance is below a set threshold.
24Comparison of Clustering Algorithms Steinbach et
al.
- Implement 3 versions of HAC and 2 versions of
k-Means - Compare performance on documents hand labelled as
relevant to one of a set of classes. - Well known data sets (TREC)
- Found that UPGMA is best of hierarchical, but
bisecting k-means seems to do better if
considered over many runs.
M. Steinbach, G. Karypis, V.Kumar. A Comparison
of Document Clustering Techniques, KDD Workshop
on Text Mining, 2000.
25Evaluation Metrics 1
- Evaluation how to measure cluster quality?
- Entropy
- where pij is probability that a member of cluster
j belongs to class i, nj is size of cluster j, m
is number of clusters, n is number of docs and CS
is a clustering solution.
26Comparison Measure 2
- F measure combines precision and recall
- treat each cluster as the result of a query and
each class as the relevant set of docs
nij is of members of class i in cluster j, nj
is in j, ni is in i, n is of docs.