Title: Different Perspectives at Clustering: The
1Different Perspectives at Clustering The
Number-of-Clusters Case
- B. Mirkin
- School of Computer Science
- Birkbeck College, University of London
- IFCS 2006
2Different Perspectives at Number of Clusters
Talk Outline
- Clustering and K-Means A discussion
- Clustering goals and four perspectives
- Number of clusters in
- - Classical statistics perspective
- - Machine learning perspective
- - Data Mining perspective
- (including a simulation study with 8 methods)
- - Knowledge discovery perspective
- (including a comparative genomics project)
3- WHAT IS CLUSTERING WHAT IS DATA
- K-MEANS CLUSTERING Conventional K-Means
Initialization of K-Means Intelligent K-Means
Interpretation Aids - WARD HIERARCHICAL CLUSTERING Agglomeration
Divisive Clustering with Ward Criterion
Extensions of Ward Clustering - DATA RECOVERY MODELS Statistics Modelling as
Data Recovery - Data Recovery Model for K-Means for Ward
Extensions to Other Data Types One-by-One
Clustering - DIFFERENT CLUSTERING APPROACHES Extensions of
K-Means Graph-Theoretic Approaches Conceptual
Description of Clusters - GENERAL ISSUES Feature Selection and Extraction
Similarity on Subsets and Partitions Validity
and Reliability
4Example W. Jevons (1835-1882), updated in
Mirkin 1996
- Pluto doesnt fit in the two clusters of planets
5Example A Few Clusters
- Clustering interface to WEB search engines
- (Grouper)
- Query Israel (after O. Zamir and O. Etzioni
2001)
Cluster sites Interpretation
1 View Refine 24 Society, religion Israel and Iudaism Judaica collection
2 View Refine 12 Middle East, War, History The state of Israel Arabs and Palestinians
3 View Refine 31 Economy, Travel Israel Hotel Association Electronics in Israel
6Clustering Main Steps
- Data collecting
- Data pre-processing
- Finding clusters (the only step appreciated in
conventional clustering) - Interpretation
- Drawing conclusions
7Conventional Clustering Cluster Algorithms
- Single Linkage Nearest Neighbour
- Ward Agglomeration
- Conceptual Clustering
- K-means
- Kohonen SOM
- .
8K-Means a generic clustering method
- Entities are presented as multidimensional
points () - 0. Put K
hypothetical centroids (seeds) - 1. Assign
points to the centroids - according
to minimum distance rule - 2. Put
centroids in gravity centres of - thus
obtained clusters - 3. Iterate 1.
and 2. until convergence - K 3
hypothetical centroids (_at_)
9K-Means a generic clustering method
- Entities are presented as multidimensional
points () - 0. Put K
hypothetical centroids (seeds) - 1. Assign
points to the centroids - according
to Minimum distance rule - 2. Put
centroids in gravity centres of - thus
obtained clusters - 3. Iterate 1.
and 2. until convergence -
10K-Means a generic clustering method
- Entities are presented as multidimensional
points () - 0. Put K
hypothetical centroids (seeds) - 1. Assign
points to the centroids - according
to Minimum distance rule - 2. Put
centroids in gravity centres of - thus
obtained clusters - 3. Iterate 1.
and 2. until convergence -
11K-Means a generic clustering method
- Entities are presented as multidimensional
points () - 0. Put K
hypothetical centroids (seeds) - 1. Assign
points to the centroids - according
to Minimum distance rule - 2. Put
centroids in gravity centres of - thus
obtained clusters - 3. Iterate 1.
and 2. until convergence - 4. Output
final centroids and clusters -
_at_ _at_
_at_
12Advantages of K-Means
- Conventional
- Models typology building
- Computationally effective
- Can be incremental, on-line
- Unconventional
- Associates feature salience with feature scales
and correlation/association - Applicable to mixed scale data
13Drawbacks of K-Means
-
- No advice on
- Data pre-processing
- Number of clusters
- Initial setting
- Instability of results
- Criterion can be inadequate
- Insufficient interpretation aids
14Initial Centroids Correct
Two cluster case
15Initial Centroids Correct
Final
Initial
16Different Initial Centroids
17Different Initial Centroids Wrong, even though
in different clusters
Initial
Final
18Two types of goals (with no clear-cut
borderline)
- Engineering goals
- Data analysis goals
19Engineering goals (examples)
- Devising a market segmentation to minimise the
promotion and advertisement expenses - Dividing a large scheme into modules to minimise
the cost - Organisation structure design
20Data analysis goals (examples)
- Recovery of the distribution function
- Prediction
- Revealing patterns in data
- Enhancing knowledge with additional concepts
- and regularities
- Each of these is realised
- in a different perspective at clustering
21Clustering Perspectives
- Classical statistics
- Recovery of a multimodal distribution
function - Machine learning Prediction
- Data mining Revealing patterns in data
- Knowledge discovery additional concepts
- and regularities
22Clustering Perspectives at Clusters
- Classical statistics
- As many as meaningful modes (mixture items)
- Machine learning
- As many as needed for acceptable prediction
- Data mining
- As many as meaningful patterns in data
(including incomplete clustering) - Knowledge discovery
- As many as needed to produce concepts and
regularities adequate to the domain
23Main Sources for Deriving Clusters
- Classical statistics
- Model of the world
- Machine learning
- Cost Accuracy Trade Off
- Data mining
- Data
- Knowledge discovery
- Domain knowledge
24Classical Statistics Perspective
- There must be a model of data generation
- E.g., Mixture of Gaussians
- The task identify all parameters of the model
by using observed data - E.g., The number of Gaussians and their
probabilities, means and covariances
25Mixture of 3 Gaussian densities
26Classical statistics perspective on K-Means
- But a maximum likelihood method with spherical
Gaussians of the same variance - - within a cluster, all variables are
independent and Gaussian with the same
cluster-independent variance (z - scoring is a
must then) - - the issue of the number of clusters can be
approached with conventional approaches to
hypothesis testing
27Machine learning perspective
- Clusters should be of help in learning data
incrementally generated - The number should be specified by the trade-off
between accuracy and cost - A criterion should guarantee partitioning of the
feature space with clearly separated high density
areas - A method should be proven to be consistent with
the criterion on the population
28Machine learning on K-Means
- The number of clusters to be specified
according to prediction goals - Pre-processing no advice
- An incremental version of K-means converges to a
minimum of the summary within-cluster variance,
under conventional assumptions of data generation
(McQueen 1967 major reference, though the
method is traced a dozen or two years earlier)
29Data mining perspective
30Data recovery framework fordata mining methods
- Type of Data
- Similarity
- Temporal
- Entity-to-feature
- Co-occurrence
- Type of Model
- Regression
- Principal components
- Clusters
Model Data Model_Derived_Data
Residual Pythagoras Data2 Model_Derived_Data2
Residual2 The better fit the better the model
31K-Means as a data recovery method
32Representing a partition
Cluster k Centroid ckv (v -
feature) Binary 1/0 membership zik
(i - entity)
33Basic equations (analogous to PCA)
y data entry, z - membership c - cluster
centroid, N cardinality i - entity,
v - feature /category, k - cluster
34Meaning of Data scatter
- The sum of contributions of features the basis
for feature pre-processing (dividing by range,
not std) - Proportional to the summary variance
35Contribution of a feature F to a partition
Contrib(F)
- Proportional to
- correlation ratio ?2 if F is quantitative
- a contingency coefficient between cluster
partition and F, if F is nominal - Pearson chi-square (Poisson normalised)
- Goodman-Kruskal tau-b (Range normalised)
36Contribution of a quantitative feature to a
partition
- Proportional to
- correlation ratio ?2 if F is quantitative
37Contribution of a nominal feature to
a partition
- Proportional to a contingency coefficient
- Pearson chi-square (Poisson normalised)
- Goodman-Kruskal tau-b (Range normalised)
- Bj1
38Pythagorean Decomposition of data scatter for
interpretation
39Contribution based description of clusters
- C. Dickens FCon 0
- M. Twain LenD lt 28
- L. Tolstoy NumCh gt 3 or
- Direct 1
40Principal Cluster Analysis (Anomalous Pattern)
Method
- yiv cv zi eiv,
- where zi 1 if i?S, zi 0 if i?S
- With Euclidean distance squared
cS must be anomalous, that is, interesting
41Initial setting with Anomalous Single Cluster
for iK-Means
42iK-Means with Anomalous Single Clusters
0
43Anomalous clusters K-means
After extracting 2 clusters (how one can know
that 2 is right?)
Final
44Simulation study of 8 methods (joint work with
Mark Chiang) Number-of clusters methods
- Variance based
- Hartigan(HK)
- Calinski Harabasz (CH)
- Jump Statistic (JS)
- Structure based
- Silhouette Width (SW)
- Consensus based
- Consensus Distribution area (CD)
- Consensus Distribution mean (DD)
- Sequential extraction of APs
- Least Square (LS)
- Least Moduli (LM)
45Data generation for the experiment
- Gaussian Mixture (6,7,9 clusters) with
- Cluster spatial size
- - Constant (spherical)
- - k-proportional
- - k2-proportional
- Cluster spread (distance between centroids)
Spread Spherical PPCA model PPCA model
Spread Spherical k-proport. k2-proport.
Large 2 (?) 10 (?) 10 (?)
Small 0.2 (?) 0.5 (?) 2 (?)
46Evaluation of results Estimated clustering
versus that generated
- Number of clusters
- Distance between centroids
- Similarity between partitions
47Distance between estimated centroids (o) and
those generated (o )
e1(q1)
e2(q2)
e3(q3)
g1------e2 g2------e4 g3------e5
G1(p1)
G2(p2)
e4(q4)
e5(q5)
G3(p3)
48Distance between estimated centroids (o) and
those generated (o )
e1(q1)
e2(q2)
e3(q3)
g1------e2, e1 g2------e4, e3 g3------e5
G1(p1)
G2(p2)
e4(q4)
e5(q5)
G3(p3)
49Distance between centroids quadratic and
city-block
g1(p1)------e1(q1), e2(q2) g2(p2)------e3(q3),
e4(q4) g3(p3)------e5(q5)
d1(q1d(g1,e1)q2d(g1,e2))/(q1q2)
d2(q3d(g2,e3)q4d(g2,e4))/(q3q4)
d3(q5d(g3,e5)/q5
50Distance between centroids quadratic and
city-block
- Assignment
- Distancing
- 3. Averaging
p1d1p2d2p3d3
51Similarity between partitions according to their
confusion table
- Relative distance (Mirkin-Cherny 1970)
- Tchouprov coefficient (Cramer 1943)
- Adjusted Rand Index (Arabie-Hubert, 1985)
- Average Overlap (Mirkin 2005)
52Results
at 9 clusters, 1000 entities, 20 features
generated
Estimated number of clusters Estimated number of clusters Distance between Centroids Distance between Centroids Adjust Rand Index Adjust Rand Index
Large spread Small spread Large spread Small spread Large spread Small spread
HK
CH
JS
SW
CD
DD
LS
LM
53Knowledge discovery perspective on clustering
- Conforming to and enhancing domain knowledge
- Informal considerations so far
- Relevant items
- Decision trees
- External validation
54A case to generalise
- Entities with a similarity measure
- Clustering interpretation tool developed
- Clustering method using a similarity threshold
leading to a number of clusters - Domain knowledge leading to constraints to the
similarity threshold - Best fitting interpretation provides for the best
number of clusters
55Entities with a similarity measure
- 740 Homologous Protein Families (HPFs)
- (in 30 herpes-virus genomes)
- Homology defined by a protein sequence fragment
- Sequence neighbourhood based similarity measure
on HPFs
F1 F2 F3
56Interpretation tool Mapping to an evolutionary
tree over genomes
F3 F2 F1
57Algorithm ADDI-S (Mirkin JoC 1987), a data
approximation technique
- To maximize Contribution to Data Scatter, Average
within-cluster similarity c multiplied by the
clusters size S - Algorithm ADDI-S
- Take S j for arbitrary j
- Given S, find cc(S) and similarities b(i,S) to S
for all entities i in and out of S - Check the differences b(i,S)-c/2. If they are
consistent, change the state of a most
contributing entity. Else, stop and output S. - Resulting S a tightness property.
- Holzinger (1941) B-coefficient,
ArkadievBraverman (1964, 1967) Specter, Mirkin
(1976, 1987) ADDI family, Ben-Dor, Shamir,
Yakhini (1999) CAST
58Algorithm ADDI-S (Mirkin 1987), a data
approximation techniques
- Number of clusters Depends on similarity shift
threshold b - b(ij) ? b(ij) b
59Domain knowledge Function is known at some HPFs
- 287 pairs of HPFs with known function of which 86
are SYNONYMOUS (same function)
density
Non-synonymous
synonymous
0.42 0.67 Similarity
Two values Min error No non-synonymous
60Knowledge enhancing
- Analyzing the reconstructed contents in 3 family
ancestors and HUCA (the root) - Analyzing differences between b.42 and b.67
cluster reconstructions - Analyzing gene arrangement within genomes
- Glycoprotein Ls HPFs are sequence-dissimilar,
but they are always followed in genomes by
glycolase that is mapped to HUCA - Glyc L Glycolase
- Therefore glycoprotein L must be in HUCA too
61Final HPFs and APFs
- HPFs with a sequence-based similarity measure
- Interpretation parsimonious histories
- Clustering ADDI-S using a similarity threshold
leading to a number of clusters - Domain knowledge 86 pairs should be in same
clusters, and 201 in different clusters ? 2
suggested similarity thresholds - Best fitting 102 APF (aggregating 249 HPFs) and
491 singleton HPFs
62- Whole HPF aggregation methods structure (joint
work with
R. Camargo, T. Fenner, P. Kellam, G. Loizou)
63Conclusion I Number of clusters?
- Engineering perspective defined by cost/effect
- Classical statistics perspective can and should
be determined from data with a model - Machine learning perspective can be specified
according to the prediction accuracy to achieve - Data mining perspective not to pre-specify only
those are of interest that bear interesting
patterns - Knowledge discovery perspective not to
pre-specify those that are best in knowledge
enhancing
64Conclusion II Each other data analysis concept
- Classical statistics perspective can be
determined from data with a model - Machine learning perspective prediction accuracy
to achieve - Data mining perspective data approximation
- Knowledge discovery perspective knowledge
enhancing
65Variance based methods
- Hartigan (HK)
- calculate HT(Wk/Wk1-1)(N-k-1), where N is the
number of entities - find the k which HT is less than a threshold 10
- Calinski and Harabasz (CH)
- calculate CH((T-Wk)/(k-1))/(Wk/(N-k)), where T
is the data scatter - find the k which maximize CH
Wk is given K, the smallest within-cluster
summary distance to centroids among those found
at different K-Means initializations
66Variance based methods
- Jump Statistic (JS)
- for each entity i, clustering SS1,S2,,Sk, and
- centroids CC1,C2,,Ck
- calculate d(i, Sk)(yi-Ck)TG-1(yi-Ck) and dk(
d(i, Sk))/PN
- where P is the number of features, N is the
number of rows and G is the covariance matrix of
y - select a transformation power, typically P/2
- calculate the jumps JSd
- find the k which maximize JS
67Structure based methods
- Silhouette Width (SW)
- for each entity i, a(i)average dissimilarity
between i and - all other entities of the cluster to which i
belongs and - b(i) is the minimum of average dissimilarity
of i to all entities - in other cluster
- for each other cluster Sk, d(i, Sk)average
dissimilarity between i - and all entities of Sk
- (i)min(d(i, Sk)) over Sk
- s(i)b(i)-a(i)/max(a(i),b(i))
- calculate the average s
- find the K maximizing the average s
s(i)/N
68Consensus based methods
Consensus based area(CD)
- For each different K-means initializations
- Find the connectivity matrix
- Calculate the consensus matrix
- Calculate the cumulative distribution matrix CDF
- Calculate the area under CDF A(k)
- Calculate ?(K1)
- find the k which maximize ?(k)
69Consensus based methods
µK is the mean of the consensus matrix sK is the
variance of the consensus matrix
avdis(K) µK(1- µK)- sK2
davdis(K)(avdis(K)-avdis(K1))/avdis(K1)
Find the K which maximize davdis(DD)
70Sequential cluster extraction
- Intelligent K-Means
- Anomalous Pattern (Initial clusters)
- Removal of singletons
- K-Means
- Euclidean Distance The within-cluster mean
- ? Least Square Criterion (LS)
- Manhattan Distance The within-cluster median ?
Least Modules Criterion (LM)