Title: Clustering: Tackling Challenges with Data Recovery Approach
1Clustering Tackling Challenges with Data
Recovery Approach
- B. Mirkin
- School of Computer Science
- Birkbeck University of London
- Advert of a Special Issue The Computer Journal,
Profiling Expertise and Behaviour Deadline 15
Nov. 2006. To submit, http// www.dcs.bbk.ac.uk/m
ark/cfp_cj_profiling.txt
2- WHAT IS CLUSTERING WHAT IS DATA
- K-MEANS CLUSTERING Conventional K-Means
Initialization of K-Means Intelligent K-Means
Mixed Data Interpretation Aids - WARD HIERARCHICAL CLUSTERING Agglomeration
Divisive Clustering with Ward Criterion
Extensions of Ward Clustering - DATA RECOVERY MODELS Statistics Modelling as
Data Recovery - Data Recovery Model for K-Means for Ward
Extensions to Other Data Types One-by-One
Clustering - DIFFERENT CLUSTERING APPROACHES Extensions of
K-Means Graph-Theoretic Approaches Conceptual
Description of Clusters - GENERAL ISSUES Feature Selection and Extraction
Similarity on Subsets and Partitions Validity
and Reliability
3What is clustering?
- Finding homogeneous fragments, mostly sets of
entities, in data for further analysis
4Example W. Jevons (1857) planet clusters,
updated (Mirkin, 1996)
- Pluto doesnt fit in the two clusters of planets
5Example A Few Clusters
- Clustering interface to WEB search engines
- (Grouper)
- Query Israel (after O. Zamir and O. Etzioni
2001)
Cluster sites Interpretation
1 View Refine 24 Society, religion Israel and Iudaism Judaica collection
2 View Refine 12 Middle East, War, History The state of Israel Arabs and Palestinians
3 View Refine 31 Economy, Travel Israel Hotel Association Electronics in Israel
6Clustering algorithms
- Nearest neighbour
- Ward
- Conceptual clustering
- K-means
- Kohonen SOM
- Etc.
7K-Means a generic clustering method
- Entities are presented as multidimensional
points () - 0. Put K
hypothetical centroids (seeds) - 1. Assign
points to the centroids - according
to minimum distance rule - 2. Put
centroids in gravity centres of - thus
obtained clusters - 3. Iterate 1.
and 2. until convergence - K 3
hypothetical centroids (_at_)
8K-Means a generic clustering method
- Entities are presented as multidimensional
points () - 0. Put K
hypothetical centroids (seeds) - 1. Assign
points to the centroids - according
to Minimum distance rule - 2. Put
centroids in gravity centres of - thus
obtained clusters - 3. Iterate 1.
and 2. until convergence -
9K-Means a generic clustering method
- Entities are presented as multidimensional
points () - 0. Put K
hypothetical centroids (seeds) - 1. Assign
points to the centroids - according
to Minimum distance rule - 2. Put
centroids in gravity centres of - thus
obtained clusters - 3. Iterate 1.
and 2. until convergence -
10K-Means a generic clustering method
- Entities are presented as multidimensional
points () - 0. Put K
hypothetical centroids (seeds) - 1. Assign
points to the centroids - according
to Minimum distance rule - 2. Put
centroids in gravity centres of - thus
obtained clusters - 3. Iterate 1.
and 2. until convergence - 4. Output
final centroids and clusters -
_at_ _at_
_at_
11 Advantages of K-Means
- Models typology building
- Computationally effective
- Can be utilised incrementally, on-line
- Shortcomings of K-Means
- Instability of results
- Convex cluster shape
12Initial Centroids Correct
Two cluster case
13Initial Centroids Correct
Final
Initial
14Different Initial Centroids
15Different Initial Centroids Wrong
Initial
Final
16Clustering issues
K-Means gives no advice on Number of
clusters Initial setting Data
normalisation Mixed variable scales
Multiple data sets K-Means gives limited advice
on Interpretation of results
17 Data recovery for data mining (discovery of
patterns in data)
- Type of Data
- Similarity
- Temporal
- Entity-to-feature
- Co-occurrence
- Type of Model
- Regression
- Principal components
- Clusters
Model Data Model_Derived_Data
Residual Pythagoras Data2 Model_Derived_Data2
Residual2 The better fit, the better the
model
18Pythagorean decomposition in Data recovery
approach, provides for
- Data scatter a unique data characteristic (A
perspective at data normalisation) - Additive contributions of entities or features to
clusters (A perspective for interpretation) - Feature contributions are correlation/association
measures affected by scaling (Mixed scale data
treatable) - Clusters can be extracted one-by-one (Data mining
perspective, incomplete clustering, number of
clusters) - Multiple data can be approximated as well as
single sourced ones (not talked of today)
19Example
Mixed scale data table
20Conventional quantitative coding data
standardisation
21Standardisation of features
- Yik (Xik Ak)/Bk
- X - original data
- Y standardised data
- i entities
- k features
- Ak shift of the origin, typically, the average
- Bk rescaling factor, traditionally the standard
deviation, but range may be better in clustering
22No standardisation
Tom Sawyer
23Z-scoring (scaling by std)
Tom Sawyer
24Standardising by range weight
Tom Sawyer
25K-Means as a data recovery method
26Representing a partition
Cluster k Centroid ckv (v -
feature) Binary 1/0 membership zik
(i - entity)
27Basic equations (analogous to PCA, with score
vectors zk constrained to be binary)
y data entry, z membership, not
score c - cluster centroid, N
cardinality i - entity, v - feature
/category, k - cluster
28Meaning of Data scatter
- The sum of contributions of features the basis
for feature pre-processing (dividing by range
rather than std) - Proportional to the summary variance
29Contribution of a feature F to a partition
Contrib(F)
- Proportional to
- correlation ratio ?2 if F is quantitative
- a contingency coefficient between cluster
partition and F, if F is nominal - Pearson chi-square (Poisson normalised)
- Goodman-Kruskal tau-b (Range normalised)
30Contribution of a quantitative feature to a
partition
- Proportional to
- correlation ratio ?2 if F is quantitative
31Contribution of a nominal feature to
a partition
- Proportional to a contingency coefficient
- Pearson chi-square (Poisson normalised)
- Goodman-Kruskal tau-b (Range normalised)
- Bj1
32Pythagorean Decomposition of data scatter for
interpretation
33Contribution based description of clusters
- C. Dickens FCon 0
- M. Twain LenD lt 28
- L. Tolstoy NumCh gt 3 or
- Direct 1
34PCA based Anomalous Pattern Clustering
- yiv cv zi eiv,
- where zi 1 if i?S, zi 0 if i?S
- With Euclidean distance squared
cS must be anomalous, that is, interesting
35Initial setting with Anomalous Pattern Cluster
36Anomalous Pattern Clusters Iterate
0
37iK-MeansAnomalous clusters K-means
After extracting 2 clusters (how one can know
that 2 is right?)
Final
38Example of iK-Means Media Mirrored Russian
Corruption (55 cases) with M. Levin and E.
Bakaleinik
- Features
- Corrupt office (1)
- Client (1)
- Rendered service (6)
- Mechanism of corruption (2)
- Environment (1)
39A schema for Bribery
Environment
Interaction
Office
Client
Service
40Data standardisation
- Categories as one/zero variables
- Subtracting the average
- All features Normalising by range
- Categories, sometimes by the number of them
41iK-MeansInitial Setting with Iterative
Anomalous Pattern Clustering
- 13 clusters found with AC, of which
8 do not fit (4 singletons, 4
doublets) - 5 clusters remain, to get initial seeds from
- Cluster elements are taken as seeds
42Interpretation II Patterning(Interpretation I
Representatives Interpretation III Conceptual
description)
- Patterns in centroid values of salient features
- Salience of feature v at cluster k
(grand mean -
within-cluster mean)2
43InterpretationII III
- Cluster 1 (7 cases)
- Other branch (877)
- Improper categorisation (439)
- Level of client (242)
- Cluster 2 (19 cases)
- Obstruction of justice (467)
- Law enforcement (379)
- Occasional (251)
Branch Other
Branch Law Enforc. Service No
Cover-Up Client Level ? Organisation
44InterpretationII (pattern) III (appcod)
- Cluster 3 (10 cases)
- Extortion (474)
- Organisation(289)
- Government (275)
0 lt Extort - Obstruct lt 1 2
lt Extort Bribe lt3 No
Inspection No Protection
NO ERRORS
45Overall Description It is Branch that matters
- Government
- Extortion for free services (Cluster 3)
- Protection (Cluster 4)
- Law enforcement
- Obstruction of justice (Cluster 2)
- Cover-up (Cluster 5)
- Other
- Category change (Cluster 1)
- Is this knowledge enhancement?
46Data recovery clustering of similarities
- Example
- Similarities between algebraic functions in an
- experimental method for knowledge evaluation
- lnx x² x³ x½ x¼
- lnx - 1 1 2.5 2.5
- x² 1 - 6 2.5 2.5
- X³ 1 6 - 3 3
- x½ 2.5 2.5 3 - 4
- x¼ 2.5 2.5 3 4 -
- Scoring similarities between algebraic functions
by a 6th grade - student in scale 1 to 7
47Additive clustering
- Similarities are the sum of intensities of
clusters - Cl. 0 All are funcrtions, lnx, x², x³, x½,
x¼ - Intensity 1 (upper sub-matrix)
- lnx x² x³ x½ x¼
- lnx - 1 1 1 1
- x² 1 - 1 1 1
- X³ 1 6 - 1 1
- x½ 2.5 2.5 3 - 1
- x¼ 2.5 2.5 3 4 -
- Scoring similarities between algebraic functions
by a 6th grade - student in scale 1 to 7 (lower sub-matrix)
48Additive clustering
- Similarities are the sum of intensities of
clusters - Cl. 1 Power functions, x², x³, x½, x¼
- Intensity 2 (upper sub-matrix)
- lnx x² x³ x½ x¼
- lnx - 0 0 0 0
- x² 1 - 2 2 2
- X³ 1 6 - 2 2
- x½ 2.5 2.5 3 - 2
- x¼ 2.5 2.5 3 4 -
- Scoring similarities between algebraic functions
by a 6th grade - student in scale 1 to 7 (lower sub-matrix)
49Additive clustering
- Similarities are the sum of intensities of
clusters - Cl. 2 Sub-linear functions, lnx, x½, x¼
- Intensity 1 (upper sub-matrix)
- lnx x² x³ x½ x¼
- lnx - 0 0 1 1
- x² 1 - 0 0 0
- X³ 1 6 - 0 0
- x½ 2.5 2.5 3 - 1
- x¼ 2.5 2.5 3 4 -
- Scoring similarities between algebraic functions
by a 6th grade - student in scale 1 to 7 (lower sub-matrix)
50Additive clustering
- Similarities are the sum of intensities of
clusters - Cl. 3 Fast growing functions, x², x³
- Intensity 3 (upper sub-matrix)
- lnx x² x³ x½ x¼
- lnx - 0 0 0 0
- x² 1 - 3 0 0
- X³ 1 6 - 0 0
- x½ 2.5 2.5 3 - 0
- x¼ 2.5 2.5 3 4 -
- Scoring similarities between algebraic functions
by a 6th grade - student in scale 1 to 7 (lower sub-matrix)
51Additive clustering
- Similarities are the sum of intensities of
clusters - Residuals relatively small
- (upper sub-matrix)
- lnx x² x³ x½ x¼
- lnx - 0 0 .5 .5
- x² 1 - 0 -.5 -.5
- X³ 1 6 - 0 0
- x½ 2.5 2.5 3 - 0
- x¼ 2.5 2.5 3 4 -
- Scoring similarities between algebraic functions
by a 6th grade - student in scale 1 to 7 (lower sub-matrix)
52Data recovery Additive clustering
- Observed similarity matrix
-
- B Ag A1 A2 A3 E
- Problem given B, find As to minimize E, the
differences between B and summary A - ??B (Ag A1 A2 A3)?? ? min A
53Doubly greedy strategy
- OUTER LOOP One cluster at a time
- Find real c and binary z to minimize L2(B,c,z)
- Take cluster S i z i 1
- Update B B ? B - czzT
- Reiterate
- After m iterations Sk, NkSk, ck
- T(B) c12 N12 cm2 Nm2 L2 (?)
54Inner loop finding a cluster
- Maximize Contribution to (?), Max (cNS)2
-
- N.Property Average similarity b(i,S) of i to S
gt c/2 if i? S and lt c/2 if i ? S - Algorithm ADDI-S
- Take S i for arbitrary i
- Given S, find cc(S) and b(i,S) for all i
- If b(i,S)-c/2. is gt0 for i? S or lt 0 for i ? S
change the state of i. Else, stop and output S. - Resulting S satisfies the property.
- Holzinger (1941) B-coefficient,
ArkadievBraverman (1964, - 1967) Specter, Mirkin (1976, 1987) ADDI-,
Ben-Dor, Shamir, - Yakhini (1999) CAST
55DRA on Mixed variable scales and normalisation
Feature Normalisation any measure, clear of
the distribution e. g., range Nominal
scale Binary categories normalised to get the
total feature contribution right e.g. by the
square root of the number of categories
56DRA on Interpretation
Cluster centroids are supplemented with
contributions of feature/cluster pairs or
entity/cluster pairs K-Means What is
Representative? Distance Min (conventional) Inne
r product Max (data recovery)
57DRA on Incomplete clustering
- With the model assigning un-clustered entities to
the norm (e.g., gravity centre), Anomalous
Pattern clustering (iterated)
58DRA on Number of clusters
- iK-Means
- (under the assumption that every cluster, in
- sequence, contributes more than the next one
- a planetary model)
- Otherwise, the issue is rather bleak
59 Failure of statistically sound criteria
- MingTso Chiang (2006) 100 entities in 6D 4
clusters between dist. 50 times gt within dist. - Hartigans F coefficient and Jump statistic fail
60Conclusion
- Data recovery approach should be the major
mathematical underpinning - for data mining as a framework for finding
patterns in data