Clustering: Tackling Challenges with Data Recovery Approach - PowerPoint PPT Presentation

About This Presentation

Title:

Clustering: Tackling Challenges with Data Recovery Approach

Description:

Advert of a Special Issue: The Computer Journal, Profiling Expertise and ... WARD HIERARCHICAL CLUSTERING: Agglomeration; Divisive Clustering with Ward ... – PowerPoint PPT presentation

Number of Views:102

Avg rating:3.0/5.0

Slides: 61

Provided by: helen54

Category:

more less

Transcript and Presenter's Notes

Title: Clustering: Tackling Challenges with Data Recovery Approach

1
Clustering Tackling Challenges with Data
Recovery Approach

B. Mirkin
School of Computer Science
Birkbeck University of London
Advert of a Special Issue The Computer Journal,
Profiling Expertise and Behaviour Deadline 15
Nov. 2006. To submit, http// www.dcs.bbk.ac.uk/m
ark/cfp_cj_profiling.txt

WHAT IS CLUSTERING WHAT IS DATA
K-MEANS CLUSTERING Conventional K-Means
Initialization of K-Means Intelligent K-Means
Mixed Data Interpretation Aids
WARD HIERARCHICAL CLUSTERING Agglomeration
Divisive Clustering with Ward Criterion
Extensions of Ward Clustering
DATA RECOVERY MODELS Statistics Modelling as
Data Recovery
Data Recovery Model for K-Means for Ward
Extensions to Other Data Types One-by-One
Clustering
DIFFERENT CLUSTERING APPROACHES Extensions of
K-Means Graph-Theoretic Approaches Conceptual
Description of Clusters
GENERAL ISSUES Feature Selection and Extraction
Similarity on Subsets and Partitions Validity
and Reliability

3
What is clustering?

Finding homogeneous fragments, mostly sets of
entities, in data for further analysis

4
Example W. Jevons (1857) planet clusters,
updated (Mirkin, 1996)

Pluto doesnt fit in the two clusters of planets

5
Example A Few Clusters

Clustering interface to WEB search engines
(Grouper)
Query Israel (after O. Zamir and O. Etzioni
2001)

Cluster sites Interpretation
1 View Refine 24 Society, religion Israel and Iudaism Judaica collection
2 View Refine 12 Middle East, War, History The state of Israel Arabs and Palestinians
3 View Refine 31 Economy, Travel Israel Hotel Association Electronics in Israel
6
Clustering algorithms

Nearest neighbour
Ward
Conceptual clustering
K-means
Kohonen SOM
Etc.

7
K-Means a generic clustering method

Entities are presented as multidimensional
points ()
0. Put K
hypothetical centroids (seeds)
1. Assign
points to the centroids
according
to minimum distance rule
2. Put
centroids in gravity centres of
thus
obtained clusters
3. Iterate 1.
and 2. until convergence
K 3
hypothetical centroids (_at_)

_at_ _at_
_at_

8
K-Means a generic clustering method

Entities are presented as multidimensional
points ()
0. Put K
hypothetical centroids (seeds)
1. Assign
points to the centroids
according
to Minimum distance rule
2. Put
centroids in gravity centres of
thus
obtained clusters
3. Iterate 1.
and 2. until convergence

_at_ _at_
_at_

9
K-Means a generic clustering method

Entities are presented as multidimensional
points ()
0. Put K
hypothetical centroids (seeds)
1. Assign
points to the centroids
according
to Minimum distance rule
2. Put
centroids in gravity centres of
thus
obtained clusters
3. Iterate 1.
and 2. until convergence

_at_ _at_
_at_

10
K-Means a generic clustering method

Entities are presented as multidimensional
points ()
0. Put K
hypothetical centroids (seeds)
1. Assign
points to the centroids
according
to Minimum distance rule
2. Put
centroids in gravity centres of
thus
obtained clusters
3. Iterate 1.
and 2. until convergence
4. Output
final centroids and clusters

_at_ _at_

_at_
11
Advantages of K-Means

Models typology building
Computationally effective
Can be utilised incrementally, on-line
Shortcomings of K-Means
Instability of results
Convex cluster shape

12
Initial Centroids Correct
Two cluster case
13
Initial Centroids Correct
Final
Initial
14
Different Initial Centroids
15
Different Initial Centroids Wrong
Initial
Final
16
Clustering issues
K-Means gives no advice on Number of
clusters Initial setting Data
normalisation Mixed variable scales
Multiple data sets K-Means gives limited advice
on Interpretation of results
17
Data recovery for data mining (discovery of
patterns in data)

Type of Data
Similarity
Temporal
Entity-to-feature
Co-occurrence

Type of Model
Regression
Principal components
Clusters

Model Data Model_Derived_Data
Residual Pythagoras Data2 Model_Derived_Data2
Residual2 The better fit, the better the
model
18
Pythagorean decomposition in Data recovery
approach, provides for

Data scatter a unique data characteristic (A
perspective at data normalisation)
Additive contributions of entities or features to
clusters (A perspective for interpretation)
Feature contributions are correlation/association
measures affected by scaling (Mixed scale data
treatable)
Clusters can be extracted one-by-one (Data mining
perspective, incomplete clustering, number of
clusters)
Multiple data can be approximated as well as
single sourced ones (not talked of today)

19
Example
Mixed scale data table
20
Conventional quantitative coding data
standardisation
21
Standardisation of features

Yik (Xik Ak)/Bk
X - original data
Y standardised data
i entities
k features
Ak shift of the origin, typically, the average
Bk rescaling factor, traditionally the standard
deviation, but range may be better in clustering

22
No standardisation
Tom Sawyer
23
Z-scoring (scaling by std)
Tom Sawyer
24
Standardising by range weight
Tom Sawyer
25
K-Means as a data recovery method
26
Representing a partition
Cluster k Centroid ckv (v -
feature) Binary 1/0 membership zik
(i - entity)
27
Basic equations (analogous to PCA, with score
vectors zk constrained to be binary)
y data entry, z membership, not
score c - cluster centroid, N
cardinality i - entity, v - feature
/category, k - cluster
28
Meaning of Data scatter

The sum of contributions of features the basis
for feature pre-processing (dividing by range
rather than std)
Proportional to the summary variance

29
Contribution of a feature F to a partition
Contrib(F)

Proportional to
correlation ratio ?2 if F is quantitative
a contingency coefficient between cluster
partition and F, if F is nominal
Pearson chi-square (Poisson normalised)
Goodman-Kruskal tau-b (Range normalised)

30
Contribution of a quantitative feature to a
partition

Proportional to
correlation ratio ?2 if F is quantitative

31
Contribution of a nominal feature to
a partition

Proportional to a contingency coefficient
Pearson chi-square (Poisson normalised)
Goodman-Kruskal tau-b (Range normalised)
Bj1

32
Pythagorean Decomposition of data scatter for
interpretation
33
Contribution based description of clusters

C. Dickens FCon 0
M. Twain LenD lt 28
L. Tolstoy NumCh gt 3 or
Direct 1

34
PCA based Anomalous Pattern Clustering

yiv cv zi eiv,
where zi 1 if i?S, zi 0 if i?S
With Euclidean distance squared

cS must be anomalous, that is, interesting
35
Initial setting with Anomalous Pattern Cluster
36
Anomalous Pattern Clusters Iterate
0
37
iK-MeansAnomalous clusters K-means
After extracting 2 clusters (how one can know
that 2 is right?)
Final
38
Example of iK-Means Media Mirrored Russian
Corruption (55 cases) with M. Levin and E.
Bakaleinik

Features
Corrupt office (1)
Client (1)
Rendered service (6)
Mechanism of corruption (2)
Environment (1)

39
A schema for Bribery

Environment
Interaction
Office
Client
Service
40
Data standardisation

Categories as one/zero variables
Subtracting the average
All features Normalising by range
Categories, sometimes by the number of them

41
iK-MeansInitial Setting with Iterative
Anomalous Pattern Clustering

13 clusters found with AC, of which
8 do not fit (4 singletons, 4
doublets)
5 clusters remain, to get initial seeds from
Cluster elements are taken as seeds

42
Interpretation II Patterning(Interpretation I
Representatives Interpretation III Conceptual
description)

Patterns in centroid values of salient features
Salience of feature v at cluster k

(grand mean -
within-cluster mean)2

43
InterpretationII III

Cluster 1 (7 cases)
Other branch (877)
Improper categorisation (439)
Level of client (242)
Cluster 2 (19 cases)
Obstruction of justice (467)
Law enforcement (379)
Occasional (251)

Branch Other
Branch Law Enforc. Service No
Cover-Up Client Level ? Organisation
44
InterpretationII (pattern) III (appcod)

Cluster 3 (10 cases)
Extortion (474)
Organisation(289)
Government (275)

0 lt Extort - Obstruct lt 1 2
lt Extort Bribe lt3 No
Inspection No Protection
NO ERRORS
45
Overall Description It is Branch that matters

Government
Extortion for free services (Cluster 3)
Protection (Cluster 4)
Law enforcement
Obstruction of justice (Cluster 2)
Cover-up (Cluster 5)
Other
Category change (Cluster 1)
Is this knowledge enhancement?

46
Data recovery clustering of similarities

Example
Similarities between algebraic functions in an
experimental method for knowledge evaluation
lnx x² x³ x½ x¼
lnx - 1 1 2.5 2.5
x² 1 - 6 2.5 2.5
X³ 1 6 - 3 3
x½ 2.5 2.5 3 - 4
x¼ 2.5 2.5 3 4 -
Scoring similarities between algebraic functions
by a 6th grade
student in scale 1 to 7

47
Additive clustering

Similarities are the sum of intensities of
clusters
Cl. 0 All are funcrtions, lnx, x², x³, x½,
x¼
Intensity 1 (upper sub-matrix)
lnx x² x³ x½ x¼
lnx - 1 1 1 1
x² 1 - 1 1 1
X³ 1 6 - 1 1
x½ 2.5 2.5 3 - 1
x¼ 2.5 2.5 3 4 -
Scoring similarities between algebraic functions
by a 6th grade
student in scale 1 to 7 (lower sub-matrix)

48
Additive clustering

Similarities are the sum of intensities of
clusters
Cl. 1 Power functions, x², x³, x½, x¼
Intensity 2 (upper sub-matrix)
lnx x² x³ x½ x¼
lnx - 0 0 0 0
x² 1 - 2 2 2
X³ 1 6 - 2 2
x½ 2.5 2.5 3 - 2
x¼ 2.5 2.5 3 4 -
Scoring similarities between algebraic functions
by a 6th grade
student in scale 1 to 7 (lower sub-matrix)

49
Additive clustering

Similarities are the sum of intensities of
clusters
Cl. 2 Sub-linear functions, lnx, x½, x¼
Intensity 1 (upper sub-matrix)
lnx x² x³ x½ x¼
lnx - 0 0 1 1
x² 1 - 0 0 0
X³ 1 6 - 0 0
x½ 2.5 2.5 3 - 1
x¼ 2.5 2.5 3 4 -
Scoring similarities between algebraic functions
by a 6th grade
student in scale 1 to 7 (lower sub-matrix)

50
Additive clustering

Similarities are the sum of intensities of
clusters
Cl. 3 Fast growing functions, x², x³
Intensity 3 (upper sub-matrix)
lnx x² x³ x½ x¼
lnx - 0 0 0 0
x² 1 - 3 0 0
X³ 1 6 - 0 0
x½ 2.5 2.5 3 - 0
x¼ 2.5 2.5 3 4 -
Scoring similarities between algebraic functions
by a 6th grade
student in scale 1 to 7 (lower sub-matrix)

51
Additive clustering

Similarities are the sum of intensities of
clusters
Residuals relatively small
(upper sub-matrix)
lnx x² x³ x½ x¼
lnx - 0 0 .5 .5
x² 1 - 0 -.5 -.5
X³ 1 6 - 0 0
x½ 2.5 2.5 3 - 0
x¼ 2.5 2.5 3 4 -
Scoring similarities between algebraic functions
by a 6th grade
student in scale 1 to 7 (lower sub-matrix)

52
Data recovery Additive clustering

Observed similarity matrix
B Ag A1 A2 A3 E
Problem given B, find As to minimize E, the
differences between B and summary A
??B (Ag A1 A2 A3)?? ? min A

53
Doubly greedy strategy

OUTER LOOP One cluster at a time
Find real c and binary z to minimize L2(B,c,z)
Take cluster S i z i 1
Update B B ? B - czzT
Reiterate
After m iterations Sk, NkSk, ck
T(B) c12 N12 cm2 Nm2 L2 (?)

54
Inner loop finding a cluster

Maximize Contribution to (?), Max (cNS)2
N.Property Average similarity b(i,S) of i to S
gt c/2 if i? S and lt c/2 if i ? S
Algorithm ADDI-S
Take S i for arbitrary i
Given S, find cc(S) and b(i,S) for all i
If b(i,S)-c/2. is gt0 for i? S or lt 0 for i ? S
change the state of i. Else, stop and output S.
Resulting S satisfies the property.
Holzinger (1941) B-coefficient,
ArkadievBraverman (1964,
1967) Specter, Mirkin (1976, 1987) ADDI-,
Ben-Dor, Shamir,
Yakhini (1999) CAST

55
DRA on Mixed variable scales and normalisation
Feature Normalisation any measure, clear of
the distribution e. g., range Nominal
scale Binary categories normalised to get the
total feature contribution right e.g. by the
square root of the number of categories
56
DRA on Interpretation
Cluster centroids are supplemented with
contributions of feature/cluster pairs or
entity/cluster pairs K-Means What is
Representative? Distance Min (conventional) Inne
r product Max (data recovery)
57
DRA on Incomplete clustering