Title: Twomode cluster analysis: Monte Carlo tests of the accuracy of methods
1Two-mode cluster analysis Monte Carlo tests of
the accuracy of methods
- Sabine Krolak-Schwerdt
- Saarland University
2Two-mode cluster analysis Monte Carlo tests of
the accuracy of methods
- Sabine Krolak-Schwerdt
- Saarland University
- Overview
- Indication for two-mode clustering
- Classification of two-mode clustering methods
- Monte Carlo study
- Data model used to construct the data sets
- Experiment 1 Non-overlapping clusters
- Experiment 2 Overlapping clusters
- Conclusions for selection of methods
3Classification of two-mode clustering methods
- Generalizations of the ADCLUS model
- GENNCLUS (DeSarbo, 1982)
- PENCLUS (Both Gaul, 1987)
- Baier et al. (1996)
- Representations by ultrametric tree structures
- Missing value method (Espejo Gaul, 1986)
- Centroid effect method (Eckes Orlik, 1993)
- ESOCLUS (Schwaiger, 1997)
- ...
- Reordering methods
- Bond energy algorithm (McCormick, Schweitzer
White, 1972) - Modal block method (Hartigan, 1976)
- Two-way joining (Hartigan, 1975)
- Gridpat (Krolak-Schwerdt, Orlik Ganter, 1994)
- ...
4Generalizations of the ADCLUS model
- Data
- is a nonsymmetric (similarity) matrix
- of order n x m
- Model
- where
- binary l x k matrix designating membership
of the n objects in k clusters - k x k matrix of weights
- binary m x k matrix designating membership
of the m attributes in k clusters -
-
(cf. DeSarbo, 1982)
5Representations by ultrametric trees
6Representations by ultrametric trees
Grand matrix
7Reordering approaches Two-way joining
(Hartigan, 1975)
8Monte Carlo study Selected methods
- Generalizations of the ADCLUS model
- GENNCLUS (DeSarbo, 1982)
- PENCLUS (Both Gaul, 1987)
- Baier et al. (1996)
- Representations by ultrametric tree structures
- Missing value method (Espejo Gaul, 1986)
- Centroid effect method (Eckes Orlik, 1993)
- ESOCLUS (Schwaiger, 1997)
- ...
- Reordering methods
- Bond energy algorithm (McCormick, Schweitzer
White, 1972) - Modal block method (Hartigan, 1976)
- Two-way joining (Hartigan, 1975)
- Gridpat (Krolak-Schwerdt, Orlik Ganter, 1994)
- ...
9Monte Carlo study Selected methods
- Generalizations of the ADCLUS model
- GENNCLUS (DeSarbo, 1982)
- PENCLUS (Both Gaul, 1987)
- Baier et al. (1996)
- Representations by ultrametric tree structures
- Missing value method (Espejo Gaul, 1986)
- Centroid effect method (Eckes Orlik, 1993)
- ESOCLUS (Schwaiger, 1996)
- ...
- Reordering methods
- Bond energy algorithm (McCormick, Schweitzer
White, 1972) - Modal block method (Hartigan, 1976)
- Two-way joining (Hartigan, 1975)
- Gridpat (Krolak-Schwerdt, Orlik Ganter, 1994)
- ...
10Data model
11Data model
nonsymmetric matrix of order n x
m n objects
- m attributes binary n x K
matrix designating membership of the n
objects in K clusters K x K matrix of
weights binary m x K matrix
designating membership of the m attributes
in K clusters
12Data model
Model of nonoverlapping clusters
13Non-overlapping clusters
...
...
14Data model
M overlapping clusters
M
15Data model
Weight matrix of experiment 2 Toeplitz Matrix
16Non-overlapping clusters
...
...
17Model parameters of Experiment 1
(3 clusters, non-overlapping)
18Model parameters of Experiment 2
Factors of the experimental design
Overlap large vs. small Cluster number
3, 5 or 8 Parameter of Toeplitz matrix
large vs. small Size of variance large
vs. small
19Experiment 1 ANOVA of Adjusted Rand indices
20Experiment 1 ANOVA of Adjusted Rand indices
Number of clusters F(2,180) 112.21,
3 5 8 p ? 0.001 0.81 0.69 0.56
Structure of clusters F(3,180)
19.86, p ?? 0.001
1 2 3 4
0.75 0.71 0.68 0.61
21Experiment 1 ANOVA of Adjusted Rand indices
Number of clusters F(8,180)
68.44, Method 3 5 8 p ?
0.001 ESOCLUS 1.00 0.90 0.52 Centroid effect
method 0.96 0.78 0.57 Baier et
al. 0.54 0.53 0.50 GRIDPAT 0.90 0.76 0.30 Two-
way joining 0.63 0.48 0.94
22Experiment 1 ANOVA of Adjusted Rand indices
Number of clusters F(8,180)
68.44, Method 3 5 8 p ?
0.001 ESOCLUS 1.00 0.90 0.52 Centroid effect
method 0.96 0.78 0.57 Baier et
al. 0.54 0.53 0.50 GRIDPAT 0.90 0.76 0.30 Two-
way joining 0.63 0.48 0.94
Structure of clusters F(12,180)
3.12, Method 1 2 3 4 p ??
0.001 ESOCLUS 0.88 0.84 0.80 0.71 Centroid
effect method 0.87 0.84 0.78 0.59 Baier et
al. 0.61 0.50 0.50 0.47 GRIDPAT 0.67 0.69 0.60
0.65 Two-way joining 0.70 0.70 0.71 0.63
23Experiment 2 ANOVA of Omega indices
24Experiment 2 ANOVA of Omega indices
Cluster Overlap F(1,240) 29.72,
p ? 0.001
Large
Small
0.84
0.87
s of the normal distribution F(1,240)
28.66, p?? 0.001
Large
Small
0.84
0.87
25Experiment 2 ANOVA of Omega indices
Large overlap Method 3 5
8 ESOCLUS 0.80 0.81 0.81 Centroid effect
method 0.81 0.81 0.82 Baier et
al. 0.92 0.85 0.78 GRIDPAT 0.92 0.87 0.82 Two-
way joining 0.80 0.89 0.76
F(8,240) 6.61, p ? 0.001
Small overlap Method 3 5
8 ESOCLUS 0.88 0.90 0.91 Centroid effect
method 0.87 0.91 0.92 Baier et
al. 0.85 0.84 0.86 GRIDPAT 0.82 0.87 0.87 Two-
way joining 0.68 0.78 0.79
26Conclusions
- Recovery performance of two-mode clustering
methods depends on the type and complexity of the
data structure. - Methods performed best if the input data
correspond to the data structure presumed by the
method - - Non-overlapping clusters ESOCLUS, Centroid
effect method - - Overlapping clusters Baier et al.,
Two-way joining, GRIDPAT - Some apriori knowledge or hypothesis on the type
and structure of data is necessary for the
selection of an optimal method.