Title: Ensembles of Partitions via Data Resampling
1 Ensembles of Partitions via Data Resampling
- Behrouz Minaei, Alexander Topchy and William
Punch - Department of Computer Science and Engineering
- ITCC 2004, Las Vegas, April 7th 2004
2Outline
- Overview of Data Mining Tasks
- Cluster analysis and its difficulty
- Clustering Ensemble
- How to generate different partitions?
- How to combine multiple partitions?
- Resampling Methods
- Bootstrap vs. Subsampling
- Experimental study
- Methods
- Results
- Conclusion
3Overview of Data Mining Tasks
- Classification
- The goal is to predict the class variable based
on the feature values of samples Avoid
Overfitting - Clustering (unsupervised learning)
- Association Analysis
- Dependence Modeling
- A generalization of classification task. Any
feature variable can occur both in antecedent and
in the consequent of a rule. - Association Rules
- Find binary relationships among data items
4Clustering vs. Classification
- Identification of a pattern as a member of a
category (pattern class) we already know, or we
are familiar with - Supervised Classification (known categories)
- Unsupervised Classification, or Clustering
- (creation of new categories)
Clustering
5Classification vs. Clustering
Given some training patterns from each class, the
goal is to construct decision boundaries or to
partition the feature space
Given some patterns, the goal is to discover the
underlying structure (categories) in the data
based on inter-pattern similarities
6Taxonomy of Clustering Approaches
A. Jain, M. N. Murty, and P. Flynn. Data
clustering A review. ACM Computing Surveys,
31(3)264323, September 1999.
7k-Means Algorithm
- Minimize the sum of within-cluster square errors
- Start with k cluster centers
- Iterate between
- Assign data points to the closest cluster centers
- Adjust the cluster centers to be the means of the
data points - User specified parameters k, initialization of
cluster centers - Fast O(kNI)
- Proven to converge to local optimum
- In practice, converges quickly
- Tends to produce spherical, equal-sized clusters
k-means, k3
8Single-Link algorithm
- Form a hierarchy for the data points
(dendrogram), which can be used to partition the
data - The closest data points are joined to form a
cluster at each step - Closely related to the minimum spanning
tree-based clustering
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
Data
Dendrogram
Single-link, k3
9Users Dilemma!
- Which similarity measure and which features to
use? - How many clusters?
- Which is the best clustering method?
- Are the individual clusters and the partitions
valid? - How to choose algorithmic parameters?
10How Many Clusters?
k-means, k2
k-means, k3
k-means, k4
k-means, k5
11Any Best Clustering Algorithm?
- Clustering is an ill-posed problem there does
not exist a uniformly best clustering algorithm - In practice, we need to determine which
clustering algorithm(s) is appropriate for the
given data
k-means, 3 clusters
Single-link, 30 clusters
Spectral, 3 clusters
EM, 3 clusters
12Ensemble Benefits
- Combinations of classifiers proved to be very
effective in supervised learning framework, e.g.
bagging and boosting algorithms - Distributed data mining requires efficient
algorithms capable to integrate the solutions
obtained from multiple sources of data and
features - Ensembles of clusterings can provide novel,
robust, and stable solutions
13Is Meaningful Clustering Combination Possible?
Combination of 4 different partitions can lead
to true clusters!
14Pattern Matrix, Distance matrix
Features Features Features Features Features Features
X1 x11 x12 x1j x1d
X2 x21 x22 x2j x2d
Xi xi1 xi2 xij xid
XN xN1 xN2 xNj xNd
X1 X2 Xj XN
X1 d11 d12 d1j d1N
X2 d21 d22 d2j d2N
Xi di1 di2 dij diN
XN dN1 dN2 dNj dNN
15Representation of Multiple Partitions
- Combination of partitions can be viewed as
another clustering problem, where each Pi
represents a new feature with categorical values - Cluster membership of a pattern in different
partitions is regarded as a new feature vector - Combining the partitions is equivalent to
clustering these tuples
objects P1 P2 P3 P4
x1 1 A ? Z
X2 1 A ? Y
X3 3 D ? ?
X4 2 D ? Y
X5 2 B ? Z
X6 3 C ? Z
X7 3 C ? ?
7 objects clustered by 4 algorithms
16Re-labeling and Voting
C-1 C-2 C-3 C-4
X1 1 A ? Z
X2 1 A ? Y
X3 3 B ? ?
X4 2 C ? Y
X5 2 B ? Z
X6 3 C ? Z
X7 3 B ? ?
C-1 C-2 C-3 C-4
X1 1 1 1 2
X2 1 1 2 1
X3 3 3 2 ?
X4 2 2 1 1
X5 2 3 2 2
X6 3 2 ? 2
X7 3 3 2 ?
FC
1
1
3
?
2
2
3
17Co-association As Consensus Function
- Similarity between objects can be estimated by
the number of clusters shared by two objects in
all the partitions of an ensemble - This similarity definition expresses the strength
of co-association of n objects by an n x n matrix
- xi the i-th pattern pk(xi) cluster label of
xi in the k-th partition I() Indicator
function N no. of different partitions - This consensus function eliminates the need for
solving the label correspondence problem
18Taxonomy of Clustering Combination Approaches
19Resampling Methods
- Bootstrapping (Sampling with replacement)
- Create an artificial list by randomly drawing N
elements from that list. Some elements will be
picked more than once. - Statistically on average 37 of elements are
repeated - Subsampling (Sampling without replacement)
- Control over the size of subsample
20Experiment Data sets
Number of Classes Number of Features Total no of patterns Patterns per class
Halfrings 2 2 400 100-300
2-spirals 2 2 200 100-100
Star/Galaxy 2 14 4192 2082-2110
Wine 3 13 178 59-71-48
LON 2 6 227 64-163
Iris 3 4 150 50-50-50
21Half Rings Data Set
k-means with k 2 does not identify the true
clusters
Original data set
k-Means, k2
22Half Rings Data Set
- Both SL and k-means algorithms fail on this data,
but clustering combination detects true clusters
Dendrograms produced by the single-link algorithm
using
Euclidean distance over the original data set
Co-association matrix, k15, N200
l3
l2 2-cluster lifetime
23Bootstrap results on Iris
24Bootstrap results on Galaxy/Star
25Bootstrap results on Galaxy/Stark5, different
consensus functions
26Error Rate for Individual Clustering
Data set k-means Single Link Complete Link Average Link
Halfrings 25 24.3 14 5.3
2 Spiral 43.5 0 48 48
Iris 15.1 32 16 9.3
Wine 30.2 56.7 32.6 42
LON 27 27.3 25.6 27.3
Star/Galaxy 21 49.7 44.1 49.7
27Summary of the best results of Bootstrapping
Data set Best Consensus function(s) Lowest Error rate obtained Parameters
Halfrings Co-association, SL Co-association, AL 0 0 K 10, B. 100 k 15, B 100
2 Spiral Co-association, SL 0 k 10, B. 100
Iris Hypergraph-HGPA 2.7 k 10, B 20
Wine Hypergraph-CSPA 26.8 k 10, B 20
LON Co-association, CL 21.1 k 4, B 100
Star/Galaxy Hypergraph-MCLA Co-association, AL Mutual Information 9.5 10 11 k 20, B 10 k 10, B 100 k 3, B 20
28Discussion
- What is the trade-off between the accuracy of the
overall clustering combination and computational
cost of generating component partitions? - What is the optimal size and granularity of the
component partitions? - What is the best consensus function to combine
bootstrap partitions?
29References
- B. Minaei-Bidgoli, A. Topchy and W.F. Punch,
Effect of the Resampling Methods on Clustering
Ensemble Efficacy, prepared to submit to Intl.
Conf. on Machine Learning Models, Technologies
and Applications, 2004 - A. Topchy, B. Minaei-Bigoli, A.K. Jain, W.F.
Punch, Adaptive Clustering Ensembles, Intl.
Conf on Pattern Recognition, ICPR 2004, in press - A. Topchy, A.K. Jain and W. Punch, A Mixture
Model of Clustering Ensembles, in Proceedings
SIAM Conf. on Data Mining, April 2004, in press
30Clusters of Galaxies