Ensembles of Partitions via Data Resampling - PowerPoint PPT Presentation

About This Presentation

Title:

Ensembles of Partitions via Data Resampling

Description:

Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004, Las Vegas ... – PowerPoint PPT presentation

Number of Views:150

Avg rating:3.0/5.0

Slides: 31

Provided by: rje60

Learn more at: http://www.lon-capa.org

Category:

more less

Transcript and Presenter's Notes

Title: Ensembles of Partitions via Data Resampling

1
Ensembles of Partitions via Data Resampling

Behrouz Minaei, Alexander Topchy and William
Punch
Department of Computer Science and Engineering
ITCC 2004, Las Vegas, April 7th 2004

2
Outline

Overview of Data Mining Tasks
Cluster analysis and its difficulty
Clustering Ensemble
How to generate different partitions?
How to combine multiple partitions?
Resampling Methods
Bootstrap vs. Subsampling
Experimental study
Methods
Results
Conclusion

3
Overview of Data Mining Tasks

Classification
The goal is to predict the class variable based
on the feature values of samples Avoid
Overfitting
Clustering (unsupervised learning)
Association Analysis
Dependence Modeling
A generalization of classification task. Any
feature variable can occur both in antecedent and
in the consequent of a rule.
Association Rules
Find binary relationships among data items

4
Clustering vs. Classification

Identification of a pattern as a member of a
category (pattern class) we already know, or we
are familiar with
Supervised Classification (known categories)
Unsupervised Classification, or Clustering
(creation of new categories)

Clustering
5
Classification vs. Clustering
Given some training patterns from each class, the
goal is to construct decision boundaries or to
partition the feature space
Given some patterns, the goal is to discover the
underlying structure (categories) in the data
based on inter-pattern similarities
6
Taxonomy of Clustering Approaches
A. Jain, M. N. Murty, and P. Flynn. Data
clustering A review. ACM Computing Surveys,
31(3)264323, September 1999.
7
k-Means Algorithm

Minimize the sum of within-cluster square errors
Start with k cluster centers
Iterate between
Assign data points to the closest cluster centers
Adjust the cluster centers to be the means of the
data points
User specified parameters k, initialization of
cluster centers
Fast O(kNI)
Proven to converge to local optimum
In practice, converges quickly
Tends to produce spherical, equal-sized clusters

k-means, k3
8
Single-Link algorithm

Form a hierarchy for the data points
(dendrogram), which can be used to partition the
data
The closest data points are joined to form a
cluster at each step
Closely related to the minimum spanning
tree-based clustering

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
Data
Dendrogram
Single-link, k3
9
Users Dilemma!

Which similarity measure and which features to
use?
How many clusters?
Which is the best clustering method?
Are the individual clusters and the partitions
valid?
How to choose algorithmic parameters?

10
How Many Clusters?
k-means, k2
k-means, k3
k-means, k4
k-means, k5
11
Any Best Clustering Algorithm?

Clustering is an ill-posed problem there does
not exist a uniformly best clustering algorithm
In practice, we need to determine which
clustering algorithm(s) is appropriate for the
given data

k-means, 3 clusters
Single-link, 30 clusters
Spectral, 3 clusters
EM, 3 clusters
12
Ensemble Benefits

Combinations of classifiers proved to be very
effective in supervised learning framework, e.g.
bagging and boosting algorithms
Distributed data mining requires efficient
algorithms capable to integrate the solutions
obtained from multiple sources of data and
features
Ensembles of clusterings can provide novel,
robust, and stable solutions

13
Is Meaningful Clustering Combination Possible?
Combination of 4 different partitions can lead
to true clusters!
14
Pattern Matrix, Distance matrix
Features Features Features Features Features Features
X1 x11 x12 x1j x1d
X2 x21 x22 x2j x2d

Xi xi1 xi2 xij xid

XN xN1 xN2 xNj xNd
X1 X2 Xj XN
X1 d11 d12 d1j d1N
X2 d21 d22 d2j d2N

Xi di1 di2 dij diN

XN dN1 dN2 dNj dNN
15
Representation of Multiple Partitions

Combination of partitions can be viewed as
another clustering problem, where each Pi
represents a new feature with categorical values
Cluster membership of a pattern in different
partitions is regarded as a new feature vector
Combining the partitions is equivalent to
clustering these tuples

objects P1 P2 P3 P4
x1 1 A ? Z
X2 1 A ? Y
X3 3 D ? ?
X4 2 D ? Y
X5 2 B ? Z
X6 3 C ? Z
X7 3 C ? ?
7 objects clustered by 4 algorithms
16
Re-labeling and Voting
C-1 C-2 C-3 C-4
X1 1 A ? Z
X2 1 A ? Y
X3 3 B ? ?
X4 2 C ? Y
X5 2 B ? Z
X6 3 C ? Z
X7 3 B ? ?
C-1 C-2 C-3 C-4
X1 1 1 1 2
X2 1 1 2 1
X3 3 3 2 ?
X4 2 2 1 1
X5 2 3 2 2
X6 3 2 ? 2
X7 3 3 2 ?
FC
1
1
3
?
2
2
3
17
Co-association As Consensus Function

Similarity between objects can be estimated by
the number of clusters shared by two objects in
all the partitions of an ensemble
This similarity definition expresses the strength
of co-association of n objects by an n x n matrix
xi the i-th pattern pk(xi) cluster label of
xi in the k-th partition I() Indicator
function N no. of different partitions
This consensus function eliminates the need for
solving the label correspondence problem

18
Taxonomy of Clustering Combination Approaches
19
Resampling Methods

Bootstrapping (Sampling with replacement)
Create an artificial list by randomly drawing N
elements from that list. Some elements will be
picked more than once.
Statistically on average 37 of elements are
repeated
Subsampling (Sampling without replacement)
Control over the size of subsample

20
Experiment Data sets
Number of Classes Number of Features Total no of patterns Patterns per class
Halfrings 2 2 400 100-300
2-spirals 2 2 200 100-100
Star/Galaxy 2 14 4192 2082-2110
Wine 3 13 178 59-71-48
LON 2 6 227 64-163
Iris 3 4 150 50-50-50
21
Half Rings Data Set
k-means with k 2 does not identify the true
clusters
Original data set
k-Means, k2
22
Half Rings Data Set

Both SL and k-means algorithms fail on this data,
but clustering combination detects true clusters

Dendrograms produced by the single-link algorithm
using
Euclidean distance over the original data set
Co-association matrix, k15, N200
l3
l2 2-cluster lifetime
23
Bootstrap results on Iris
24
Bootstrap results on Galaxy/Star
25
Bootstrap results on Galaxy/Stark5, different
consensus functions
26
Error Rate for Individual Clustering
Data set k-means Single Link Complete Link Average Link
Halfrings 25 24.3 14 5.3
2 Spiral 43.5 0 48 48
Iris 15.1 32 16 9.3
Wine 30.2 56.7 32.6 42
LON 27 27.3 25.6 27.3
Star/Galaxy 21 49.7 44.1 49.7
27
Summary of the best results of Bootstrapping
Data set Best Consensus function(s) Lowest Error rate obtained Parameters
Halfrings Co-association, SL Co-association, AL 0 0 K 10, B. 100 k 15, B 100
2 Spiral Co-association, SL 0 k 10, B. 100
Iris Hypergraph-HGPA 2.7 k 10, B 20
Wine Hypergraph-CSPA 26.8 k 10, B 20
LON Co-association, CL 21.1 k 4, B 100
Star/Galaxy Hypergraph-MCLA Co-association, AL Mutual Information 9.5 10 11 k 20, B 10 k 10, B 100 k 3, B 20
28
Discussion

What is the trade-off between the accuracy of the
overall clustering combination and computational
cost of generating component partitions?
What is the optimal size and granularity of the
component partitions?
What is the best consensus function to combine
bootstrap partitions?

29
References

B. Minaei-Bidgoli, A. Topchy and W.F. Punch,
Effect of the Resampling Methods on Clustering
Ensemble Efficacy, prepared to submit to Intl.
Conf. on Machine Learning Models, Technologies
and Applications, 2004
A. Topchy, B. Minaei-Bigoli, A.K. Jain, W.F.
Punch, Adaptive Clustering Ensembles, Intl.
Conf on Pattern Recognition, ICPR 2004, in press
A. Topchy, A.K. Jain and W. Punch, A Mixture
Model of Clustering Ensembles, in Proceedings
SIAM Conf. on Data Mining, April 2004, in press

30
Clusters of Galaxies

Write a Comment

User Comments (0)