Ensembles of Partitions via Data Resampling - PowerPoint PPT Presentation

About This Presentation
Title:

Ensembles of Partitions via Data Resampling

Description:

Ensembles of Partitions via Data Resampling Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering ITCC 2004, Las Vegas ... – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 31
Provided by: rje60
Learn more at: http://www.lon-capa.org
Category:

less

Transcript and Presenter's Notes

Title: Ensembles of Partitions via Data Resampling


1
Ensembles of Partitions via Data Resampling
  • Behrouz Minaei, Alexander Topchy and William
    Punch
  • Department of Computer Science and Engineering
  • ITCC 2004, Las Vegas, April 7th 2004

2
Outline
  • Overview of Data Mining Tasks
  • Cluster analysis and its difficulty
  • Clustering Ensemble
  • How to generate different partitions?
  • How to combine multiple partitions?
  • Resampling Methods
  • Bootstrap vs. Subsampling
  • Experimental study
  • Methods
  • Results
  • Conclusion

3
Overview of Data Mining Tasks
  • Classification
  • The goal is to predict the class variable based
    on the feature values of samples Avoid
    Overfitting
  • Clustering (unsupervised learning)
  • Association Analysis
  • Dependence Modeling
  • A generalization of classification task. Any
    feature variable can occur both in antecedent and
    in the consequent of a rule.
  • Association Rules
  • Find binary relationships among data items

4
Clustering vs. Classification
  • Identification of a pattern as a member of a
    category (pattern class) we already know, or we
    are familiar with
  • Supervised Classification (known categories)
  • Unsupervised Classification, or Clustering
  • (creation of new categories)

Clustering
5
Classification vs. Clustering
Given some training patterns from each class, the
goal is to construct decision boundaries or to
partition the feature space
Given some patterns, the goal is to discover the
underlying structure (categories) in the data
based on inter-pattern similarities
6
Taxonomy of Clustering Approaches
A. Jain, M. N. Murty, and P. Flynn. Data
clustering A review. ACM Computing Surveys,
31(3)264323, September 1999.
7
k-Means Algorithm
  • Minimize the sum of within-cluster square errors
  • Start with k cluster centers
  • Iterate between
  • Assign data points to the closest cluster centers
  • Adjust the cluster centers to be the means of the
    data points
  • User specified parameters k, initialization of
    cluster centers
  • Fast O(kNI)
  • Proven to converge to local optimum
  • In practice, converges quickly
  • Tends to produce spherical, equal-sized clusters

k-means, k3
8
Single-Link algorithm
  • Form a hierarchy for the data points
    (dendrogram), which can be used to partition the
    data
  • The closest data points are joined to form a
    cluster at each step
  • Closely related to the minimum spanning
    tree-based clustering

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
Data
Dendrogram
Single-link, k3
9
Users Dilemma!
  • Which similarity measure and which features to
    use?
  • How many clusters?
  • Which is the best clustering method?
  • Are the individual clusters and the partitions
    valid?
  • How to choose algorithmic parameters?

10
How Many Clusters?
k-means, k2
k-means, k3
k-means, k4
k-means, k5
11
Any Best Clustering Algorithm?
  • Clustering is an ill-posed problem there does
    not exist a uniformly best clustering algorithm
  • In practice, we need to determine which
    clustering algorithm(s) is appropriate for the
    given data

k-means, 3 clusters
Single-link, 30 clusters
Spectral, 3 clusters
EM, 3 clusters
12
Ensemble Benefits
  • Combinations of classifiers proved to be very
    effective in supervised learning framework, e.g.
    bagging and boosting algorithms
  • Distributed data mining requires efficient
    algorithms capable to integrate the solutions
    obtained from multiple sources of data and
    features
  • Ensembles of clusterings can provide novel,
    robust, and stable solutions

13
Is Meaningful Clustering Combination Possible?
Combination of 4 different partitions can lead
to true clusters!
14
Pattern Matrix, Distance matrix
Features Features Features Features Features Features
X1 x11 x12 x1j x1d
X2 x21 x22 x2j x2d

Xi xi1 xi2 xij xid

XN xN1 xN2 xNj xNd
   X1 X2 Xj XN
X1 d11 d12 d1j d1N
X2 d21 d22 d2j d2N

Xi di1 di2 dij diN

XN dN1 dN2 dNj dNN
15
Representation of Multiple Partitions
  • Combination of partitions can be viewed as
    another clustering problem, where each Pi
    represents a new feature with categorical values
  • Cluster membership of a pattern in different
    partitions is regarded as a new feature vector
  • Combining the partitions is equivalent to
    clustering these tuples

objects P1 P2 P3 P4
x1 1 A ? Z
X2 1 A ? Y
X3 3 D ? ?
X4 2 D ? Y
X5 2 B ? Z
X6 3 C ? Z
X7 3 C ? ?
7 objects clustered by 4 algorithms
16
Re-labeling and Voting
  C-1 C-2 C-3 C-4
X1 1 A ? Z
X2 1 A ? Y
X3 3 B ? ?
X4 2 C ? Y
X5 2 B ? Z
X6 3 C ? Z
X7 3 B ? ?
  C-1 C-2 C-3 C-4
X1 1 1 1 2
X2 1 1 2 1
X3 3 3 2 ?
X4 2 2 1 1
X5 2 3 2 2
X6 3 2 ? 2
X7 3 3 2 ?
FC
1
1
3
?
2
2
3
17
Co-association As Consensus Function
  • Similarity between objects can be estimated by
    the number of clusters shared by two objects in
    all the partitions of an ensemble
  • This similarity definition expresses the strength
    of co-association of n objects by an n x n matrix
  • xi the i-th pattern pk(xi) cluster label of
    xi in the k-th partition I() Indicator
    function N no. of different partitions
  • This consensus function eliminates the need for
    solving the label correspondence problem

18
Taxonomy of Clustering Combination Approaches
19
Resampling Methods
  • Bootstrapping (Sampling with replacement)
  • Create an artificial list by randomly drawing N
    elements from that list. Some elements will be
    picked more than once.
  • Statistically on average 37 of elements are
    repeated
  • Subsampling (Sampling without replacement)
  • Control over the size of subsample

20
Experiment Data sets
Number of Classes Number of Features Total no of patterns Patterns per class
Halfrings 2 2 400 100-300
2-spirals 2 2 200 100-100
Star/Galaxy 2 14 4192 2082-2110
Wine 3 13 178 59-71-48
LON 2 6 227 64-163
Iris 3 4 150 50-50-50
21
Half Rings Data Set
k-means with k 2 does not identify the true
clusters
Original data set
k-Means, k2
22
Half Rings Data Set
  • Both SL and k-means algorithms fail on this data,
    but clustering combination detects true clusters

Dendrograms produced by the single-link algorithm
using
Euclidean distance over the original data set
Co-association matrix, k15, N200
l3
l2 2-cluster lifetime
23
Bootstrap results on Iris
24
Bootstrap results on Galaxy/Star
25
Bootstrap results on Galaxy/Stark5, different
consensus functions
26
Error Rate for Individual Clustering
Data set k-means Single Link Complete Link Average Link
Halfrings 25 24.3 14 5.3
2 Spiral 43.5 0 48 48
Iris 15.1 32 16 9.3
Wine 30.2 56.7 32.6 42
LON 27 27.3 25.6 27.3
Star/Galaxy 21 49.7 44.1 49.7
27
Summary of the best results of Bootstrapping
Data set Best Consensus function(s) Lowest Error rate obtained Parameters
Halfrings Co-association, SL Co-association, AL 0 0 K 10, B. 100 k 15, B 100
2 Spiral Co-association, SL 0 k 10, B. 100
Iris Hypergraph-HGPA 2.7 k 10, B 20
Wine Hypergraph-CSPA 26.8 k 10, B 20
LON Co-association, CL 21.1 k 4, B 100
Star/Galaxy Hypergraph-MCLA Co-association, AL Mutual Information 9.5 10 11 k 20, B 10 k 10, B 100 k 3, B 20
28
Discussion
  • What is the trade-off between the accuracy of the
    overall clustering combination and computational
    cost of generating component partitions?
  • What is the optimal size and granularity of the
    component partitions?
  • What is the best consensus function to combine
    bootstrap partitions?

29
References
  • B. Minaei-Bidgoli, A. Topchy and W.F. Punch,
    Effect of the Resampling Methods on Clustering
    Ensemble Efficacy, prepared to submit to Intl.
    Conf. on Machine Learning Models, Technologies
    and Applications, 2004
  • A. Topchy, B. Minaei-Bigoli, A.K. Jain, W.F.
    Punch, Adaptive Clustering Ensembles, Intl.
    Conf on Pattern Recognition, ICPR 2004, in press
  • A. Topchy, A.K. Jain and W. Punch, A Mixture
    Model of Clustering Ensembles, in Proceedings
    SIAM Conf. on Data Mining, April 2004, in press

30
Clusters of Galaxies
Write a Comment
User Comments (0)
About PowerShow.com