Title: Ensemble Clustering in Medical Diagnostics
1Ensemble Clustering in Medical Diagnostics
- Derek Greene, Alexey Tsymbal
- Nadia Bolshakova, Pádraig Cunningham
Department of Computer Science, Trinity College
Dublin, Ireland
2Agenda
- Cluster Analysis
- Overview
- Applications in Medical Diagnostics
- Ensemble Clustering
- Motivation
- General Model
- Design Issues
- Experimental Evaluation
- Ensemble techniques
- Empirical results
- Implementation
- Conclusion
3Cluster Analysis
- Data mining approach to discover hidden patterns
in data. - Divide a dataset into groups or clusters of
objects based on a given similarity criterion. - Unsupervised learning procedure
- Often no information exists concerning underlying
structure of partition. - i.e. number of clusters or their composition.
4Applications in Medical Diagnostics
- Examples
- Categorization of patients into cohesive
sub-groups. - e.g. clustering of cancer patient data to define
previously unrecognized tumour sub-types - Analysis of medical imaging data.
- e.g. identification of cell tissue types from MRI
image data - Gene expression analysis.
- e.g. identification of co-regulated gene groups
5Common Cluster Analysis Methods
- Hierarchical Methods
- e.g. hierarchical agglomerative
1 2 3 4
5
6Ensemble Clustering - Overview
- Ensembles in Supervised Learning
- Ensemble have been successfully applied in cases
where classes are well-defined.(e.g. Breiman,
1996). - Ensemble Clustering
- Combine the strengths of multiple partitions to
produce a superior clustering (Strehl Ghosh,
2002).
7Ensemble Clustering - Motivation
- Accuracy in unsupervised learning
- No pre-defined model for the data
- No definitive measure of accuracy.
- A clustering that agrees with domain expert
opinion is desirable. - Issues with common clustering algorithms
- May be influenced by bias of clustering algorithm
toward cluster shape and dispersion. - Goal for Ensemble Clustering
- Aggregate a collection of base clusterings to
produce a more accurate partition of a dataset.
8Ensemble Clustering - Model
- Generic model for Ensemble Clustering
Dataset
9Ensemble Design Decisions
- Base Algorithm
- Which clustering algorithm to apply to produce
the base clusterings? - e.g. k-means, k-medoids, weak clustering
- Generation Strategy
- How many base clusterings to generate?
- How can we ensure diversity among the base
clusterings? - Integration Strategy
- How should the base clusterings be aggregated?
10Experimental Overview
- Goal
- Evaluate ensemble generation and integration
strategies on a varied collection of datasets. - Data
- Benchmark datasets
- Iris, 2-Spirals, Half-rings
- Real-world medical databases from UCI ML
repository - Breast cancer
- Pima Indians diabetes
- Cleveland heart disease
- BUPA liver-disorders
- Lymphography
- Thyroid disease
11Generation Strategies
- Plain
- Rely on stochastic element in base clustering
algorithm. - Random-k
- Randomly select number of clusters (k).
- Bagging
- Generate clusterings on random subset of data.
- Random projection
- Randomly transform data to new set of features.
- Random subspacing
- Randomly select subset of original features.
- Heterogeneous ensembles
- Use multiple different base clustering algorithms.
12Integration Strategies
- Co-Occurrence Method
- Determine level of association between each of
pair objects in a dataset (Jain Fred, 2002).
Base clusterings
Co-occurrence matrix
A
D
C
E
B
A
D
C
E
B
A
D
C
E
B
13Integration Strategies (cont.)
- Which algorithm to use for meta-clustering?
- We apply hierarchical agglomerative clustering
algorithm to co-occurrence matrix. - Single-linkage
- Complete-linkage
- Average-linkage
14Evaluation - Accuracy
- Comparison to single k-means algorithm based on
Jaccard accuracy score
15Evaluation - Diversity v. Accuracy
- Comparison of generator diversity with Jaccard
accuracy scores across all datasets
- Results indicate that diversity alone is not
sufficient to yield an improved solution. - Base accuracy is also important.
16Evaluation - Meta-clustering Algorithms
- Comparison of hierarchical meta-clustering
algorithms based on Jaccard accuracy scores
across all datasets
- Results indicate that choice of integration
strategy is important - Choice of algorithm may often be domain-specific.
17Implementation
- MachaonClustering Framework
http//www.cs.tcd.ie/Nadia.Bolshakova/Machaon.html
18Conclusion
- Summary
- Ensemble clustering offers potential to improve
our ability to identify hidden patterns in data. - To exploit this, appropriate design decisions
must be made - Sufficient level of diversity in base
clusterings. - Suitable meta-clustering algorithm.
- Future Work
- Examine relationship between accuracy of ensemble
members and final output. - Consider alternative integration strategies.
19Contact Details
Derek Greene Department of Computer Science
Trinity College Dublin, Ireland Derek.Greene_at_cs.
tcd.ie
20IEEE CBMS 2005
Trinity College Dublin June 23-24
The 18th IEEE Symposium on Computer-Based Medical
Systems