Title: Discovering Clusters
1Discovering Clusters in High-Dimensional Feature
Spaces Using a Dynamic Agglomerative Decimation
Clustering Algorithm
Youngser Park
Research Director Professor Peter
Bock Department of Computer Science The George
Washington University
2Agenda
- Problem Description
- Solution Description
- Research Goals
- Preliminary Experiments
- Preliminary Results
3Statistical Pattern Recognition
From Jain00
Analysis Problem Description
4Statistical Pattern Recognition
From Jain00
Analysis Problem Description
5Clustering Analysis
Goal To discover a reasonable categorization of
the data (if one exists)
Data Vectors
Interpattern Similarity
- number, type, scale of features
- number of feature vectors
- measured by distance function (e.g. Euclidean,
Mahalanobis, etc.)
Classification
Feature Selection
- hard vs. fuzzy
- hierarchical vs. partitional
- cluster tendency (reasonableness)
- cluster validity (termination criteria)
- To identify the most effective subset of original
features
Feature Extraction
- To produce new salient features using one or more
input features
Analysis Problem Description
6Problem Description
Existing clustering algorithms have difficulty
finding satisfactory solutions to the problems
containing
?
- non-trivial geometric shapes
- high-dimensional spaces (gt20)
- large sample size (e.g., millions)
- high noise levels
- unspecified termination criteria
- different units in data
- Users Dilemma!
- How many clusters?
- What (and how many) features?
- How to handle a large data set?
Analysis Problem Description
7Related Work
Partitional Algorithms
Hierarchical Algorithms
- Approaches
- agglomerative vs. divisive
- single vs. complete vs. Ward
- Approaches
- squared error vs. graph-based
- hard vs. fuzzy
Pseudo-code set k desired number of
clusters select ?1, ?2, , ?k repeat until no
change in ?i cluster n samples to nearest
?i recompute ?i end
Pseudo-code set k desired number of
clusters assign each data vector to a
cluster repeat until k clusters remain merge
closest cluster pair recompute distances end
Pros and Cons needs only one parameter, k
O(n2)
Pros and Cons O(kn) needs k 1 parameters, k
and ? sensitive to initial parameters may
converge to local minimum
Analysis Related Work
8Related Work
Partitional k-means McQueen67 ISODATA
Ball65 MST Zhan77 Leader Hartigan75 Vector
Quantization Gray84
- Hierarchical
- Single-link Sneath73
- Complete-link King67
- Ward Ward63
- Dendritic Bock99
Applications Image Processing, Bioinformatics,
Data Mining, etc.
Which method? (No Free Lunch Theorem)
Analysis Related Work
9Dendritic Clusterer
Hierarchical agglomerative algorithm Bock99
Quantization Reduction
robust to noise (reduction)
requires M-dimensional histogram binning
causes quantization error does not yield
membership only valid with dimensionless data
Analysis Related Work
10Dendritic Clustering
Analysis Related Work
11Dendritic Clustering
Euc
BC Euc 1.00
EucLog
BA EucLog 1.34
Analysis Related Work
12Dendritic Clustering
Euc
BC Euc 1.00
EucLog
BA EucLog 1.34
Analysis Related Work
13Dendritic Clustering
Euc
BC Euc 1.00
EucLog
BA EucLog 1.34
Analysis Related Work
14Dendritic Clustering
Euc
BC Euc 1.00
EucLog
BA EucLog 1.34
Analysis Related Work
15Dendritic Clustering
Euc
BC Euc 1.00
EucLog
BA EucLog 1.34
Analysis Related Work
16Dendritic Clustering
Euc
BC Euc 1.00
EucLog
BA EucLog 1.34
Analysis Related Work
17Dendritic Clustering
Euc
BC Euc 1.00
Notice the difference!
EucLog
BA EucLog 1.34
Analysis Related Work
18Dendritic Clustering
Euc
BC Euc 1.00
EucLog
BA EucLog 1.34
Analysis Related Work
19Dendritic Clustering
Euc
BC Euc 1.00
EucLog
BA EucLog 1.34
Analysis Related Work
20Dendritic Clustering
Euc
BC Euc 1.00
EucLog
BA EucLog 1.34
Analysis Related Work
21Dendritic Clustering
Euc
BC Euc 1.00
EucLog
BA EucLog 1.34
Analysis Related Work
22Dendritic Clustering
Euc
BC Euc 1.00
EucLog
BA EucLog 1.34
Analysis Related Work
23Dendritic Clustering
Euc
BC Euc 1.00
Final results are the same!
EucLog
BA EucLog 1.34
Analysis Related Work
24Dendritic Clustering
2D Data Space
Noisy Gaussian Mixtures
5 Gaussian Clusters Precision 200x200
Note For higher-dimensional data sets, the
parameters of the centroids are repeated
systematically.
Analysis Related Work
25Performance Criteria Research Objective
- Performance Criteria
- robust for high noise levels
- works with high-dimensional spaces (gt20)
- requires very few parameters
- yields position and membership of clusters
- requires no parametric assumptions about data
- eliminates quantization error during clustering
Research Objective To design and test an
unsupervised method for discovering subtle but
significant clusters in high-dimensional,
noisy, sparsely populated feature spaces,
subject to the performance criteria listed above
Analysis Performance Criteria, Objective
26Dynamic Agglomerative Decimation Clustering
Algorithm
Improved version of Dendritic Clusterer
- additional distance metrics (Z, ZLog, M, MLog)
- more efficient memory management (dynamically
allocated list) - one-dimensional histogram used for decimation
only (S ltlt Q M ) - no binning required during clustering phase no
quantization error - designates membership classification error
Hypothesis Solution
27Dynamic Agglomerative Decimation Clustering
Algorithm
Distance Metrics
Hypothesis Solution
28DAD Decimation Phase (optional)
Dynamically allocate input vectors to DAL Assign
count of 1 to each vector in DAL for each vector
in DAL Quantize DAL to precision QD Store in
QDAL Sort quantized vectors in QDAL by each
dimension as a key Store sorted vectors in
SQDAL Collapse identical vectors in SQDAL and sum
the counts Store a copy of vector and count in
CSQDAL for each vector in CSQDAL if
corresponding count lt cut-off threshold
C Remove vector from CSQDAL Store remaining
vectors in RDAL Replace vectors in RDAL with
averages of elements of the associated original
real-valued vectors in DAL
Hypothesis Solution
29DAD Decimation Phase
Hypothesis Solution
30DAD Clustering Phase
Algorithm ( same as Dendritic Clusterer)
repeat until reasonable number of clusters
remains for each vector in RDAL if either this
vector or its nearest neighbor is a new
cluster Find the new distance between
them Find the nearest neighbor Find the new
global minimum distance Combine vectors
separated by minimum distance Recompute new
centroid and its new count end
Hypothesis Solution
31Performance Metrics
- Classification Error (Ec)
- CPU time measured separately for each phase
Hypothesis Performance Metrics
32Normalization Methods
??????mean 0, standard deviation 1
automatic clipping
Hypothesis Performance Metrics
33Preliminary Experiment 1 Quantization Precision
Fixed conditions and parameters data set G
Gaussian sample size S 100,000 cut-off
threshold C 1 noise level N 0 distance
metric D EucLog
Fixed conditions and parameters data set G
Gaussian quantization precision Q 6 cut-off
threshold C 1 noise level N 0 distance
metric D EucLog
Validation Conclusions
34Preliminary Experiment 2Gaussian Data Reduction
Fixed conditions and parameters data set G
Gaussian sample size S 100,000 noise level
N 0 distance metric D EucLog
Factor cut-off threshold C 1 and 2
Conclusions Low quantization precision should
be chosen!
Validation Conclusions
35Preliminary Experiment 3Gaussian Data Cut-Off
Fixed conditions and parameters data set G
Gaussian sample size S 100,000 noise level
N 0 distance metric D EucLog
Factor cut-off threshold C 1 and 2
Conclusions C1 is enough!
Validation Conclusions
36Preliminary Experiment 4Gaussian Data Noise
Random Noise Test
Fixed conditions and parameters sample size S
2,000 data vector dimension M
50 quantization precision Q 2 cut-off
threshold C 0
Confusion Noise Test
Fixed conditions and parameters sample size S
1,000 data vector dimension M 2 no
decimation
Conclusions Logarithm factor in distance
metric improves performance
Validation Conclusions
37Preliminary Experiment 4 Classification
Fishers Iris Data
- 3 classes
- setosa
- versocolor
- virginica
- 4 features
- 50 vectors/class
Fixed conditions and parameters sample size S
150 data vector dimension M 4 noise level
N 0 no decimation no normalization
Synthesis Experiments
38Preliminary Experiment 4 Classification
Validation Conclusions
39Preliminary Experiment 4 Classification
Validation Conclusions
40Original and Significant Contributions
- New distance metrics
- Robust with high levels of noise
- Works with high-dimensional data
- No binning required for clustering
- New termination criteria