Discovering Clusters - PowerPoint PPT Presentation

About This Presentation

Title:

Discovering Clusters

Description:

... 9.00 4.00 43.00 2.66 13.00 99999.00 99325.00 95245.00 82916.00 27307.00 94.00 13.00 99999.00 98993.00 93837.00 79816.00 25027.00 10.00 10.00 10.00 1.00 52.00 1 ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 41

Provided by: Youngs4

Learn more at: https://conferences.cs.umbc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Discovering Clusters

1
Discovering Clusters in High-Dimensional Feature
Spaces Using a Dynamic Agglomerative Decimation
Clustering Algorithm
Youngser Park
Research Director Professor Peter
Bock Department of Computer Science The George
Washington University
2
Agenda

Problem Description
Solution Description
Research Goals
Preliminary Experiments
Preliminary Results

3
Statistical Pattern Recognition
From Jain00
Analysis Problem Description
4
Statistical Pattern Recognition
From Jain00
Analysis Problem Description
5
Clustering Analysis
Goal To discover a reasonable categorization of
the data (if one exists)
Data Vectors
Interpattern Similarity

number, type, scale of features
number of feature vectors

measured by distance function (e.g. Euclidean,
Mahalanobis, etc.)

Classification
Feature Selection

hard vs. fuzzy
hierarchical vs. partitional
cluster tendency (reasonableness)
cluster validity (termination criteria)

To identify the most effective subset of original
features

Feature Extraction

To produce new salient features using one or more
input features

Analysis Problem Description
6
Problem Description
Existing clustering algorithms have difficulty
finding satisfactory solutions to the problems
containing
?

non-trivial geometric shapes
high-dimensional spaces (gt20)
large sample size (e.g., millions)
high noise levels
unspecified termination criteria
different units in data

Users Dilemma!
How many clusters?
What (and how many) features?
How to handle a large data set?

Analysis Problem Description
7
Related Work
Partitional Algorithms
Hierarchical Algorithms

Approaches
agglomerative vs. divisive
single vs. complete vs. Ward

Approaches
squared error vs. graph-based
hard vs. fuzzy

Pseudo-code set k desired number of
clusters select ?1, ?2, , ?k repeat until no
change in ?i cluster n samples to nearest
?i recompute ?i end
Pseudo-code set k desired number of
clusters assign each data vector to a
cluster repeat until k clusters remain merge
closest cluster pair recompute distances end
Pros and Cons needs only one parameter, k
O(n2)
Pros and Cons O(kn) needs k 1 parameters, k
and ? sensitive to initial parameters may
converge to local minimum
Analysis Related Work
8
Related Work
Partitional k-means McQueen67 ISODATA
Ball65 MST Zhan77 Leader Hartigan75 Vector
Quantization Gray84

Hierarchical
Single-link Sneath73
Complete-link King67
Ward Ward63
Dendritic Bock99

Applications Image Processing, Bioinformatics,
Data Mining, etc.
Which method? (No Free Lunch Theorem)
Analysis Related Work
9
Dendritic Clusterer
Hierarchical agglomerative algorithm Bock99
Quantization Reduction
robust to noise (reduction)
requires M-dimensional histogram binning
causes quantization error does not yield
membership only valid with dimensionless data
Analysis Related Work
10
Dendritic Clustering
Analysis Related Work
11
Dendritic Clustering
Euc
BC Euc 1.00
EucLog
BA EucLog 1.34
Analysis Related Work
12
Dendritic Clustering
Euc
BC Euc 1.00
EucLog
BA EucLog 1.34
Analysis Related Work
13
Dendritic Clustering
Euc
BC Euc 1.00
EucLog
BA EucLog 1.34
Analysis Related Work
14
Dendritic Clustering
Euc
BC Euc 1.00
EucLog
BA EucLog 1.34
Analysis Related Work
15
Dendritic Clustering
Euc
BC Euc 1.00
EucLog
BA EucLog 1.34
Analysis Related Work
16
Dendritic Clustering
Euc
BC Euc 1.00
EucLog
BA EucLog 1.34
Analysis Related Work
17
Dendritic Clustering
Euc
BC Euc 1.00
Notice the difference!
EucLog
BA EucLog 1.34
Analysis Related Work
18
Dendritic Clustering
Euc
BC Euc 1.00
EucLog
BA EucLog 1.34
Analysis Related Work
19
Dendritic Clustering
Euc
BC Euc 1.00
EucLog
BA EucLog 1.34
Analysis Related Work
20
Dendritic Clustering
Euc
BC Euc 1.00
EucLog
BA EucLog 1.34
Analysis Related Work
21
Dendritic Clustering
Euc
BC Euc 1.00
EucLog
BA EucLog 1.34
Analysis Related Work
22
Dendritic Clustering
Euc
BC Euc 1.00
EucLog
BA EucLog 1.34
Analysis Related Work
23
Dendritic Clustering
Euc
BC Euc 1.00
Final results are the same!
EucLog
BA EucLog 1.34
Analysis Related Work
24
Dendritic Clustering
2D Data Space
Noisy Gaussian Mixtures
5 Gaussian Clusters Precision 200x200
Note For higher-dimensional data sets, the
parameters of the centroids are repeated
systematically.
Analysis Related Work
25
Performance Criteria Research Objective

Performance Criteria
robust for high noise levels
works with high-dimensional spaces (gt20)
requires very few parameters
yields position and membership of clusters
requires no parametric assumptions about data
eliminates quantization error during clustering

Research Objective To design and test an
unsupervised method for discovering subtle but
significant clusters in high-dimensional,
noisy, sparsely populated feature spaces,
subject to the performance criteria listed above
Analysis Performance Criteria, Objective
26
Dynamic Agglomerative Decimation Clustering
Algorithm
Improved version of Dendritic Clusterer

additional distance metrics (Z, ZLog, M, MLog)
more efficient memory management (dynamically
allocated list)
one-dimensional histogram used for decimation
only (S ltlt Q M )
no binning required during clustering phase no
quantization error
designates membership classification error

Hypothesis Solution
27
Dynamic Agglomerative Decimation Clustering
Algorithm
Distance Metrics
Hypothesis Solution
28
DAD Decimation Phase (optional)
Dynamically allocate input vectors to DAL Assign
count of 1 to each vector in DAL for each vector
in DAL Quantize DAL to precision QD Store in
QDAL Sort quantized vectors in QDAL by each
dimension as a key Store sorted vectors in
SQDAL Collapse identical vectors in SQDAL and sum
the counts Store a copy of vector and count in
CSQDAL for each vector in CSQDAL if
corresponding count lt cut-off threshold
C Remove vector from CSQDAL Store remaining
vectors in RDAL Replace vectors in RDAL with
averages of elements of the associated original
real-valued vectors in DAL
Hypothesis Solution
29
DAD Decimation Phase
Hypothesis Solution
30
DAD Clustering Phase
Algorithm ( same as Dendritic Clusterer)
repeat until reasonable number of clusters
remains for each vector in RDAL if either this
vector or its nearest neighbor is a new
cluster Find the new distance between
them Find the nearest neighbor Find the new
global minimum distance Combine vectors
separated by minimum distance Recompute new
centroid and its new count end
Hypothesis Solution
31
Performance Metrics

Positional Error (Ep)

Classification Error (Ec)

CPU time measured separately for each phase

Hypothesis Performance Metrics
32
Normalization Methods
??????mean 0, standard deviation 1
automatic clipping
Hypothesis Performance Metrics
33
Preliminary Experiment 1 Quantization Precision
Fixed conditions and parameters data set G
Gaussian sample size S 100,000 cut-off
threshold C 1 noise level N 0 distance
metric D EucLog
Fixed conditions and parameters data set G
Gaussian quantization precision Q 6 cut-off
threshold C 1 noise level N 0 distance
metric D EucLog
Validation Conclusions
34
Preliminary Experiment 2Gaussian Data Reduction
Fixed conditions and parameters data set G
Gaussian sample size S 100,000 noise level
N 0 distance metric D EucLog
Factor cut-off threshold C 1 and 2
Conclusions Low quantization precision should
be chosen!
Validation Conclusions
35
Preliminary Experiment 3Gaussian Data Cut-Off
Fixed conditions and parameters data set G
Gaussian sample size S 100,000 noise level
N 0 distance metric D EucLog
Factor cut-off threshold C 1 and 2
Conclusions C1 is enough!
Validation Conclusions
36
Preliminary Experiment 4Gaussian Data Noise
Random Noise Test
Fixed conditions and parameters sample size S
2,000 data vector dimension M
50 quantization precision Q 2 cut-off
threshold C 0
Confusion Noise Test
Fixed conditions and parameters sample size S
1,000 data vector dimension M 2 no
decimation
Conclusions Logarithm factor in distance
metric improves performance
Validation Conclusions
37
Preliminary Experiment 4 Classification
Fishers Iris Data

3 classes
setosa
versocolor
virginica
4 features
50 vectors/class

Fixed conditions and parameters sample size S
150 data vector dimension M 4 noise level
N 0 no decimation no normalization
Synthesis Experiments
38
Preliminary Experiment 4 Classification
Validation Conclusions
39
Preliminary Experiment 4 Classification
Validation Conclusions
40
Original and Significant Contributions