Title: Efficient Algorithms for Non-Parametric Clustering With Clutter
1Efficient Algorithms for Non-Parametric
Clustering With Clutter
- Weng-Keen Wong
- Andrew Moore
2Problems From the Physical Sciences
Minefield detection (Dasgupta and Raftery 1998)
Earthquake faults (Byers and Raftery 1998)
3Problems From the Physical Sciences
(Pereira 2002)
(Sloan Digital Sky Survey 2000)
4A Simplified Example
5Clustering with Traditional Algorithms
Single Linkage Clustering
Mixture of Gaussians with a Uniform Background
Component
6Clustering with CFF
Cuevas-Febrero-Fraiman
Original Dataset
7Related Work
- (Dasgupta and Raftery 98)
- Mixture model approach mixture of Gaussians for
features, Poisson process for clutter - (Byers and Raftery 98)
- K-nearest neighbour distances for all points
modeled as a mixture of two gamma distributions,
one for clutter and one for the features - Classify each data point based on which component
it was most likely generated from
8Outline
- 1. Introduction Clustering and Clutter
- 2. The Cuevas-Febreiro-Fraiman Algorithm
- 3. Optimizing Step One of CFF
- 4. Optimizing Step Two of CFF
- 5. Results
9The CFF Algorithm Step One
- Find the high
- density datapoints
10The CFF Algorithm Step Two
- Cluster the high density points using Single
Linkage Clustering - Stop when link length gt ?
11The CFF Algorithm
- Originally intended to estimate the number of
clusters - Can also be used to find clusters against a noisy
background
12Step One Non-Parametric Density Estimator
- A datapoint is a high
- density datapoint if
- The number of
- datapoints within a
- hypersphere of radius
- h is gt threshold c
13Speeding up the Non-Parametric Density Estimator
- Addressed in a separate paper (Gray and Moore
2001) - Two basic ideas
- 1. Use a dual tree algorithm (Gray and Moore
2000) - 2. Cut search off early without computing exact
densities (Moore 2000)
14Step Two Euclidean Minimum Spanning Trees (EMSTs)
- Traditional MST algorithms assume you are given
all the distances - Implies O(N2) memory usage
- Want to use a Euclidean Minimum Spanning Tree
algorithm
15Optimizing Clustering Step
- Exploit recent results in computational geometry
for efficient EMSTs - Involves modification to GeoMST2 algorithm by
(Narasimhan et al 2000) - GeoMST2 is based on Well-Separated Pairwise
Decompositions (WSPDs) (Callahan 1995) - Our optimizations gain an order of magnitude
speedup, especially in higher dimensions
16Outline for Optimizing Step Two
- 1. High level overview of GeoMST2
- 2. Example of a WSPD
- 3. More detailed description of GeoMST2
- 4. Our optimizations
17Intuition behind GeoMST2
18Intuition behind GeoMST2
19High Level Overview of GeoMST2
1. Create the Well-Separated Pairwise
Decomposition
- (A1,B1)
- (A2,B2)
- .
- .
- .
- (Am,Bm)
20High Level Overview of GeoMST2
1. Create the Well-Separated Pairwise
Decomposition
Each Pair (Ai,Bi) represents a possible edge in
the MST
- (A1,B1)
- (A2,B2)
- .
- .
- .
- (Am,Bm)
21High Level Overview of GeoMST2
1. Create the Well-Separated Pairwise
Decomposition
- (A1,B1)
- (A2,B2)
- .
- .
- .
- (Am,Bm)
2. Take the pair (Ai,Bi) that corresponds to the
shortest edge
3. If the vertices of that edge are not in the
same connected component, add the edge to the
MST. Repeat Step 2.
22A Well-Separated Pair (Callahan 1995)
- Let A and B be point sets in ?d
- Let RA and RB be their respective bounding
hyper-rectangles - Define MargDistance(A,B) to be the minimum
distance between RA and RB
23A Well-Separated Pair (Cont)
- The point sets A and B are considered to be
- well-separated if
- MargDistance(A,B) ? maxDiam(RA),Diam(RB)
24A Well-Separated Pairwise Decomposition
Pair 1 (0,1)
Pair 2 (0,1, 2)
Pair 3 (0,1,2,3,4)
Pair 4 (3, 4)
The set of pairs (0,1), (0,1, 2),
(0,1,2,3,4), (3, 4) form a
Well-Separated Pairwise Decomposition.
25The Size of a WSPD
A WSPD
- (A1,B1)
- (A2,B2)
- .
- .
- .
- (Am,Bm)
If there are n points, a WSPD can be constructed
with O(n) pairs using a fair split tree (Callahan
1995)
26High Level Overview of GeoMST2
1. Create the Well-Separated Pairwise
Decomposition
- (A1,B1)
- (A2,B2)
- .
- .
- .
- (Am,Bm)
2. Take the pair (Ai,Bi) that corresponds to the
shortest edge
3. If the vertices of that edge are not in the
same connected component, add the edge to the
MST. Repeat Step 2
27Bichromatic Closest Pair Distance
- Given two sets (Ai,Bi), the Bichromatic
- Closest Pair Distance is the closest distance
- from a point in Ai to a point in Bi
28High Level Overview of GeoMST2
1. Create the Well-Separated Pairwise
Decomposition
- (A1,B1)
- (A2,B2)
- .
- .
- .
- (Am,Bm)
2. Take the pair (Ai,Bi) with the shortest BCP
distance
3. If Ai and Bi are not already connected, add
the edge to the MST. Repeat Step 2.
29GeoMST2 Example Start
Current MST
30GeoMST2 Example Iteration 1
Current MST
31GeoMST2 Example Iteration 2
Current MST
32GeoMST2 Example Iteration 3
Current MST
33GeoMST2 Example Iteration 4
Current MST
34High Level Overview of GeoMST2
1. Create the Well-Separated Pairwise
Decomposition
Modification for CFF If BCP distance gt ?,
terminate
- (A1,B1)
- (A2,B2)
- .
- .
- .
- (Am,Bm)
2. Take the pair (Ai,Bi) with the shortest BCP
distance
3. If Ai and Bi are not already connected, add
the edge to the MST. Repeat Step 2.
35Optimizations
- We dont need the EMST
- We just need to cluster all points that are
within ? distance or less from each other - Allows two optimizations to GeoMST2 code
36High Level Overview of GeoMST2
Optimizations take place in Step 1
1. Create the Well-Separated Pairwise
Decomposition
- (A1,B1)
- (A2,B2)
- .
- .
- .
- (Am,Bm)
2. Take the pair (Ai,Bi) with the shortest BCP
distance
3. If Ai and Bi are not already connected, add
the edge to the MST. Repeat Step 2.
37Optimization 1 Illustration
38Optimization 1
- Ignore all links that are gt ?
- Every pair (Ai, Bi) in the WSPD becomes an edge
unless it joins two already connected components - If MargDistance(Ai,Bi) gt ?, then an edge of
length ? cannot exist between a point in Ai and
Bi - Dont include such a pair in the WSPD
39Optimization 2 Illustration
40Optimization 2
- Join all elements that are within ? distance of
each other - If the max distance separating the bounding
hyper-rectangles of Ai and Bi is ? ?, then join
all the points in Ai and Bi if they are not
already connected - Do not add such a pair (Ai,Bi) to the WSPD
41Implications of the optimizations
- Reduce the amount of time spent in creating the
WSPD - Reduce the number of WSPDs, thereby speeding up
the GeoMST2 algorithm by reducing the size of the
priority queue
42Results
- Ran step two algorithms on subsets of the Sloan
Digital Sky Survey - Compared Kruskal, GeoMST2, and
- ?-clustering
- 7 attributes 4 colors, 2 sky coordinates, 1
redshift value
43Results (GeoMST2 vs ?-Clustering vs Kruskal in
4D)
44Results (GeoMST2 vs ?-Clustering in 3D)
45Results (GeoMST2 vs ?-Clustering in 4D)
46Results (Change in Time as ? changes for 4D data)
47Results (Increasing Dimensions vs Time
48Conclusions
- ?-clustering outperforms GeoMST2 by nearly an
order of magnitude in higher dimensions - Combining the optimizations in both steps will
yield an efficient algorithm for clustering
against clutter on massive data sets