Title: Clustering of Uncertain data objects by Voronoi-diagram-based approach
1Clustering of Uncertain data objects by
Voronoi-diagram-based approach
- Speaker Chan Kai Fong, Paul
- Dept of CS, HKU
2Presentation Outline
- Introduction
- concept of clustering, clustering of uncertain
objects - Example Application of clustering on uncertain
data - UK-means algorithm
- Motivation
- Voronoi-diagram-based (VD) clustering
- MinMax-based (MM) clustering
- VD is strictly better than MinMax
- Clustering algorithms
- VDBi, VDBiP, VD based methods with Cluster Shift
- When VD based methods are better than MM based
methods? - Experiments
- Conclusion
3Introduction
4Introduction
- Clustering
- Group similar data objects together to form
clusters - Partition-based clustering
- Input of clusters (k), of objects (n)
- Iterative method
- In each iteration, divide n data objects into k
groups to minimize an objective function - e.g., minimize the sum of squares of distances
- Stop when the results are converged
5Introduction
- To cluster the data points in 2D space
- Data objects n data points
- Apply any partition-based clustering algorithms
(K-means) - Distance measure Euclidean distance, Manhattan
distance, etc.
6Introduction
- To cluster the uncertain objects in 2D space
- Uncertain objects objects with uncertainty (e.g.
location uncertainty) - No fixed coordinates in 2D space
- Objects location is estimated by using a
probability density function (pdf) over an
uncertainty region - Assume the pdf for each object can be obtained
- Uncertainty region (ur) a region that the object
may appear, with a certain probability
distribution and the probability of the objects
appear outside the uncertainty region is zero - Each object may have an irregular uncertainty
region, also the pdf could be arbitrary
MBR of o1.ur
o1.ur
7Expected distance computation
- The expected distance (ED) is used to measure the
distance between uncertain object and cluster
representative. - ED is the expected distance function, d is
Euclidean distance function, x is any point
inside ois uncertainty region, f is the pdf of
uncertain objects oi, and pj is any cluster
representatives. - ED computations are very expensive, in each
iteration of K-means, nk ED computations are
required.
8Application Clustering the vehicles
- Objective get traffic patterns by clustering
vehicles in a city - Data objects vehicles on a 2D map
- Uncertainty location uncertainty of the
vehicles, each pdf defined over objects
uncertainty region represent the probability
distribution of possible location of a vehicle in
a certain period of time
9- Degree of uncertainty is affected by the
following factors, - Time
- Traffic of the roads
- Shape of the roads
- Speed of the vehicles
10Results
11UK-means
- UK-means first extension of K-means algorithm to
handle uncertain objects - Distance measure Expected distance (ED)
- Disadvantage Slow and inefficient
- Show the possibility of using K-means to handle
the clustering of uncertain objects
12Two Approaches to solve clustering problem by
UK-means
- MinMax-based approach (Jacky)
- Voronoi-Diagram-based approach (Paul)
13Motivation
14Two Approaches to solve clustering problem by
UK-means
- MinMax-based approach (Jacky)
- Basic MinMax distance pruning (MinMax)
- MinMax with pre-computation of ED
- MinMax with Cluster Shift (MinMax-Shift)
- Voronoi-Diagram-based approach (Paul)
- Voronoi diagram with Bisector Pruning (VDBi)
- Voronoi diagram with Bisector Pruning and Partial
expected distance computations (VDBiP) - Voronoi diagram with Bisector Pruning and Cluster
Shift (VDBi-Shift) - Voronoi diagram with Bisector Pruning and Partial
expected distance computations and Cluster Shift
(VDBiP-Shift)
15MinMax-based Approach
- UK-means with MinMax distance pruning
- Objective avoid expected distance computation
- using mindist and maxdist between objects MBR
and cluster representatives to represent the
distance bounds of ED(cj, oi) ED(cm, oi) - E.g., given an object oi , cluster rep cj and cm
, - if mindist(cj, oi) gt maxdist (cm , oi) then cj
can be pruned
ED(cj,oi) need not be calculated.
(pruned) ED(cj,oi) gt ED(cm,oi) ? prune cj
16MinMax-based Approach
- Upper and lower bounds can become tighter by
using Cluster Shift (CS) and ED Pre-computation
(PC) methods - Replace mindist and maxdist loose estimation by
tighter estimations on distance bounds - Details refer to Jackys works
17Voronoi-diagram-based approach
Uncertain object o1 indexed by R-tree
Voronoi diagram for 5 cluster representatives
- Each objects uncertainty region is bounded by
its minimum bounding rectangle (MBR) - The objects MBRs are indexed by R-tree
- Voronoi diagram is constructed for the cluster
representatives in each iteration
18Voronoi-diagram-based approach
- If the bisector of two cluster representatives do
not cut an objects MBR, and fall in p2 side of
the bisector, then - ED(p1,o1) gt ED(p2, o1)
19Voronoi-diagram-based approach(Cluster
Assignment)
ED(o1, p2) lt ED(o1, p1) and ED(o1,p2) lt ED(o1,
p3) o1 is assigned to cluster p2.
20Voronoi-diagram-based approach
object enclosed entirely in Voronoi cell
get candidate objects for the cluster
object that intersect with more than one Voronoi
cell
- In each iteration,
- For each Voronoi cell, (approximated by a MBR)
issue a range queries to objects R-tree retrieve
the candidates objects for the cluster - If the candidates MBR is completely enclosed in
the Voronoi cell, assign the object to the
cluster - If the candidates MBR intersect with more than
one Voronoi cells, special handling methods
required for the objects to prune away the
unqualified clusters
21Advantages of using Voronoi-diagram-based
clustering
- Avoid expected distance computation
- If the object is completely enclosed in a Voronoi
cell, then the object must belong to this cluster - For the best case, we do not need any expensive
expected distance calculations, and we do not
need to retrieve the objects pdf during the
clustering
22Advantages of using Voronoi-diagram-based
clustering
- Voronoi diagram construction cost is independent
of number of objects - We only need O(k log k) time to compute the 2D
Voronoi diagram in each iteration, where k is the
number of clusters, and k is not depend on number
of objects - n is much larger than k
23Difficulties of Voronoi based clustering
o1
c1
- Handling of uncertain objects that intersect with
more than one Voronoi cells - We cannot determine the nearest clusters by just
looking at the Voronoi diagram
24Is VD better than basic MinMax?
- Theorem
- VD is strictly better than basic MinMax
- Given an object oi that is assigned to cluster
c1, for any iteration in UK-means, if VD
calculates ED(oi, cp) for some cp, then MM must
calculate ED(oi, cp) as well. - If VD does not calculate ED(oi, cp), sometimes MM
must calculate ED(oi,cp).
25In some situations, VD based is better
- VD based methods is always better than basic
MinMax, but VD based methods may not beat
MinMax-Shift - In some situations, VD based methods outperform
all MM based methods - when the object uncertainty are very small, then
VD based methods are preferred
26Clustering algorithms
27Clustering Methods
- Voronoi-diagram-based approach
- Voronoi diagram with bisector pruning (VDBi)
- Voronoi diagram with bisector pruning and partial
expected distance computation (VDBiP)
28MinMax-based Methods
- For each object,
- Find out the upper and lower bounds of ED values
- if Cluster-Shift (CS) method is not enabled,
upper and lower bounds is estimated by maxdist
and mindist respectively (MinMax) - if CS method is enabled, then upper and lower
bounds become tighter (MinMax-Shift) - Prune unwanted clusters by upper and lower bounds
- For all un-pruned cluster compute the ED values
to determine the cluster assignment of the object
29Voronoi-diagram-based Methods
- Before each iteration, Voronoi diagram is
constructed for all cluster representatives - For each cluster representative,
- Find out the objects which completely enclosed in
the clusters Voronoi cell - Apply bisector pruning to prune unrelated
clusters
30Voronoi diagram with Bisector Pruning (VDBi)
o1
c1
Comparing c1 and c3, o1 fall into c1 side of the
bisector(c1,c3), then c3 can be pruned. Since
bisector of c1 and c2 cut o1s MBR, o1 may
assigned to either c1 or c2.
31Voronoi diagram with bisector pruning and partial
expected distance computation (VDBiP)
- Cut the objects MBR input two equal halves (a)
and (b)
32VDBiP
- If o1(b)s MBR is completely enclosed in Voronoi
cell of c2 - Compute ED(o1(a) , c1) ED(o1(a) , c2)
- Since ED(o1(b), c2) lt ED(o1(b), c1)
- If ED(o1(a) , c2) lt ED(o1(a) , c1) then
- ED(o1(a) , c2) ED(o1(b) , c2) lt ED(o1(a), c1)
ED(o1(b) , c1) - gt prune c1
33Experiments
34Experiments
- Measures
- Efficiency (Expected distance computation
required) - Comparison with
- Basic Min-max distance pruning (MinMax)
- Voronoi diagram with Bisector Pruning (VDBi)
- Voronoi diagram with Bisector Pruning and Partial
expected distance computation (VDBiP) - MM-based with Cluster Shift (MinMax-Shift)
- VD-based with Cluster Shift (VDBi-Shift
,VDBiP-Shift)
35Experimental Settings
Data set randomly generated synthetic data set
Probability density function random
Domain 100 x 100 2D space
Number of objects 10000
Number of clusters vary
Maximum length of an MBRs side 10, 1, 0.1
Number of sample points 20 20
36Degree of uncertainty is large (MBR width 10)
- VDBi perform slight better than basic MinMax only
- Cluster shift method greatly improve basic MinMax
and VDBi performance
37Degree of uncertainty is small (MBR width 1)
- Cluster shift method cannot greatly improve the
performance of MinMax - VD-based approach outperform MM-based approach
- VD-based approach still better than MM-based
approach, but VD perform slightly better if there
are less clusters
38Degree of uncertainty is very small (MBR width
0.1)
39Performance analysis
Algorithms Description
MinMax the worst one
MinMax-Shift Good when object is large
VDBi Good when object is small
VDBi-Shift Good at all cases, outperform MinMax-based method
VDBiP better than VDBi, perform well when MBR width is small
VDBiP-Shift Further improvement to VDBiP
40Performance Analysis
- Basic MinMax performance is bad, because of the
loose upper and lower bound estimation by maxdist
and mindist. - When degree of uncertainty of an object are
small, MinMax with cluster shift (improved
distance bounds) method cannot greatly improve
the tightness of distance bounds, since mindist
and maxdist is accurate enough - MinMax-Shifts performance is similar to that of
basic MinMax - Because of the smaller objects size, lesser
objects may intersect with multiple Voronoi
cells, also we proved that VD is better than
basic MinMax - VD is good for small objects, and a hybrid of
cluster shift (PC) and VD perform well in all
cases
41Conclusion
- Uncertain clustering
- Voronoi-diagram-based approach and MinMax-based
approach - VDBi is strictly better than basic MinMax
- Voronoi-diagram-based approach beat MinMax-based
approach when objects uncertainty are small - Hybrid approach is good in all cases
42Thank you