Title: An Efficient Distance Calculation Method for Uncertain Objects
1An Efficient Distance Calculation Method for
Uncertain Objects
- Edward Hung
- csehung_at_comp.polyu.edu.hk
- Hong Kong Polytechnic University
- 2007 CIDM, Hawaii, USA, Apr 1-5, 2007
2Outline
- Why we care about uncertain objects and their
distances? - Analytic Solutions for Uniform and Gaussian
Distributions - Five Approximation Methods (DM, PRS, GAPS, PGM,
ASG) for Arbitrary Distributions - Equivalence of PRS, PGM and ASG
- Performance Study
- Conclusion
3Uncertain Objects From Where?
- Sources
- Readings from sensors
- Classification results of image processing using
statistical classifiers - Results from predictive programs used for stock
market - Weather prediction
- Etc
4Uncertain Objects How to Represent?
- Representation
- An exact value with margins of error
- E.g., 1560.5, 23.8, 24.9
- An uncertainty domain with a probability
distribution/density function (PDF/pdf) - Discrete E.g., for object o1, UD(o1)
5.1,5.2,5.3, P1(5.1) 0.3, P1(5.2)0.4,
P1(5.3)0.3 - Continuous E.g., for object o2 with uniform
distribution, UD(o2) 6,11, p2(x) 0.2 where
6 x 11
5Uncertain Objects handled traditionally
- Transformed into exact values to store in
traditional databases - Weighted average or mean
- Value of highest frequency or possibility
- Why bad??
- Intermediate and final results of mining or
queries will also be approximate and may be wrong - E.g., deviation of cluster centroids and wrong
assignment of some data - Shown in experimental results later
6Distance Why Important?
- Various queries and data mining tasks, e.g.,
- Nearest-neighbor queries
- Clustering (e.g., K-means clustering)
7Distance Why Expensive?
- An uncertain object has more than one possible
location - Discrete E.g., o1 (o2) has n1 (n2) possible
locations - n1n2 possible pair-wise combinations of their
locations to calculate distances - Probability of each location may be different
o1
o2
8Distance Why Expensive?
- Continuous E.g., take n samples on each uncertain
object - More samples in region of higher probability
density - Each sample has the same probability
o1
o2
9Distance Why Expensive?
- Approximation by a grid of a finite number of
cells formed on the uncertainty domain (region)1 - A grid of 14X14 cells
- Probability of each cell determined by sampling
- All combinations of cells of two objects ?
196X196 distance calculations
1e.g., used in Ngai, et al., Efficient
clustering of uncertain data, in the 2006 IEEE
International Conference on Data Mining (ICDM).
10Why Expected Distance?
- All possible pair-wise combinations ? a distance
function di,j(x) to return the probability (or
density) that the distance between objects oi and
oj is x - VERY expensive (previous slides)
- Expected distance weighted average of all
combinations distances - Could be much cheaper IF we do NOT need to try
all combinations - Squared Euclidean distance chosen
- Easier integration compared with Euclidean
distance or Manhattan distance
11Analytic Solutions
12Uniform pdf
- c2(a2-abb2)/3
- C2r2/2
- C23r2/5
- C2(r12r22)/3
(5) C2(r12r22)/2 (6) C2r12/23r22/5 (7)
C23(r12r22)/5
13Gaussian pdf
- For objects oi with Gaussian pdf N(µi,Si), where
µi is a dX1 mean vector, Si is a dXd covariance
matrix, - Expected distance between objects oi, oj is
- EDAS(oi, oj) µi - µj2 trace(Si)
trace(Sj) - where trace(Si) is sum of all diagonal elements
in Si
14Approximation Methods for Arbitrary pdf
- 5 methods proposed
- Distance between Means (DM)
- Pair-wise between Random Samples (PRS)
- Grid Approximation and Pair-wise between Samples
(GAPS) - Pair-wise between Gaussian Mixture (PGM)
- Approximation by Single Gaussian (ASG)
151. Distance between Means (DM)
o1
o2
162. Pair-wise between Random Samples (PRS)
- take n samples on each uncertain object
- More samples in region of higher probability
density each sample has the same probability
o1
o2
173. Grid Approximation and Pair-wise between
Samples (GAPS)
- Approximation by a grid of vs X vs cells formed
on the uncertainty domain - Probability of each cell determined by sampling
184. Pair-wise between Gaussian Mixture (PGM)
- Approximate an uncertain object oi by a mixture
of Gaussian distributions ?uCi Ai,uN(µi,u,Si,u) - use K-means to cluster samples into a few
clusters) - EDPGM(oi, oj) ?uCi ?vCj Ai,uAj,v(µi,u
µj,v2 trace(Si,u) trace(Sj,v))
o1
o2
195. Approximation by Single Gaussian (ASG)
- Approximate an uncertain object oi by a single
Gaussian distributions - N(µi,Si)
- EDASG(oi, oj) µi - µj2 trace(Si)
trace(Sj) - Complexity O((ninj)d)
o1
o2
20Equivalence of PRS, PGM and ASG
- Theorem
- Given any uncertain objects oi, oj and their
samples xi,1,,xi,ni, xj,1,,xj,nj,
EDPRS(oi,oj)EDPGM(oi,oj)EDASG(oi,oj) - Theoretically ASG is the most inexpensive
compared with all other methods (except DM) with
the same results as PRS and PGM - What about compared with DM and GAPS?
21Performance Study
- Experimental results also show that ASG is
- much more accurate than DM with comparable speed
- much faster than GAPS with higher or comparable
accuracy - grid cells samples
22Experiment 1
- 100 uncertain objects (4 Gaussian pdfs, variances
in 1,10)
23Experiment 1
24Experiment 1
25Experiment 2
- Data generated in the way as
- Ngai, et al., Efficient clustering of uncertain
data, in the 2006 IEEE International Conference
on Data Mining (ICDM) - A grid of 14X14 cells
- Probability of each cell randomly generated
- normalized
GAPS produces the correct solution, but how close
is ASG?
26Experiment 2
27Experiment 2
28Experiment 3
- ASG also approximates well objects with uniform
pdf - 10 objects with radius in 1,5, random located
in 100X100 2D space - ASG takes 100 samples, and repeats for 6 times
- Accuracy
- Worst case gt 0.98
- Average gt 0.99
29Experiment 4
- Scalability w.r.t. Dimensions
- 2/3/4-D
- 256/216/256 samples/cells
- ASG
- Accuracy 0.97 0.99
- Time 0.02ms or less
30Experiment 4
31Experiment 4
32Conclusion
- Importance of expected distance calculation in
queries and data mining applications on uncertain
data - Analytic solutions of special cases
(uniform/Gaussian pdf) - ASG can obtain highly accurate results quickly
- ASG can replace GAPS used in recent research work