An Efficient Distance Calculation Method for Uncertain Objects - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

An Efficient Distance Calculation Method for Uncertain Objects

Description:

Weather prediction. Etc. Uncertain Objects: How to Represent? Representation ... All possible pair-wise combinations a distance function di,j(x) to return the ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 33

Provided by: hkpu

Category:

more less

Transcript and Presenter's Notes

Title: An Efficient Distance Calculation Method for Uncertain Objects

1
An Efficient Distance Calculation Method for
Uncertain Objects

Edward Hung
csehung_at_comp.polyu.edu.hk
Hong Kong Polytechnic University
2007 CIDM, Hawaii, USA, Apr 1-5, 2007

2
Outline

Why we care about uncertain objects and their
distances?
Analytic Solutions for Uniform and Gaussian
Distributions
Five Approximation Methods (DM, PRS, GAPS, PGM,
ASG) for Arbitrary Distributions
Equivalence of PRS, PGM and ASG
Performance Study
Conclusion

3
Uncertain Objects From Where?

Sources
Readings from sensors
Classification results of image processing using
statistical classifiers
Results from predictive programs used for stock
market
Weather prediction
Etc

4
Uncertain Objects How to Represent?

Representation
An exact value with margins of error
E.g., 1560.5, 23.8, 24.9
An uncertainty domain with a probability
distribution/density function (PDF/pdf)
Discrete E.g., for object o1, UD(o1)
5.1,5.2,5.3, P1(5.1) 0.3, P1(5.2)0.4,
P1(5.3)0.3
Continuous E.g., for object o2 with uniform
distribution, UD(o2) 6,11, p2(x) 0.2 where
6 x 11

5
Uncertain Objects handled traditionally

Transformed into exact values to store in
traditional databases
Weighted average or mean
Value of highest frequency or possibility
Why bad??
Intermediate and final results of mining or
queries will also be approximate and may be wrong
E.g., deviation of cluster centroids and wrong
assignment of some data
Shown in experimental results later

6
Distance Why Important?

Various queries and data mining tasks, e.g.,
Nearest-neighbor queries
Clustering (e.g., K-means clustering)

7
Distance Why Expensive?

An uncertain object has more than one possible
location
Discrete E.g., o1 (o2) has n1 (n2) possible
locations
n1n2 possible pair-wise combinations of their
locations to calculate distances
Probability of each location may be different

o1
o2
8
Distance Why Expensive?

Continuous E.g., take n samples on each uncertain
object
More samples in region of higher probability
density
Each sample has the same probability

o1
o2
9
Distance Why Expensive?

Approximation by a grid of a finite number of
cells formed on the uncertainty domain (region)1
A grid of 14X14 cells
Probability of each cell determined by sampling
All combinations of cells of two objects ?
196X196 distance calculations

1e.g., used in Ngai, et al., Efficient
clustering of uncertain data, in the 2006 IEEE
International Conference on Data Mining (ICDM).
10
Why Expected Distance?

All possible pair-wise combinations ? a distance
function di,j(x) to return the probability (or
density) that the distance between objects oi and
oj is x
VERY expensive (previous slides)
Expected distance weighted average of all
combinations distances
Could be much cheaper IF we do NOT need to try
all combinations
Squared Euclidean distance chosen
Easier integration compared with Euclidean
distance or Manhattan distance

11
Analytic Solutions

Uniform pdf
Gaussian pdf

12
Uniform pdf

c2(a2-abb2)/3
C2r2/2
C23r2/5
C2(r12r22)/3

(5) C2(r12r22)/2 (6) C2r12/23r22/5 (7)
C23(r12r22)/5
13
Gaussian pdf

For objects oi with Gaussian pdf N(µi,Si), where
µi is a dX1 mean vector, Si is a dXd covariance
matrix,
Expected distance between objects oi, oj is
EDAS(oi, oj) µi - µj2 trace(Si)
trace(Sj)
where trace(Si) is sum of all diagonal elements
in Si

14
Approximation Methods for Arbitrary pdf

5 methods proposed
Distance between Means (DM)
Pair-wise between Random Samples (PRS)
Grid Approximation and Pair-wise between Samples
(GAPS)
Pair-wise between Gaussian Mixture (PGM)
Approximation by Single Gaussian (ASG)

15
1. Distance between Means (DM)

EDDM(oi, oj) µi - µj2

o1
o2
16
2. Pair-wise between Random Samples (PRS)

take n samples on each uncertain object
More samples in region of higher probability
density each sample has the same probability

o1
o2
17
3. Grid Approximation and Pair-wise between
Samples (GAPS)

Approximation by a grid of vs X vs cells formed
on the uncertainty domain
Probability of each cell determined by sampling

18
4. Pair-wise between Gaussian Mixture (PGM)

Approximate an uncertain object oi by a mixture
of Gaussian distributions ?uCi Ai,uN(µi,u,Si,u)
use K-means to cluster samples into a few
clusters)
EDPGM(oi, oj) ?uCi ?vCj Ai,uAj,v(µi,u
µj,v2 trace(Si,u) trace(Sj,v))

o1
o2
19
5. Approximation by Single Gaussian (ASG)

Approximate an uncertain object oi by a single
Gaussian distributions
N(µi,Si)
EDASG(oi, oj) µi - µj2 trace(Si)
trace(Sj)
Complexity O((ninj)d)

o1
o2
20
Equivalence of PRS, PGM and ASG

Theorem
Given any uncertain objects oi, oj and their
samples xi,1,,xi,ni, xj,1,,xj,nj,
EDPRS(oi,oj)EDPGM(oi,oj)EDASG(oi,oj)
Theoretically ASG is the most inexpensive
compared with all other methods (except DM) with
the same results as PRS and PGM
What about compared with DM and GAPS?

21
Performance Study

Experimental results also show that ASG is
much more accurate than DM with comparable speed
much faster than GAPS with higher or comparable
accuracy
grid cells samples

22
Experiment 1

100 uncertain objects (4 Gaussian pdfs, variances
in 1,10)

23
Experiment 1
24
Experiment 1

ASG 0.02ms

25
Experiment 2

Data generated in the way as
Ngai, et al., Efficient clustering of uncertain
data, in the 2006 IEEE International Conference
on Data Mining (ICDM)
A grid of 14X14 cells
Probability of each cell randomly generated
normalized

GAPS produces the correct solution, but how close
is ASG?
26
Experiment 2
27
Experiment 2

ASG 0.02ms

28
Experiment 3

ASG also approximates well objects with uniform
pdf
10 objects with radius in 1,5, random located
in 100X100 2D space
ASG takes 100 samples, and repeats for 6 times
Accuracy
Worst case gt 0.98
Average gt 0.99

29
Experiment 4

Scalability w.r.t. Dimensions
2/3/4-D
256/216/256 samples/cells
ASG
Accuracy 0.97 0.99
Time 0.02ms or less

30
Experiment 4
31
Experiment 4
32
Conclusion

Importance of expected distance calculation in
queries and data mining applications on uncertain
data
Analytic solutions of special cases
(uniform/Gaussian pdf)
ASG can obtain highly accurate results quickly
ASG can replace GAPS used in recent research work

Write a Comment

User Comments (0)