An Efficient Distance Calculation Method for Uncertain Objects - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

An Efficient Distance Calculation Method for Uncertain Objects

Description:

Weather prediction. Etc. Uncertain Objects: How to Represent? Representation ... All possible pair-wise combinations a distance function di,j(x) to return the ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 33
Provided by: hkpu
Category:

less

Transcript and Presenter's Notes

Title: An Efficient Distance Calculation Method for Uncertain Objects


1
An Efficient Distance Calculation Method for
Uncertain Objects
  • Edward Hung
  • csehung_at_comp.polyu.edu.hk
  • Hong Kong Polytechnic University
  • 2007 CIDM, Hawaii, USA, Apr 1-5, 2007

2
Outline
  • Why we care about uncertain objects and their
    distances?
  • Analytic Solutions for Uniform and Gaussian
    Distributions
  • Five Approximation Methods (DM, PRS, GAPS, PGM,
    ASG) for Arbitrary Distributions
  • Equivalence of PRS, PGM and ASG
  • Performance Study
  • Conclusion

3
Uncertain Objects From Where?
  • Sources
  • Readings from sensors
  • Classification results of image processing using
    statistical classifiers
  • Results from predictive programs used for stock
    market
  • Weather prediction
  • Etc

4
Uncertain Objects How to Represent?
  • Representation
  • An exact value with margins of error
  • E.g., 1560.5, 23.8, 24.9
  • An uncertainty domain with a probability
    distribution/density function (PDF/pdf)
  • Discrete E.g., for object o1, UD(o1)
    5.1,5.2,5.3, P1(5.1) 0.3, P1(5.2)0.4,
    P1(5.3)0.3
  • Continuous E.g., for object o2 with uniform
    distribution, UD(o2) 6,11, p2(x) 0.2 where
    6 x 11

5
Uncertain Objects handled traditionally
  • Transformed into exact values to store in
    traditional databases
  • Weighted average or mean
  • Value of highest frequency or possibility
  • Why bad??
  • Intermediate and final results of mining or
    queries will also be approximate and may be wrong
  • E.g., deviation of cluster centroids and wrong
    assignment of some data
  • Shown in experimental results later

6
Distance Why Important?
  • Various queries and data mining tasks, e.g.,
  • Nearest-neighbor queries
  • Clustering (e.g., K-means clustering)

7
Distance Why Expensive?
  • An uncertain object has more than one possible
    location
  • Discrete E.g., o1 (o2) has n1 (n2) possible
    locations
  • n1n2 possible pair-wise combinations of their
    locations to calculate distances
  • Probability of each location may be different

o1
o2
8
Distance Why Expensive?
  • Continuous E.g., take n samples on each uncertain
    object
  • More samples in region of higher probability
    density
  • Each sample has the same probability

o1
o2
9
Distance Why Expensive?
  • Approximation by a grid of a finite number of
    cells formed on the uncertainty domain (region)1
  • A grid of 14X14 cells
  • Probability of each cell determined by sampling
  • All combinations of cells of two objects ?
    196X196 distance calculations

1e.g., used in Ngai, et al., Efficient
clustering of uncertain data, in the 2006 IEEE
International Conference on Data Mining (ICDM).
10
Why Expected Distance?
  • All possible pair-wise combinations ? a distance
    function di,j(x) to return the probability (or
    density) that the distance between objects oi and
    oj is x
  • VERY expensive (previous slides)
  • Expected distance weighted average of all
    combinations distances
  • Could be much cheaper IF we do NOT need to try
    all combinations
  • Squared Euclidean distance chosen
  • Easier integration compared with Euclidean
    distance or Manhattan distance

11
Analytic Solutions
  • Uniform pdf
  • Gaussian pdf

12
Uniform pdf
  • c2(a2-abb2)/3
  • C2r2/2
  • C23r2/5
  • C2(r12r22)/3

(5) C2(r12r22)/2 (6) C2r12/23r22/5 (7)
C23(r12r22)/5
13
Gaussian pdf
  • For objects oi with Gaussian pdf N(µi,Si), where
    µi is a dX1 mean vector, Si is a dXd covariance
    matrix,
  • Expected distance between objects oi, oj is
  • EDAS(oi, oj) µi - µj2 trace(Si)
    trace(Sj)
  • where trace(Si) is sum of all diagonal elements
    in Si

14
Approximation Methods for Arbitrary pdf
  • 5 methods proposed
  • Distance between Means (DM)
  • Pair-wise between Random Samples (PRS)
  • Grid Approximation and Pair-wise between Samples
    (GAPS)
  • Pair-wise between Gaussian Mixture (PGM)
  • Approximation by Single Gaussian (ASG)

15
1. Distance between Means (DM)
  • EDDM(oi, oj) µi - µj2

o1
o2
16
2. Pair-wise between Random Samples (PRS)
  • take n samples on each uncertain object
  • More samples in region of higher probability
    density each sample has the same probability

o1
o2
17
3. Grid Approximation and Pair-wise between
Samples (GAPS)
  • Approximation by a grid of vs X vs cells formed
    on the uncertainty domain
  • Probability of each cell determined by sampling

18
4. Pair-wise between Gaussian Mixture (PGM)
  • Approximate an uncertain object oi by a mixture
    of Gaussian distributions ?uCi Ai,uN(µi,u,Si,u)
  • use K-means to cluster samples into a few
    clusters)
  • EDPGM(oi, oj) ?uCi ?vCj Ai,uAj,v(µi,u
    µj,v2 trace(Si,u) trace(Sj,v))

o1
o2
19
5. Approximation by Single Gaussian (ASG)
  • Approximate an uncertain object oi by a single
    Gaussian distributions
  • N(µi,Si)
  • EDASG(oi, oj) µi - µj2 trace(Si)
    trace(Sj)
  • Complexity O((ninj)d)

o1
o2
20
Equivalence of PRS, PGM and ASG
  • Theorem
  • Given any uncertain objects oi, oj and their
    samples xi,1,,xi,ni, xj,1,,xj,nj,
    EDPRS(oi,oj)EDPGM(oi,oj)EDASG(oi,oj)
  • Theoretically ASG is the most inexpensive
    compared with all other methods (except DM) with
    the same results as PRS and PGM
  • What about compared with DM and GAPS?

21
Performance Study
  • Experimental results also show that ASG is
  • much more accurate than DM with comparable speed
  • much faster than GAPS with higher or comparable
    accuracy
  • grid cells samples

22
Experiment 1
  • 100 uncertain objects (4 Gaussian pdfs, variances
    in 1,10)

23
Experiment 1
24
Experiment 1
  • ASG 0.02ms

25
Experiment 2
  • Data generated in the way as
  • Ngai, et al., Efficient clustering of uncertain
    data, in the 2006 IEEE International Conference
    on Data Mining (ICDM)
  • A grid of 14X14 cells
  • Probability of each cell randomly generated
  • normalized

GAPS produces the correct solution, but how close
is ASG?
26
Experiment 2
27
Experiment 2
  • ASG 0.02ms

28
Experiment 3
  • ASG also approximates well objects with uniform
    pdf
  • 10 objects with radius in 1,5, random located
    in 100X100 2D space
  • ASG takes 100 samples, and repeats for 6 times
  • Accuracy
  • Worst case gt 0.98
  • Average gt 0.99

29
Experiment 4
  • Scalability w.r.t. Dimensions
  • 2/3/4-D
  • 256/216/256 samples/cells
  • ASG
  • Accuracy 0.97 0.99
  • Time 0.02ms or less

30
Experiment 4
31
Experiment 4
32
Conclusion
  • Importance of expected distance calculation in
    queries and data mining applications on uncertain
    data
  • Analytic solutions of special cases
    (uniform/Gaussian pdf)
  • ASG can obtain highly accurate results quickly
  • ASG can replace GAPS used in recent research work
Write a Comment
User Comments (0)
About PowerShow.com