The PowerMethod: A Comprehensive Estimation Technique for MultiDimensional Queries

About This Presentation
Title:

The PowerMethod: A Comprehensive Estimation Technique for MultiDimensional Queries

Description:

Query selectivity. Query (I/O) cost. for any Lp metric. using a single method ... Regional distance join selectivity. Tao, Faloutsos, Papadias. 38. Target Problem ... –

Number of Views:44
Avg rating:3.0/5.0
Slides: 44
Provided by: ssu51
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: The PowerMethod: A Comprehensive Estimation Technique for MultiDimensional Queries


1
The Power-Method A Comprehensive Estimation
Technique for Multi-Dimensional Queries
  • Yufei Tao U. Hong Kong
  • Christos Faloutsos CMU
  • Dimitris Papadias Hong Kong UST

2
Roadmap
  • Problem motivation
  • Survey
  • Proposed method main idea
  • Proposed method details
  • Experiments
  • Conclusions

3
Target query types
  • DB set of m d points.
  • Range search (RS)
  • k nearest neighbor (KNN)
  • Regional distance (self-) join (RDJ)
  • in Louisiana, find all pairs of music stores
    closer than 1mi to each other

4
Target problem
  • Estimate
  • Query selectivity
  • Query (I/O) cost
  • for any Lp metric
  • using a single method

5
Target Problem
  • for any Lp metric
  • using a single method

6
Roadmap
  • Problem motivation
  • Survey
  • Proposed method main idea
  • Proposed method details
  • Experiments
  • Conclusions

7
Older Query estimation approaches
  • Vast literature
  • Sampling, kernel estimation, single value
    decomposition, compressed histograms, sketches,
    maximal independence, Euler formula, etc
  • BUT They target specific cases (mostly range
    search selectivity under the L? norm), and their
    extensions to other problems are unclear

8
Main competitors
  • Local method
  • Representative methods Histograms
  • Global method
  • Provides a single estimate corresponding to the
    average selectivity/cost of all queries,
    independently of their locations
  • Representative methods Fractal and power law

9
Rationale and problems of histograms
  • Partition the data space into a set of buckets
    and assume (local) uniformity
  • Problems
  • uniformity
  • tricky/slow estimations, for all but the L? norm

10
Roadmap
  • Problem motivation
  • Survey
  • Proposed method main idea
  • Proposed method details
  • Experiments
  • Conclusions

11
Inherent defect of histograms
  • Density trap what is the density in the
    vicinity of q?

diameter10 10/100 0.1 diameter100
100/10,000 0.01 Q What is going on?
10
12
Inherent defect of histograms
  • Density trap what is the density in the
    vicinity of q?

diameter10 10/100 0.1 diameter100
100/10,000 0.01 Q What is going on? A we ask
a silly question what is the area of a
line?
10
13
Density Trap
  • Not caused not by a mathematical oddity like the
    Hilbert curve, but by a line, a perfectly
    behaving Euclidean object!
  • This trap will appear for any non-uniform
    dataset
  • Almost ALL real point-sets are non-uniform -gt the
    trap is real

14
Density Trap
  • In short
  • is meaningless
  • What should we do instead?

15
Density Trap
  • In short
  • is meaningless
  • What should we do instead?
  • A log(count_of_neighbors) vs log(area)

16
Local power law
  • In more detail local power law
  • nb neighbors of point p, within radius r
  • cp local constant
  • np local exponent ( local intrinsic
    dimensionality)

17
Local power law
  • Intuitively to avoid the density trap, use
  • nplocal intrinsic dimensionality
  • instead of density

18
Does LPL make sense?
  • For point q LPL gives
  • nbq(r) ltconstantgt r1
  • (no need for density, nor uniformity)

diameter10 10/100 0.1 diameter100
100/10,000 0.01
10
19
Local power law and Lx
  • if a point obeys L.P.L under L?,
  • ditto for any other Lx metric,
  • with same local exponent
  • -gt LPL works easily, for ANY Lx metric

20
Examples
neighbors(ltr)
p1
p2
radius
p1 has higher local exponent local
intrinsic dimensionality than p2
21
Examples
22
Roadmap
  • Problem motivation
  • Survey
  • Proposed method main idea
  • Proposed method details
  • Experiments
  • Conclusions

23
Proposed method
  • Main idea if we know (or can approximate) the cp
    and np of every point p, we can solve all the
    problems

24
Target Problem
  • for any Lp metric
  • using a single method

25
Target Problem
  • for any Lp metric (Lemma3.2)
  • using a single method

26
Theoretical results
  • interesting observation
  • (Thm3.4) the cost of a kNN query q depends
  • only on the local exponent
  • and NOT on the local constant,
  • nor on the cardinality of the dataset

27
Implementation
  • Given a query point q, we need its local exponent
    and constants to perform estimation
  • but too expensive to store, for every point.
  • Q What to do?

28
Implementation
  • Given a query point q, we need its local exponent
    and constants to perform estimation
  • but too expensive to store, for every point.
  • Q What to do?
  • A exploit locality

29
Implementation
  • nearby points usually have similar local
    constants and exponents. Thus, one solution
  • anchors pre-compute the LPLaw for a set of
    representative points (anchors) use nearest
    anchor to q

30
Implementation
  • choose anchors with sampling, DBS, or any other
    method.

31
Implementation
  • (In addition to anchors, we also tried to use
    patches of near-constant cp and np it gave
    similar accuracy, for more complicated
    implementation)

32
Experiments - Settings
  • Datasets
  • SC that contain 40k points representing the coast
    lines of Scandinavia
  • LB that include 53k points corresponding to
    locations in Long Beach county
  • Structure R-tree
  • Compare Power method to
  • Minskew
  • Global method (fractal)

33
Experiments - Settings
  • The LPLaw coefficients of each anchor point are
    computed using L8 0.05-neighborhoods
  • Queries Biased (following the data distribution)
  • A query workload contains 500 queries
  • We report the average error ?iacti?esti/?iacti

34
Target Problem
  • for any Lp metric (Lemma3.2)
  • using a single method

35
Range search selectivity
  • the LPL method wins

36
Target Problem
  • for any Lp metric (Lemma3.2)
  • using a single method

37
Regional distance join selectivity
  • No known global method in this case
  • The LPL method wins, with higher margin

38
Target Problem
  • for any Lp metric (Lemma3.2)
  • using a single method

39
Range search query cost
40
k nearest neighbor cost
41
Regional distance join cost
42
Conclusions
  • We spot the density trap problem of the local
    uniformity assumption (lt- histograms)
  • we show how to resolve it, using the local
    intrinsic dimension instead (-gt Local Power
    Law)
  • and we solved all posed problems

43
Conclusions contd
  • for any Lp metric
  • using a single method

44
Conclusions contd
  • for any Lp metric (Lemma3.2)
  • using a single method (LPL anchors)
Write a Comment
User Comments (0)
About PowerShow.com