Title: The PowerMethod: A Comprehensive Estimation Technique for MultiDimensional Queries
1The Power-Method A Comprehensive Estimation
Technique for Multi-Dimensional Queries
- Yufei Tao U. Hong Kong
- Christos Faloutsos CMU
- Dimitris Papadias Hong Kong UST
2Roadmap
- Problem motivation
- Survey
- Proposed method main idea
- Proposed method details
- Experiments
- Conclusions
3Target query types
- DB set of m d points.
- Range search (RS)
- k nearest neighbor (KNN)
- Regional distance (self-) join (RDJ)
- in Louisiana, find all pairs of music stores
closer than 1mi to each other
4Target problem
- Estimate
- Query selectivity
- Query (I/O) cost
- for any Lp metric
- using a single method
5Target Problem
- for any Lp metric
- using a single method
6Roadmap
- Problem motivation
- Survey
- Proposed method main idea
- Proposed method details
- Experiments
- Conclusions
7Older Query estimation approaches
- Vast literature
- Sampling, kernel estimation, single value
decomposition, compressed histograms, sketches,
maximal independence, Euler formula, etc - BUT They target specific cases (mostly range
search selectivity under the L? norm), and their
extensions to other problems are unclear
8Main competitors
- Local method
- Representative methods Histograms
- Global method
- Provides a single estimate corresponding to the
average selectivity/cost of all queries,
independently of their locations - Representative methods Fractal and power law
9Rationale and problems of histograms
- Partition the data space into a set of buckets
and assume (local) uniformity
- Problems
- uniformity
- tricky/slow estimations, for all but the L? norm
10Roadmap
- Problem motivation
- Survey
- Proposed method main idea
- Proposed method details
- Experiments
- Conclusions
11Inherent defect of histograms
- Density trap what is the density in the
vicinity of q?
diameter10 10/100 0.1 diameter100
100/10,000 0.01 Q What is going on?
10
12Inherent defect of histograms
- Density trap what is the density in the
vicinity of q?
diameter10 10/100 0.1 diameter100
100/10,000 0.01 Q What is going on? A we ask
a silly question what is the area of a
line?
10
13Density Trap
- Not caused not by a mathematical oddity like the
Hilbert curve, but by a line, a perfectly
behaving Euclidean object! - This trap will appear for any non-uniform
dataset - Almost ALL real point-sets are non-uniform -gt the
trap is real
14Density Trap
- In short
- is meaningless
- What should we do instead?
15Density Trap
- In short
- is meaningless
- What should we do instead?
- A log(count_of_neighbors) vs log(area)
16Local power law
- In more detail local power law
- nb neighbors of point p, within radius r
- cp local constant
- np local exponent ( local intrinsic
dimensionality)
17Local power law
- Intuitively to avoid the density trap, use
- nplocal intrinsic dimensionality
- instead of density
18Does LPL make sense?
- For point q LPL gives
- nbq(r) ltconstantgt r1
- (no need for density, nor uniformity)
diameter10 10/100 0.1 diameter100
100/10,000 0.01
10
19Local power law and Lx
- if a point obeys L.P.L under L?,
- ditto for any other Lx metric,
- with same local exponent
- -gt LPL works easily, for ANY Lx metric
20Examples
neighbors(ltr)
p1
p2
radius
p1 has higher local exponent local
intrinsic dimensionality than p2
21Examples
22Roadmap
- Problem motivation
- Survey
- Proposed method main idea
- Proposed method details
- Experiments
- Conclusions
23Proposed method
- Main idea if we know (or can approximate) the cp
and np of every point p, we can solve all the
problems
24Target Problem
- for any Lp metric
- using a single method
25Target Problem
- for any Lp metric (Lemma3.2)
- using a single method
26Theoretical results
- interesting observation
- (Thm3.4) the cost of a kNN query q depends
- only on the local exponent
- and NOT on the local constant,
- nor on the cardinality of the dataset
27Implementation
- Given a query point q, we need its local exponent
and constants to perform estimation - but too expensive to store, for every point.
- Q What to do?
28Implementation
- Given a query point q, we need its local exponent
and constants to perform estimation - but too expensive to store, for every point.
- Q What to do?
- A exploit locality
29Implementation
- nearby points usually have similar local
constants and exponents. Thus, one solution - anchors pre-compute the LPLaw for a set of
representative points (anchors) use nearest
anchor to q
30Implementation
- choose anchors with sampling, DBS, or any other
method.
31Implementation
- (In addition to anchors, we also tried to use
patches of near-constant cp and np it gave
similar accuracy, for more complicated
implementation)
32Experiments - Settings
- Datasets
- SC that contain 40k points representing the coast
lines of Scandinavia - LB that include 53k points corresponding to
locations in Long Beach county - Structure R-tree
- Compare Power method to
- Minskew
- Global method (fractal)
33Experiments - Settings
- The LPLaw coefficients of each anchor point are
computed using L8 0.05-neighborhoods - Queries Biased (following the data distribution)
- A query workload contains 500 queries
- We report the average error ?iacti?esti/?iacti
34Target Problem
- for any Lp metric (Lemma3.2)
- using a single method
35Range search selectivity
36Target Problem
- for any Lp metric (Lemma3.2)
- using a single method
37Regional distance join selectivity
- No known global method in this case
- The LPL method wins, with higher margin
38Target Problem
- for any Lp metric (Lemma3.2)
- using a single method
39Range search query cost
40k nearest neighbor cost
41Regional distance join cost
42Conclusions
- We spot the density trap problem of the local
uniformity assumption (lt- histograms) - we show how to resolve it, using the local
intrinsic dimension instead (-gt Local Power
Law) - and we solved all posed problems
43Conclusions contd
- for any Lp metric
- using a single method
44Conclusions contd
- for any Lp metric (Lemma3.2)
- using a single method (LPL anchors)