The PowerMethod: A Comprehensive Estimation Technique for MultiDimensional Queries

About This Presentation

Title:

The PowerMethod: A Comprehensive Estimation Technique for MultiDimensional Queries

Description:

Query selectivity. Query (I/O) cost. for any Lp metric. using a single method ... Regional distance join selectivity. Tao, Faloutsos, Papadias. 38. Target Problem ... –

Number of Views:44

Avg rating:3.0/5.0

Slides: 44

Provided by: ssu51

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: The PowerMethod: A Comprehensive Estimation Technique for MultiDimensional Queries

1
The Power-Method A Comprehensive Estimation
Technique for Multi-Dimensional Queries

Yufei Tao U. Hong Kong
Christos Faloutsos CMU
Dimitris Papadias Hong Kong UST

2
Roadmap

Problem motivation
Survey
Proposed method main idea
Proposed method details
Experiments
Conclusions

3
Target query types

DB set of m d points.
Range search (RS)
k nearest neighbor (KNN)
Regional distance (self-) join (RDJ)
in Louisiana, find all pairs of music stores
closer than 1mi to each other

4
Target problem

Estimate
Query selectivity
Query (I/O) cost
for any Lp metric
using a single method

5
Target Problem

for any Lp metric
using a single method

6
Roadmap

Problem motivation
Survey
Proposed method main idea
Proposed method details
Experiments
Conclusions

7
Older Query estimation approaches

Vast literature
Sampling, kernel estimation, single value
decomposition, compressed histograms, sketches,
maximal independence, Euler formula, etc
BUT They target specific cases (mostly range
search selectivity under the L? norm), and their
extensions to other problems are unclear

8
Main competitors

Local method
Representative methods Histograms
Global method
Provides a single estimate corresponding to the
average selectivity/cost of all queries,
independently of their locations
Representative methods Fractal and power law

9
Rationale and problems of histograms

Partition the data space into a set of buckets
and assume (local) uniformity

Problems
uniformity
tricky/slow estimations, for all but the L? norm

10
Roadmap

Problem motivation
Survey
Proposed method main idea
Proposed method details
Experiments
Conclusions

11
Inherent defect of histograms

Density trap what is the density in the
vicinity of q?

diameter10 10/100 0.1 diameter100
100/10,000 0.01 Q What is going on?
10
12
Inherent defect of histograms

Density trap what is the density in the
vicinity of q?

diameter10 10/100 0.1 diameter100
100/10,000 0.01 Q What is going on? A we ask
a silly question what is the area of a
line?
10
13
Density Trap

Not caused not by a mathematical oddity like the
Hilbert curve, but by a line, a perfectly
behaving Euclidean object!
This trap will appear for any non-uniform
dataset
Almost ALL real point-sets are non-uniform -gt the
trap is real

14
Density Trap

In short
is meaningless
What should we do instead?

15
Density Trap

In short
is meaningless
What should we do instead?
A log(count_of_neighbors) vs log(area)

16
Local power law

In more detail local power law
nb neighbors of point p, within radius r
cp local constant
np local exponent ( local intrinsic
dimensionality)

17
Local power law

Intuitively to avoid the density trap, use
nplocal intrinsic dimensionality
instead of density

18
Does LPL make sense?

For point q LPL gives
nbq(r) ltconstantgt r1
(no need for density, nor uniformity)

diameter10 10/100 0.1 diameter100
100/10,000 0.01
10
19
Local power law and Lx

if a point obeys L.P.L under L?,
ditto for any other Lx metric,
with same local exponent
-gt LPL works easily, for ANY Lx metric

20
Examples
neighbors(ltr)
p1
p2
radius
p1 has higher local exponent local
intrinsic dimensionality than p2
21
Examples
22
Roadmap

Problem motivation
Survey
Proposed method main idea
Proposed method details
Experiments
Conclusions

23
Proposed method

Main idea if we know (or can approximate) the cp
and np of every point p, we can solve all the
problems

24
Target Problem

for any Lp metric
using a single method

25
Target Problem

for any Lp metric (Lemma3.2)
using a single method

26
Theoretical results

interesting observation
(Thm3.4) the cost of a kNN query q depends
only on the local exponent
and NOT on the local constant,
nor on the cardinality of the dataset

27
Implementation

Given a query point q, we need its local exponent
and constants to perform estimation
but too expensive to store, for every point.
Q What to do?

28
Implementation

Given a query point q, we need its local exponent
and constants to perform estimation
but too expensive to store, for every point.
Q What to do?
A exploit locality

29
Implementation

nearby points usually have similar local
constants and exponents. Thus, one solution
anchors pre-compute the LPLaw for a set of
representative points (anchors) use nearest
anchor to q

30
Implementation

choose anchors with sampling, DBS, or any other
method.

31
Implementation

(In addition to anchors, we also tried to use
patches of near-constant cp and np it gave
similar accuracy, for more complicated
implementation)

32
Experiments - Settings

Datasets
SC that contain 40k points representing the coast
lines of Scandinavia
LB that include 53k points corresponding to
locations in Long Beach county
Structure R-tree
Compare Power method to
Minskew
Global method (fractal)

33
Experiments - Settings

The LPLaw coefficients of each anchor point are
computed using L8 0.05-neighborhoods
Queries Biased (following the data distribution)
A query workload contains 500 queries
We report the average error ?iacti?esti/?iacti

34
Target Problem

for any Lp metric (Lemma3.2)
using a single method

35
Range search selectivity

the LPL method wins

36
Target Problem

for any Lp metric (Lemma3.2)
using a single method

37
Regional distance join selectivity

No known global method in this case
The LPL method wins, with higher margin

38
Target Problem

for any Lp metric (Lemma3.2)
using a single method

39
Range search query cost
40
k nearest neighbor cost
41
Regional distance join cost
42
Conclusions

We spot the density trap problem of the local
uniformity assumption (lt- histograms)
we show how to resolve it, using the local
intrinsic dimension instead (-gt Local Power
Law)
and we solved all posed problems

43
Conclusions contd