Selectivity Estimation For Boolean Queries - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Selectivity Estimation For Boolean Queries

Description:

AT&T Labs. 1. Selectivity Estimation For Boolean Queries. Zhiyuan Chen (Speaker) ... AT&T Labs. 2. Motivation(1) Boolean queries on substring predicates are ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 21

Provided by: zhiyua1

Learn more at: http://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: Selectivity Estimation For Boolean Queries

1
Selectivity Estimation For Boolean Queries

Zhiyuan Chen (Speaker)
Flip Korn Nick Koudas S.Muthukrishnan

2
Motivation(1)
Document 1. Peanut butter lovers club
... Document 2. peanut stock... Document 3.
butter, the natural choice Document 100. ...
How many documents contain substring peanut but
not butter?
Boolean queries on substring predicates are
ubiquitous. Information Retrieval. Bibliographic
search. Web searching. 40 millions per day at
AltaVista. RDB queries.
3
Motivation(2)

Computing exact answers is expensive!
Either super-linear space or linear time.

Use estimates for.
Query optimization.
Best Filtering order.
Interactive query refinement.
Hard to write a query having 1- 20 answers.
Ranking approach does not always work and
expensive.
Estimate - refine query - ... - exact answer

4
Outline

Problem Definition
Related work
Our approach
Experiments
Conclusions

5
Problem Definition
Document 1. Peanut butter lovers club
... Document 2. peanut stock... Document
100. ... How many documents contain substring
peanut but not butter?

Substring predicate ?(s) is true iff string s
contains ? as substring.
Boolean queries
Substring predicates concatenated with AND, OR,
NOT.
For a string set S, a Boolean query q
Selectivity P(q) is the fraction of strings in S
that satisfy q.

6
Related Work (1)

Histograms Not suitable for strings
Selectivity of adjacent substrings often differs
a lot!

End-biased histograms(IP95).
Structure of substring dependence is not used.
If ? is pruned, for any ?
Count(??) Count(?) Default ????

7
Related Work (2)

Existing work for substring queries.
Conjunction-only queries KVI96, WVI97, JNS99,
JKNS99

Preprocess a compact data structure Pruned
Suffix Tree.

Correlation between predicates explicitly stored.
Otherwise - independence assumption on substring
predicates.

Pruned case
Parse into subqueries estimate each subquery.
Probabilistic formula to combine estimates.
independence assumption.
Maximal overlap (conditioning on overlaps).

8
Use Previous Approach?

NO!
- Exponential (22m) space to store correlation
between substring predicates. (m as number of
suffixes)

Not to store correlation?
- No! Correlation is important!

Independence Assumption P(peanut ? butter)
P(peanut) P(butter) 2/100 2/100 0.04
Document 1. Peanut butter lovers club
... Document 2. peanut stock Document 3.
butter, the natural choice Document 100. ...
25 times smaller than true count!
9
Set-Oriented Approach - Store Correlation
Implicitly
Base set the set of IDs of strings that contain
the substring.
Document 1. Peanut butter lovers club
... Document 2. peanut stock... Document 3.
butter, the natural choice Document 100. ...
BaseSet(peanut ? butter) BaseSet(peanut) ?
BaseSet(butter) 1
Base sets can be huge in the worst case!!!
O(number of strings)
10
Set-Hashing Approach

A Monte Carlo technique(Cohen94,Broder98)

Two Sets A, B
Set inclusion-exclusion for unions and
complements.
11
Signature Generation
Pick first element in A 1,2,3
Signature of A
Randomly permute universe 3 times
12
Reconstruction
1
3
2
As signature SA
Bs signature SB
2
3
2
Definition r of pair-wise matches of SA and
SB / SA 2 /3
Theorem
A?B r / (1 r) (A B) 2/3 / (12/3)
(33) 2.4.
13
Implementation Issues

Approximate permutations
Use a set of independent hash functions.
Pick the minimal hash images as signature
components. Sig(A) minh(x) x in A.
Signature of unions Sig(peanut ? butter)
Pair-wise min of Sig(p) and Sig(b).

14
Algorithm Outline - No Pruning

Without negations.
Convert to CNF forms. (Peanut ? butter)?
sandwich
Estimate using Sig(Peanut ? butter),
Sig(sandwich).

With negations.
(Peanut ? butter)? ? sandwich
(p ? s) ? ( b ? ? s) (Convert to DNF)
p ? ? s b ? ? s - p ? b ? ? s
(Eliminate disjunction by set-inclusion-exclusio
n)
p - p ? s b - b ? s - p ? b p ?
b ? s
(Eliminate negations)

Comments Works fine with short queries.
15
Pruned Suffix Tree Case

Parse a query into subqueries only have
predicates in suffix trees.

Use signatures to estimate each subquery.

Combine them using probabilistic formula. Maximal
overlap parsing and conditioning on overlap. E.g.
P((abc ? 12) ? ?23) P(23) - P(abc ?12 ?23)
P(23) - P(ab ?12 ?23) P(bc ?12 ?23 ab ?12
?23) ? P(23) - P(ab ?12 ?23) P(bc ?12 ?23 b
?12 ?23) P(23) - P(ab ?12 ?23) P(bc ?12 ?23)
/ P(b ?12 ?23)
abc parsed into ab and bc
16
Complexity

Theorem
Preprocessing building tree and signatures
O(signature length database size) time and
space.
Online estimate
O(2O(L)), L is the query length.
Online time only related to query length.
L is small in real life.
Below 1 millisecond in experiments.

17
Experiments - Setup

Data set real ATT data - service description.
130 K strings, 2.5 MB.
Queries
Templates
T1 (A ? B) ? (C ? D)
T2 (A ? B) ? (C ? D) ? (E ? F) ? (G ? H)
With a certain probability of negation.
Positive Negative queries.
Compare with independence assumption.
Run on an Intel PC 350 MHz, 128 MB RAM.
1 minute preprocessing, time.

18
PST-Positive Queries
Probability of negations 5 Average absolute
relative error.
T1 (A ? B) ? (C ? D)
19
PST-Negative Queries
Probability of negations 5 Average
root-mean-square error (count)
T1 (A ? B) ? (C ? D)
20
Conclusions

Contributions
A novel problem.
A novel approach of implicitly storing
correlation and generating correlation as needed
by set-hashing.
Far superior than independence assumption.
1.0 space,
4 times more accurate for positive queries, many
orders for negative queries.
Ongoing Future work.
Twig estimation for XML documents.
Regular expressions, position constraints.