Title: Selectivity Estimation For Boolean Queries
1Selectivity Estimation For Boolean Queries
- Zhiyuan Chen (Speaker)
- Flip Korn Nick Koudas S.Muthukrishnan
2Motivation(1)
Document 1. Peanut butter lovers club
... Document 2. peanut stock... Document 3.
butter, the natural choice Document 100. ...
How many documents contain substring peanut but
not butter?
Boolean queries on substring predicates are
ubiquitous. Information Retrieval. Bibliographic
search. Web searching. 40 millions per day at
AltaVista. RDB queries.
3Motivation(2)
- Computing exact answers is expensive!
- Either super-linear space or linear time.
- Use estimates for.
- Query optimization.
- Best Filtering order.
- Interactive query refinement.
- Hard to write a query having 1- 20 answers.
- Ranking approach does not always work and
expensive. - Estimate - refine query - ... - exact answer
4Outline
- Problem Definition
- Related work
- Our approach
- Experiments
- Conclusions
5Problem Definition
Document 1. Peanut butter lovers club
... Document 2. peanut stock... Document
100. ... How many documents contain substring
peanut but not butter?
- Substring predicate ?(s) is true iff string s
contains ? as substring. - Boolean queries
- Substring predicates concatenated with AND, OR,
NOT. - For a string set S, a Boolean query q
- Selectivity P(q) is the fraction of strings in S
that satisfy q.
6Related Work (1)
- Histograms Not suitable for strings
- Selectivity of adjacent substrings often differs
a lot!
- End-biased histograms(IP95).
- Structure of substring dependence is not used.
- If ? is pruned, for any ?
- Count(??) Count(?) Default ????
7Related Work (2)
- Existing work for substring queries.
- Conjunction-only queries KVI96, WVI97, JNS99,
JKNS99
- Preprocess a compact data structure Pruned
Suffix Tree.
- Correlation between predicates explicitly stored.
- Otherwise - independence assumption on substring
predicates.
- Pruned case
- Parse into subqueries estimate each subquery.
- Probabilistic formula to combine estimates.
- independence assumption.
- Maximal overlap (conditioning on overlaps).
8Use Previous Approach?
- NO!
- - Exponential (22m) space to store correlation
between substring predicates. (m as number of
suffixes)
- Not to store correlation?
- - No! Correlation is important!
Independence Assumption P(peanut ? butter)
P(peanut) P(butter) 2/100 2/100 0.04
Document 1. Peanut butter lovers club
... Document 2. peanut stock Document 3.
butter, the natural choice Document 100. ...
25 times smaller than true count!
9Set-Oriented Approach - Store Correlation
Implicitly
Base set the set of IDs of strings that contain
the substring.
Document 1. Peanut butter lovers club
... Document 2. peanut stock... Document 3.
butter, the natural choice Document 100. ...
BaseSet(peanut ? butter) BaseSet(peanut) ?
BaseSet(butter) 1
Base sets can be huge in the worst case!!!
O(number of strings)
10Set-Hashing Approach
- A Monte Carlo technique(Cohen94,Broder98)
Two Sets A, B
Set inclusion-exclusion for unions and
complements.
11Signature Generation
Pick first element in A 1,2,3
Signature of A
Randomly permute universe 3 times
12Reconstruction
1
3
2
As signature SA
Bs signature SB
2
3
2
Definition r of pair-wise matches of SA and
SB / SA 2 /3
Theorem
A?B r / (1 r) (A B) 2/3 / (12/3)
(33) 2.4.
13Implementation Issues
- Approximate permutations
- Use a set of independent hash functions.
- Pick the minimal hash images as signature
components. Sig(A) minh(x) x in A. - Signature of unions Sig(peanut ? butter)
- Pair-wise min of Sig(p) and Sig(b).
14Algorithm Outline - No Pruning
- Without negations.
- Convert to CNF forms. (Peanut ? butter)?
sandwich - Estimate using Sig(Peanut ? butter),
Sig(sandwich).
- With negations.
- (Peanut ? butter)? ? sandwich
- (p ? s) ? ( b ? ? s) (Convert to DNF)
- p ? ? s b ? ? s - p ? b ? ? s
- (Eliminate disjunction by set-inclusion-exclusio
n) - p - p ? s b - b ? s - p ? b p ?
b ? s - (Eliminate negations)
Comments Works fine with short queries.
15Pruned Suffix Tree Case
- Parse a query into subqueries only have
predicates in suffix trees.
- Use signatures to estimate each subquery.
Combine them using probabilistic formula. Maximal
overlap parsing and conditioning on overlap. E.g.
P((abc ? 12) ? ?23) P(23) - P(abc ?12 ?23)
P(23) - P(ab ?12 ?23) P(bc ?12 ?23 ab ?12
?23) ? P(23) - P(ab ?12 ?23) P(bc ?12 ?23 b
?12 ?23) P(23) - P(ab ?12 ?23) P(bc ?12 ?23)
/ P(b ?12 ?23)
abc parsed into ab and bc
16Complexity
- Theorem
- Preprocessing building tree and signatures
- O(signature length database size) time and
space. - Online estimate
- O(2O(L)), L is the query length.
- Online time only related to query length.
- L is small in real life.
- Below 1 millisecond in experiments.
17Experiments - Setup
- Data set real ATT data - service description.
- 130 K strings, 2.5 MB.
- Queries
- Templates
- T1 (A ? B) ? (C ? D)
- T2 (A ? B) ? (C ? D) ? (E ? F) ? (G ? H)
- With a certain probability of negation.
- Positive Negative queries.
- Compare with independence assumption.
- Run on an Intel PC 350 MHz, 128 MB RAM.
- 1 minute preprocessing, time.
18PST-Positive Queries
Probability of negations 5 Average absolute
relative error.
T1 (A ? B) ? (C ? D)
19PST-Negative Queries
Probability of negations 5 Average
root-mean-square error (count)
T1 (A ? B) ? (C ? D)
20Conclusions
- Contributions
- A novel problem.
- A novel approach of implicitly storing
correlation and generating correlation as needed
by set-hashing. - Far superior than independence assumption.
- 1.0 space,
- 4 times more accurate for positive queries, many
orders for negative queries. - Ongoing Future work.
- Twig estimation for XML documents.
- Regular expressions, position constraints.