Selectivity Estimation For Boolean Queries - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Selectivity Estimation For Boolean Queries

Description:

AT&T Labs. 1. Selectivity Estimation For Boolean Queries. Zhiyuan Chen (Speaker) ... AT&T Labs. 2. Motivation(1) Boolean queries on substring predicates are ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 21
Provided by: zhiyua1
Category:

less

Transcript and Presenter's Notes

Title: Selectivity Estimation For Boolean Queries


1
Selectivity Estimation For Boolean Queries
  • Zhiyuan Chen (Speaker)
  • Flip Korn Nick Koudas S.Muthukrishnan

2
Motivation(1)
Document 1. Peanut butter lovers club
... Document 2. peanut stock... Document 3.
butter, the natural choice Document 100. ...
How many documents contain substring peanut but
not butter?
Boolean queries on substring predicates are
ubiquitous. Information Retrieval. Bibliographic
search. Web searching. 40 millions per day at
AltaVista. RDB queries.
3
Motivation(2)
  • Computing exact answers is expensive!
  • Either super-linear space or linear time.
  • Use estimates for.
  • Query optimization.
  • Best Filtering order.
  • Interactive query refinement.
  • Hard to write a query having 1- 20 answers.
  • Ranking approach does not always work and
    expensive.
  • Estimate - refine query - ... - exact answer

4
Outline
  • Problem Definition
  • Related work
  • Our approach
  • Experiments
  • Conclusions

5
Problem Definition
Document 1. Peanut butter lovers club
... Document 2. peanut stock... Document
100. ... How many documents contain substring
peanut but not butter?
  • Substring predicate ?(s) is true iff string s
    contains ? as substring.
  • Boolean queries
  • Substring predicates concatenated with AND, OR,
    NOT.
  • For a string set S, a Boolean query q
  • Selectivity P(q) is the fraction of strings in S
    that satisfy q.

6
Related Work (1)
  • Histograms Not suitable for strings
  • Selectivity of adjacent substrings often differs
    a lot!
  • End-biased histograms(IP95).
  • Structure of substring dependence is not used.
  • If ? is pruned, for any ?
  • Count(??) Count(?) Default ????

7
Related Work (2)
  • Existing work for substring queries.
  • Conjunction-only queries KVI96, WVI97, JNS99,
    JKNS99
  • Preprocess a compact data structure Pruned
    Suffix Tree.
  • Correlation between predicates explicitly stored.
  • Otherwise - independence assumption on substring
    predicates.
  • Pruned case
  • Parse into subqueries estimate each subquery.
  • Probabilistic formula to combine estimates.
  • independence assumption.
  • Maximal overlap (conditioning on overlaps).

8
Use Previous Approach?
  • NO!
  • - Exponential (22m) space to store correlation
    between substring predicates. (m as number of
    suffixes)
  • Not to store correlation?
  • - No! Correlation is important!

Independence Assumption P(peanut ? butter)
P(peanut) P(butter) 2/100 2/100 0.04
Document 1. Peanut butter lovers club
... Document 2. peanut stock Document 3.
butter, the natural choice Document 100. ...
25 times smaller than true count!
9
Set-Oriented Approach - Store Correlation
Implicitly
Base set the set of IDs of strings that contain
the substring.
Document 1. Peanut butter lovers club
... Document 2. peanut stock... Document 3.
butter, the natural choice Document 100. ...
BaseSet(peanut ? butter) BaseSet(peanut) ?
BaseSet(butter) 1
Base sets can be huge in the worst case!!!
O(number of strings)
10
Set-Hashing Approach
  • A Monte Carlo technique(Cohen94,Broder98)

Two Sets A, B
Set inclusion-exclusion for unions and
complements.
11
Signature Generation
Pick first element in A 1,2,3
Signature of A
Randomly permute universe 3 times
12
Reconstruction
1
3
2
As signature SA
Bs signature SB
2
3
2
Definition r of pair-wise matches of SA and
SB / SA 2 /3
Theorem
A?B r / (1 r) (A B) 2/3 / (12/3)
(33) 2.4.
13
Implementation Issues
  • Approximate permutations
  • Use a set of independent hash functions.
  • Pick the minimal hash images as signature
    components. Sig(A) minh(x) x in A.
  • Signature of unions Sig(peanut ? butter)
  • Pair-wise min of Sig(p) and Sig(b).

14
Algorithm Outline - No Pruning
  • Without negations.
  • Convert to CNF forms. (Peanut ? butter)?
    sandwich
  • Estimate using Sig(Peanut ? butter),
    Sig(sandwich).
  • With negations.
  • (Peanut ? butter)? ? sandwich
  • (p ? s) ? ( b ? ? s) (Convert to DNF)
  • p ? ? s b ? ? s - p ? b ? ? s
  • (Eliminate disjunction by set-inclusion-exclusio
    n)
  • p - p ? s b - b ? s - p ? b p ?
    b ? s
  • (Eliminate negations)

Comments Works fine with short queries.
15
Pruned Suffix Tree Case
  • Parse a query into subqueries only have
    predicates in suffix trees.
  • Use signatures to estimate each subquery.

Combine them using probabilistic formula. Maximal
overlap parsing and conditioning on overlap. E.g.
P((abc ? 12) ? ?23) P(23) - P(abc ?12 ?23)
P(23) - P(ab ?12 ?23) P(bc ?12 ?23 ab ?12
?23) ? P(23) - P(ab ?12 ?23) P(bc ?12 ?23 b
?12 ?23) P(23) - P(ab ?12 ?23) P(bc ?12 ?23)
/ P(b ?12 ?23)
abc parsed into ab and bc
16
Complexity
  • Theorem
  • Preprocessing building tree and signatures
  • O(signature length database size) time and
    space.
  • Online estimate
  • O(2O(L)), L is the query length.
  • Online time only related to query length.
  • L is small in real life.
  • Below 1 millisecond in experiments.

17
Experiments - Setup
  • Data set real ATT data - service description.
  • 130 K strings, 2.5 MB.
  • Queries
  • Templates
  • T1 (A ? B) ? (C ? D)
  • T2 (A ? B) ? (C ? D) ? (E ? F) ? (G ? H)
  • With a certain probability of negation.
  • Positive Negative queries.
  • Compare with independence assumption.
  • Run on an Intel PC 350 MHz, 128 MB RAM.
  • 1 minute preprocessing, time.

18
PST-Positive Queries
Probability of negations 5 Average absolute
relative error.
T1 (A ? B) ? (C ? D)
19
PST-Negative Queries
Probability of negations 5 Average
root-mean-square error (count)
T1 (A ? B) ? (C ? D)
20
Conclusions
  • Contributions
  • A novel problem.
  • A novel approach of implicitly storing
    correlation and generating correlation as needed
    by set-hashing.
  • Far superior than independence assumption.
  • 1.0 space,
  • 4 times more accurate for positive queries, many
    orders for negative queries.
  • Ongoing Future work.
  • Twig estimation for XML documents.
  • Regular expressions, position constraints.
Write a Comment
User Comments (0)
About PowerShow.com