Title: Efficient Detection of Empty Result Queries
1Efficient Detection of Empty Result Queries
- Gang Luo
- IBM T.J. Watson Research Center
- luog_at_us.ibm.com
2Empty Result Problem
- Query returns an empty result set
- User gets lost about where to look at next
- Frequently encountered in interactive exploration
of massive data sets - Our contribution method for quickly detecting
empty result sets
3Example Percentages of Empty Result Queries
- In a Customer Relationship Management (CRM)
application developed by IBM - 18.07 (3,396 empty result queries in 18,793
queries) - In a real estate application developed by IBM
- 5.75
- In a digital library application JCM00
- 10.53
- In a bioinformatics application RCP98
- 38
4Empty Result Queries May Not Finish Execution
Quickly
- Consider a query joining two relations
- Query execution time is longer than join time, no
matter whether or not query result set is empty - Even if a query finishes in a few seconds in a
lightly loaded RDBMS, it may last longer than one
minute in a heavily loaded RDBMS
5Outline
- Limitations of previous approaches
- Fast detection method for empty result queries
- Some experiments
6Existing Solutions to the Empty Result Problem
- Explain what leads to the empty result set
- Automatically generalize the query so that the
generalized query will return some answers
7Limitations of Existing Solutions
- Require domain specific knowledge
- Only apply to a restricted form of queries
- Require an excessive amount of time
- Give too many reasons why the result set is empty
- Users cannot reuse each others query results
8Outline
- Limitations of previous approaches
- Fast detection method for empty result queries
- Some experiments
9Our Solution
- Only consider read-only environment
- From previous queries execution, remember the
query parts that lead to empty result sets - When a new query Q comes, match it with the
remembered query parts. If such a match exists,
report that Q will return an empty result set
without executing Q - Utilize special properties of empty result sets
and thus often more powerful than traditional
materialized view method
10Definitions
- Empty result propagating operator An operator
whose output is empty if any input is empty - Empty result propagating query A query whose
query plan only contains empty result propagating
operators (our focus) - Query part A sub-tree of a query plan
- Atomic query part An ordered pair (relation
names RN, selection condition SC) - Corresponds to a relational algebra formula
first product join all relations in RN, then
apply SC - SC is a conjunction of primitive terms, where
each primitive term is a comparison
11Definitions Cont.
- Cover Atomic query part P1(RN1, SC1) covers
atomic query part P2(RN2, SC2) if - RN1?RN2
- Whenever SC2 is true, SC1 is true
- Property Suppose atomic query part P1 covers
atomic query part P2. For a given database, if
the output of P1 is empty, the output of P2 is
also empty.
12Given an Empty Result Query
- Find the lowest-level query part P whose output
is empty
13Transforming P into a Simplified Query Part Ps
- Drop all operators (e.g., projection, hash, sort)
that have no influence on the emptiness of the
output - Replace each physical join operator with a
logical join operator - Replace each index-scan operator with a
table-scan operator followed by a selection
operator, where the selection condition is the
index-scan condition
14Transforming P into a Simplified Query Part Ps
Cont.
- Corresponding relational algebra formula
- (?50ltA.alt100 ? A.b200 (A)) ?A.cB.d (?B.elt40 ?
B.e50 (B))
15Breaking Ps into Atomic Query Parts
- Get all selection conditions in the
selection/join operators - Rewrite the conjunction of these selection
conditions into a disjunctive normal form (DNF) - Negations on numeric or string attributes are
removed using complementary operators - Interval-based comparison is treated as a single
primitive term - Generate a set of atomic query parts (RN, SC)
- RN input relations of all table-scan operators
in Ps - SC a term in the DNF
16Breaking Ps into Atomic Query Parts Cont.
(?50ltA.alt100 (A)) ?A.cB.d (?B.elt40
(B)) (?A.b200 (A)) ?A.cB.d (?B.elt40
(B)) (?50ltA.alt100 (A)) ?A.cB.d (?B.e50
(B)) (?A.b200 (A)) ?A.cB.d (?B.e50 (B))
- Property The following three assertions are
equivalent to each other - The output of the query part P is empty
- The output of the simplified query part Ps is
empty - The output of each generated atomic query part is
empty
17Storing the Generated Atomic Query Parts
- For each generated atomic query part Pa
- Insert Pa into a collection Caqp of atomic query
parts - Remove from Caqp all previously stored atomic
query parts that are covered by Pa - See paper for details of the coverage checking
algorithm
18When Getting a New Query Q
- Break Q into a set of atomic query parts
- For each such atomic query part Pa, check whether
some atomic query part Ai in Caqp covers Pa - If such an Ai exists for each Pa, report that Q
will return an empty result set without executing
Q
19Outline
- Limitations of previous approaches
- Fast detection method for empty result queries
- Some experiments
20Setup
- Testing environment
- PostgreSQL 7.3.4
- Windows XP OS
- Dell Inspiron 8500 PC with one 2.2GHz CPU, 512MB
memory, one 40GB disk - TPC-R benchmark
- See paper for detection probability analysis
21Overhead Experiment
- Query Q1 Find the information about certain
parts that were sold on certain days
select from orders o, lineitem lwhere
o.orderkeyl.orderkey and (o.orderdated1 or
or o.orderdatede) and (l.partkeyp1 or or
l.partkeypf)
22Overhead Experiment Cont.
- Query Q2 Find the information about certain
parts that were sold to certain customers on
certain days
select from orders o, lineitem l, customer
cwhere o.orderkeyl.orderkey and
o.custkeyc.custkey and (o.orderdated1 or or
o.orderdatede) and (l.partkeyp1 or or
l.partkeypf) and (c.nationkeyn1 or or
c.nationkeyng)
23Overhead Experiment Cont.
- The overhead of our method increases with both
query complexity and the number of atomic query
parts stored in Caqp - When check fails, the overhead of our method is
higher than that when check succeeds
24Overhead Experiment Cont.
- The overhead of our method is trivial compared to
query execution overhead
25Summary
- Provide a fast detection method for empty result
queries - Low overhead
- High detection probability once enough
information has been accumulated
26Open Issues
- In the presence of update, correctly preserve as
much stored information as possible - A hybrid method that can combine the advantages
of both our method and the existing solutions - More aggressive storage saving technique