Title: Topk Query Processing in Uncertain Database
1Top-k Query Processing in Uncertain Database
- Mohamed A. Soliman, Ihab F. Ilyas,
- Kevin Chen-Chuan Chang. ICDE07
- Kai, Jiang Fudan University
2Outline
- Introduction
- Processing Framework
- U-Topk Queries
- U-kRanks Queries
- Queries with Tuple Independence
- Experiments
- Conclusion
3Introduction
- Uncertain (probabilistic) data
- sensor networks, moving objects tracking, data
cleaning etc. - Uncertain data model
- Possible worlds a set of possible instances
- Confidence membership uncertainty
- Generation rules logical formulas determine
valid worlds - Independent tuples correlated with no rules
4Uncertain Database
5Motivation Challenges
- Different from traditional top-k queries
- Not depend only on score function but also on
membership probability - Two interesting top-k queries
- Top-k speeding cars in the last hour
- A ranking over the models of the top-k speeding
cars - Interaction between most probable and top-k
several different possible interpretations - Involve both ranking and aggregation across
worlds which is prohibitively expensive
6Problem Definition U-Topk
- Uncertain Top-k Query (U-Topk)
- Let D be an uncertain database with possible
worlds space PWPW1, . . . , PWn. Let TT1, .
. . , Tm be a set of k-length tuple vectors,
where for each Ti?T - (1)Tuples of Ti are ordered according to scoring
function F - (2) Ti is the top-k answer for a non empty set
of possible worlds . - A U-Topk query, based on F, returns T?T, where
7Problem Definition U-kRanks
- Uncertain k Ranks Query (U-kRanks)
- Let D be an uncertain database with possible
worlds space PWPW1, . . . , PWn. For i1k,
let be a set of tuples, where each
tuple appears at rank i in a non empty set
of possible worlds based on
scoring function F. A U-kRanks query, based on F,
returns , where
8Processing Framework
9Data Access
- Theorem Among all sequential access methods,
sorted score access is optimal in the number of
retrieved tuples to answer uncertain top-k
queries. - Algorithm A retrieves tuples sequentially out of
score order, cannot decide whether a seen tuple t
belongs to any possible top-k answer or not. - Retrieving tuples in confidence is also not
optimal, cannot guarantee it has seen all tuples
with high scores than t
10Process Overview
11Computing State Probabilities
- Probability Reduction
- Extending a combination of tuple events by adding
another tuple existence/absence event results in
a new combination with at most the same
probability - State Probability
- d access to D in F order. P(sl)Pr(sln?I(sl,d))
- State Extension (Extend sl with tuple t)
- A modified version of sl, event ?t
- A state sl1 appended by t, event t
12U-Topk Queries
- OptU-Topk algorithm
- Buffer the ranked tuples retrieved from D
- Q a priority queue of states ordered on their
probabilities, ties are broken by state length,
initializing with e with state P(s0,0) 1 - Lazy materialization, at each step extend only
the state with the highest probability into two
possible state - Terminate when the top state of Q is a complete
state. - Can extended to return n most probable U-Topk
answers
13(No Transcript)
14Optimality
- Among all algorithms that access tuples ordered
on score, OptU-Topk is optimal in the number of
accessed tuples. - Let x be the reported answer by OptU-Topk. Among
all algorithms that access tuples ordered on
score, there is no algorithm that can skip a
state visited by OptU-Topk and report x as the
U-Topk query answer.
15U-kRanks Queries
- OptU-kRanks
- Extend maintained states based on each seen tuple
- Computer Pt,i, for i1,,k
- For each rank i, remember the most probable
answer obtained so far - Terminate at rank i when
- Optimality
- Among all algorithms that access tuples ordered
on score, OptU-kRanks is optimal in number of
accessed tuples.
16(No Transcript)
17U-Topk Queries with Tuple Independence
- Under tuple independence, if all states are
maintained after seeing the same tuples, xn and
ym(nm) would follow the same path to reach a
complete state. If P(xn) gtP(ym), prune ym. - IndepU-Topk groups states into equivalence
classed based on their lengths, keeps at most one
state for each length 0,,k in a candidate set.
18IndepU-Topk
19U-kRanks Queries with Tuple Independence
20Experiment
21Experiment
22Experiment
23Experiment
24Experiment
25Conclusion
- First paper to address top-k query processing
under possible worlds semantics - Formulate the problem as a state space search,
query algorithms with optimality guarantees on
accessed tuples and materializaed search states. - Process framework leverages existing techniques
and be integrated with existing DBMSs
26Efficient Top-k Query Evaluation on
Probabilistic Data
- Christopher Re, Nilesh Dalvi,
- Dan Suciu. ICDE07
27Outline
- Introduction
- Preliminaries
- Top-k Query Evaluations
- Discussion
- Experiments
- Conclusion
28Introduction
- Imprecise data probabilistic database
- Computer and rank top k answer of a SQL query
- Answers with approximate probabilities
- Shift focus from probabilities to ranks
29Application Example
30Challenges
- Compute the exact output probabilities is
computationally hard P-complete - Number of potential answers is large
- 1415 answers in the previous example
- Only interested in the first few of them
31Approach given in the paper
- Guarantee
- The top k answers are correct
- The ranking of the top k answers is correct
- Limitations
- Probabilities listed explicitly
- Do not handle continuous attribute values
32Possible World
- Definition
- A probabilistic database over schema S is a pair
(W,P), where WW1,,Wn is a set of database
instances over S, and P W-gt0,1 is a
probability distribution. Each instances Wj for
which P(Wj) gt 0 is called a possible world.
33Possible World
- Representation
- Table
-
- Constraint
- Each instance Jp over schema Sp represents a
probabilistic database over S, denoted Mod(Jp) - S a single relation name R(A1,,Am,B1,,Bn),
R(A,B) - Jp table Rp (A, B, p)
- Wj subsets of
34Example
35DNF Formulas over Tuples
- Definition
- Let (W,P) a probabilistic database, t1,t2, all
the tuples - ti true if ti?W and ti false if ti W
- A fomula E, P(E)sum(P(Wi) Etrue in Wi)
- E(t1?t5) ?t2, P(E)P(W3)P(W7)P(W10)P(W11)
36Queries
- Consider SQL queries of the form
- aggregate op sum,count(sum(1)),min,max
- Given (W,P), the answer is a table like
37Possible worlds semantics
- SQL query on Wj a set of tuples
-
-
-
38Semantics based on DNF Formulas
- Possible worldssemantics is not practical
- DNF formulas
- Modify q -gt qe
- Evaluate qe on Jp and denote answer ET
- Form of ET t(t1,,tr), t1? ,, tr?
- t.Et1?t2??tr
- P(t.E) can be computed easily
39DNF Formulas(continue)
- Partition ET by GROUP-BY
- ETG1?G2??Gn, Gt1,tm
-
-
- Computer P(G.E) is P-complete
40Monte Carlo (MC) Simulation
- Naïve MC algorithm
- Approximate P(G.E) by repeatedly choose a random
possible world and compute the frequency of
G.Etrue - Luby and Karps improved MC algorithm
41Property of Luby and Karps algorithm
- Let dgt 0
- m number of disjuncts
- N the number of steps executed
- Define
-
- Then
42Top-k Query Evaluation
- Evaluation has two parts
- Evaluate the extend SQL query qe and group the
answer tuples - Run a MC simulation on each group to compute the
probabilities then return the top k probabilities - Goal minimize the total number of simulation
steps
43Multisimulation (MS)
- GG1,,Gn with unknown prob p1,pn , goal to
find the k objects with highest prob, denoted
TopK G - Assumptions
44Multisimulation
- Two intervals ai,bi,aj,bj, if biaj, first is
below, second is above - Two intervals are separated if we know pi lt pj
- n intervals is k-separated if a set T G of k
intervals any interval in T is above any interval
not in T
45Multisimulation
- Sound strategy round robin
- Cost nNopt
- Notations and definitions
- Topk(x1,,xn) be the ks largest value
- Critical region (c,d)(topk(a1,,an),
topk1(b1,,bn)) - Top objects TGi d ai T TopK
- Bottom object BGi bi c BnTopKØ
46Multisimulation
- There is a k-separation iff critical region is
empty i.e. cd, TopKT - Gi is a double crosser if ai lt c, d lt bi
- Gi is a lower (upper) crosser if ai lt c (d lt bi)
47Multisimulation
- MS Algorithm
- First, try a double crosser
- Then try to find an upper and lower crosser pair
- If not exists it means either all crossers have
the same left endpoint aic or the same right
endpoint dbi. Pick the maximal crosser - After each iteration re-compute the critical
region - Stop when cd, return the set T
48The Multisimulation Algorithm
49Algorithm Guarantee
- The algorithm always terminates and returns the
correct TopK.For any deterministic algorithm
computing the top k and for any clt2 there exists
an instance on which its cost is cNopt. - Let A be any deterministic algorithm for finding
TopK. Then (a) on any instance the cost of
MS_TopK is at most twice the cost of A, and (b)
for any clt1 there exists an instance where the
cost of A is greater than c times the cost of
MS_TopK.
50Discussion
- Extensions
- Extend MS to compute and rank the top k
answers.Tk MS_TopK(G, k) - Tk-1 MS_TopK(Tk, k-1)
- Tk-2 MS_TopK(Tk-1, k-2)
-
- T1 MS_TopK(T2, 1)
- Variation
- Any-time algorithm which computes and returns the
top answers in order 1,2,3, and can be stopped
at any time
51Review the assumptions
- Precision
- Each step P(p ?aN, bN gt 1-d) and global
precision gt 1-d0, (1- d)N1- d0, d d0/N - Progress fails in general
- After step N of MC, midpoint move by 1/N,
while width of interval shrinks only
O(1/N1/2-1/(N1)1/2)O(N-3/2) - Solution run MC N1/2 iteration at each step
- s.t. The
midpoint moves between steps N and NNa by
Nt-1 - O(1/N1/2 - 1/(NNa)1/2)O(Na-3/2), a1/2
- MS algorithm runs at most 2(NoptNopt 1/2) steps
52Optimization
- Initialize the intervals ai, bi to better
estimates than 0, 1 - eliminates low ranking objects from start
- Safe plan rewriting identify subqueries whose
probabilities can be computed inside of the SQL
engine.
53Experiment
54Experiment
55Conclusion
- Describe a method for answering top-k queries on
probabilistic databases - Prove the technique to be near optimal and
validate it experimentally
56The End