Title: Probabilistic Ranking of Database Query Results
1Probabilistic Ranking of Database Query Results
- Surajit Chaudhuri, Microsoft Research
- Gautam Das, Microsoft Research
- Vagelis Hristidis, Florida International
University - Gerhard Weikum, MPI Informatik
Presented by Weimin He CSE_at_UTA
2Outline
- Motivation
- Problem Definition
- System Architecture
- Construction of Ranking Function
- Implementation
- Experiments
- Conclusion and open problems
3Motivating example
- Realtor DB
- Table D(TID, Price , City, Bedrooms,
Bathrooms, LivingArea, SchoolDistrict, View,
Pool, Garage, BoatDock)
SQL query Select From D Where CitySeattle
AND ViewWaterfront
4Motivation
- Many-answers problem
- Two alternative solutions
- Query reformulation
- Automatic ranking
- Apply probabilistic model in IR to DB tuple
ranking -
5Problem Definition
- Given a database table D with n tuples t1, ,
tn over a set of m categorical attributes A
A1, , Am - and a query Q SELECT FROM D
- WHERE
- X1x1 AND AND Xsxs
- where each Xi is an attribute from A and xi is a
value in its domain. - The set of attributes X X1, , Xs is known as
the set of attributes specified by the query,
while the set Y A X is known as the set of
unspecified attributes - Let be the answer set of Q
- How to rank tuples in S and return top-k tuples
to the user ?
6System Architecture
7Intuition for Ranking Function
- Select From D Where CitySeattle And
ViewWaterfront - Score of a Result Tuple t depends on
- Global Score Global Importance of Unspecified
Attribute Values - E.g., Homes with good school districts are
globally desirable - Conditional Score Correlations between Specified
and Unspecified Attribute Values - E.g., Waterfront ? BoatDock
8Probabilistic Model in IR
- Document t, Query QR Relevant document setR
D - R Irrelevant document set
9Adaptation of PIR to DB
- Tuple t is considered as a document
- Partition t into t(X) and t(Y)
- t(X) and t(Y) are written as X and Y
- Derive from initial scoring function until final
ranking function is obtained
10Preliminary Derivation
11Limited Independence Assumptions
- Given a query Q and a tuple t, the X (and Y)
values within themselves are assumed to be
independent, though dependencies between the X
and Y values are allowed
12Continuing Derivation
13Workload-based Estimation of
Assume a collection of past queries existed in
system Workload W is represented as a set of
tuples Given query Q and specified attribute
set X, approximate R as all query tuples in W
that also request for X All properties of the
set of relevant tuple set R can be obtained by
only examining the subset of the workload that
caontains queries that also request for X
14Final Ranking Function
15Pre-computing Atomic Probabilities in Ranking
Function
Relative frequency in W
Relative frequency in D
(of tuples in W that conatains x, y)/total of
tuples in W
(of tuples in D that conatains x, y)/total of
tuples in D
16Example for Computing Atomic Probabilities
- Select From D Where CitySeattle And
ViewWaterfront - YSchoolDistrict, BoatDock,
- D10,000 W1000
- Wexcellent10
- Wwaterfront yes5
- p(excellentW)10/10000.1
- p(excellentD)10/10,0000.01
- p(waterfrontyes,W)5/10000.005
- p(waterfrontyes,D)5/10,0000.0005
17Indexing Atomic Probabilities
AttName, AttVal, Prob B tree index on
(AttName, AttVal)
AttName, AttVal, Prob B tree index on
(AttName, AttVal)
AttNameLeft, AttValLeft, AttNameRight,
AttValRight, Prob B tree index on
(AttNameLeft, AttValLeft, AttNameRight,
AttValRight)
AttNameLeft, AttValLeft, AttNameRight,
AttValRight, Prob B tree index on
(AttNameLeft, AttValLeft, AttNameRight,
AttValRight)
18Scan Algorithm
- Preprocessing - Atomic Probabilities Module
- Computes and Indexes the Quantities P(y W),
P(y D), P(x y, W), and P(x y, D) for All
Distinct Values x and y - Execution
- Select Tuples that Satisfy the Query
- Scan and Compute Score for Each Result-Tuple
- Return Top-K Tuples
19Beyond Scan Algorithm
- Scan algorithm is Inefficient
- Many tuples in the answer set
- Another extreme
- Pre-compute top-K tuples for all possible
queries - Still infeasible in practice
- Trade-off solution
- Pre-compute ranked lists of tuples for all
possible atomic queries - At query time, merge ranked lists to get top-K
tuples -
20Two kinds of Ranked List
- CondList Cx
- AttName, AttVal, TID, CondScore
- B tree index on (AttName, AttVal, CondScore)
- GlobList Gx
- AttName, AttVal, TID, GlobScore
- B tree index on (AttName, AttVal, GlobScore)
-
21Index Module
22List Merge Algorithm
23Experimental Setup
- Datasets
- MSR HomeAdvisor Seattle (http//houseandhome.msn.c
om/) - Internet Movie Database (http//www.imdb.com)
- Software and Hardware
- Microsoft SQL Server2000 RDBMS
- P4 2.8-GHz PC, 1 GB RAM
- C, Connected to RDBMS through DAO
24Quality Experiments
- Conducted on Seattle Homes and Movies tables
- Collect a workload from users
- Compare Conditional Ranking Method in the paper
with the Global Method CIDR03
25Quality Experiment-Average Precision
- For each query Qi , generate a set Hi of 30
tuples likely to contain a good mix of relevant
and irrelevant tuples - Let each user mark 10 tuples in Hi as most
relevant to Qi - Measure how closely the 10 tuples marked by the
user match the 10 tuples returned by each
algorithm
26Quality Experiment- Fraction of Users Preferring
Each Algorithm
- 5 new queries
- Users were given the top-5 results
27Performance Experiments
- Compare 2 Algorithms
- Scan algorithm
- List Merge algorithm
28Performance Experiments Pre-computation Time
29Performance Experiments Execution Time
30Performance Experiments Execution Time
31Performance Experiments Execution Time
32Conclusion and Open Problems
- Automatic ranking for many-answers
- Adaptation of PIR to DB
- Mutiple-table query
- Non-categorical attributes