Title: Probabilistic Ranking of Database Query Result
1Probabilistic Ranking of Database Query Result
- Surajit Chaudhuri, Microsoft Research
- Gautam Das, Microsoft Research
- Vagelis Hristidis, Florida International
University - Gerhard Weikum, MPI Informatik
- 30th VLDB Conference Toronto ,Canada,2004
- Presented By
- Abhishek Jamloki
- CSE_at_UB
2Motivating example
- Realtor DB
- Table D(TID, Price , City, Bedrooms,
Bathrooms, LivingArea, SchoolDistrict, View,
Pool, Garage, BoatDock) - SQL query
- Select From D
- Where CitySeattle AND ViewWaterfront
3Formal Definition
- Consider a database table D with n tuples t1, ,
tn over a set of m - categorical attributes A A1, , Am
- a query Q SELECT FROM D
- WHERE
- X1x1 AND AND Xsxs
- where each Xi is an attribute from A and xi is a
value in its domain. - specified attributes X X1, , Xs
- unspecified attributes Y A X
- Let S be the answer set of Q
- How to rank tuples in S and return top-k tuples
to the user?
4Many Answers Problem
- IR Treatment
- Query Reformulation
- Automatic Ranking
- Correlations are ignored in high dimensional
spaces of IR - Automated Ranking function proposed based on
- A global score of unspecified attributes
- A conditional score (strength of correlation
between specified and unspecified attributes) - Automatic estimation using workload and data
analysis
5Architecture of Ranking System
6Probabilistic Model in IR
Document t, Query QR Relevant document setR
D - R Irrelevant document set
7Adaptation of PIR Models for Structured Data
- Each tuple t is treated as a document
- Partition t into two parts
- t(X) contains specified attributes
- t(Y) contains unspecified attributes
- Replace t with X and Y
- Replace R with D
8Derivation
9Limited Independence Assumptions
- Comprehensive dependency models have unacceptable
preprocessing and query processing costs - Choose a middle ground.
- Given a query Q and a tuple t, the X (and Y)
values within themselves are assumed to be
independent, though dependencies between the X
and Y values are allowed
10Derivation Continued
11Workload-Based Estimation of p(yR)
- Workload W a collection of ranking queries that
have been executed on our system in the past. - Represented as a set of tuples, where each
tuple represents a query and is a vector
containing the corresponding values of the
specified attributes. - We approximate R as all query tuples in W that
also request for X (approximation is novel to
this paper) - Properties of the set of relevant tuples R can be
obtained by only examining the subset of the
workload that contains queries that also request
for X - Substitute p(y R) as p(y X,W)
12Derivation Continued
13Computing the Atomic Probabilities
- p(y W) the relative frequencies of each
distinct value y in the workload - p( y D) relative frequencies of each distinct
value y in the - database (similar to IDF concept in IR)
- p(x y,W) confidences of pair-wise association
rules in the workload, that is (of tuples in W
that contains x, y)/total of tuples in W - p(x y,D) (of tuples in D that contains x,
y)/total of tuples in D - Stored as auxiliary tables in the intermediate
knowledge representation layer
14Implementation
- p(y w) AttName, AttVal, Prob
- BTree index on (AttName, AttVal)
- p(y D) AttName, AttVal, Prob
- BTree index on (AttName, AttVal)
- p(x y,W) AttNameLeft, AttValLeft,
AttNameRight, AttValRight, Prob - BTree index on (AttNameLeft, AttValLeft,
AttNameRight, AttValRight) - p(x y,D) AttNameLeft, AttValLeft,
AttNameRight, AttValRight, Prob - BTree index on (AttNameLeft, AttValLeft,
AttNameRight, AttValRight)
15Naïve Scan Algorithm
- Preprocessing - Atomic Probabilities Module
- Computes and Indexes the Quantities P(y W),
P(y D), P(x y, W), and P(x y, D) for All
Distinct Values x and y - Execution
- Select Tuples that Satisfy the Query
- Scan and Compute Score for Each Result-Tuple
- Return Top-K Tuples
16Adapting the TA
- Trade off between pre-processing and query
processing - Pre-compute ranked lists of the tuples for all
possible atomic queries. Then at query time,
given an actual query that specifies a set of
values X, we merge the ranked lists
corresponding to each x in X to compute the final
Top-K tuples. - We should be able to perform merging without
scanning the entire ranked lists - Threshold algorithm can be used for this purpose
- A feasible adaptation of TA should keep the
number of sorted streams small - Number of sorted streams will depend on number of
attributes in database
17Pre-Compute Data Structures
- At query time we do a TA-like merging of several
ranked lists (i.e. sorted streams) - The required number of sorted streams depends
only on the number of specified attribute values
in the query and not on the total number of
attributes in the database - Such a merge operation is only made possible due
to the specific functional form of our ranking
function resulting from our limited independence
assumptions
18The Computed Lists
- Index Module takes as inputs the association
rules and the database, and for every distinct
value x, creates two lists Cx and Gx, each
containing the tuple-ids of all data tuples that
contain x, ordered in specific ways. - Conditional List Cx consists of pairs of the
form ltTID, CondScoregt, ordered by descending
CondScore - TID tuple-id of a tuple t that contains x
- Global List Gx consists of pairs of the form
ltTID, GlobScoregt, ordered by descending
GlobScore, where TID is the tuple-id of a tuple t
that contains x and
19Retrieving values
- At query time we retrieve and multiply the scores
of t in the lists Cx1,,Cxs and in one of
Gx1,,Gxs. This requires only s1 multiplications
and results in a score2 that is proportional to
the actual score.Two kinds of efficient access
operations are needed - First, given a value x, it should be possible to
perform a GetNextTID operation on lists Cx and Gx
in constant time, tuple-ids in the lists should
be efficiently retrievable one-by-one in order of
decreasing score. This corresponds to the sorted
stream access of TA. - Second, it should be possible to perform random
access on the lists, that is, given a TID, the
corresponding score (CondScore or GlobScore)
should be retrievable in constant time.
20The Conditional and Global Lists
- These lists are stored as database tables
- CondList Cx
- AttName, AttVal, TID, CondScore
- BTree index on (AttName, AttVal, CondScore)
- GlobList Gx
- AttName, AttVal, TID, GlobScore
- BTree index on (AttName, AttVal, GlobScore)
21Index Module
22Query Processing Component
23Limited Available Space
- Space consumed by the lists is O(mn) bytes (m is
the number of attributes and n the number of
tuples of the database table) - We can store only a subset of the lists at
preprocessing time, at the expense of an increase
in the query processing time. - Determining which lists to retain/omit at
preprocessing time done by analyzing the
workload. - Store the conditional lists Cx and the
corresponding global lists Gx only for those
attribute values x that occur most frequently in
the workload - Probe the intermediate knowledge representation
layer at query time to compute the missing
information
24Experimental Setup
- The following Datasets were used
- MSR HomeAdvisor Seattle (http//houseandhome.msn.c
om/) - Internet Movie Database (http//www.imdb.com)
- Software and Hardware
- Microsoft SQL Server2000 RDBMS
- P4 2.8-GHz PC, 1 GB RAM
- C, Connected to RDBMS through DAO
25Quality Experiments
- Evaluated using two ranking methods
- 1) Conditional
- 2) Global
-
- Several hundred workload queries were collected
for both the datasets and ranking algorithm
trained on this workload
26Quality Experiment-Average Precision
- For each query Qi , generate a set Hi of 30
tuples likely to contain a good mix of relevant
and irrelevant tuples - Let each user mark 10 tuples in Hi as most
relevant to Qi - Measure how closely the 10 tuples marked by the
user match the 10 tuples returned by each
algorithm
27Quality Experiment Fraction of Users
Preferring Each Algorithm
- Users were given the Top-5 results of the two
ranking methods for 5 queries (different from the
previous survey), and were asked to choose which
rankings they preferred
28Performance Experiments
- Compared performance of the various
implementations of the Conditional algorithm
List Merge, its space-saving variant and Scan - Datasets used
29Performance Experiments Pre-Computation Time
30Performance Experiments Execution Time
31Performance Experiments Execution Time
32Performance Experiments Execution Time
33Conclusion
- Completely automated approach for the
Many-Answers Problem which leverages data and
workload statistics and correlation - Probabilistic IR models were adapted for
structured data. - Experiments demonstrate efficiency as well as
quality of the ranking system
34Open Questions
- Many relational databases contain text columns in
addition to numeric and categorical columns.
Whether correlations between text and non-text
data can be leveraged in a meaningful way for
ranking ? - Comprehensive quality benchmarks for database
ranking need to be established