Probabilistic Ranking of Database Query Results - PowerPoint PPT Presentation

About This Presentation

Title:

Probabilistic Ranking of Database Query Results

Description:

Probabilistic Ranking of Database Query Results. Surajit Chaudhuri, Microsoft Research ... D=(TID, Price , City, Bedrooms, Bathrooms, LivingArea, SchoolDistrict, View, ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 33

Provided by: wei88

Learn more at: https://ranger.uta.edu

Category:

more less

Transcript and Presenter's Notes

Title: Probabilistic Ranking of Database Query Results

1
Probabilistic Ranking of Database Query Results

Surajit Chaudhuri, Microsoft Research
Gautam Das, Microsoft Research
Vagelis Hristidis, Florida International
University
Gerhard Weikum, MPI Informatik

Presented by Weimin He CSE_at_UTA
2
Outline

Motivation
Problem Definition
System Architecture
Construction of Ranking Function
Implementation
Experiments
Conclusion and open problems

3
Motivating example

Realtor DB
Table D(TID, Price , City, Bedrooms,
Bathrooms, LivingArea, SchoolDistrict, View,
Pool, Garage, BoatDock)

SQL query Select From D Where CitySeattle
AND ViewWaterfront
4
Motivation

Many-answers problem
Two alternative solutions
Query reformulation
Automatic ranking
Apply probabilistic model in IR to DB tuple
ranking

5
Problem Definition

Given a database table D with n tuples t1, ,
tn over a set of m categorical attributes A
A1, , Am
and a query Q SELECT FROM D
WHERE
X1x1 AND AND Xsxs
where each Xi is an attribute from A and xi is a
value in its domain.
The set of attributes X X1, , Xs is known as
the set of attributes specified by the query,
while the set Y A X is known as the set of
unspecified attributes
Let be the answer set of Q
How to rank tuples in S and return top-k tuples
to the user ?

6
System Architecture
7
Intuition for Ranking Function

Select From D Where CitySeattle And
ViewWaterfront
Score of a Result Tuple t depends on
Global Score Global Importance of Unspecified
Attribute Values
E.g., Homes with good school districts are
globally desirable
Conditional Score Correlations between Specified
and Unspecified Attribute Values
E.g., Waterfront ? BoatDock

8
Probabilistic Model in IR

Bayes Rule
Product Rule

Document t, Query QR Relevant document setR
D - R Irrelevant document set

9
Adaptation of PIR to DB

Tuple t is considered as a document
Partition t into t(X) and t(Y)
t(X) and t(Y) are written as X and Y
Derive from initial scoring function until final
ranking function is obtained

10
Preliminary Derivation
11
Limited Independence Assumptions

Given a query Q and a tuple t, the X (and Y)
values within themselves are assumed to be
independent, though dependencies between the X
and Y values are allowed

12
Continuing Derivation
13
Workload-based Estimation of
Assume a collection of past queries existed in
system Workload W is represented as a set of
tuples Given query Q and specified attribute
set X, approximate R as all query tuples in W
that also request for X All properties of the
set of relevant tuple set R can be obtained by
only examining the subset of the workload that
caontains queries that also request for X
14
Final Ranking Function
15
Pre-computing Atomic Probabilities in Ranking
Function
Relative frequency in W
Relative frequency in D
(of tuples in W that conatains x, y)/total of
tuples in W
(of tuples in D that conatains x, y)/total of
tuples in D
16
Example for Computing Atomic Probabilities

Select From D Where CitySeattle And
ViewWaterfront
YSchoolDistrict, BoatDock,
D10,000 W1000
Wexcellent10
Wwaterfront yes5
p(excellentW)10/10000.1
p(excellentD)10/10,0000.01
p(waterfrontyes,W)5/10000.005
p(waterfrontyes,D)5/10,0000.0005

17
Indexing Atomic Probabilities
AttName, AttVal, Prob B tree index on
(AttName, AttVal)
AttName, AttVal, Prob B tree index on
(AttName, AttVal)
AttNameLeft, AttValLeft, AttNameRight,
AttValRight, Prob B tree index on
(AttNameLeft, AttValLeft, AttNameRight,
AttValRight)
AttNameLeft, AttValLeft, AttNameRight,
AttValRight, Prob B tree index on
(AttNameLeft, AttValLeft, AttNameRight,
AttValRight)
18
Scan Algorithm

Preprocessing - Atomic Probabilities Module
Computes and Indexes the Quantities P(y W),
P(y D), P(x y, W), and P(x y, D) for All
Distinct Values x and y
Execution
Select Tuples that Satisfy the Query
Scan and Compute Score for Each Result-Tuple
Return Top-K Tuples

19
Beyond Scan Algorithm

Scan algorithm is Inefficient
Many tuples in the answer set
Another extreme
Pre-compute top-K tuples for all possible
queries
Still infeasible in practice
Trade-off solution
Pre-compute ranked lists of tuples for all
possible atomic queries
At query time, merge ranked lists to get top-K
tuples

20
Two kinds of Ranked List

CondList Cx
AttName, AttVal, TID, CondScore
B tree index on (AttName, AttVal, CondScore)
GlobList Gx
AttName, AttVal, TID, GlobScore
B tree index on (AttName, AttVal, GlobScore)

21
Index Module
22
List Merge Algorithm
23
Experimental Setup

Datasets
MSR HomeAdvisor Seattle (http//houseandhome.msn.c
om/)
Internet Movie Database (http//www.imdb.com)

Software and Hardware
Microsoft SQL Server2000 RDBMS
P4 2.8-GHz PC, 1 GB RAM
C, Connected to RDBMS through DAO

24
Quality Experiments

Conducted on Seattle Homes and Movies tables
Collect a workload from users
Compare Conditional Ranking Method in the paper
with the Global Method CIDR03

25
Quality Experiment-Average Precision

For each query Qi , generate a set Hi of 30
tuples likely to contain a good mix of relevant
and irrelevant tuples
Let each user mark 10 tuples in Hi as most
relevant to Qi
Measure how closely the 10 tuples marked by the
user match the 10 tuples returned by each
algorithm

26
Quality Experiment- Fraction of Users Preferring
Each Algorithm

5 new queries
Users were given the top-5 results

27
Performance Experiments

Datasets

Compare 2 Algorithms
Scan algorithm
List Merge algorithm

28
Performance Experiments Pre-computation Time
29
Performance Experiments Execution Time
30
Performance Experiments Execution Time
31
Performance Experiments Execution Time
32
Conclusion and Open Problems