Probabilistic Ranking of Database Query Result

About This Presentation

Title:

Probabilistic Ranking of Database Query Result

Description:

Two kinds of efficient access operations are needed: ... of the Conditional algorithm: List Merge, its space-saving variant and Scan. Datasets used: ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 35

Provided by: Abh97

Learn more at: https://cse.buffalo.edu

Category:

more less

Transcript and Presenter's Notes

Title: Probabilistic Ranking of Database Query Result

1
Probabilistic Ranking of Database Query Result

Surajit Chaudhuri, Microsoft Research
Gautam Das, Microsoft Research
Vagelis Hristidis, Florida International
University
Gerhard Weikum, MPI Informatik
30th VLDB Conference Toronto ,Canada,2004
Presented By
Abhishek Jamloki
CSE_at_UB

2
Motivating example

Realtor DB
Table D(TID, Price , City, Bedrooms,
Bathrooms, LivingArea, SchoolDistrict, View,
Pool, Garage, BoatDock)
SQL query
Select From D
Where CitySeattle AND ViewWaterfront

3
Formal Definition

Consider a database table D with n tuples t1, ,
tn over a set of m
categorical attributes A A1, , Am
a query Q SELECT FROM D
WHERE
X1x1 AND AND Xsxs
where each Xi is an attribute from A and xi is a
value in its domain.
specified attributes X X1, , Xs
unspecified attributes Y A X
Let S be the answer set of Q
How to rank tuples in S and return top-k tuples
to the user?

4
Many Answers Problem

IR Treatment
Query Reformulation
Automatic Ranking
Correlations are ignored in high dimensional
spaces of IR
Automated Ranking function proposed based on
A global score of unspecified attributes
A conditional score (strength of correlation
between specified and unspecified attributes)
Automatic estimation using workload and data
analysis

5
Architecture of Ranking System
6
Probabilistic Model in IR

Bayes Rule
Product Rule

Document t, Query QR Relevant document setR
D - R Irrelevant document set
7
Adaptation of PIR Models for Structured Data

Each tuple t is treated as a document
Partition t into two parts
t(X) contains specified attributes
t(Y) contains unspecified attributes
Replace t with X and Y
Replace R with D

8
Derivation
9
Limited Independence Assumptions

Comprehensive dependency models have unacceptable
preprocessing and query processing costs
Choose a middle ground.
Given a query Q and a tuple t, the X (and Y)
values within themselves are assumed to be
independent, though dependencies between the X
and Y values are allowed

10
Derivation Continued
11
Workload-Based Estimation of p(yR)

Workload W a collection of ranking queries that
have been executed on our system in the past.
Represented as a set of tuples, where each
tuple represents a query and is a vector
containing the corresponding values of the
specified attributes.
We approximate R as all query tuples in W that
also request for X (approximation is novel to
this paper)
Properties of the set of relevant tuples R can be
obtained by only examining the subset of the
workload that contains queries that also request
for X
Substitute p(y R) as p(y X,W)

12
Derivation Continued
13
Computing the Atomic Probabilities

p(y W) the relative frequencies of each
distinct value y in the workload
p( y D) relative frequencies of each distinct
value y in the
database (similar to IDF concept in IR)
p(x y,W) confidences of pair-wise association
rules in the workload, that is (of tuples in W
that contains x, y)/total of tuples in W
p(x y,D) (of tuples in D that contains x,
y)/total of tuples in D
Stored as auxiliary tables in the intermediate
knowledge representation layer

14
Implementation

p(y w) AttName, AttVal, Prob
BTree index on (AttName, AttVal)
p(y D) AttName, AttVal, Prob
BTree index on (AttName, AttVal)
p(x y,W) AttNameLeft, AttValLeft,
AttNameRight, AttValRight, Prob
BTree index on (AttNameLeft, AttValLeft,
AttNameRight, AttValRight)
p(x y,D) AttNameLeft, AttValLeft,
AttNameRight, AttValRight, Prob
BTree index on (AttNameLeft, AttValLeft,
AttNameRight, AttValRight)

15
Naïve Scan Algorithm

Preprocessing - Atomic Probabilities Module
Computes and Indexes the Quantities P(y W),
P(y D), P(x y, W), and P(x y, D) for All
Distinct Values x and y
Execution
Select Tuples that Satisfy the Query
Scan and Compute Score for Each Result-Tuple
Return Top-K Tuples

16
Adapting the TA

Trade off between pre-processing and query
processing
Pre-compute ranked lists of the tuples for all
possible atomic queries. Then at query time,
given an actual query that specifies a set of
values X, we merge the ranked lists
corresponding to each x in X to compute the final
Top-K tuples.
We should be able to perform merging without
scanning the entire ranked lists
Threshold algorithm can be used for this purpose
A feasible adaptation of TA should keep the
number of sorted streams small
Number of sorted streams will depend on number of
attributes in database

17
Pre-Compute Data Structures

At query time we do a TA-like merging of several
ranked lists (i.e. sorted streams)
The required number of sorted streams depends
only on the number of specified attribute values
in the query and not on the total number of
attributes in the database
Such a merge operation is only made possible due
to the specific functional form of our ranking
function resulting from our limited independence
assumptions

18
The Computed Lists

Index Module takes as inputs the association
rules and the database, and for every distinct
value x, creates two lists Cx and Gx, each
containing the tuple-ids of all data tuples that
contain x, ordered in specific ways.
Conditional List Cx consists of pairs of the
form ltTID, CondScoregt, ordered by descending
CondScore
TID tuple-id of a tuple t that contains x
Global List Gx consists of pairs of the form
ltTID, GlobScoregt, ordered by descending
GlobScore, where TID is the tuple-id of a tuple t
that contains x and

19
Retrieving values

At query time we retrieve and multiply the scores
of t in the lists Cx1,,Cxs and in one of
Gx1,,Gxs. This requires only s1 multiplications
and results in a score2 that is proportional to
the actual score.Two kinds of efficient access
operations are needed
First, given a value x, it should be possible to
perform a GetNextTID operation on lists Cx and Gx
in constant time, tuple-ids in the lists should
be efficiently retrievable one-by-one in order of
decreasing score. This corresponds to the sorted
stream access of TA.
Second, it should be possible to perform random
access on the lists, that is, given a TID, the
corresponding score (CondScore or GlobScore)
should be retrievable in constant time.

20
The Conditional and Global Lists

These lists are stored as database tables
CondList Cx
AttName, AttVal, TID, CondScore
BTree index on (AttName, AttVal, CondScore)
GlobList Gx
AttName, AttVal, TID, GlobScore
BTree index on (AttName, AttVal, GlobScore)

21
Index Module
22
Query Processing Component
23
Limited Available Space

Space consumed by the lists is O(mn) bytes (m is
the number of attributes and n the number of
tuples of the database table)
We can store only a subset of the lists at
preprocessing time, at the expense of an increase
in the query processing time.
Determining which lists to retain/omit at
preprocessing time done by analyzing the
workload.
Store the conditional lists Cx and the
corresponding global lists Gx only for those
attribute values x that occur most frequently in
the workload
Probe the intermediate knowledge representation
layer at query time to compute the missing
information

24
Experimental Setup

The following Datasets were used
MSR HomeAdvisor Seattle (http//houseandhome.msn.c
om/)
Internet Movie Database (http//www.imdb.com)
Software and Hardware
Microsoft SQL Server2000 RDBMS
P4 2.8-GHz PC, 1 GB RAM
C, Connected to RDBMS through DAO

25
Quality Experiments

Evaluated using two ranking methods
1) Conditional
2) Global
Several hundred workload queries were collected
for both the datasets and ranking algorithm
trained on this workload

26
Quality Experiment-Average Precision

For each query Qi , generate a set Hi of 30
tuples likely to contain a good mix of relevant
and irrelevant tuples
Let each user mark 10 tuples in Hi as most
relevant to Qi
Measure how closely the 10 tuples marked by the
user match the 10 tuples returned by each
algorithm

27
Quality Experiment Fraction of Users
Preferring Each Algorithm

Users were given the Top-5 results of the two
ranking methods for 5 queries (different from the
previous survey), and were asked to choose which
rankings they preferred

28
Performance Experiments

Compared performance of the various
implementations of the Conditional algorithm
List Merge, its space-saving variant and Scan
Datasets used

29
Performance Experiments Pre-Computation Time
30
Performance Experiments Execution Time
31
Performance Experiments Execution Time
32
Performance Experiments Execution Time
33
Conclusion

Completely automated approach for the
Many-Answers Problem which leverages data and
workload statistics and correlation
Probabilistic IR models were adapted for
structured data.
Experiments demonstrate efficiency as well as
quality of the ranking system

34
Open Questions

Many relational databases contain text columns in
addition to numeric and categorical columns.
Whether correlations between text and non-text
data can be leveraged in a meaningful way for
ranking ?
Comprehensive quality benchmarks for database
ranking need to be established

Write a Comment

User Comments (0)

About PowerShow.com

Probabilistic Ranking of Database Query Result - PowerPoint PPT Presentation

Probabilistic Ranking of Database Query Result

Two kinds of efficient access operations are needed: ... of the Conditional algorithm: List Merge, its space-saving variant and Scan. Datasets used: ... – PowerPoint PPT presentation