Probabilistic Ranking of Database Query Results - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Probabilistic Ranking of Database Query Results

Description:

Gerhard Weikum, MPI Informatik. Vagelis Hristidis. VLDB 2004. 2. Roadmap. Problem Definition ... House Attributes: Price, City, Bedrooms, Bathrooms, ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 32

Provided by: usersC8

Category:

more less

Transcript and Presenter's Notes

Title: Probabilistic Ranking of Database Query Results

1
Probabilistic Ranking of Database Query Results

Surajit Chaudhuri, Microsoft Research
Gautam Das, Microsoft Research
Vagelis Hristidis, Florida International
University
Gerhard Weikum, MPI Informatik

2
Roadmap

Problem Definition
Architecture
Probabilistic Information Retrieval
Performance
Experiments
Related Work
Conclusion

3
Motivation

SQL Returns Unordered Sets of Results
Overwhelms Users of Information Discovery
Applications
How Can Ranking be Introduced, Given that ALL
Results Satisfy Query?

4
Example Realtor Database

House Attributes Price, City, Bedrooms,
Bathrooms, SchoolDistrict, Waterfront, BoatDock,
Year
Query City Seattle AND Waterfront TRUE
Too Many Results!
Intuitively, Houses with lower Price, more
Bedrooms, or BoatDock are generally preferable

5
Rank According to Unspecified Attributes

Score of a Result Tuple t depends on
Global Score Global Importance of Unspecified
Attribute Values CIDR2003
E.g., Newer Houses are generally preferred
Conditional Score Correlations between Specified
and Unspecified Attribute Values
E.g., Waterfront ? BoatDock Many Bedrooms ?
Good School District

6
Key Problems

Given a Query Q, How to Combine the Global and
Conditional Scores into a Ranking Function.Use
Probabilistic Information Retrieval (PIR).
How to Calculate the Global and Conditional
Scores.Use Query Workload and Data.

7
Roadmap

Problem Definition
Architecture
Probabilistic Information Retrieval
Performance
Experiments
Related Work
Conclusion

8
Architecture
9
Roadmap

Problem Definition
Architecture
Probabilistic Information Retrieval
Performance
Experiments
Related Work
Conclusion

10
PIR Review

Bayes Rule
Product Rule

Document (Tuple) t, Query QR Relevant
DocumentsR D - R Irrelevant Documents

11
Ranking Function Adapt PIR

Query Q X1x1 AND AND Xsxs, X X1, , Xs
Result-Tuple t(X,Y), where X Specified
Attributes, Y Unspecified Attributes

R?D
X, R, D common. R satisfies X.
12
Ranking Function Limited Conditional
Independence

Given a query Q and a tuple t, the X (and Y)
values within themselves are assumed to be
independent, though dependencies between the X
and Y values are allowed

, C involves Y, R, D
, C involves X, R, D
Use Workload
Use Data
13
Atomic Probabilities Estimation Using Workload W

If Many Queries Specify Set X of Conditions then
there is Preference Correlation between
Attributes in X.
Global E.g., If Many Queries ask for Waterfront
then p(WaterfrontTRUE) is high.
Conditional E.g., If Many Queries ask for
4-Bedroom Houses in Good School Districts, then
p(Bedrooms4 SchoolDistrictgood),
p(SchoolDistrictgood Bedrooms4) are high.

Using Limited Conditional Independence

Probabilities p(x y, W) (p(x y, D))
Calculated Using Standard Association Rule Mining
Techniques on W (D)

Conditional Part
Global Part
14
Roadmap

Problem Definition
Architecture
Probabilistic Information Retrieval
Performance
Experiments
Related Work
Conclusion

15
Performance
16
Scan Algorithm

Preprocessing - Atomic Probabilities Module
Computes and Indexes the Quantities P(y W),
P(y D), P(x y, W), and P(x y, D) for All
Distinct Values x and y
Execution
Select Tuples that Satisfy the Query
Scan and Compute Score for Each Result-Tuple
Return Top-K Tuples

17
List Merge Algorithm
Final Formula

Preprocessing
For Each Distinct Value x of Database, Calculate
and Store the Conditional (Cx) and the Global
(Gx) Lists as follows
For Each Tuple t Containing x Calculate

and add to Cx and Gx respectively
Sort Cx, Gx by decreasing scores

Execution Query Q X1x1 AND AND Xsxs
Execute Threshold Algorithm Fag01 on the
following lists Cx1,,Cxs, and Gxb, where Gxb
is the shortest list among Gx1,,Gxs

18
Roadmap

Problem Definition
Architecture
Probabilistic Information Retrieval
Performance
Experiments
Related Work
Conclusion

19
Quality Experiments

Compare our Conditional Ranking Method with the
Global Method CIDR03
Surveyed 14 MSR employees
Datasets
MSR HomeAdvisor Seattle (http//houseandhome.msn.c
om/)
Internet Movie Database (http//www.imdb.com)
Each User Behaved According to Various Profiles.
E.g.
singles, middle-class family, rich retirees
teenage males, people interested in comedies of
the 80s
First Collect Workloads, Then Compare Results of
2 Methods for a Set of Queries

20
Quality Experiments Average Precision

For 5 queries, ask users to Mark 10 out of a Set
of 30 likely results containing the Top-10
results of both the Conditional and Global plus a
few randomly selected tuples.
Precision Recall

21
Quality Experiments - Fraction of Users
Preferring Each Algorithm

Seattle Homes and Movies Datasets
5 new queries
Top-5 Result-lists

22
Performance Experiments

Microsoft SQL Server 2000 RDBMS
P4 2.8-GHz PC, 1 GB RAM
C, Connected to RDBMS through DAO
Datasets

Compared Algorithms
LM List Merge
Scan

23
Performance Experiments - Precomputation

Time and Space Consumed by Index Module

24
Performance Experiments - Execution
Varying Number of Tuples Satisfying Selection
Conditions

US Homes Database
2-Attributes Queries

25
Performance Experiments - Execution

US Homes Database

26
Roadmap

Problem Definition
Architecture
Probabilistic Information Retrieval
Performance
Experiments
Related Work
Conclusion

27
Related Work

CIDR2003
Use Workload, Focus on Empty-Answer Problem.
Drawback Global Ranking Regardless of Query.
E.g. Tram is desirable to be away from expensive
houses, but close to cheap.
Collaborative Filtering
Require Training Data of Queries and their Ranked
Results
Relevance-Feedback Techniques for Learning
Similarity in Multimedia and Relational Databases

28
Roadmap

Problem Definition
Architecture
Probabilistic Information Retrieval
Performance
Experiments
Related Work
Conclusion

29
Conclusions Future Work

Conclusions
Completely Automated Approach for the
Many-Answers Problem which Leverages Data and
Workload Statistics and Correlations
Based on PIR
Future Work
Empty-Answer Problem
Handle Plain Text Attributes

30
Questions?
31
Performance Experiments - Execution