Title: Probabilistic Ranking of Database Query Results
1Probabilistic Ranking of Database Query Results
- Surajit Chaudhuri, Microsoft Research
- Gautam Das, Microsoft Research
- Vagelis Hristidis, Florida International
University - Gerhard Weikum, MPI Informatik
2Roadmap
- Problem Definition
- Architecture
- Probabilistic Information Retrieval
- Performance
- Experiments
- Related Work
- Conclusion
3Motivation
- SQL Returns Unordered Sets of Results
- Overwhelms Users of Information Discovery
Applications - How Can Ranking be Introduced, Given that ALL
Results Satisfy Query?
4Example Realtor Database
- House Attributes Price, City, Bedrooms,
Bathrooms, SchoolDistrict, Waterfront, BoatDock,
Year - Query City Seattle AND Waterfront TRUE
- Too Many Results!
- Intuitively, Houses with lower Price, more
Bedrooms, or BoatDock are generally preferable
5Rank According to Unspecified Attributes
- Score of a Result Tuple t depends on
- Global Score Global Importance of Unspecified
Attribute Values CIDR2003 - E.g., Newer Houses are generally preferred
- Conditional Score Correlations between Specified
and Unspecified Attribute Values - E.g., Waterfront ? BoatDock Many Bedrooms ?
Good School District
6Key Problems
- Given a Query Q, How to Combine the Global and
Conditional Scores into a Ranking Function.Use
Probabilistic Information Retrieval (PIR). - How to Calculate the Global and Conditional
Scores.Use Query Workload and Data.
7Roadmap
- Problem Definition
- Architecture
- Probabilistic Information Retrieval
- Performance
- Experiments
- Related Work
- Conclusion
8Architecture
9Roadmap
- Problem Definition
- Architecture
- Probabilistic Information Retrieval
- Performance
- Experiments
- Related Work
- Conclusion
10PIR Review
- Document (Tuple) t, Query QR Relevant
DocumentsR D - R Irrelevant Documents
11Ranking Function Adapt PIR
- Query Q X1x1 AND AND Xsxs, X X1, , Xs
- Result-Tuple t(X,Y), where X Specified
Attributes, Y Unspecified Attributes
R?D
X, R, D common. R satisfies X.
12Ranking Function Limited Conditional
Independence
- Given a query Q and a tuple t, the X (and Y)
values within themselves are assumed to be
independent, though dependencies between the X
and Y values are allowed
, C involves Y, R, D
, C involves X, R, D
Use Workload
Use Data
13Atomic Probabilities Estimation Using Workload W
- If Many Queries Specify Set X of Conditions then
there is Preference Correlation between
Attributes in X. - Global E.g., If Many Queries ask for Waterfront
then p(WaterfrontTRUE) is high. - Conditional E.g., If Many Queries ask for
4-Bedroom Houses in Good School Districts, then
p(Bedrooms4 SchoolDistrictgood),
p(SchoolDistrictgood Bedrooms4) are high.
Using Limited Conditional Independence
- Probabilities p(x y, W) (p(x y, D))
Calculated Using Standard Association Rule Mining
Techniques on W (D)
Conditional Part
Global Part
14Roadmap
- Problem Definition
- Architecture
- Probabilistic Information Retrieval
- Performance
- Experiments
- Related Work
- Conclusion
15Performance
16Scan Algorithm
- Preprocessing - Atomic Probabilities Module
- Computes and Indexes the Quantities P(y W),
P(y D), P(x y, W), and P(x y, D) for All
Distinct Values x and y - Execution
- Select Tuples that Satisfy the Query
- Scan and Compute Score for Each Result-Tuple
- Return Top-K Tuples
17List Merge Algorithm
Final Formula
- Preprocessing
- For Each Distinct Value x of Database, Calculate
and Store the Conditional (Cx) and the Global
(Gx) Lists as follows - For Each Tuple t Containing x Calculate
- and add to Cx and Gx respectively
- Sort Cx, Gx by decreasing scores
- Execution Query Q X1x1 AND AND Xsxs
- Execute Threshold Algorithm Fag01 on the
following lists Cx1,,Cxs, and Gxb, where Gxb
is the shortest list among Gx1,,Gxs
18Roadmap
- Problem Definition
- Architecture
- Probabilistic Information Retrieval
- Performance
- Experiments
- Related Work
- Conclusion
19Quality Experiments
- Compare our Conditional Ranking Method with the
Global Method CIDR03 - Surveyed 14 MSR employees
- Datasets
- MSR HomeAdvisor Seattle (http//houseandhome.msn.c
om/) - Internet Movie Database (http//www.imdb.com)
- Each User Behaved According to Various Profiles.
E.g. - singles, middle-class family, rich retirees
- teenage males, people interested in comedies of
the 80s - First Collect Workloads, Then Compare Results of
2 Methods for a Set of Queries
20Quality Experiments Average Precision
- For 5 queries, ask users to Mark 10 out of a Set
of 30 likely results containing the Top-10
results of both the Conditional and Global plus a
few randomly selected tuples. - Precision Recall
21Quality Experiments - Fraction of Users
Preferring Each Algorithm
- Seattle Homes and Movies Datasets
- 5 new queries
- Top-5 Result-lists
22Performance Experiments
- Microsoft SQL Server 2000 RDBMS
- P4 2.8-GHz PC, 1 GB RAM
- C, Connected to RDBMS through DAO
- Datasets
- Compared Algorithms
- LM List Merge
- Scan
23Performance Experiments - Precomputation
- Time and Space Consumed by Index Module
24Performance Experiments - Execution
Varying Number of Tuples Satisfying Selection
Conditions
- US Homes Database
- 2-Attributes Queries
25Performance Experiments - Execution
26Roadmap
- Problem Definition
- Architecture
- Probabilistic Information Retrieval
- Performance
- Experiments
- Related Work
- Conclusion
27Related Work
- CIDR2003
- Use Workload, Focus on Empty-Answer Problem.
- Drawback Global Ranking Regardless of Query.
E.g. Tram is desirable to be away from expensive
houses, but close to cheap. - Collaborative Filtering
- Require Training Data of Queries and their Ranked
Results - Relevance-Feedback Techniques for Learning
Similarity in Multimedia and Relational Databases
28Roadmap
- Problem Definition
- Architecture
- Probabilistic Information Retrieval
- Performance
- Experiments
- Related Work
- Conclusion
29Conclusions Future Work
- Conclusions
- Completely Automated Approach for the
Many-Answers Problem which Leverages Data and
Workload Statistics and Correlations - Based on PIR
- Future Work
- Empty-Answer Problem
- Handle Plain Text Attributes
30Questions?
31Performance Experiments - Execution
- LMM List Merge where lists for one of the two
specified attributes are missing, halving space - Seattle Homes Database