Title: QPIAD: Query Processing over Incomplete Autonomous Databases
1QPIAD Query Processing over Incomplete
Autonomous Databases
- Hemal Khatri (Arizona State University)
- Jianchun Fan (Arizona State University)
- Garrett Wolf (Arizona State University)
- Yi Chen (Arizona State University)
- Subbarao Kambhampati (Arizona State University)
2Incompleteness in Web databases
Automated Extraction
3Problem
How to retrieve ranked relevant uncertain answers
for user queries?
- Challenges
- How to retrieve relevant uncertain answers
through form-based interfaces of autonomous
databases? - How to keep query processing cost manageable?
- How to rank the retrieved uncertain answers?
- Possible Approaches
- QBody style Convt
- 1.CERTAIN ANSWERS ONLY Return certain answers
only as in traditional databases (Low Recall) - 2. ALL RETURNED Return certain answers and
answers having body style value missing (Low
precision, Infeasible) - 3. ALL RANKED Ranking all answers by predicting
values of missing attribute (Costly, Infeasible)
4QPIAD System Architecture
5Retrieving Relevant Answers via Query Rewriting
Given a query Q(Body styleConvt) retrieve all
relevant tuples
Base Result Set
Q
AFD Model Body style
Use F-measure to select top K Rewritten
Queries Q1 ModelA4 Q2 ModelZ4 Q3
ModelBoxster
Re-order top K queries based on Estimated
Precision
Ranked Relevant Uncertain Answers
F-Measure (1a)PR/(aPR) P Estimated
Precision R Estimated Recall based on P and
Estimated Selectivity
6Learning Statistics to support Ranking Rewriting
- Learning attribute correlations by Approximate
Functional Dependency(AFD) and Approximate
Key(AKey)
Determining Set(Y) dtrSet(Y)
Sample Database
Prune based on AKEY
TANE
AFDs (XY) confidence
- Learning value distributions using Naïve Bayes
Classifiers(NBC)
Learn NBC classifiers with m-estimates
Determining Set(Am)
Feature Selection
Estimated Precision P(AmvmdtrSet(Am))
- Learning Selectivity Estimates of Rewritten
Queries(QSel) based on - Selectivity of rewritten query issued on sample
- Ratio of original database size over sample
- Percentage of incomplete tuples while creating
sample
7Empirical Evaluation
Two experimental databases Cars(Cars.com) and
Census(UCI ML)
- Experimental Setup
- Oracular study used to measure Precision/Recall
by artificially introducing missing values in
databases. - AFDs and NBC classifiers learned for various
samples sizes ranging from 3 to 15. - Purpose of Experiments
- Measuring quality of uncertain results returned
by QPIAD Figure 1. - Efficiency of QPIAD in retrieving relevant
results Figure 2. - Robustness of the learning algorithms used in
QPIAD wrt to various sample sizes Figure 3.
Figure 1
Figure 3
Figure 2
8QPIAD Web Interface