Title: Ranking of Database Query Results
1Ranking of Database Query Results
Nitesh Maan, Arujn Saraswat, Nishant Kapoor
2Introduction
- As the name suggests Ranking is the process of
ordering a set of values (or data items) based on
some parameter that is of high relevance to the
user of ranking process. - Ranking and returning the most relevant results
of users query is a popular paradigm in
information retrieval.
3Ranking and databases
- Not much work has been done in ranking of results
of query in database systems - We have all seen example of ranking of results in
the internet. The most common example is the
internet search engines (like Google). A set of
WebPages (satisfying the users search criteria)
are returned, with most relevant results
featuring at the top of the list.
4- In contrast to the WWW, databases support only a
Boolean query model. For example a selection
query on a SQL database schema returns all tuples
that satisfy the conditions specified in the
query. Depending on the conditions specified in
the query, two situations may arise
5- Empty Answers when the query is too selective,
the answer may be empty. - Many Answers when the query is not too
selective, too many tuples may be there in the
answer. - We next consider these two scenarios in detail
and look at various mechanism to produce ranked
results in these circumstances.
6The Empty Answers Problem
- Empty answers problem is the consequence of a
very selective query in database system. - In this case it would be desirable to return a
ranked list of approximately matching tuples
without burdening the user to specify any
additional conditions. In other words, an
automated approach for ranking and returning
approximately matching tuples.
7Automated Ranking Functions
- Automated ranking of query results is the process
of taking a user query and mapping it to a Top-K
query with a ranking function that depends on
conditions specified in the user query. - A ranking function should be able to work well
even for large databases and have minimum side
effects on query processing
8Automated Ranking functions for the Empty
Answers Problem
- IDF Similarity
- QF Similarity
- QFIDF Similarity
9IDF Similarity
- IDF (inverse document frequency) is an adaptation
of popular IR technique based on the philosophy
that frequently occurring words convey less
information about users needs than rarely
occurring words, and thus should be weighted less.
10IDF Similarity, formal definition
- For every value t in the domain of attribute
A, IDF(t) can be defined as log(n/F(t)), - where n number of tuples in the database
- F(t) frequency of tuples in database where
A t - The similarity between a tuple T and a query
Q is defined as - i.e., similarity between a tuple T and a query
Q is simply the sum of corresponding similarity
coefficients over all attributes in T
11QF Similarity leveraging workloads
- There may be instances where relevance of a
attribute value may be due to factors other than
the frequency of its occurrence. - QF similarity is based on this very philosophy.
According to QF Similarity, the importance of
attribute values is directly related to the
frequency of their occurrence in query strings in
workload.
12QFIDF Similarity
- QF is purely workload based, i.e., it does not
use data at all. This may be a disadvantage in
situations wherein we have insufficient or
unreliable workloads. - QFIDF Similarity is a remedy in such situations.
It combines QF and IDF weights. This way even if
a value is never referenced in the workload, it
gets a small non-zero QF.
13Breaking ties.
- In case of many answers problem, the recently
discussed ranking functions might fail to
perform. - This is because many tuples may tie for the same
similarity score. Such a scenario could arise for
empty answer problem also. - To break this tie, requires looking beyond the
attributes specified in the query, i.e., missing
attributes.
14Many Answers Problem
- We know by now, that many answers problem in
database systems is the consequence of not too
selective queries. - Such a query on a database system produces a
large number of tuples that satisfy the condition
specified in the query. - Let us see how ranking of results in such a
scenario is accomplished.
15Basic Approach
- Any ranking function for many answers problem has
to look beyond the attributes specified in the
query, since all or a large number of tuples
satisfy the specified conditions. - To determine precisely the unspecified attributes
is a challenging task. We show adaptation of
Probabilistic Information Retrieval (PIR) ranking
methods.
16Ranking function for Many Answers Problem
- Ranking function for many answers problem is
developed by adaptation of PIR models that best
model data dependencies and correlations. - The ranking function of a tuple depends on two
factors (a) a global score, and (b) a
conditional score. - These scores can be computed through workload as
well as data analysis.
17Ranking Function Adaptation of PIR Models for
Structured Data
- The basic philosophy of PIR models is that given
a document collection, D, the set of relevant
documents, R, and the set of irrelevant
documents, ( D R), any document t in
D can be ranked by finding out score(t). The
score(t) is the probability of t belonging to
the relevant set, R
18Problem in adapting this approach
- The problem in computing the score(t) using PIR
model for the databases is that the relevant set,
R is unknown at query time. - This approach is well suited to IR domain as R
is usually determined through user feedback. - User feedback based estimation of R might be
attempted in databases also but we propose an
automated approach.
19The ranking formula
- This is the final ranking formula that we will
use in computing scores of tuples in order to
rank them. - The ranking formula is composed of two large
factors - Global part of the score measures the global
importance of unspecified attributes - Conditional part of the score measures the
dependencies between the specified and
unspecified attributes
20Architecture of Ranking System
Detailed Architecture of the Ranking System
21Implementation
- Pre-Processing the pre-processing component is
composed of Atomic Probabilities Module and
Index Module. - Atomic Probabilities Module is responsible
for computation of several atomic probabilities
necessary for the computation of Ranking
Function, Score(t). - Index Module is responsible for
pre-computation of ranked lists necessary for
improving the efficiency of query processing
module.
22Implementation
- Intermediate Layer the atomic probabilities,
lists computed by the Index Module are all stored
as database tables in the intermediate layer. All
the tables in the intermediate layer are indexed
on appropriate attributes for fast access during
the later stages. - Primary purpose of the intermediate layer is to
avoid computing the score from scratch each time
a query is received, by storing pre-computed
results of all atomic computations
23The Index Module
- Index module pre-computes the ranked lists of
tuples for each possible atomic query. - Purpose is to take the run-time load off the
query processing component. - To assist the query processing component in
returning the Top-K tuples, ranked lists of the
tuples for all possible atomic queries are
pre-computed - Taking as input, the association rules and the
database, Conditional List and Global List are
created for each distinct value x in the
database
24Query Processing Component
- List merge algorithm is the key player in the
query processing component. - Its function is to take the user query, compute
scores for all the tuples that satisfy the
condition specified in the query, rank the tuples
in a sorted order of the scores and then return
the Top-K tuples.
25Space Requirements
- To build the conditional and the global lists,
space consumed is O(mn) bytes (where m is the
number of attributes and n is the number of
tuples of the database table) - There may be applications where space is an
expensive resource. - In such cases, only a subset of the lists may be
stored at pre-processing times, but this will at
the expense of an increase in query processing
time.
26Whats Next..
- The ranking function so presented works on single
table databases and does not allow presence of
NULL values. - A very interesting but nevertheless challenging
extension to this work would be to develop
ranking functions that work on multi-table
databases and allow NULLs as well as non-text
data in database columns.