Title: Effective Keyword Based Selection of Relational Databases
1Effective Keyword Based Selection of Relational
Databases
- Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H
Tung
2Overview
- What is unstructured retrieval?
- This is retrieving data from documents like
journals, articles etc. - What is structured retrieval?
- Retrieving data from databases, XML files etc.
(that is, structural relationship between data
exists)
3Traditional IR approach
- Use keyword frequency and document frequency
statistics for query words to determine relevance
of a document - Keyword frequency No. of times a keyword
appears in a document - Document frequency No. of documents in which a
keyword appears. - Use the combination of the two as a weighting
factor
4Traditional IR technique is inadequate for
relational databases
- Traditional IR techniques do not capture the
relationship between data sources in a normalized
database - Need to take into account the relationship
between keywords in a database - Example
- A keyword is in a tuple referenced by many other
tuples - No. of joins that need to be performed to get all
keywords in a query
5Example
- DB1
- Inproceedings
Conferences -
id procID Conference
t3 23 The conference on Very Large Databases (VLDB)
t4 18 ACM Sigmod Conf on management of data
id inprocID title procID year mon annote
t1 Adiba1986 Historical Multimedia Databases 23 1988 Aug temporal
t2 Abarbanel1987 ConnectionPerspective Reform 18 1987 May Intellicorp
6Example
7Example
- Query (Multimedia, Database, VLDB)
- DB1 will give us good results,
- But traditional IR model will return DB2 as the
better one as term frequencies are higher in DB2 - Hence we need to effectively summarize
relationships between keywords in databases
8Contributions
- Address the problem of selection of structured
data sources for keyword based queries - Propose a method for summarizing relationships
between keywords in a database - Define metrics to rank source databases given a
keyword query based on keyword relationships - Evaluation of proposed summarization using real
datasets
9Measuring Strength of Relationships Between
Keywords
- Strength of relationships between two keywords
measured as a combination of two factors - Proximity factor Inverse of distance
- Frequency factor, given a distance d Number of
combinations of exactly d1 distinct tuples that
can be joined in a sequence to get the two
keywords in the end tuples
10Modeling of an RDBMS
- Let m No. of distinct keywords in database DB
- Let n Total no. of tuples in DB.
- Then matrix D t1 t2 . tn
- k1
- k2
-
-
- km
- D represents presence or absence of a keyword in
a tuple (Similar to term-document incidence
matrix in VSM)
11Modeling of an RDBMS Contd
- Matrix T represents relationship between
tuples(for example, foreign key) - T t1 t2 tn
- t1 0 1
- t2 1 0
-
-
- tn
12Mathematical representation of keyword
relationships
13Mathematical representation of keyword
relationships Contd
- A Keyword Relationship Matrix (KRM) R represents
the relationship between any two pair of keywords
with respect to d and K
14Mathematical representation of keyword
relationships Contd
15Example
- For two given keywords k1 and k2, and K40
- Database A has 5 joining sequences connecting
them at distance 1 Then score 5 (1/2)
2.5 - Database B has 40 joining sequences connecting
them at distance 4 Then score 40(1/5) 8 - Here B wins.
16Example (contd)
- If we bring down K to 10, then A wins.
- Thus one may prefer A to B due to better quality.
- K defines the number of top results users expect
from the database.
17Computation of KRM
- How to compute
- Few definitions
-
18Three proven propositions aiding the computation
of the KRM
19Three proven propositions aiding the Computation
of KRM Contd
20 Comparison of frequencies of keyword pairs in
DB1 and DB2
- Frequencies of keyword pairs in DB1
-
-
- Frequencies of keyword pairs in DB2
- Our query was Q (Multimedia, Database, VLDB )
- Observation tells us that query words are more
closely related in DB1
Keyword pair d0 d1 d2 d3 d4
databasemultimedia 1 1 - - -
multimediaVLDB 0 1 - - -
DatabaseVLDB 1 1 - - -
Keyword pair d0 d1 d2 d3 d4
databasemultimedia 0 0 0 0 2
multimediaVLDB 0 0 0 0 0
DatabaseVLDB 0 0 1 0 0
21Comparison of relationship scores of DB1 and DB2
-
-
-
- Sample computation for DB1 (K10)
- Rel Database, multimedia 1 1 0.5
1 1.5
Keyword pair DB1 DB2
Databasemultimedia 1.5 0.4
MultimediaVLDB 0.5 0
DatabaseVLDB 1.5 0.33
22Implementation with SQL
- Relation RD(kId, tId) represents the non-zero
entries of the keyword incidence matrix D - kId is the keyword ID and tId is the tuple ID
- RK(kId, keyword) stores the keyword IDs and
keywords (similar to a word dictionary in IR) - Matrices T1, T2, T3... (Tuple relationship
matrices) are represented with relations RT1,RT2
,RT3.. - RT1 - Produced by joining pairs of tables
- RT2 - Produced by self-joining RT1
23Implementation with SQL Contd
- RT3 produced using the following SQLs
- INSERT INTO RT3 (tId1, tId2)
- SELECT s1.tId1, s2.tId2
- FROM RT2 s1, RT1 s2
- WHERE s1.tId2 s2.tId1
- INSERT INTO RT3 (tId1, tId2)
- SELECT s1.tId1, s2.tId1
- FROM RT2 s1, RT1 s2
- WHERE s1.tId2 s2.tId2 AND s1.tId1 lt
s2.tId1 - INSERT INTO RT3 (tId1, tId2)
- SELECT s2.tId1, s1.tId2
- FROM RT2 s1, RT1 s2
- WHERE s1.tId1 s2.tId2
24Implementation with SQL Contd
- INSERT INTO RT3 (tId1, tId2)
- SELECT s1.tId2, s2.tId2
- FROM RT2 s1, RT1 s2
- WHERE s1.tId1 s2.tId1 AND s1.tId2 lt s2.tId2
- DELETE a FROM RT3 a, RT2 b, RT1 c
- WHERE (a.tId1 b.tId1 AND a.tId2 b.tId2) OR
- (a.tId1 c.tId1 AND a.tId2 c.tId2)
- In general, RTd is generated by joining RTd-1
with RT1 - and excluding the tuples already in RTd-1,
RTd-2, RT1
25Creation of W0,W1, W2.(Matrices representing
frequencies)
- W0 is represented with a relation RW0(kId1, kId2,
freq) - tuple (kId1, kId2, freq) records the pair of
keywords (kId1,kId2) (kId1 lt kId2), and its
frequency (freq) at 0 distance, where freq is
greater than 0. - RW0 is the result of self-joining RD (kId, tId).
- SQL for creating RW0
- INSERT INTO RW0 (kId1, kId2, freq)
- SELECT s1.kId AS kId1, s2.kId AS kId2,
count() - FROM RD s1, RD s2
- WHERE s1.tId s2.tId AND s1.kId lt s2.kId
- GROUP BY kId1, kId2
-
26Creation of W0,W1, W2.(Matrices representing
frequencies)
- SQL for creating RWd , d gt 0
- INSERT INTO RWd (kId1, kId2, freq)
- SELECT s1.kId AS kId1, s2.kId AS kId2,
count() - FROM RD s1, RD s2, RTd r
- WHERE ((s1.tId r.tId1 AND s2.tId r.tId2)
OR - (s1.tId r.tId2 AND s2.tId r.tId1)) AND
s1.kId lt s2.kId - GROUP BY kId1, kId2
-
27Final resulting KRM
- The final resulting KRM, R is stored in a
relation RR(kId1,kId2),consisting of pairs of
keywords and their relationship score. - It is computed using the formula
-
- Update issues -
- The tables for storing these matrices can be
updated dynamically.
28Estimating multi-keyword relationships
- Mutiple keywords are connected with Steiner
trees. - It is an NP complete problem to find a minimum
Steiner - tree.
- Most current keyword search algorithms rely on
heuristics to find top-K results. - Hence estimation between multiple keywords
estimated - using derived keyword relationships described
above.
29Estimating multi-keyword relationships Contd
30Estimating multi-keyword relationships Contd
31Estimating multi-keyword relationships Contd
32Database ranking and indexing
- With KR summary, we can effectively rank a set of
databases D DB1,DB2,,DBN for a given keyword
query. - We can use either a global index or a local index
- Global Index
- Analogous to an inverted index in IRUse keyword
pairs as key, and ltdatabase Id, relationship
scoregt as a postings entry - To evaluate a query, fetch the corresponding
inverted lists, and compute the score for each
database.
33Database ranking and indexing Contd
- Decentralized index
- Each machine can store a subset of the index
(that is, keyword pairs and inverted lists) - When a query is received at a node, search
messages are sent across nodes and the
corresponding postings lists are retrieved.
34Experiments done to evaluate efficiency of this
system
- K-R score compared with score from brute force
method (real_rank) over 82 databases spread
across 16 nodes. - Effectiveness of this technique has been
successfully established over distributed
databases - Definitions used for comparison -
-
35Experiments done to evaluate efficiency of this
system
36Experiments done to evaluate efficiency of this
system Contd
- Effects of (length of joining sequence)
-
- Selection performance of keyword queries
generally gets better when grows larger. - Precision and recall values for different values
tend to cluster into groups - There are big gaps in both precision and recall
values - when and when is greater
37Experiments done to evaluate efficiency of this
system Contd
-
- Recall and precision of 2-keyword queries using
KR summaries and KF-summaries
38Experiments done to evaluate efficiency of this
system Contd
- Effects of number of query keywords
- 1) Performance of 2-keyword queries generally
better than - 3-keyword and 4-keyword queries
- 5-keyword queries give better recall than
3 and 4 keyword queries - as they are more selective
-
- 2) Generally, the difference in the recall of
queries with - different no. of keywords is less than
that of the precision - This shows that the system is effective
in assigning high ranks to - useful databases, although less relevant
or irrelevant databases - may also be selected.
39Experiments done to evaluate efficiency of this
system Contd
- Comparison of four kinds of estimations
- (MIN,MAX,SUM,PROD)
- SUM and PROD have similar behavior
- and outperform the other two methods
- Hence it is more effective to take into account
relationship information of every keyword pair in
the query when estimating overall scores
40Experiments done to evaluate efficiency of this
system Contd
- Recall and precision of K-R summaries using
different estimations ( )