Effective Keyword Based Selection of Relational Databases - PowerPoint PPT Presentation

About This Presentation

Title:

Effective Keyword Based Selection of Relational Databases

Description:

What is unstructured retrieval? ... What is structured retrieval? ... Experiments done to evaluate efficiency of this system Cont'd ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 41

Provided by: ybed

Learn more at: https://cse.buffalo.edu

Category:

more less

Transcript and Presenter's Notes

Title: Effective Keyword Based Selection of Relational Databases

1
Effective Keyword Based Selection of Relational
Databases

Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H
Tung

2
Overview

What is unstructured retrieval?
This is retrieving data from documents like
journals, articles etc.
What is structured retrieval?
Retrieving data from databases, XML files etc.
(that is, structural relationship between data
exists)

3
Traditional IR approach

Use keyword frequency and document frequency
statistics for query words to determine relevance
of a document
Keyword frequency No. of times a keyword
appears in a document
Document frequency No. of documents in which a
keyword appears.
Use the combination of the two as a weighting
factor

4
Traditional IR technique is inadequate for
relational databases

Traditional IR techniques do not capture the
relationship between data sources in a normalized
database
Need to take into account the relationship
between keywords in a database
Example
A keyword is in a tuple referenced by many other
tuples
No. of joins that need to be performed to get all
keywords in a query

5
Example

DB1
Inproceedings
Conferences

id procID Conference
t3 23 The conference on Very Large Databases (VLDB)
t4 18 ACM Sigmod Conf on management of data
id inprocID title procID year mon annote
t1 Adiba1986 Historical Multimedia Databases 23 1988 Aug temporal
t2 Abarbanel1987 ConnectionPerspective Reform 18 1987 May Intellicorp
6
Example

7
Example

Query (Multimedia, Database, VLDB)
DB1 will give us good results,
But traditional IR model will return DB2 as the
better one as term frequencies are higher in DB2
Hence we need to effectively summarize
relationships between keywords in databases

8
Contributions

Address the problem of selection of structured
data sources for keyword based queries
Propose a method for summarizing relationships
between keywords in a database
Define metrics to rank source databases given a
keyword query based on keyword relationships
Evaluation of proposed summarization using real
datasets

9
Measuring Strength of Relationships Between
Keywords

Strength of relationships between two keywords
measured as a combination of two factors
Proximity factor Inverse of distance
Frequency factor, given a distance d Number of
combinations of exactly d1 distinct tuples that
can be joined in a sequence to get the two
keywords in the end tuples

10
Modeling of an RDBMS

Let m No. of distinct keywords in database DB
Let n Total no. of tuples in DB.
Then matrix D t1 t2 . tn
k1
k2
km
D represents presence or absence of a keyword in
a tuple (Similar to term-document incidence
matrix in VSM)

11
Modeling of an RDBMS Contd

Matrix T represents relationship between
tuples(for example, foreign key)
T t1 t2 tn
t1 0 1
t2 1 0
tn

12
Mathematical representation of keyword
relationships
13
Mathematical representation of keyword
relationships Contd

A Keyword Relationship Matrix (KRM) R represents
the relationship between any two pair of keywords
with respect to d and K

14
Mathematical representation of keyword
relationships Contd
15
Example

For two given keywords k1 and k2, and K40
Database A has 5 joining sequences connecting
them at distance 1 Then score 5 (1/2)
2.5
Database B has 40 joining sequences connecting
them at distance 4 Then score 40(1/5) 8
Here B wins.

16
Example (contd)

If we bring down K to 10, then A wins.
Thus one may prefer A to B due to better quality.
K defines the number of top results users expect
from the database.

17
Computation of KRM

How to compute
Few definitions

18
Three proven propositions aiding the computation
of the KRM
19
Three proven propositions aiding the Computation
of KRM Contd
20
Comparison of frequencies of keyword pairs in
DB1 and DB2

Frequencies of keyword pairs in DB1
Frequencies of keyword pairs in DB2
Our query was Q (Multimedia, Database, VLDB )
Observation tells us that query words are more
closely related in DB1

Keyword pair d0 d1 d2 d3 d4
databasemultimedia 1 1 - - -
multimediaVLDB 0 1 - - -
DatabaseVLDB 1 1 - - -
Keyword pair d0 d1 d2 d3 d4
databasemultimedia 0 0 0 0 2
multimediaVLDB 0 0 0 0 0
DatabaseVLDB 0 0 1 0 0
21
Comparison of relationship scores of DB1 and DB2

Sample computation for DB1 (K10)
Rel Database, multimedia 1 1 0.5
1 1.5

Keyword pair DB1 DB2
Databasemultimedia 1.5 0.4
MultimediaVLDB 0.5 0
DatabaseVLDB 1.5 0.33
22
Implementation with SQL

Relation RD(kId, tId) represents the non-zero
entries of the keyword incidence matrix D
kId is the keyword ID and tId is the tuple ID
RK(kId, keyword) stores the keyword IDs and
keywords (similar to a word dictionary in IR)
Matrices T1, T2, T3... (Tuple relationship
matrices) are represented with relations RT1,RT2
,RT3..
RT1 - Produced by joining pairs of tables
RT2 - Produced by self-joining RT1

23
Implementation with SQL Contd

RT3 produced using the following SQLs
INSERT INTO RT3 (tId1, tId2)
SELECT s1.tId1, s2.tId2
FROM RT2 s1, RT1 s2
WHERE s1.tId2 s2.tId1
INSERT INTO RT3 (tId1, tId2)
SELECT s1.tId1, s2.tId1
FROM RT2 s1, RT1 s2
WHERE s1.tId2 s2.tId2 AND s1.tId1 lt
s2.tId1
INSERT INTO RT3 (tId1, tId2)
SELECT s2.tId1, s1.tId2
FROM RT2 s1, RT1 s2
WHERE s1.tId1 s2.tId2

24
Implementation with SQL Contd

INSERT INTO RT3 (tId1, tId2)
SELECT s1.tId2, s2.tId2
FROM RT2 s1, RT1 s2
WHERE s1.tId1 s2.tId1 AND s1.tId2 lt s2.tId2
DELETE a FROM RT3 a, RT2 b, RT1 c
WHERE (a.tId1 b.tId1 AND a.tId2 b.tId2) OR
(a.tId1 c.tId1 AND a.tId2 c.tId2)
In general, RTd is generated by joining RTd-1
with RT1
and excluding the tuples already in RTd-1,
RTd-2, RT1

25
Creation of W0,W1, W2.(Matrices representing
frequencies)

W0 is represented with a relation RW0(kId1, kId2,
freq)
tuple (kId1, kId2, freq) records the pair of
keywords (kId1,kId2) (kId1 lt kId2), and its
frequency (freq) at 0 distance, where freq is
greater than 0.
RW0 is the result of self-joining RD (kId, tId).
SQL for creating RW0
INSERT INTO RW0 (kId1, kId2, freq)
SELECT s1.kId AS kId1, s2.kId AS kId2,
count()
FROM RD s1, RD s2
WHERE s1.tId s2.tId AND s1.kId lt s2.kId
GROUP BY kId1, kId2

26
Creation of W0,W1, W2.(Matrices representing
frequencies)

SQL for creating RWd , d gt 0
INSERT INTO RWd (kId1, kId2, freq)
SELECT s1.kId AS kId1, s2.kId AS kId2,
count()
FROM RD s1, RD s2, RTd r
WHERE ((s1.tId r.tId1 AND s2.tId r.tId2)
OR
(s1.tId r.tId2 AND s2.tId r.tId1)) AND
s1.kId lt s2.kId
GROUP BY kId1, kId2

27
Final resulting KRM

The final resulting KRM, R is stored in a
relation RR(kId1,kId2),consisting of pairs of
keywords and their relationship score.
It is computed using the formula
Update issues -
The tables for storing these matrices can be
updated dynamically.

28
Estimating multi-keyword relationships

Mutiple keywords are connected with Steiner
trees.
It is an NP complete problem to find a minimum
Steiner
tree.
Most current keyword search algorithms rely on
heuristics to find top-K results.
Hence estimation between multiple keywords
estimated
using derived keyword relationships described
above.

29
Estimating multi-keyword relationships Contd
30
Estimating multi-keyword relationships Contd
31
Estimating multi-keyword relationships Contd
32
Database ranking and indexing

With KR summary, we can effectively rank a set of
databases D DB1,DB2,,DBN for a given keyword
query.
We can use either a global index or a local index
Global Index
Analogous to an inverted index in IRUse keyword
pairs as key, and ltdatabase Id, relationship
scoregt as a postings entry
To evaluate a query, fetch the corresponding
inverted lists, and compute the score for each
database.

33
Database ranking and indexing Contd

Decentralized index
Each machine can store a subset of the index
(that is, keyword pairs and inverted lists)
When a query is received at a node, search
messages are sent across nodes and the
corresponding postings lists are retrieved.

34
Experiments done to evaluate efficiency of this
system

K-R score compared with score from brute force
method (real_rank) over 82 databases spread
across 16 nodes.
Effectiveness of this technique has been
successfully established over distributed
databases
Definitions used for comparison -

35
Experiments done to evaluate efficiency of this
system
36
Experiments done to evaluate efficiency of this
system Contd

Effects of (length of joining sequence)
Selection performance of keyword queries
generally gets better when grows larger.
Precision and recall values for different values
tend to cluster into groups
There are big gaps in both precision and recall
values
when and when is greater

37
Experiments done to evaluate efficiency of this
system Contd

Recall and precision of 2-keyword queries using
KR summaries and KF-summaries

38
Experiments done to evaluate efficiency of this
system Contd

Effects of number of query keywords
1) Performance of 2-keyword queries generally
better than
3-keyword and 4-keyword queries
5-keyword queries give better recall than
3 and 4 keyword queries
as they are more selective
2) Generally, the difference in the recall of
queries with
different no. of keywords is less than
that of the precision
This shows that the system is effective
in assigning high ranks to
useful databases, although less relevant
or irrelevant databases
may also be selected.

39
Experiments done to evaluate efficiency of this
system Contd

Comparison of four kinds of estimations
(MIN,MAX,SUM,PROD)
SUM and PROD have similar behavior
and outperform the other two methods
Hence it is more effective to take into account
relationship information of every keyword pair in
the query when estimating overall scores

40
Experiments done to evaluate efficiency of this
system Contd