Effective Keyword Based Selection of Relational Databases - PowerPoint PPT Presentation

About This Presentation
Title:

Effective Keyword Based Selection of Relational Databases

Description:

What is unstructured retrieval? ... What is structured retrieval? ... Experiments done to evaluate efficiency of this system Cont'd ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 41
Provided by: ybed
Learn more at: https://cse.buffalo.edu
Category:

less

Transcript and Presenter's Notes

Title: Effective Keyword Based Selection of Relational Databases


1
Effective Keyword Based Selection of Relational
Databases
  • Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H
    Tung

2
Overview
  • What is unstructured retrieval?
  • This is retrieving data from documents like
    journals, articles etc.
  • What is structured retrieval?
  • Retrieving data from databases, XML files etc.
    (that is, structural relationship between data
    exists)

3
Traditional IR approach
  • Use keyword frequency and document frequency
    statistics for query words to determine relevance
    of a document
  • Keyword frequency No. of times a keyword
    appears in a document
  • Document frequency No. of documents in which a
    keyword appears.
  • Use the combination of the two as a weighting
    factor

4
Traditional IR technique is inadequate for
relational databases
  • Traditional IR techniques do not capture the
    relationship between data sources in a normalized
    database
  • Need to take into account the relationship
    between keywords in a database
  • Example
  • A keyword is in a tuple referenced by many other
    tuples
  • No. of joins that need to be performed to get all
    keywords in a query

5
Example
  • DB1
  • Inproceedings
    Conferences

id procID Conference
t3 23 The conference on Very Large Databases (VLDB)
t4 18 ACM Sigmod Conf on management of data
id inprocID title procID year mon annote
t1 Adiba1986 Historical Multimedia Databases 23 1988 Aug temporal
t2 Abarbanel1987 ConnectionPerspective Reform 18 1987 May Intellicorp
6
Example
  • DB2

7
Example
  • Query (Multimedia, Database, VLDB)
  • DB1 will give us good results,
  • But traditional IR model will return DB2 as the
    better one as term frequencies are higher in DB2
  • Hence we need to effectively summarize
    relationships between keywords in databases

8
Contributions
  • Address the problem of selection of structured
    data sources for keyword based queries
  • Propose a method for summarizing relationships
    between keywords in a database
  • Define metrics to rank source databases given a
    keyword query based on keyword relationships
  • Evaluation of proposed summarization using real
    datasets

9
Measuring Strength of Relationships Between
Keywords
  • Strength of relationships between two keywords
    measured as a combination of two factors
  • Proximity factor Inverse of distance
  • Frequency factor, given a distance d Number of
    combinations of exactly d1 distinct tuples that
    can be joined in a sequence to get the two
    keywords in the end tuples

10
Modeling of an RDBMS
  • Let m No. of distinct keywords in database DB
  • Let n Total no. of tuples in DB.
  • Then matrix D t1 t2 . tn
  • k1
  • k2
  • km
  • D represents presence or absence of a keyword in
    a tuple (Similar to term-document incidence
    matrix in VSM)

11
Modeling of an RDBMS Contd
  • Matrix T represents relationship between
    tuples(for example, foreign key)
  • T t1 t2 tn
  • t1 0 1
  • t2 1 0
  • tn

12
Mathematical representation of keyword
relationships
13
Mathematical representation of keyword
relationships Contd
  • A Keyword Relationship Matrix (KRM) R represents
    the relationship between any two pair of keywords
    with respect to d and K

14
Mathematical representation of keyword
relationships Contd
15
Example
  • For two given keywords k1 and k2, and K40
  • Database A has 5 joining sequences connecting
    them at distance 1 Then score 5 (1/2)
    2.5
  • Database B has 40 joining sequences connecting
    them at distance 4 Then score 40(1/5) 8
  • Here B wins.

16
Example (contd)
  • If we bring down K to 10, then A wins.
  • Thus one may prefer A to B due to better quality.
  • K defines the number of top results users expect
    from the database.

17
Computation of KRM
  • How to compute
  • Few definitions

18
Three proven propositions aiding the computation
of the KRM
19
Three proven propositions aiding the Computation
of KRM Contd
20
Comparison of frequencies of keyword pairs in
DB1 and DB2
  • Frequencies of keyword pairs in DB1
  • Frequencies of keyword pairs in DB2
  • Our query was Q (Multimedia, Database, VLDB )
  • Observation tells us that query words are more
    closely related in DB1

Keyword pair d0 d1 d2 d3 d4
databasemultimedia 1 1 - - -
multimediaVLDB 0 1 - - -
DatabaseVLDB 1 1 - - -
Keyword pair d0 d1 d2 d3 d4
databasemultimedia 0 0 0 0 2
multimediaVLDB 0 0 0 0 0
DatabaseVLDB 0 0 1 0 0
21
Comparison of relationship scores of DB1 and DB2
  • Sample computation for DB1 (K10)
  • Rel Database, multimedia 1 1 0.5
    1 1.5

Keyword pair DB1 DB2
Databasemultimedia 1.5 0.4
MultimediaVLDB 0.5 0
DatabaseVLDB 1.5 0.33
22
Implementation with SQL
  • Relation RD(kId, tId) represents the non-zero
    entries of the keyword incidence matrix D
  • kId is the keyword ID and tId is the tuple ID
  • RK(kId, keyword) stores the keyword IDs and
    keywords (similar to a word dictionary in IR)
  • Matrices T1, T2, T3... (Tuple relationship
    matrices) are represented with relations RT1,RT2
    ,RT3..
  • RT1 - Produced by joining pairs of tables
  • RT2 - Produced by self-joining RT1

23
Implementation with SQL Contd
  • RT3 produced using the following SQLs
  • INSERT INTO RT3 (tId1, tId2)
  • SELECT s1.tId1, s2.tId2
  • FROM RT2 s1, RT1 s2
  • WHERE s1.tId2 s2.tId1
  • INSERT INTO RT3 (tId1, tId2)
  • SELECT s1.tId1, s2.tId1
  • FROM RT2 s1, RT1 s2
  • WHERE s1.tId2 s2.tId2 AND s1.tId1 lt
    s2.tId1
  • INSERT INTO RT3 (tId1, tId2)
  • SELECT s2.tId1, s1.tId2
  • FROM RT2 s1, RT1 s2
  • WHERE s1.tId1 s2.tId2

24
Implementation with SQL Contd
  • INSERT INTO RT3 (tId1, tId2)
  • SELECT s1.tId2, s2.tId2
  • FROM RT2 s1, RT1 s2
  • WHERE s1.tId1 s2.tId1 AND s1.tId2 lt s2.tId2
  • DELETE a FROM RT3 a, RT2 b, RT1 c
  • WHERE (a.tId1 b.tId1 AND a.tId2 b.tId2) OR
  • (a.tId1 c.tId1 AND a.tId2 c.tId2)
  • In general, RTd is generated by joining RTd-1
    with RT1
  • and excluding the tuples already in RTd-1,
    RTd-2, RT1

25
Creation of W0,W1, W2.(Matrices representing
frequencies)
  • W0 is represented with a relation RW0(kId1, kId2,
    freq)
  • tuple (kId1, kId2, freq) records the pair of
    keywords (kId1,kId2) (kId1 lt kId2), and its
    frequency (freq) at 0 distance, where freq is
    greater than 0.
  • RW0 is the result of self-joining RD (kId, tId).
  • SQL for creating RW0
  • INSERT INTO RW0 (kId1, kId2, freq)
  • SELECT s1.kId AS kId1, s2.kId AS kId2,
    count()
  • FROM RD s1, RD s2
  • WHERE s1.tId s2.tId AND s1.kId lt s2.kId
  • GROUP BY kId1, kId2

26
Creation of W0,W1, W2.(Matrices representing
frequencies)
  • SQL for creating RWd , d gt 0
  • INSERT INTO RWd (kId1, kId2, freq)
  • SELECT s1.kId AS kId1, s2.kId AS kId2,
    count()
  • FROM RD s1, RD s2, RTd r
  • WHERE ((s1.tId r.tId1 AND s2.tId r.tId2)
    OR
  • (s1.tId r.tId2 AND s2.tId r.tId1)) AND
    s1.kId lt s2.kId
  • GROUP BY kId1, kId2

27
Final resulting KRM
  • The final resulting KRM, R is stored in a
    relation RR(kId1,kId2),consisting of pairs of
    keywords and their relationship score.
  • It is computed using the formula
  • Update issues -
  • The tables for storing these matrices can be
    updated dynamically.

28
Estimating multi-keyword relationships
  • Mutiple keywords are connected with Steiner
    trees.
  • It is an NP complete problem to find a minimum
    Steiner
  • tree.
  • Most current keyword search algorithms rely on
    heuristics to find top-K results.
  • Hence estimation between multiple keywords
    estimated
  • using derived keyword relationships described
    above.

29
Estimating multi-keyword relationships Contd
30
Estimating multi-keyword relationships Contd
31
Estimating multi-keyword relationships Contd
32
Database ranking and indexing
  • With KR summary, we can effectively rank a set of
    databases D DB1,DB2,,DBN for a given keyword
    query.
  • We can use either a global index or a local index
  • Global Index
  • Analogous to an inverted index in IRUse keyword
    pairs as key, and ltdatabase Id, relationship
    scoregt as a postings entry
  • To evaluate a query, fetch the corresponding
    inverted lists, and compute the score for each
    database.

33
Database ranking and indexing Contd
  • Decentralized index
  • Each machine can store a subset of the index
    (that is, keyword pairs and inverted lists)
  • When a query is received at a node, search
    messages are sent across nodes and the
    corresponding postings lists are retrieved.

34
Experiments done to evaluate efficiency of this
system
  • K-R score compared with score from brute force
    method (real_rank) over 82 databases spread
    across 16 nodes.
  • Effectiveness of this technique has been
    successfully established over distributed
    databases
  • Definitions used for comparison -

35
Experiments done to evaluate efficiency of this
system
36
Experiments done to evaluate efficiency of this
system Contd
  • Effects of (length of joining sequence)
  • Selection performance of keyword queries
    generally gets better when grows larger.
  • Precision and recall values for different values
    tend to cluster into groups
  • There are big gaps in both precision and recall
    values
  • when and when is greater

37
Experiments done to evaluate efficiency of this
system Contd
  • Recall and precision of 2-keyword queries using
    KR summaries and KF-summaries

38
Experiments done to evaluate efficiency of this
system Contd
  • Effects of number of query keywords
  • 1) Performance of 2-keyword queries generally
    better than
  • 3-keyword and 4-keyword queries
  • 5-keyword queries give better recall than
    3 and 4 keyword queries
  • as they are more selective
  • 2) Generally, the difference in the recall of
    queries with
  • different no. of keywords is less than
    that of the precision
  • This shows that the system is effective
    in assigning high ranks to
  • useful databases, although less relevant
    or irrelevant databases
  • may also be selected.

39
Experiments done to evaluate efficiency of this
system Contd
  • Comparison of four kinds of estimations
  • (MIN,MAX,SUM,PROD)
  • SUM and PROD have similar behavior
  • and outperform the other two methods
  • Hence it is more effective to take into account
    relationship information of every keyword pair in
    the query when estimating overall scores

40
Experiments done to evaluate efficiency of this
system Contd
  • Recall and precision of K-R summaries using
    different estimations ( )
Write a Comment
User Comments (0)
About PowerShow.com