Efficient IR-Style Keyword Search over Relational Databases - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient IR-Style Keyword Search over Relational Databases

Description:

0.00 56218.00 281.00 0.02 0.00 11367.63 116.00 0.16 0.00 56688.00 281.00 0.02 5.00 23941.25 154.92 0.19 0.00 57047.00 297.00 0.02 20.00 12446.00 111.48 0.36 0.00 ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 33
Provided by: csUcrEdu2
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: Efficient IR-Style Keyword Search over Relational Databases


1
Efficient IR-Style Keyword Searchover Relational
Databases
  • Vagelis Hristidis
  • University of California, San Diego
  • Luis Gravano
  • Columbia University
  • Yannis Papakonstantinou
  • University of California, San Diego

2
Motivation
  • Keyword search is the dominant information
    discovery method in documents
  • Increasing amount of data is stored in databases
  • Plain text coexists with structured data

3
Motivation
  • Up until recently, information discovery in
    databases required
  • Knowledge of schema
  • Knowledge of a query language (e.g., SQL)
  • Knowledge of the role of the keywords
  • Goal Enable IR-style keyword search over DBMSs
    without the above requirements

4
IR-Style Search over DBMSs
  • IR keyword search well developed for document
    search
  • Modern DBMSs offer IR-style keyword search over
    individual text attributes
  • What is equivalent to document in databases?

5
Example Complaints Database Schema
6
Example - Complaints Database Data
Complaints
tupleId prodId custId date comments
c1 p121 c3232 6-30-2002 disk crashed after just one week of moderate use on an IBM Netvista X41
c2 p131 c3131 7-3-2002 lower-end IBM Netvista caught fire, starting apparently with disk
c3 p131 c3143 8-3-2002 IBM Netvista unstable with Maxtor HD
Customers
Products
tupleId custId name occupation
u1 c3232 John Smith Software Engineer
u2 c3131 Jack Lucas Architect
u3 c3143 John Mayer Student
tupleId prodId manufacturer model
p1 p121 Maxtor D540X
p2 p131 IBM Netvista
p3 p141 Tripplite Smart 700VA
7
Example Keyword Query Maxtor Netvista
Complaints
tupleId prodId custId date comments
c1 p121 c3232 6-30-2002 disk crashed after just one week of moderate use on an IBM Netvista X41
c2 p131 c3131 7-3-2002 lower-end IBM Netvista caught fire, starting apparently with disk
c3 p131 c3143 8-3-2002 IBM Netvista unstable with Maxtor HD
Customers
Products
tupleId custId name occupation
u1 c3232 John Smith Software Engineer
u2 c3131 Jack Lucas Architect
u3 c3143 John Mayer Student
tupleId prodId manufacturer model
p1 p121 Maxtor D540X
p2 p131 IBM Netvista
p3 p141 Tripplite Smart 700VA
8
Keyword Query Semantics (definition of
document in databases)
  • Keywords are
  • in same tuple
  • in same relation
  • in tuples connected through primary-foreign key
    relationships
  • Score of result
  • distance of keywords within a tuple
  • distance between keywords in terms of
    primary-foreign key connections
  • IR-style score of result tree

9
Example Keyword Query Maxtor Netvista
Complaints
tupleId prodId custId date comments
c1 p121 c3232 6-30-2002 disk crashed after just one week of moderate use on an IBM Netvista X41
c2 p131 c3131 7-3-2002 lower-end IBM Netvista caught fire, starting apparently with disk
c3 p131 c3143 8-3-2002 IBM Netvista unstable with Maxtor HD
Customers
Products
tupleId custId name occupation
u1 c3232 John Smith Software Engineer
u2 c3131 Jack Lucas Architect
u3 c3143 John Mayer Student
tupleId prodId manufacturer model
p1 p121 Maxtor D540X
p2 p131 IBM Netvista
p3 p141 Tripplite Smart 700VA
Results (1) c3, (2) p2? c3, (3) p1? c1
10
Result of Keyword Query
  • Result is tree T of tuples where
  • each edge corresponds to a primary-foreign key
    relationship
  • no tuple of T is redundant (minimality)
  • - AND query semantics Every query keyword
    appears in T
  • - OR query semantics Some query keywords
    might be missing from T

11
Score of Result T
  • Combining function Score combines scores of
    attribute values of T
  • One reasonable choice
  • Score?a?TScore(a)/size(T)
  • Attribute value scores Score(a) calculated using
    the DBMS's IR datablades

12
Shortcomings of Prior Work
  • Simplistic ranking methods (e.g., based only on
    size of connecting tree), ignoring well-studied
    IR ranking strategies
  • No straightforward extension to improve
    efficiency by returning just top-k results
  • Not good in handling free-text attributes

DBXplorer,DISCOVER
13
Example Keyword Query Maxtor Netvista
Complaints
tupleId prodId custId date comments
c1 p121 c3232 6-30-2002 disk crashed after just one week of moderate use on an IBM Netvista X41
c2 p131 c3131 7-3-2002 lower-end IBM Netvista caught fire, starting apparently with disk
c3 p131 c3143 8-3-2002 IBM Netvista unstable with Maxtor HD
score
1/3
1/3
4/3
Score(c3) 4/3
Score(p1? c1) (11/3)/2 4/6
Customers
Products
tupleId custId name occupation
u1 c3232 John Smith Software Engineer
u2 c3131 Jack Lucas Architect
u3 c3143 John Mayer Student
tupleId prodId manufacturer model
p1 p121 Maxtor D540X
p2 p131 IBM Netvista
p3 p141 Tripplite Smart 700VA
score
1
1
0
Score(p2? c3) (14/3)/2 7/6
Results (1) c3, (2) p2? c3, (3) p1? c1
14
Architecture
ComplaintsQ (c3,comments,1.33),
(c1,comments,0.33), (c2,comments,0.33)
ProductsQ (p1,manufacturer,1), (p2,model,1)
Maxtor Netvista
ComplaintsQ ProductsQ ComplaintsQ ?
ProductsQ ComplaintsQ ? Customer
?ComplaintsQ ComplaintsQ ? Product ?
ComplaintsQ
... SELECT FROM ComplaintsQ c, ProductsQ
p WHERE c.prodId p.prodId AND c.prodId? AND
c.custId ? ...
c3 p2 ? c3 p1 ? c2
15
Architecture
ComplaintsQ (c3,comments,1.33),
(c1,comments,0.33), (c2,comments,0.33)
ProductsQ (p1,manufacturer,1), (p2,model,1)
Maxtor Netvista
ComplaintsQ ProductsQ ComplaintsQ ?
ProductsQ ComplaintsQ ? Customer
?ComplaintsQ ComplaintsQ ? Product ?
ComplaintsQ
... SELECT FROM ComplaintsQ c, ProductsQ
p WHERE c.prodId p.prodId AND c.prodId? AND
c.custId ? ...
c3 p2 ? c3 p1 ? c2
16
Candidate Network Generator
  • Find all trees of tuple sets (free or non-free)
    that may produce a result, based on DISCOVER's CN
    generator VLDB 2002
  • Use single non-free tuple set for each relation
  • allows OR semantics
  • fewer CNs are generated
  • extra filtering step required for AND semantics

17
Candidate Network Generator Example
  • For query Maxtor Netvista, CNs
  • ComplaintsQ
  • ProductsQ
  • ComplaintsQ ? ProductsQ
  • ComplaintsQ ? Customer ?ComplaintsQ
  • ComplaintsQ ? Product ? ComplaintsQ
  • Non-CNs
  • ComplaintsQ ? Customer ?Complaints
  • Product Q ? Complaints ? ProductQ

18
Architecture
ComplaintsQ (c3,comments,1.33),
(c1,comments,0.33), (c2,comments,0.33)
ProductsQ (p1,manufacturer,1), (p2,model,1)
Maxtor Netvista
ComplaintsQ ProductsQ ComplaintsQ ?
ProductsQ ComplaintsQ ? Customer
?ComplaintsQ ComplaintsQ ? Product ?
ComplaintsQ
... SELECT FROM ComplaintsQ c, ProductsQ
p WHERE c.prodId p.prodId AND c.prodId? AND
c.custId ? ...
c3 p2 ? c3 p1 ? c2
19
Execution Algorithms
  • Users usually want top-k results.
  • Hence, submitting to DBMS a SQL query for each CN
    (Naïve algorithm) is inefficient.
  • When queries produce at most very few results,
    Naïve algorithm is efficient, since it fully
    exploits DBMS.
  • Monotonic combining functions if results T, T'
    have same schema and for every attribute
    Score(ai)Score(a'i) then Score(T)Score(T')

20
Sparse Algorithm Example Execution
CN results score MFS
ProductsQ
ComplaintsQ
ComplaintsQ ? ProductsQ
p1 9 9
c2 7 7
c1 ? p1 (95)/27 (97)/2 8
  • Best when query produces at most a few results

21
Single Pipelined Algorithm Example Execution
CN ComplaintsQ ? ProductsQ
Results queue
Get next tuple from most promising non-free tuple
set
MPFS
Max(59)/2, (76)/27
Max(19)/2, (76)/26.5
Max(19)/2, (71)/25
result score


p1?c1 7
p2?c2 6.5
Output p1?c1
p2?c2
22
Global Pipelined Algorithm Example Execution
global MPFSmax(MPFSi) over all CNs Ci
  • Best when query produces many results.

23
Hybrid Algorithm
  • Estimate number of results.
  • For OR-semantics, use DBMS estimator
  • For AND-semantics, probabilistically adjust
    DBMS estimator.
  • If at most a few query results expected, then use
    Sparse Algorithm.
  • If many query results expected, then use Global
    Pipelined Algorithm.

24
Related Work
  • DBXplorer ICDE 2002, DISCOVER VLDB 2002
  • Similar three-step architecture
  • Score 1/size(T)
  • Only AND semantics
  • No straightforward extension for efficient top-k
    execution
  • BANKS ICDE 2002, Goldman et al. VLDB 1998
  • Database viewed as graph
  • No use of schema
  • Florescu et al. WWW 2000, XQuery Full-Text
  • Ilyas et al. VLDB 2003, J algorithm VLDB
    2001
  • Top-k algorithms for join queries

25
Experiments DBLP Dataset
C Conference Y Year P Paper A Author
DBLP contains few citation edges. Synthetic
citation edges were added such that average
citations is 20. Final dataset is
56MB. Experiments run over state-of-the-art
commercial RDBMS.
26
OR Semantics Effect of Maximum Allowed CN Size
Average execution time of 100 2-keyword top-10
queries
27
OR Semantics Effect of Number of Objects
Requested k
Average execution time of 100 2-keyword queries
with maximum candidate-network size of 6
28
OR Semantics Effect of Number of Query Keywords
Average execution time of 100 top-10 queries with
maximum candidate-network size of 6
29
Conclusions
  • Extend IR-style ranking to databases.
  • Exploit text-search capabilities of modern DBMSs,
    to generate results of higher quality.
  • Support both AND and OR semantics.
  • Achieve substantial speedup over prior work via
    pipelined top-k query processing algorithms.

30
Questions?
31
Compare algorithms wrt Result size
OR-semantics
AND-semantics
Max CN size 6, top-10, 2 keywords, OR-semantics
32
Ranking Functions
  • Proposed algorithms support tuple monotone
    combining functions
  • That is, if results T, T' have same schema and
    for every attribute Score(ai)Score(a'i) then
    Score(T)Score(T')
Write a Comment
User Comments (0)
About PowerShow.com