Efficient IR-Style Keyword Search over Relational Databases - PowerPoint PPT Presentation

About This Presentation

Title:

Efficient IR-Style Keyword Search over Relational Databases

Description:

0.00 56218.00 281.00 0.02 0.00 11367.63 116.00 0.16 0.00 56688.00 281.00 0.02 5.00 23941.25 154.92 0.19 0.00 57047.00 297.00 0.02 20.00 12446.00 111.48 0.36 0.00 ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 33

Provided by: csUcrEdu2

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: Efficient IR-Style Keyword Search over Relational Databases

1
Efficient IR-Style Keyword Searchover Relational
Databases

Vagelis Hristidis
University of California, San Diego
Luis Gravano
Columbia University
Yannis Papakonstantinou
University of California, San Diego

2
Motivation

Keyword search is the dominant information
discovery method in documents
Increasing amount of data is stored in databases
Plain text coexists with structured data

3
Motivation

Up until recently, information discovery in
databases required
Knowledge of schema
Knowledge of a query language (e.g., SQL)
Knowledge of the role of the keywords

Goal Enable IR-style keyword search over DBMSs
without the above requirements

4
IR-Style Search over DBMSs

IR keyword search well developed for document
search
Modern DBMSs offer IR-style keyword search over
individual text attributes
What is equivalent to document in databases?

5
Example Complaints Database Schema
6
Example - Complaints Database Data
Complaints
tupleId prodId custId date comments
c1 p121 c3232 6-30-2002 disk crashed after just one week of moderate use on an IBM Netvista X41
c2 p131 c3131 7-3-2002 lower-end IBM Netvista caught fire, starting apparently with disk
c3 p131 c3143 8-3-2002 IBM Netvista unstable with Maxtor HD
Customers
Products
tupleId custId name occupation
u1 c3232 John Smith Software Engineer
u2 c3131 Jack Lucas Architect
u3 c3143 John Mayer Student
tupleId prodId manufacturer model
p1 p121 Maxtor D540X
p2 p131 IBM Netvista
p3 p141 Tripplite Smart 700VA
7
Example Keyword Query Maxtor Netvista
Complaints
tupleId prodId custId date comments
c1 p121 c3232 6-30-2002 disk crashed after just one week of moderate use on an IBM Netvista X41
c2 p131 c3131 7-3-2002 lower-end IBM Netvista caught fire, starting apparently with disk
c3 p131 c3143 8-3-2002 IBM Netvista unstable with Maxtor HD
Customers
Products
tupleId custId name occupation
u1 c3232 John Smith Software Engineer
u2 c3131 Jack Lucas Architect
u3 c3143 John Mayer Student
tupleId prodId manufacturer model
p1 p121 Maxtor D540X
p2 p131 IBM Netvista
p3 p141 Tripplite Smart 700VA
8
Keyword Query Semantics (definition of
document in databases)

Keywords are
in same tuple
in same relation
in tuples connected through primary-foreign key
relationships
Score of result
distance of keywords within a tuple
distance between keywords in terms of
primary-foreign key connections
IR-style score of result tree

9
Example Keyword Query Maxtor Netvista
Complaints
tupleId prodId custId date comments
c1 p121 c3232 6-30-2002 disk crashed after just one week of moderate use on an IBM Netvista X41
c2 p131 c3131 7-3-2002 lower-end IBM Netvista caught fire, starting apparently with disk
c3 p131 c3143 8-3-2002 IBM Netvista unstable with Maxtor HD
Customers
Products
tupleId custId name occupation
u1 c3232 John Smith Software Engineer
u2 c3131 Jack Lucas Architect
u3 c3143 John Mayer Student
tupleId prodId manufacturer model
p1 p121 Maxtor D540X
p2 p131 IBM Netvista
p3 p141 Tripplite Smart 700VA
Results (1) c3, (2) p2? c3, (3) p1? c1
10
Result of Keyword Query

Result is tree T of tuples where
each edge corresponds to a primary-foreign key
relationship
no tuple of T is redundant (minimality)
- AND query semantics Every query keyword
appears in T
- OR query semantics Some query keywords
might be missing from T

11
Score of Result T

Combining function Score combines scores of
attribute values of T
One reasonable choice
Score?a?TScore(a)/size(T)
Attribute value scores Score(a) calculated using
the DBMS's IR datablades

12
Shortcomings of Prior Work

Simplistic ranking methods (e.g., based only on
size of connecting tree), ignoring well-studied
IR ranking strategies
No straightforward extension to improve
efficiency by returning just top-k results
Not good in handling free-text attributes

DBXplorer,DISCOVER
13
Example Keyword Query Maxtor Netvista
Complaints
tupleId prodId custId date comments
c1 p121 c3232 6-30-2002 disk crashed after just one week of moderate use on an IBM Netvista X41
c2 p131 c3131 7-3-2002 lower-end IBM Netvista caught fire, starting apparently with disk
c3 p131 c3143 8-3-2002 IBM Netvista unstable with Maxtor HD
score
1/3
1/3
4/3
Score(c3) 4/3
Score(p1? c1) (11/3)/2 4/6
Customers
Products
tupleId custId name occupation
u1 c3232 John Smith Software Engineer
u2 c3131 Jack Lucas Architect
u3 c3143 John Mayer Student
tupleId prodId manufacturer model
p1 p121 Maxtor D540X
p2 p131 IBM Netvista
p3 p141 Tripplite Smart 700VA
score
1
1
0
Score(p2? c3) (14/3)/2 7/6
Results (1) c3, (2) p2? c3, (3) p1? c1
14
Architecture
ComplaintsQ (c3,comments,1.33),
(c1,comments,0.33), (c2,comments,0.33)
ProductsQ (p1,manufacturer,1), (p2,model,1)
Maxtor Netvista
ComplaintsQ ProductsQ ComplaintsQ ?
ProductsQ ComplaintsQ ? Customer
?ComplaintsQ ComplaintsQ ? Product ?
ComplaintsQ
... SELECT FROM ComplaintsQ c, ProductsQ
p WHERE c.prodId p.prodId AND c.prodId? AND
c.custId ? ...
c3 p2 ? c3 p1 ? c2
15
Architecture
ComplaintsQ (c3,comments,1.33),
(c1,comments,0.33), (c2,comments,0.33)
ProductsQ (p1,manufacturer,1), (p2,model,1)
Maxtor Netvista
ComplaintsQ ProductsQ ComplaintsQ ?
ProductsQ ComplaintsQ ? Customer
?ComplaintsQ ComplaintsQ ? Product ?
ComplaintsQ
... SELECT FROM ComplaintsQ c, ProductsQ
p WHERE c.prodId p.prodId AND c.prodId? AND
c.custId ? ...
c3 p2 ? c3 p1 ? c2
16
Candidate Network Generator

Find all trees of tuple sets (free or non-free)
that may produce a result, based on DISCOVER's CN
generator VLDB 2002
Use single non-free tuple set for each relation
allows OR semantics
fewer CNs are generated
extra filtering step required for AND semantics

17
Candidate Network Generator Example

For query Maxtor Netvista, CNs
ComplaintsQ
ProductsQ
ComplaintsQ ? ProductsQ
ComplaintsQ ? Customer ?ComplaintsQ
ComplaintsQ ? Product ? ComplaintsQ
Non-CNs
ComplaintsQ ? Customer ?Complaints
Product Q ? Complaints ? ProductQ

18
Architecture
ComplaintsQ (c3,comments,1.33),
(c1,comments,0.33), (c2,comments,0.33)
ProductsQ (p1,manufacturer,1), (p2,model,1)
Maxtor Netvista
ComplaintsQ ProductsQ ComplaintsQ ?
ProductsQ ComplaintsQ ? Customer
?ComplaintsQ ComplaintsQ ? Product ?
ComplaintsQ
... SELECT FROM ComplaintsQ c, ProductsQ
p WHERE c.prodId p.prodId AND c.prodId? AND
c.custId ? ...
c3 p2 ? c3 p1 ? c2
19
Execution Algorithms

Users usually want top-k results.
Hence, submitting to DBMS a SQL query for each CN
(Naïve algorithm) is inefficient.
When queries produce at most very few results,
Naïve algorithm is efficient, since it fully
exploits DBMS.
Monotonic combining functions if results T, T'
have same schema and for every attribute
Score(ai)Score(a'i) then Score(T)Score(T')

20
Sparse Algorithm Example Execution
CN results score MFS
ProductsQ
ComplaintsQ
ComplaintsQ ? ProductsQ
p1 9 9
c2 7 7
c1 ? p1 (95)/27 (97)/2 8

Best when query produces at most a few results

21
Single Pipelined Algorithm Example Execution
CN ComplaintsQ ? ProductsQ
Results queue
Get next tuple from most promising non-free tuple
set
MPFS
Max(59)/2, (76)/27
Max(19)/2, (76)/26.5
Max(19)/2, (71)/25
result score

p1?c1 7
p2?c2 6.5
Output p1?c1
p2?c2
22
Global Pipelined Algorithm Example Execution
global MPFSmax(MPFSi) over all CNs Ci

Best when query produces many results.

23
Hybrid Algorithm

Estimate number of results.
For OR-semantics, use DBMS estimator
For AND-semantics, probabilistically adjust
DBMS estimator.
If at most a few query results expected, then use
Sparse Algorithm.
If many query results expected, then use Global
Pipelined Algorithm.

24
Related Work

DBXplorer ICDE 2002, DISCOVER VLDB 2002
Similar three-step architecture
Score 1/size(T)
Only AND semantics
No straightforward extension for efficient top-k
execution
BANKS ICDE 2002, Goldman et al. VLDB 1998
Database viewed as graph
No use of schema
Florescu et al. WWW 2000, XQuery Full-Text
Ilyas et al. VLDB 2003, J algorithm VLDB
2001
Top-k algorithms for join queries

25
Experiments DBLP Dataset
C Conference Y Year P Paper A Author
DBLP contains few citation edges. Synthetic
citation edges were added such that average
citations is 20. Final dataset is
56MB. Experiments run over state-of-the-art
commercial RDBMS.
26
OR Semantics Effect of Maximum Allowed CN Size
Average execution time of 100 2-keyword top-10
queries
27
OR Semantics Effect of Number of Objects
Requested k
Average execution time of 100 2-keyword queries
with maximum candidate-network size of 6
28
OR Semantics Effect of Number of Query Keywords
Average execution time of 100 top-10 queries with
maximum candidate-network size of 6
29
Conclusions

Extend IR-style ranking to databases.
Exploit text-search capabilities of modern DBMSs,
to generate results of higher quality.
Support both AND and OR semantics.
Achieve substantial speedup over prior work via
pipelined top-k query processing algorithms.

30
Questions?
31
Compare algorithms wrt Result size
OR-semantics
AND-semantics
Max CN size 6, top-10, 2 keywords, OR-semantics
32
Ranking Functions

Proposed algorithms support tuple monotone
combining functions
That is, if results T, T' have same schema and
for every attribute Score(ai)Score(a'i) then
Score(T)Score(T')

Write a Comment

User Comments (0)