Title: Efficient IR-Style Keyword Search over Relational Databases
1Efficient IR-Style Keyword Searchover Relational
Databases
- Vagelis Hristidis
- University of California, San Diego
- Luis Gravano
- Columbia University
- Yannis Papakonstantinou
- University of California, San Diego
2Motivation
- Keyword search is the dominant information
discovery method in documents - Increasing amount of data is stored in databases
- Plain text coexists with structured data
3Motivation
- Up until recently, information discovery in
databases required - Knowledge of schema
- Knowledge of a query language (e.g., SQL)
- Knowledge of the role of the keywords
- Goal Enable IR-style keyword search over DBMSs
without the above requirements
4IR-Style Search over DBMSs
- IR keyword search well developed for document
search - Modern DBMSs offer IR-style keyword search over
individual text attributes - What is equivalent to document in databases?
5Example Complaints Database Schema
6Example - Complaints Database Data
Complaints
tupleId prodId custId date comments
c1 p121 c3232 6-30-2002 disk crashed after just one week of moderate use on an IBM Netvista X41
c2 p131 c3131 7-3-2002 lower-end IBM Netvista caught fire, starting apparently with disk
c3 p131 c3143 8-3-2002 IBM Netvista unstable with Maxtor HD
Customers
Products
tupleId custId name occupation
u1 c3232 John Smith Software Engineer
u2 c3131 Jack Lucas Architect
u3 c3143 John Mayer Student
tupleId prodId manufacturer model
p1 p121 Maxtor D540X
p2 p131 IBM Netvista
p3 p141 Tripplite Smart 700VA
7Example Keyword Query Maxtor Netvista
Complaints
tupleId prodId custId date comments
c1 p121 c3232 6-30-2002 disk crashed after just one week of moderate use on an IBM Netvista X41
c2 p131 c3131 7-3-2002 lower-end IBM Netvista caught fire, starting apparently with disk
c3 p131 c3143 8-3-2002 IBM Netvista unstable with Maxtor HD
Customers
Products
tupleId custId name occupation
u1 c3232 John Smith Software Engineer
u2 c3131 Jack Lucas Architect
u3 c3143 John Mayer Student
tupleId prodId manufacturer model
p1 p121 Maxtor D540X
p2 p131 IBM Netvista
p3 p141 Tripplite Smart 700VA
8Keyword Query Semantics (definition of
document in databases)
- Keywords are
- in same tuple
- in same relation
- in tuples connected through primary-foreign key
relationships - Score of result
- distance of keywords within a tuple
- distance between keywords in terms of
primary-foreign key connections - IR-style score of result tree
9Example Keyword Query Maxtor Netvista
Complaints
tupleId prodId custId date comments
c1 p121 c3232 6-30-2002 disk crashed after just one week of moderate use on an IBM Netvista X41
c2 p131 c3131 7-3-2002 lower-end IBM Netvista caught fire, starting apparently with disk
c3 p131 c3143 8-3-2002 IBM Netvista unstable with Maxtor HD
Customers
Products
tupleId custId name occupation
u1 c3232 John Smith Software Engineer
u2 c3131 Jack Lucas Architect
u3 c3143 John Mayer Student
tupleId prodId manufacturer model
p1 p121 Maxtor D540X
p2 p131 IBM Netvista
p3 p141 Tripplite Smart 700VA
Results (1) c3, (2) p2? c3, (3) p1? c1
10Result of Keyword Query
- Result is tree T of tuples where
- each edge corresponds to a primary-foreign key
relationship - no tuple of T is redundant (minimality)
- - AND query semantics Every query keyword
appears in T - - OR query semantics Some query keywords
might be missing from T
11Score of Result T
- Combining function Score combines scores of
attribute values of T - One reasonable choice
- Score?a?TScore(a)/size(T)
- Attribute value scores Score(a) calculated using
the DBMS's IR datablades
12Shortcomings of Prior Work
- Simplistic ranking methods (e.g., based only on
size of connecting tree), ignoring well-studied
IR ranking strategies - No straightforward extension to improve
efficiency by returning just top-k results - Not good in handling free-text attributes
DBXplorer,DISCOVER
13Example Keyword Query Maxtor Netvista
Complaints
tupleId prodId custId date comments
c1 p121 c3232 6-30-2002 disk crashed after just one week of moderate use on an IBM Netvista X41
c2 p131 c3131 7-3-2002 lower-end IBM Netvista caught fire, starting apparently with disk
c3 p131 c3143 8-3-2002 IBM Netvista unstable with Maxtor HD
score
1/3
1/3
4/3
Score(c3) 4/3
Score(p1? c1) (11/3)/2 4/6
Customers
Products
tupleId custId name occupation
u1 c3232 John Smith Software Engineer
u2 c3131 Jack Lucas Architect
u3 c3143 John Mayer Student
tupleId prodId manufacturer model
p1 p121 Maxtor D540X
p2 p131 IBM Netvista
p3 p141 Tripplite Smart 700VA
score
1
1
0
Score(p2? c3) (14/3)/2 7/6
Results (1) c3, (2) p2? c3, (3) p1? c1
14Architecture
ComplaintsQ (c3,comments,1.33),
(c1,comments,0.33), (c2,comments,0.33)
ProductsQ (p1,manufacturer,1), (p2,model,1)
Maxtor Netvista
ComplaintsQ ProductsQ ComplaintsQ ?
ProductsQ ComplaintsQ ? Customer
?ComplaintsQ ComplaintsQ ? Product ?
ComplaintsQ
... SELECT FROM ComplaintsQ c, ProductsQ
p WHERE c.prodId p.prodId AND c.prodId? AND
c.custId ? ...
c3 p2 ? c3 p1 ? c2
15Architecture
ComplaintsQ (c3,comments,1.33),
(c1,comments,0.33), (c2,comments,0.33)
ProductsQ (p1,manufacturer,1), (p2,model,1)
Maxtor Netvista
ComplaintsQ ProductsQ ComplaintsQ ?
ProductsQ ComplaintsQ ? Customer
?ComplaintsQ ComplaintsQ ? Product ?
ComplaintsQ
... SELECT FROM ComplaintsQ c, ProductsQ
p WHERE c.prodId p.prodId AND c.prodId? AND
c.custId ? ...
c3 p2 ? c3 p1 ? c2
16Candidate Network Generator
- Find all trees of tuple sets (free or non-free)
that may produce a result, based on DISCOVER's CN
generator VLDB 2002 - Use single non-free tuple set for each relation
- allows OR semantics
- fewer CNs are generated
- extra filtering step required for AND semantics
17Candidate Network Generator Example
- For query Maxtor Netvista, CNs
- ComplaintsQ
- ProductsQ
- ComplaintsQ ? ProductsQ
- ComplaintsQ ? Customer ?ComplaintsQ
- ComplaintsQ ? Product ? ComplaintsQ
- Non-CNs
- ComplaintsQ ? Customer ?Complaints
- Product Q ? Complaints ? ProductQ
18Architecture
ComplaintsQ (c3,comments,1.33),
(c1,comments,0.33), (c2,comments,0.33)
ProductsQ (p1,manufacturer,1), (p2,model,1)
Maxtor Netvista
ComplaintsQ ProductsQ ComplaintsQ ?
ProductsQ ComplaintsQ ? Customer
?ComplaintsQ ComplaintsQ ? Product ?
ComplaintsQ
... SELECT FROM ComplaintsQ c, ProductsQ
p WHERE c.prodId p.prodId AND c.prodId? AND
c.custId ? ...
c3 p2 ? c3 p1 ? c2
19Execution Algorithms
- Users usually want top-k results.
- Hence, submitting to DBMS a SQL query for each CN
(Naïve algorithm) is inefficient. - When queries produce at most very few results,
Naïve algorithm is efficient, since it fully
exploits DBMS. - Monotonic combining functions if results T, T'
have same schema and for every attribute
Score(ai)Score(a'i) then Score(T)Score(T')
20Sparse Algorithm Example Execution
CN results score MFS
ProductsQ
ComplaintsQ
ComplaintsQ ? ProductsQ
p1 9 9
c2 7 7
c1 ? p1 (95)/27 (97)/2 8
- Best when query produces at most a few results
21Single Pipelined Algorithm Example Execution
CN ComplaintsQ ? ProductsQ
Results queue
Get next tuple from most promising non-free tuple
set
MPFS
Max(59)/2, (76)/27
Max(19)/2, (76)/26.5
Max(19)/2, (71)/25
result score
p1?c1 7
p2?c2 6.5
Output p1?c1
p2?c2
22Global Pipelined Algorithm Example Execution
global MPFSmax(MPFSi) over all CNs Ci
- Best when query produces many results.
23Hybrid Algorithm
- Estimate number of results.
- For OR-semantics, use DBMS estimator
- For AND-semantics, probabilistically adjust
DBMS estimator. - If at most a few query results expected, then use
Sparse Algorithm. - If many query results expected, then use Global
Pipelined Algorithm.
24Related Work
- DBXplorer ICDE 2002, DISCOVER VLDB 2002
- Similar three-step architecture
- Score 1/size(T)
- Only AND semantics
- No straightforward extension for efficient top-k
execution - BANKS ICDE 2002, Goldman et al. VLDB 1998
- Database viewed as graph
- No use of schema
- Florescu et al. WWW 2000, XQuery Full-Text
- Ilyas et al. VLDB 2003, J algorithm VLDB
2001 - Top-k algorithms for join queries
25Experiments DBLP Dataset
C Conference Y Year P Paper A Author
DBLP contains few citation edges. Synthetic
citation edges were added such that average
citations is 20. Final dataset is
56MB. Experiments run over state-of-the-art
commercial RDBMS.
26OR Semantics Effect of Maximum Allowed CN Size
Average execution time of 100 2-keyword top-10
queries
27OR Semantics Effect of Number of Objects
Requested k
Average execution time of 100 2-keyword queries
with maximum candidate-network size of 6
28OR Semantics Effect of Number of Query Keywords
Average execution time of 100 top-10 queries with
maximum candidate-network size of 6
29Conclusions
- Extend IR-style ranking to databases.
- Exploit text-search capabilities of modern DBMSs,
to generate results of higher quality. - Support both AND and OR semantics.
- Achieve substantial speedup over prior work via
pipelined top-k query processing algorithms.
30Questions?
31Compare algorithms wrt Result size
OR-semantics
AND-semantics
Max CN size 6, top-10, 2 keywords, OR-semantics
32Ranking Functions
- Proposed algorithms support tuple monotone
combining functions - That is, if results T, T' have same schema and
for every attribute Score(ai)Score(a'i) then
Score(T)Score(T')