Keyword Search in Structured Databases - PowerPoint PPT Presentation

1 / 58

About This Presentation

Title:

Keyword Search in Structured Databases

Description:

University of California, San Diego. Research Exam & Thesis Proposal. Vagelis ... XQuery ?[29], XML-QL ?[30] , Quilt ?[31] for $t in document('db.xml')/items ... – PowerPoint PPT presentation

Number of Views:95

Avg rating:3.0/5.0

Slides: 59

Provided by: users

Category:

more less

Transcript and Presenter's Notes

Title: Keyword Search in Structured Databases

1
Keyword Search in Structured Databases

Vagelis Hristidis
University of California, San Diego
Research Exam Thesis Proposal

2
Roadmap

Motivation
Related Work
Preliminary Work
Future Work

3
Roadmap

Motivation
Related Work
Preliminary Work
Future Work

4
Motivation

Keyword Search is the dominant information
discovery method in documents
Increasing amount of data stored in databases

5
Motivation

Currently, information discovery in databases
requires
Knowledge of schema
Knowledge of a query language (eg SQL, XQuery)
Knowledge of the role of the keywords

Our work eliminates these requirements

6
Motivation - Example
7
Roadmap

Motivation
Related Work
Preliminary Work
Future Work

8
Related Work Structured Queries

XML
XQuery ?29, XML-QL ?30 , Quilt ?31
for t in document(db.xml)/items
where t/text() Greek recipe
return ltdishgttlt/dishgt
Relational
SQL, QBE, Datalog
SELECT
FROM Complaints C
WHERE C.comments disk crash

9
Related Work Structured Queries IR-ranking

XML
XIRQL ?26, Florescu et al. ?10, ELIXIR ?27
for t in document(db.xml)/items
where t/text() Greek recipe
return ltdishgttlt/dishgt
Relational
WHIRL ?25, Masermann et al. ?11, Oracle
Intermedia
SELECT
FROM Complaints C
WHERE CONTAINS (C.comments, disk crash, 1) gt 0
ORDER BY score(1) DESC

10
Related Work Keyword Queries

DBXplorer. S. Agrawal et al. ICDE 2002
Three step architecture
Drawbacks
Incomplete solutions (relations are not re-used)
Poor performance (No common subexpression
reusability)
BANKS. G. Bhalotia et al. ICDE 2002
Database viewed as graph
No schema info
Steiner tree problem approximations
Proximity searching in databases. R. Goldman et
al. VLDB 1998
Database viewed as graph
No schema info
hub nodes

11
Related Work
Types of queries
Presence of schema in keyword queries
Our published work
Our submitted work
12
Roadmap

Motivation
Related Work
Preliminary Work
Future Work

13
Preliminary Work

DISCOVER Keyword Search in Relational Databases.
Vagelis Hristidis, Yannis Papakonstantinou
VLDB, 2002
Keyword Proximity Search on XML Graphs.
Vagelis Hristidis, Yannis Papakonstantinou,
Andrey Balmin
ICDE, 2003
Efficient IR-Style Keyword Search over Relational
Databases.
Vagelis Hristidis, Luis Gravano, Yannis
Papakonstantinou
submitted
Adding context to XML keyword queries.
Vagelis Hristidis, Nick Koudas, Yannis
Papakonstantinou, Divesh Srivastava
submitted
A System for Keyword Search on XML Databases
Balmin, Hristidis, Koudas, Papakonstantinou,
Srivastava, Wang
submitted as demo proposal

14
Roadmap

Motivation
Related Work
Preliminary Work
DISCOVER
XKeyword
IR-Ranking Top-k Algorithms
Keyword Search on Trees
Future Work

15
Keyword Query - Semantics

Keywords are
in same node (XML node/tuple)
in nodes of same type (XML type/relation)
data/metadata
connected (through edges/primary-foreign key
relationships)
Score of result
distance of keywords within a node
distance between keywords in edges
weighted distance
IR-style ranking
random walk probability
keywords contained
combination of the above

16
Result of Keyword Query

Result is tree T of nodes where
each edge corresponds to an edge of the data
graph
every keyword contained in a node of T
no node of T is redundant (minimal)

17
Example (Relational) - Schema
Subset of TPC-H schema
n1
n1
ORDERS
CUSTOMER
NATION
18
Example (Relational) - Data
19
Example (Relational) Keyword Query
Query Smith, Miller
20
Example Keyword Query
Query Smith, Miller
Results
21
Example Keyword Query
Query Smith, Miller
Results
Smaller sizes usually denote tighter association
between keywords
22
DISCOVER - Architecture
User
23
DISCOVER - Architecture
24
Candidate Networks Generator - Challenges

A keyword may appear in multiple tuples
candidate networks can be too big (sometimes
unbounded)

25
Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
26
Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN1 OSmith ? C ? OMiller size2
27
Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN1 OSmith ? C ? OMiller size2
CN2 OSmith ? C ? N ? C ? OMiller size4
28
Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN3 OSmith ? C ? OMiller ? C size3
29
Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN4 OSmith ? C ? O ? C ? OMiller size4

-------------------------------------------------
c1 o c2
c1 ? c2 , because primary to foreign key from
CUSTOMER to ORDERS
Pruning Condition RK?S?RL

30
Candidate Networks Generator is Complete and
Non-Redundant

Prove that the set of Candidate Networks
generated is
Complete All solutions generated by a CN
Non-redundant There is database instance, where
by removing a CN a solution is lost

31
Size of Candidate Networks may be Unbounded

Size is unbounded iff schema graph G has one of
the following properties
There is a node of G that has at least two
incoming edges.
eg PARTSUPP?LINEITEM?ORDERS
G has a directed cycle.
eg ancestor schemas

32
DISCOVER - Architecture
33
Execution Plan - Challenges

Generated SQL queries are expensive due to joins
Reusability opportunities

34
Execution Plan

Each CN corresponds to a SQL statement
CN1 OSmith ? C ? OMiller
CN2 OSmith ? C ? N ? C ? OMiller
Execution Plan
CN1 ? OSmith ?? C ?? OMiller
CN2 ? OSmith ?? C ?? N ?? C ?? OMiller

35
Reuse Common Subexpressions - Example

Execution Plan
CN1 ? OSmith ?? C ?? OMiller
CN2 ? OSmith ?? C ?? N ?? C ?? OMiller
Optimized Execution Plan
Temp ? OSmith ?? C
CN1 ? Temp ?? OMiller
CN2 ? Temp ?? N ?? C ?? OMiller

36
Optimal Reuse of Common Subexpressions is
NP-Complete

Simple Cost Model each join has cost 1
Prove that finding Optimal Common Subexpressions
is NP-Complete.
Proof Reduce string compression problem
Heuristics

37
Roadmap

Motivation
Related Work
Preliminary Work
DISCOVER
XKeyword
IR-Ranking Top-k Algorithms
Keyword Search on Trees
Future Work

38
XKeyword

Keyword search on XML graphs
Dedicated system
Handles storing of XML data in RDBMS
Compare various decompositions methods
Smart presentation methods
Demo on DBLP database at
www.db.ucsd.edu/XKeyword

39
XKeyword - Presentation

Target Object is minimum presentation unit.
As small as possible while meaningful.

40
XKeyword Presentation Graph

Avoid mvd-like duplication of results
Graph below corresponds to 5210 results

p1paper
p6paper
p2paper
a2authorJeffrey Ullman
a3authorVasilis Vassalos
a1authorYannis
p3paper
p4paper
p5paper
p7paper
41
XKeyword Presentation Graph
42
XKeyword Presentation Graph
43
Roadmap

Motivation
Related Work
Preliminary Work
DISCOVER
XKeyword
IR-Ranking Top-k Algorithms
Keyword Search on Trees
Future Work

44
IR ranking Top-k results

Limitations of DISCOVER, XKeyword, DBXplorer
no leverage of IR ranking techniques
strict AND-semantics
all results calculated

45
IR ranking Top-k results

Nodes ranked by IR-ranking function
Node scores combined by combining function.
eg
Not all keywords have to be contained
(OR-semantics)

46
IR ranking Top-k results

OR-semantics and IR-ranking Continuous
Scale of Scores
Top-k results problem becomes more challenging
Our top-k algorithm is orders of magnitude faster
than previous work

47
Top-k ranked queries

PREFER A System for the Efficient Execution of
Multi-parametric Ranked Queries
Vagelis Hristidis, Nick Koudas, Yannis
Papakonstantinou
ACM SIGMOD, 2001
Algorithms and Applications for answering Ranked
Queries using Ranked Views
Vagelis Hristidis, Yannis Papakonstantinou
submitted for journal publication
Multi-Dimensional Processing of Ranked Queries
Y. Tao, V. Hristidis, D. Papadias, Y.
Papakonstantinou
submitted for journal publication

48
Roadmap

Motivation
Related Work
Preliminary Work
DISCOVER
XKeyword
IR-Ranking Top-k Algorithms
Keyword Search on Trees
Future Work

49
Keyword search on schema-less XML trees

Smallest connecting tree has LCA as root
Inverted Index
L(k1), L(k2) lists of nodes containing keywords
Naive algorithm complexity O(L(k1)?L(k2))
Our algorithm makes a single pass of lists
Present results in grouped way

50
Roadmap

Motivation
Related Work
Preliminary Work
Future Work

51
Future Work

Investigate other proximity semantics
Abstraction from schema design
Results estimation - Relaxation

52
Future Work - Random Walks

By proximity semantics Same score

Intuitively, Vagelis-Yannis closer

53
Future Work - Random Walks
1/2
1/3
1/2
1
1/3
1/3
1
Vagelis-Koudas
1/3?1/2 1/6
Vagelis-Yannis
1/3?1/2
1/3?1
1/3?1 5/6
54
Future Work - Random WalksOpen Issues