Keyword Search in Structured Databases - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Keyword Search in Structured Databases

Description:

University of California, San Diego. Research Exam & Thesis Proposal. Vagelis ... XQuery ?[29], XML-QL ?[30] , Quilt ?[31] for $t in document('db.xml')/items ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 59
Provided by: users
Category:

less

Transcript and Presenter's Notes

Title: Keyword Search in Structured Databases


1
Keyword Search in Structured Databases
  • Vagelis Hristidis
  • University of California, San Diego
  • Research Exam Thesis Proposal

2
Roadmap
  • Motivation
  • Related Work
  • Preliminary Work
  • Future Work

3
Roadmap
  • Motivation
  • Related Work
  • Preliminary Work
  • Future Work

4
Motivation
  • Keyword Search is the dominant information
    discovery method in documents
  • Increasing amount of data stored in databases

5
Motivation
  • Currently, information discovery in databases
    requires
  • Knowledge of schema
  • Knowledge of a query language (eg SQL, XQuery)
  • Knowledge of the role of the keywords
  • Our work eliminates these requirements

6
Motivation - Example
7
Roadmap
  • Motivation
  • Related Work
  • Preliminary Work
  • Future Work

8
Related Work Structured Queries
  • XML
  • XQuery ?29, XML-QL ?30 , Quilt ?31
  • for t in document(db.xml)/items
  • where t/text() Greek recipe
  • return ltdishgttlt/dishgt
  • Relational
  • SQL, QBE, Datalog
  • SELECT
  • FROM Complaints C
  • WHERE C.comments disk crash

9
Related Work Structured Queries IR-ranking
  • XML
  • XIRQL ?26, Florescu et al. ?10, ELIXIR ?27
  • for t in document(db.xml)/items
  • where t/text() Greek recipe
  • return ltdishgttlt/dishgt
  • Relational
  • WHIRL ?25, Masermann et al. ?11, Oracle
    Intermedia
  • SELECT
  • FROM Complaints C
  • WHERE CONTAINS (C.comments, disk crash, 1) gt 0
  • ORDER BY score(1) DESC

10
Related Work Keyword Queries
  • DBXplorer. S. Agrawal et al. ICDE 2002
  • Three step architecture
  • Drawbacks
  • Incomplete solutions (relations are not re-used)
  • Poor performance (No common subexpression
    reusability)
  • BANKS. G. Bhalotia et al. ICDE 2002
  • Database viewed as graph
  • No schema info
  • Steiner tree problem approximations
  • Proximity searching in databases. R. Goldman et
    al. VLDB 1998
  • Database viewed as graph
  • No schema info
  • hub nodes

11
Related Work
Types of queries
Presence of schema in keyword queries
Our published work
Our submitted work
12
Roadmap
  • Motivation
  • Related Work
  • Preliminary Work
  • Future Work

13
Preliminary Work
  • DISCOVER Keyword Search in Relational Databases.
  • Vagelis Hristidis, Yannis Papakonstantinou
  • VLDB, 2002
  • Keyword Proximity Search on XML Graphs.
  • Vagelis Hristidis, Yannis Papakonstantinou,
    Andrey Balmin
  • ICDE, 2003
  • Efficient IR-Style Keyword Search over Relational
    Databases.
  • Vagelis Hristidis, Luis Gravano, Yannis
    Papakonstantinou
  • submitted
  • Adding context to XML keyword queries.
  • Vagelis Hristidis, Nick Koudas, Yannis
    Papakonstantinou, Divesh Srivastava
  • submitted
  • A System for Keyword Search on XML Databases
  • Balmin, Hristidis, Koudas, Papakonstantinou,
    Srivastava, Wang
  • submitted as demo proposal

14
Roadmap
  • Motivation
  • Related Work
  • Preliminary Work
  • DISCOVER
  • XKeyword
  • IR-Ranking Top-k Algorithms
  • Keyword Search on Trees
  • Future Work

15
Keyword Query - Semantics
  • Keywords are
  • in same node (XML node/tuple)
  • in nodes of same type (XML type/relation)
  • data/metadata
  • connected (through edges/primary-foreign key
    relationships)
  • Score of result
  • distance of keywords within a node
  • distance between keywords in edges
  • weighted distance
  • IR-style ranking
  • random walk probability
  • keywords contained
  • combination of the above

16
Result of Keyword Query
  • Result is tree T of nodes where
  • each edge corresponds to an edge of the data
    graph
  • every keyword contained in a node of T
  • no node of T is redundant (minimal)

17
Example (Relational) - Schema
Subset of TPC-H schema
n1
n1
ORDERS
CUSTOMER
NATION
18
Example (Relational) - Data
19
Example (Relational) Keyword Query
Query Smith, Miller
20
Example Keyword Query
Query Smith, Miller
Results
21
Example Keyword Query
Query Smith, Miller
Results
Smaller sizes usually denote tighter association
between keywords
22
DISCOVER - Architecture
User
23
DISCOVER - Architecture
24
Candidate Networks Generator - Challenges
  • A keyword may appear in multiple tuples
  • candidate networks can be too big (sometimes
    unbounded)

25
Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
26
Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN1 OSmith ? C ? OMiller size2
27
Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN1 OSmith ? C ? OMiller size2
CN2 OSmith ? C ? N ? C ? OMiller size4
28
Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN3 OSmith ? C ? OMiller ? C size3
29
Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN4 OSmith ? C ? O ? C ? OMiller size4
  • -------------------------------------------------
  • c1 o c2
  • c1 ? c2 , because primary to foreign key from
    CUSTOMER to ORDERS
  • Pruning Condition RK?S?RL

30
Candidate Networks Generator is Complete and
Non-Redundant
  • Prove that the set of Candidate Networks
    generated is
  • Complete All solutions generated by a CN
  • Non-redundant There is database instance, where
    by removing a CN a solution is lost

31
Size of Candidate Networks may be Unbounded
  • Size is unbounded iff schema graph G has one of
    the following properties
  • There is a node of G that has at least two
    incoming edges.
  • eg PARTSUPP?LINEITEM?ORDERS
  • G has a directed cycle.
  • eg ancestor schemas

32
DISCOVER - Architecture
33
Execution Plan - Challenges
  • Generated SQL queries are expensive due to joins
  • Reusability opportunities

34
Execution Plan
  • Each CN corresponds to a SQL statement
  • CN1 OSmith ? C ? OMiller
  • CN2 OSmith ? C ? N ? C ? OMiller
  • Execution Plan
  • CN1 ? OSmith ?? C ?? OMiller
  • CN2 ? OSmith ?? C ?? N ?? C ?? OMiller

35
Reuse Common Subexpressions - Example
  • Execution Plan
  • CN1 ? OSmith ?? C ?? OMiller
  • CN2 ? OSmith ?? C ?? N ?? C ?? OMiller
  • Optimized Execution Plan
  • Temp ? OSmith ?? C
  • CN1 ? Temp ?? OMiller
  • CN2 ? Temp ?? N ?? C ?? OMiller

36
Optimal Reuse of Common Subexpressions is
NP-Complete
  • Simple Cost Model each join has cost 1
  • Prove that finding Optimal Common Subexpressions
    is NP-Complete.
  • Proof Reduce string compression problem
  • Heuristics

37
Roadmap
  • Motivation
  • Related Work
  • Preliminary Work
  • DISCOVER
  • XKeyword
  • IR-Ranking Top-k Algorithms
  • Keyword Search on Trees
  • Future Work

38
XKeyword
  • Keyword search on XML graphs
  • Dedicated system
  • Handles storing of XML data in RDBMS
  • Compare various decompositions methods
  • Smart presentation methods
  • Demo on DBLP database at
  • www.db.ucsd.edu/XKeyword

39
XKeyword - Presentation
  • Target Object is minimum presentation unit.
  • As small as possible while meaningful.

40
XKeyword Presentation Graph
  • Avoid mvd-like duplication of results
  • Graph below corresponds to 5210 results

p1paper
p6paper
p2paper
a2authorJeffrey Ullman
a3authorVasilis Vassalos
a1authorYannis
p3paper
p4paper
p5paper
p7paper
41
XKeyword Presentation Graph
42
XKeyword Presentation Graph
43
Roadmap
  • Motivation
  • Related Work
  • Preliminary Work
  • DISCOVER
  • XKeyword
  • IR-Ranking Top-k Algorithms
  • Keyword Search on Trees
  • Future Work

44
IR ranking Top-k results
  • Limitations of DISCOVER, XKeyword, DBXplorer
  • no leverage of IR ranking techniques
  • strict AND-semantics
  • all results calculated

45
IR ranking Top-k results
  • Nodes ranked by IR-ranking function
  • Node scores combined by combining function.
  • eg
  • Not all keywords have to be contained
  • (OR-semantics)

46
IR ranking Top-k results
  • OR-semantics and IR-ranking Continuous
    Scale of Scores
  • Top-k results problem becomes more challenging
  • Our top-k algorithm is orders of magnitude faster
    than previous work

47
Top-k ranked queries
  • PREFER A System for the Efficient Execution of
    Multi-parametric Ranked Queries
  • Vagelis Hristidis, Nick Koudas, Yannis
    Papakonstantinou
  • ACM SIGMOD, 2001
  • Algorithms and Applications for answering Ranked
    Queries using Ranked Views
  • Vagelis Hristidis, Yannis Papakonstantinou
  • submitted for journal publication
  • Multi-Dimensional Processing of Ranked Queries
  • Y. Tao, V. Hristidis, D. Papadias, Y.
    Papakonstantinou
  • submitted for journal publication

48
Roadmap
  • Motivation
  • Related Work
  • Preliminary Work
  • DISCOVER
  • XKeyword
  • IR-Ranking Top-k Algorithms
  • Keyword Search on Trees
  • Future Work

49
Keyword search on schema-less XML trees
  • Smallest connecting tree has LCA as root
  • Inverted Index
  • L(k1), L(k2) lists of nodes containing keywords
  • Naive algorithm complexity O(L(k1)?L(k2))
  • Our algorithm makes a single pass of lists
  • Present results in grouped way

50
Roadmap
  • Motivation
  • Related Work
  • Preliminary Work
  • Future Work

51
Future Work
  • Investigate other proximity semantics
  • Abstraction from schema design
  • Results estimation - Relaxation

52
Future Work - Random Walks
  • By proximity semantics Same score
  • Intuitively, Vagelis-Yannis closer

53
Future Work - Random Walks
1/2
1/3
1/2
1
1/3
1/3
1
Vagelis-Koudas
1/3?1/2 1/6
Vagelis-Yannis
1/3?1/2
1/3?1
1/3?1 5/6
54
Future Work - Random WalksOpen Issues
  • Semantics
  • role of direction of edges
  • length of paths
  • more than 2 keywords
  • Proposed semantics for 2 nodes A,B
  • sum of probabilities of circuits A?B ?A
  • ignore direction of edges
  • stop walking on paths with problte

55
Future Work - Random WalksOpen Issues
  • Performance
  • exact calculation is expensive
  • approximation techniques
  • Random walk semantics can be generalized for the
    Web
  • Study and use work by algorithms community

56
Future Work Schema Design Abstraction
  • Score depends on how attributes are split to
    relations
  • Propose schema-like graph structures independent
    of specific schema design
  • Algorithms that are efficiently reducible to
    corresponding algorithms on schema
  • Random walks can be viewed as schema abstraction
    technique

57
Future Work Results Estimation
  • Top-k algorithms usually slow when very few
    results.
  • Estimating that a priori allows other evaluation
    ways (eg naive execution)
  • If very few results, relax (eg remove most
    frequent keyword)

58
Questions?
Write a Comment
User Comments (0)
About PowerShow.com