Title: Keyword Search in Structured Databases
1Keyword Search in Structured Databases
- Vagelis Hristidis
- University of California, San Diego
- Research Exam Thesis Proposal
2Roadmap
- Motivation
- Related Work
- Preliminary Work
- Future Work
3Roadmap
- Motivation
- Related Work
- Preliminary Work
- Future Work
4Motivation
- Keyword Search is the dominant information
discovery method in documents - Increasing amount of data stored in databases
5Motivation
- Currently, information discovery in databases
requires - Knowledge of schema
- Knowledge of a query language (eg SQL, XQuery)
- Knowledge of the role of the keywords
- Our work eliminates these requirements
6Motivation - Example
7Roadmap
- Motivation
- Related Work
- Preliminary Work
- Future Work
8Related Work Structured Queries
- XML
- XQuery ?29, XML-QL ?30 , Quilt ?31
- for t in document(db.xml)/items
- where t/text() Greek recipe
- return ltdishgttlt/dishgt
- Relational
- SQL, QBE, Datalog
- SELECT
- FROM Complaints C
- WHERE C.comments disk crash
9Related Work Structured Queries IR-ranking
- XML
- XIRQL ?26, Florescu et al. ?10, ELIXIR ?27
- for t in document(db.xml)/items
- where t/text() Greek recipe
- return ltdishgttlt/dishgt
- Relational
- WHIRL ?25, Masermann et al. ?11, Oracle
Intermedia - SELECT
- FROM Complaints C
- WHERE CONTAINS (C.comments, disk crash, 1) gt 0
- ORDER BY score(1) DESC
10Related Work Keyword Queries
- DBXplorer. S. Agrawal et al. ICDE 2002
- Three step architecture
- Drawbacks
- Incomplete solutions (relations are not re-used)
- Poor performance (No common subexpression
reusability) - BANKS. G. Bhalotia et al. ICDE 2002
- Database viewed as graph
- No schema info
- Steiner tree problem approximations
- Proximity searching in databases. R. Goldman et
al. VLDB 1998 - Database viewed as graph
- No schema info
- hub nodes
11Related Work
Types of queries
Presence of schema in keyword queries
Our published work
Our submitted work
12Roadmap
- Motivation
- Related Work
- Preliminary Work
- Future Work
13Preliminary Work
- DISCOVER Keyword Search in Relational Databases.
- Vagelis Hristidis, Yannis Papakonstantinou
- VLDB, 2002
- Keyword Proximity Search on XML Graphs.
- Vagelis Hristidis, Yannis Papakonstantinou,
Andrey Balmin - ICDE, 2003
- Efficient IR-Style Keyword Search over Relational
Databases. - Vagelis Hristidis, Luis Gravano, Yannis
Papakonstantinou - submitted
- Adding context to XML keyword queries.
- Vagelis Hristidis, Nick Koudas, Yannis
Papakonstantinou, Divesh Srivastava - submitted
- A System for Keyword Search on XML Databases
- Balmin, Hristidis, Koudas, Papakonstantinou,
Srivastava, Wang - submitted as demo proposal
14Roadmap
- Motivation
- Related Work
- Preliminary Work
- DISCOVER
- XKeyword
- IR-Ranking Top-k Algorithms
- Keyword Search on Trees
- Future Work
15Keyword Query - Semantics
- Keywords are
- in same node (XML node/tuple)
- in nodes of same type (XML type/relation)
- data/metadata
- connected (through edges/primary-foreign key
relationships) - Score of result
- distance of keywords within a node
- distance between keywords in edges
- weighted distance
- IR-style ranking
- random walk probability
- keywords contained
- combination of the above
16Result of Keyword Query
- Result is tree T of nodes where
- each edge corresponds to an edge of the data
graph - every keyword contained in a node of T
- no node of T is redundant (minimal)
17Example (Relational) - Schema
Subset of TPC-H schema
n1
n1
ORDERS
CUSTOMER
NATION
18Example (Relational) - Data
19Example (Relational) Keyword Query
Query Smith, Miller
20Example Keyword Query
Query Smith, Miller
Results
21Example Keyword Query
Query Smith, Miller
Results
Smaller sizes usually denote tighter association
between keywords
22DISCOVER - Architecture
User
23DISCOVER - Architecture
24Candidate Networks Generator - Challenges
- A keyword may appear in multiple tuples
- candidate networks can be too big (sometimes
unbounded)
25Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
26Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN1 OSmith ? C ? OMiller size2
27Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN1 OSmith ? C ? OMiller size2
CN2 OSmith ? C ? N ? C ? OMiller size4
28Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN3 OSmith ? C ? OMiller ? C size3
29Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN4 OSmith ? C ? O ? C ? OMiller size4
- -------------------------------------------------
- c1 o c2
- c1 ? c2 , because primary to foreign key from
CUSTOMER to ORDERS - Pruning Condition RK?S?RL
30Candidate Networks Generator is Complete and
Non-Redundant
- Prove that the set of Candidate Networks
generated is - Complete All solutions generated by a CN
- Non-redundant There is database instance, where
by removing a CN a solution is lost
31Size of Candidate Networks may be Unbounded
- Size is unbounded iff schema graph G has one of
the following properties - There is a node of G that has at least two
incoming edges. - eg PARTSUPP?LINEITEM?ORDERS
- G has a directed cycle.
- eg ancestor schemas
32DISCOVER - Architecture
33Execution Plan - Challenges
- Generated SQL queries are expensive due to joins
- Reusability opportunities
34Execution Plan
- Each CN corresponds to a SQL statement
- CN1 OSmith ? C ? OMiller
- CN2 OSmith ? C ? N ? C ? OMiller
- Execution Plan
- CN1 ? OSmith ?? C ?? OMiller
- CN2 ? OSmith ?? C ?? N ?? C ?? OMiller
35Reuse Common Subexpressions - Example
- Execution Plan
- CN1 ? OSmith ?? C ?? OMiller
- CN2 ? OSmith ?? C ?? N ?? C ?? OMiller
- Optimized Execution Plan
- Temp ? OSmith ?? C
- CN1 ? Temp ?? OMiller
- CN2 ? Temp ?? N ?? C ?? OMiller
36Optimal Reuse of Common Subexpressions is
NP-Complete
- Simple Cost Model each join has cost 1
- Prove that finding Optimal Common Subexpressions
is NP-Complete. - Proof Reduce string compression problem
- Heuristics
37Roadmap
- Motivation
- Related Work
- Preliminary Work
- DISCOVER
- XKeyword
- IR-Ranking Top-k Algorithms
- Keyword Search on Trees
- Future Work
38XKeyword
- Keyword search on XML graphs
- Dedicated system
- Handles storing of XML data in RDBMS
- Compare various decompositions methods
- Smart presentation methods
- Demo on DBLP database at
- www.db.ucsd.edu/XKeyword
39XKeyword - Presentation
- Target Object is minimum presentation unit.
- As small as possible while meaningful.
40XKeyword Presentation Graph
- Avoid mvd-like duplication of results
- Graph below corresponds to 5210 results
p1paper
p6paper
p2paper
a2authorJeffrey Ullman
a3authorVasilis Vassalos
a1authorYannis
p3paper
p4paper
p5paper
p7paper
41XKeyword Presentation Graph
42XKeyword Presentation Graph
43Roadmap
- Motivation
- Related Work
- Preliminary Work
- DISCOVER
- XKeyword
- IR-Ranking Top-k Algorithms
- Keyword Search on Trees
- Future Work
44IR ranking Top-k results
- Limitations of DISCOVER, XKeyword, DBXplorer
- no leverage of IR ranking techniques
- strict AND-semantics
- all results calculated
45IR ranking Top-k results
- Nodes ranked by IR-ranking function
- Node scores combined by combining function.
- eg
- Not all keywords have to be contained
- (OR-semantics)
46IR ranking Top-k results
- OR-semantics and IR-ranking Continuous
Scale of Scores - Top-k results problem becomes more challenging
- Our top-k algorithm is orders of magnitude faster
than previous work
47Top-k ranked queries
- PREFER A System for the Efficient Execution of
Multi-parametric Ranked Queries - Vagelis Hristidis, Nick Koudas, Yannis
Papakonstantinou - ACM SIGMOD, 2001
- Algorithms and Applications for answering Ranked
Queries using Ranked Views - Vagelis Hristidis, Yannis Papakonstantinou
- submitted for journal publication
- Multi-Dimensional Processing of Ranked Queries
- Y. Tao, V. Hristidis, D. Papadias, Y.
Papakonstantinou - submitted for journal publication
48Roadmap
- Motivation
- Related Work
- Preliminary Work
- DISCOVER
- XKeyword
- IR-Ranking Top-k Algorithms
- Keyword Search on Trees
- Future Work
49Keyword search on schema-less XML trees
- Smallest connecting tree has LCA as root
- Inverted Index
- L(k1), L(k2) lists of nodes containing keywords
- Naive algorithm complexity O(L(k1)?L(k2))
- Our algorithm makes a single pass of lists
- Present results in grouped way
50Roadmap
- Motivation
- Related Work
- Preliminary Work
- Future Work
51Future Work
- Investigate other proximity semantics
- Abstraction from schema design
- Results estimation - Relaxation
52Future Work - Random Walks
- By proximity semantics Same score
- Intuitively, Vagelis-Yannis closer
53Future Work - Random Walks
1/2
1/3
1/2
1
1/3
1/3
1
Vagelis-Koudas
1/3?1/2 1/6
Vagelis-Yannis
1/3?1/2
1/3?1
1/3?1 5/6
54Future Work - Random WalksOpen Issues
- Semantics
- role of direction of edges
- length of paths
- more than 2 keywords
- Proposed semantics for 2 nodes A,B
- sum of probabilities of circuits A?B ?A
- ignore direction of edges
- stop walking on paths with problte
55Future Work - Random WalksOpen Issues
- Performance
- exact calculation is expensive
- approximation techniques
- Random walk semantics can be generalized for the
Web - Study and use work by algorithms community
56Future Work Schema Design Abstraction
- Score depends on how attributes are split to
relations - Propose schema-like graph structures independent
of specific schema design - Algorithms that are efficiently reducible to
corresponding algorithms on schema - Random walks can be viewed as schema abstraction
technique
57Future Work Results Estimation
- Top-k algorithms usually slow when very few
results. - Estimating that a priori allows other evaluation
ways (eg naive execution) - If very few results, relax (eg remove most
frequent keyword)
58Questions?