Keyword Searching and Browsing in Databases using BANKS - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Keyword Searching and Browsing in Databases using BANKS

Description:

Information units (linked Web pages) [Li et. al. WWW10] ... Jet Air 411. Cathay 123. SRC. DST. Udyan Exp. SRC. DST. Trains. Flights. Query: 'Bangalore Hong-Kong' ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 48

Provided by: Char183

Category:

more less

Transcript and Presenter's Notes

Title: Keyword Searching and Browsing in Databases using BANKS

1
Keyword Searching and Browsing in Databases using
BANKS

Arvind Hulgeri
Joint work with
Gaurav Bhalotia, Charuta Nakhe, Soumen
Chakrabarti, S. Sudarshan
Dept of Computer Science and Engg.
Indian Institute of Technology Bombay

2
Roadmap

Motivation
Model
Algorithm
System
Results
Related Work
Conclusion

3
Roadmap

Motivation
Model
Algorithm
System
Results
Related Work
Conclusion

4
Motivation

Keyword search of documents on the Web has been
enormously successful
Simple and intuitive, no need to learn any query
language
Database querying using keywords is desirable
SQL is not appropriate for casual users
Form interfaces cumbersome
Require separate form for each type of query
confusing for casual users of Web information
systems
Not suitable for ad hoc queries

5
Motivation

Many Web documents are dynamically generated from
databases
E.g. Catalog data
Keyword querying of generated Web documents
May miss answers that need to combine information
on different pages
Suffers from duplication overheads

6
Examples of Keyword Queries

On a book store database
sudarshan databases
On a travel reservation database
mumbai hong-kong
On a university database
database course
On an e-store database
camcorder panasonic

7
Differences from IR/Web Search

A logical unit of information is split across
multiple tuples due to normalization
E.g. Paper (paper-id, title, journal),
Author (author-id, name) Writes
(author-id, paper-id, position)
Different keywords may match tuples from
different relations
What joins are to be computed can only be decided
on the fly
Cites(citing-paper-id, cited-paper-id)

8
Information Splitting An example
Sudarshan
MultiQuery Optimization
Prasan Roy
Papers
Authors
Writes
9
Connectivity

Tuples may be connected by
Foreign key and object references
Inclusion dependencies and join conditions
Implicit links (shared words), etc.
Would like to find sets of (closely) connected
tuples that match all given keywords

10
Roadmap

Motivation
Model
Algorithm
System
Results
Related Work
Conclusion

11
Basic Model

Database modeled as a graph
Nodes tuples
Edges references between tuples
foreign key, inclusion dependencies, ..
Edges are directed.

MultiQuery Optimization
paper
BANKS Keyword search
writes
S. Sudarshan
Prasan Roy
Charuta
author
12
Answer Example
Query sudarshan roy
paper
MultiQuery Optimization
writes
writes
author
author
S. Sudarshan
Prasan Roy
13
The BANKS Answer Model

Query set of keywords k1, k2, .., kn
Each keyword ki matches set of nodes Si
Answer rooted, directed (steiner) tree
connecting nodes, with one node from each Si
Root node (we call it an information node) has
special significance, may be restricted to some
relations
E.g. relations representing entities, not
relationships
Multiple answers ranked by relevance

14
Ranking the answers

Answers are ranked based on their relevance
Relevance a function of proximity and prestige
Proximity a function of the edge weights of an
answer tree - edge score E
Prestige a function of the node weights - node
score N

15
Edge Directionality

Some popular tuples are connected to many other
tuples
E.g. Students -gt departments -gt university
Popular tuples would create misleading shortcuts
from every tuple to every other
E.g. every student would be closely linked with
every other student via the department/university
Solution define different forward and backward
edge weights
Forward edges In the direction of the foreign
key reference

16
Edge Weight

Weight of forward edge based on schema
e.g. cites link weights gt writes link
weights
Weight of backward edge indegree of edges
pointing to the node

1
1
1
17
Edge Weight Scaling

Problem Some backward edges have unduly large
weights
Scale edge weights by using log(1raw-edgeweight)
For an answer tree
total-edge-weight ? edge-weights
Edge score E 1 / total-edge-weight

18
Node Weight

Nodes have prestige weights too
Observation nodes with intuitively greater
prestige tend to have greater indegree
Node weight indegree
Problem Nodes with many in-edges result in
skewed answers
Subdue extreme node weights by using
log(1indegree)
For an answer tree
Node score N root-node-weight ?
leaf-node-weights

19
Combining Scores

Problem how to combine two independent metrics
node weight and edge weight
Normalize each to 0-1
Combine using weighting factor ?
Additive (1- ?) E ? N
Multiplicative E N?
Performance study to compare alternatives and to
find reasonable values for ?

20
Roadmap

Motivation
Model
Algorithm
System
Results
Related Work
Conclusion

21
Finding Answer Trees

Backward Expanding Search Algorithm
Intuition find vertices from which a forward
path exists to at least one node from each Si.
Run concurrent single source shortest path
algorithm from each node matching a keyword
Create an iterator for each node matching a
keyword
Traverse the graph edges in reverse direction
Intersection of the shortest path iterators
(corresponding to one node from each Si) results
into an answer tree rooted at the intersection
node.

22
Backward Expanding Search
Query sudarshan roy
S. Sudarshan
Prasan Roy
authors
23
Backward Expanding Search
Query sudarshan roy
writes
S. Sudarshan
Prasan Roy
authors
24
Result Ordering

Answer trees may not be generated in relevance
order
Solution
Best-first search across all iterators, based on
path length
Insert answers to a buffer (heap)
Output highest ranked answer from buffer to user
when buffer is full

25
Roadmap

Motivation
Model
Algorithm
System
Results
Related Work
Conclusion

26
The BANKS System

Connects to any database using JDBC
JDBC metadata features used to provide schema
browsing
No programming needed for customization
Minimal preprocessing of database to create
indices and give weights to links
Extensive set of browsing features

BANKS
User
HTTP
JDBC
Web Server Servlets
Database
27
The BANKS System

BANKS provides keyword search coupled with
extensive browsing facilities
Schema browsing data browsing
Graphical display of data
Implemented using Java servlets
Keyword search response times typically 1 to 3
seconds on
DBLP database with 100,000 tuples/300,000 edges
P3 600 MHz, 512 MB RAM
Check online demo with (part of) DBLP data
http//www.cse.iitb.ac.in/banks

28
Roadmap

Motivation
Model
Algorithm
System
Results
Related Work
Conclusion

29
Example of Browsing in BANKS
30
Anecdotes

Mohan
Returns C. Mohan at top based on prestige (number
of papers written)
Transaction
Returns Jim Grays classic paper and textbook as
top answers based on prestige (number of
citations)
Sunita Seltzer
No common papers, but both have papers with
Stonebraker system finds this connection

31
Effect of Parameters

Log scaling of edge weights worked well
(1- ?) E ? N versus E N? -- made little
difference
Best with ? .2 (subdue node weights but not
entirely)

EdgeLog
32
Roadmap

Motivation
Model
Algorithm
System
Results
Related Work
Conclusion

33
Related Work

DataSpot (DTL)/Mercado Intuifind Dar et. al.
VLDB 98
Based on patent by Palmon (filed 1995, granted
1998)
Based on hypergraph model, similar answer model
to ours
Differences our model of backward link weights
and prestige
Proximity Search Goldman et. al. VLDB98
Different model of proximity based on adding up
support
No edge weights, prestige, different evaluation
algorithm
Information units (linked Web pages) Li et. al.
WWW10
No directionality, only studied in Web context
Microsoft DBExplorer Agrawal et. al. this
conference
No ranking, based on SQL generation
Addresses efficient construction of text indexes

34
Conclusions and Future Work

Keyword searching in databases Important and
practical
Future work
BANKS for XML
Disambiguating queries by selecting
Tree structure coauthors or cites
Boolean queries, thesaurus
Metadata column/relation names

35
Thesis Work

Memory cognizant query optimization
Parametric query optimization (PQO)
For linear cost functions
For piecewise linear cost function
Proposed PQO for p2p-XML-stream?!!!

36
Memory Cognizant Query Optimization

Deals with
dividing available memory amongst the operators
running simultaneously in a pipeline
deciding where to break a "big" pipeline
integrating these decisions into an optimizer

37
Parametric Query Optimization (PQO)

The cost of a query plan depends on many
parameters
PQO optimizes a query into a number of candidate
plans, each optimal for some region of the
parameter space

38
PQO for Linear Cost functions

We have developed an algorithm which is
Simple
Minimally intrusive
Works for arbitrary number of parameters
Almost optimal (in terms of number of invocations
of the conventional optimizer)

39
PQO for Piecewise Linear Cost Functions

We have extended existing optimizers and the
extensions are
Intrusive
Works for arbitrary number of parameters
Very general since nonlinear and discontinuous
cost functions can be approximated to piecewise
linear form
We showed how to extend the System R algorithm
We have also extended the Volcano optimization
algorithm

40
Publications

Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe,
Soumen Chakrabarti, S. Sudarshan Keyword
Searching and Browsing in databases using BANKS.
ICDE 2002
Arvind Hulgeri, Gaurav Bhalotia, Charuta Nakhe,
Soumen Chakrabarti, S. Sudarshan Keyword Search
in Databases. IEEE Data Engineering Bulletin
24(3) 22-32 (2001)
Arvind Hulgeri, S. Seshadri, S. Sudarshan Memory
Cognizant Query Optimization. COMAD 2000
Arvind Hulgeri, S. Sudarshan Parametric Query
Optimization for Linear and Piecewise Linear Cost
Functions. Submitted to VLDB2002
Available from my homepage http//www.cse.iitb.a
c.in/aru