Title: Keyword Searching and Browsing in Databases using BANKS
1Keyword Searching and Browsing in Databases using
BANKS
- Arvind Hulgeri
- Joint work with
- Gaurav Bhalotia, Charuta Nakhe, Soumen
Chakrabarti, S. Sudarshan - Dept of Computer Science and Engg.
- Indian Institute of Technology Bombay
2Roadmap
- Motivation
- Model
- Algorithm
- System
- Results
- Related Work
- Conclusion
3Roadmap
- Motivation
- Model
- Algorithm
- System
- Results
- Related Work
- Conclusion
4Motivation
- Keyword search of documents on the Web has been
enormously successful - Simple and intuitive, no need to learn any query
language - Database querying using keywords is desirable
- SQL is not appropriate for casual users
- Form interfaces cumbersome
- Require separate form for each type of query
confusing for casual users of Web information
systems - Not suitable for ad hoc queries
5Motivation
- Many Web documents are dynamically generated from
databases - E.g. Catalog data
- Keyword querying of generated Web documents
- May miss answers that need to combine information
on different pages - Suffers from duplication overheads
6Examples of Keyword Queries
- On a book store database
- sudarshan databases
- On a travel reservation database
- mumbai hong-kong
- On a university database
- database course
- On an e-store database
- camcorder panasonic
7Differences from IR/Web Search
- A logical unit of information is split across
multiple tuples due to normalization - E.g. Paper (paper-id, title, journal),
Author (author-id, name) Writes
(author-id, paper-id, position) - Different keywords may match tuples from
different relations - What joins are to be computed can only be decided
on the fly - Cites(citing-paper-id, cited-paper-id)
8Information Splitting An example
Sudarshan
MultiQuery Optimization
Prasan Roy
Papers
Authors
Writes
9Connectivity
- Tuples may be connected by
- Foreign key and object references
- Inclusion dependencies and join conditions
- Implicit links (shared words), etc.
- Would like to find sets of (closely) connected
tuples that match all given keywords
10Roadmap
- Motivation
- Model
- Algorithm
- System
- Results
- Related Work
- Conclusion
11Basic Model
- Database modeled as a graph
- Nodes tuples
- Edges references between tuples
- foreign key, inclusion dependencies, ..
- Edges are directed.
MultiQuery Optimization
paper
BANKS Keyword search
writes
S. Sudarshan
Prasan Roy
Charuta
author
12Answer Example
Query sudarshan roy
paper
MultiQuery Optimization
writes
writes
author
author
S. Sudarshan
Prasan Roy
13The BANKS Answer Model
- Query set of keywords k1, k2, .., kn
- Each keyword ki matches set of nodes Si
- Answer rooted, directed (steiner) tree
connecting nodes, with one node from each Si - Root node (we call it an information node) has
special significance, may be restricted to some
relations - E.g. relations representing entities, not
relationships - Multiple answers ranked by relevance
14Ranking the answers
- Answers are ranked based on their relevance
- Relevance a function of proximity and prestige
- Proximity a function of the edge weights of an
answer tree - edge score E - Prestige a function of the node weights - node
score N
15Edge Directionality
- Some popular tuples are connected to many other
tuples - E.g. Students -gt departments -gt university
- Popular tuples would create misleading shortcuts
from every tuple to every other - E.g. every student would be closely linked with
every other student via the department/university - Solution define different forward and backward
edge weights - Forward edges In the direction of the foreign
key reference
16Edge Weight
- Weight of forward edge based on schema
- e.g. cites link weights gt writes link
weights - Weight of backward edge indegree of edges
pointing to the node
1
1
1
17Edge Weight Scaling
- Problem Some backward edges have unduly large
weights - Scale edge weights by using log(1raw-edgeweight)
- For an answer tree
- total-edge-weight ? edge-weights
- Edge score E 1 / total-edge-weight
18Node Weight
- Nodes have prestige weights too
- Observation nodes with intuitively greater
prestige tend to have greater indegree - Node weight indegree
- Problem Nodes with many in-edges result in
skewed answers - Subdue extreme node weights by using
log(1indegree) - For an answer tree
- Node score N root-node-weight ?
leaf-node-weights
19Combining Scores
- Problem how to combine two independent metrics
node weight and edge weight - Normalize each to 0-1
- Combine using weighting factor ?
- Additive (1- ?) E ? N
- Multiplicative E N?
- Performance study to compare alternatives and to
find reasonable values for ?
20Roadmap
- Motivation
- Model
- Algorithm
- System
- Results
- Related Work
- Conclusion
21Finding Answer Trees
- Backward Expanding Search Algorithm
- Intuition find vertices from which a forward
path exists to at least one node from each Si. - Run concurrent single source shortest path
algorithm from each node matching a keyword - Create an iterator for each node matching a
keyword - Traverse the graph edges in reverse direction
- Intersection of the shortest path iterators
(corresponding to one node from each Si) results
into an answer tree rooted at the intersection
node.
22Backward Expanding Search
Query sudarshan roy
S. Sudarshan
Prasan Roy
authors
23Backward Expanding Search
Query sudarshan roy
writes
S. Sudarshan
Prasan Roy
authors
24Result Ordering
- Answer trees may not be generated in relevance
order - Solution
- Best-first search across all iterators, based on
path length - Insert answers to a buffer (heap)
- Output highest ranked answer from buffer to user
when buffer is full
25Roadmap
- Motivation
- Model
- Algorithm
- System
- Results
- Related Work
- Conclusion
26The BANKS System
- Connects to any database using JDBC
- JDBC metadata features used to provide schema
browsing - No programming needed for customization
- Minimal preprocessing of database to create
indices and give weights to links - Extensive set of browsing features
BANKS
User
HTTP
JDBC
Web Server Servlets
Database
27The BANKS System
- BANKS provides keyword search coupled with
extensive browsing facilities - Schema browsing data browsing
- Graphical display of data
- Implemented using Java servlets
- Keyword search response times typically 1 to 3
seconds on - DBLP database with 100,000 tuples/300,000 edges
- P3 600 MHz, 512 MB RAM
- Check online demo with (part of) DBLP data
- http//www.cse.iitb.ac.in/banks
28Roadmap
- Motivation
- Model
- Algorithm
- System
- Results
- Related Work
- Conclusion
29Example of Browsing in BANKS
30Anecdotes
- Mohan
- Returns C. Mohan at top based on prestige (number
of papers written) - Transaction
- Returns Jim Grays classic paper and textbook as
top answers based on prestige (number of
citations) - Sunita Seltzer
- No common papers, but both have papers with
Stonebraker system finds this connection
31Effect of Parameters
- Log scaling of edge weights worked well
- (1- ?) E ? N versus E N? -- made little
difference - Best with ? .2 (subdue node weights but not
entirely)
EdgeLog
32Roadmap
- Motivation
- Model
- Algorithm
- System
- Results
- Related Work
- Conclusion
33Related Work
- DataSpot (DTL)/Mercado Intuifind Dar et. al.
VLDB 98 - Based on patent by Palmon (filed 1995, granted
1998) - Based on hypergraph model, similar answer model
to ours - Differences our model of backward link weights
and prestige - Proximity Search Goldman et. al. VLDB98
- Different model of proximity based on adding up
support - No edge weights, prestige, different evaluation
algorithm - Information units (linked Web pages) Li et. al.
WWW10 - No directionality, only studied in Web context
- Microsoft DBExplorer Agrawal et. al. this
conference - No ranking, based on SQL generation
- Addresses efficient construction of text indexes
34Conclusions and Future Work
- Keyword searching in databases Important and
practical - Future work
- BANKS for XML
- Disambiguating queries by selecting
- Tree structure coauthors or cites
- Boolean queries, thesaurus
- Metadata column/relation names
35Thesis Work
- Memory cognizant query optimization
- Parametric query optimization (PQO)
- For linear cost functions
- For piecewise linear cost function
- Proposed PQO for p2p-XML-stream?!!!
36Memory Cognizant Query Optimization
- Deals with
- dividing available memory amongst the operators
running simultaneously in a pipeline - deciding where to break a "big" pipeline
- integrating these decisions into an optimizer
37Parametric Query Optimization (PQO)
- The cost of a query plan depends on many
parameters - PQO optimizes a query into a number of candidate
plans, each optimal for some region of the
parameter space
38PQO for Linear Cost functions
- We have developed an algorithm which is
- Simple
- Minimally intrusive
- Works for arbitrary number of parameters
- Almost optimal (in terms of number of invocations
of the conventional optimizer)
39PQO for Piecewise Linear Cost Functions
- We have extended existing optimizers and the
extensions are - Intrusive
- Works for arbitrary number of parameters
- Very general since nonlinear and discontinuous
cost functions can be approximated to piecewise
linear form - We showed how to extend the System R algorithm
- We have also extended the Volcano optimization
algorithm
40Publications
- Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe,
Soumen Chakrabarti, S. Sudarshan Keyword
Searching and Browsing in databases using BANKS.
ICDE 2002 - Arvind Hulgeri, Gaurav Bhalotia, Charuta Nakhe,
Soumen Chakrabarti, S. Sudarshan Keyword Search
in Databases. IEEE Data Engineering Bulletin
24(3) 22-32 (2001) - Arvind Hulgeri, S. Seshadri, S. Sudarshan Memory
Cognizant Query Optimization. COMAD 2000 - Arvind Hulgeri, S. Sudarshan Parametric Query
Optimization for Linear and Piecewise Linear Cost
Functions. Submitted to VLDB2002 - Available from my homepage http//www.cse.iitb.a
c.in/aru
41Thank you!
- ChnageFontToWebdings(Thank you!)
42BANKS Query Result Example
43(No Transcript)
44Browsing Features
- Hyperlinks are automatically added to all
displayed results - Template facilities to do a variety of tasks
- Browsing data by grouping and creating crosstabs
- e.g., theses grouped by department and year
- Hierarchical views of data
- Nested XML style, even on relational data
- Graphical displays
- Bar charts, pie charts, etc
- Templates are generic and can be applied on any
data matching assumed schema - Can be applied after applying selections
- New templates can be created by user,
interactively
45Combining Keyword Search and Browsing
- Catalog searching applications
- Keywords may restrict answers to a small set,
then user needs to browse answers - If there are multiple answers, hierarchical
browsing required on the answers
46Keyword Searching and Browsing in Databases using
BANKS
- Charuta Nakhe
- Joint work with
- Gaurav Bhalotia, Arvind Hulgeri, Soumen
Chakrabarti, S. Sudarshan - I.I.T. Bombay
47Query Bangalore Hong-Kong
Cities
Bangalore
x
y
SRC
DST
Jet Air 411
SRC
DST
x
Udyan Exp.
y
Mumbai
Cathay 123
z
Trains
Flights
z
Hong-Kong
xltzlty