Keyword Searching and Browsing in Databases using BANKS - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Keyword Searching and Browsing in Databases using BANKS

Description:

Information units (linked Web pages) [Li et. al. WWW10] ... Jet Air 411. Cathay 123. SRC. DST. Udyan Exp. SRC. DST. Trains. Flights. Query: 'Bangalore Hong-Kong' ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 48
Provided by: Char183
Category:

less

Transcript and Presenter's Notes

Title: Keyword Searching and Browsing in Databases using BANKS


1
Keyword Searching and Browsing in Databases using
BANKS
  • Arvind Hulgeri
  • Joint work with
  • Gaurav Bhalotia, Charuta Nakhe, Soumen
    Chakrabarti, S. Sudarshan
  • Dept of Computer Science and Engg.
  • Indian Institute of Technology Bombay

2
Roadmap
  • Motivation
  • Model
  • Algorithm
  • System
  • Results
  • Related Work
  • Conclusion

3
Roadmap
  • Motivation
  • Model
  • Algorithm
  • System
  • Results
  • Related Work
  • Conclusion

4
Motivation
  • Keyword search of documents on the Web has been
    enormously successful
  • Simple and intuitive, no need to learn any query
    language
  • Database querying using keywords is desirable
  • SQL is not appropriate for casual users
  • Form interfaces cumbersome
  • Require separate form for each type of query
    confusing for casual users of Web information
    systems
  • Not suitable for ad hoc queries

5
Motivation
  • Many Web documents are dynamically generated from
    databases
  • E.g. Catalog data
  • Keyword querying of generated Web documents
  • May miss answers that need to combine information
    on different pages
  • Suffers from duplication overheads

6
Examples of Keyword Queries
  • On a book store database
  • sudarshan databases
  • On a travel reservation database
  • mumbai hong-kong
  • On a university database
  • database course
  • On an e-store database
  • camcorder panasonic

7
Differences from IR/Web Search
  • A logical unit of information is split across
    multiple tuples due to normalization
  • E.g. Paper (paper-id, title, journal),
    Author (author-id, name) Writes
    (author-id, paper-id, position)
  • Different keywords may match tuples from
    different relations
  • What joins are to be computed can only be decided
    on the fly
  • Cites(citing-paper-id, cited-paper-id)

8
Information Splitting An example
Sudarshan
MultiQuery Optimization
Prasan Roy
Papers
Authors
Writes
9
Connectivity
  • Tuples may be connected by
  • Foreign key and object references
  • Inclusion dependencies and join conditions
  • Implicit links (shared words), etc.
  • Would like to find sets of (closely) connected
    tuples that match all given keywords

10
Roadmap
  • Motivation
  • Model
  • Algorithm
  • System
  • Results
  • Related Work
  • Conclusion

11
Basic Model
  • Database modeled as a graph
  • Nodes tuples
  • Edges references between tuples
  • foreign key, inclusion dependencies, ..
  • Edges are directed.

MultiQuery Optimization
paper
BANKS Keyword search
writes
S. Sudarshan
Prasan Roy
Charuta
author
12
Answer Example
Query sudarshan roy
paper
MultiQuery Optimization
writes
writes
author
author
S. Sudarshan
Prasan Roy
13
The BANKS Answer Model
  • Query set of keywords k1, k2, .., kn
  • Each keyword ki matches set of nodes Si
  • Answer rooted, directed (steiner) tree
    connecting nodes, with one node from each Si
  • Root node (we call it an information node) has
    special significance, may be restricted to some
    relations
  • E.g. relations representing entities, not
    relationships
  • Multiple answers ranked by relevance

14
Ranking the answers
  • Answers are ranked based on their relevance
  • Relevance a function of proximity and prestige
  • Proximity a function of the edge weights of an
    answer tree - edge score E
  • Prestige a function of the node weights - node
    score N

15
Edge Directionality
  • Some popular tuples are connected to many other
    tuples
  • E.g. Students -gt departments -gt university
  • Popular tuples would create misleading shortcuts
    from every tuple to every other
  • E.g. every student would be closely linked with
    every other student via the department/university
  • Solution define different forward and backward
    edge weights
  • Forward edges In the direction of the foreign
    key reference

16
Edge Weight
  • Weight of forward edge based on schema
  • e.g. cites link weights gt writes link
    weights
  • Weight of backward edge indegree of edges
    pointing to the node

1
1
1
17
Edge Weight Scaling
  • Problem Some backward edges have unduly large
    weights
  • Scale edge weights by using log(1raw-edgeweight)
  • For an answer tree
  • total-edge-weight ? edge-weights
  • Edge score E 1 / total-edge-weight

18
Node Weight
  • Nodes have prestige weights too
  • Observation nodes with intuitively greater
    prestige tend to have greater indegree
  • Node weight indegree
  • Problem Nodes with many in-edges result in
    skewed answers
  • Subdue extreme node weights by using
    log(1indegree)
  • For an answer tree
  • Node score N root-node-weight ?
    leaf-node-weights

19
Combining Scores
  • Problem how to combine two independent metrics
    node weight and edge weight
  • Normalize each to 0-1
  • Combine using weighting factor ?
  • Additive (1- ?) E ? N
  • Multiplicative E N?
  • Performance study to compare alternatives and to
    find reasonable values for ?

20
Roadmap
  • Motivation
  • Model
  • Algorithm
  • System
  • Results
  • Related Work
  • Conclusion

21
Finding Answer Trees
  • Backward Expanding Search Algorithm
  • Intuition find vertices from which a forward
    path exists to at least one node from each Si.
  • Run concurrent single source shortest path
    algorithm from each node matching a keyword
  • Create an iterator for each node matching a
    keyword
  • Traverse the graph edges in reverse direction
  • Intersection of the shortest path iterators
    (corresponding to one node from each Si) results
    into an answer tree rooted at the intersection
    node.

22
Backward Expanding Search
Query sudarshan roy
S. Sudarshan
Prasan Roy
authors
23
Backward Expanding Search
Query sudarshan roy
writes
S. Sudarshan
Prasan Roy
authors
24
Result Ordering
  • Answer trees may not be generated in relevance
    order
  • Solution
  • Best-first search across all iterators, based on
    path length
  • Insert answers to a buffer (heap)
  • Output highest ranked answer from buffer to user
    when buffer is full

25
Roadmap
  • Motivation
  • Model
  • Algorithm
  • System
  • Results
  • Related Work
  • Conclusion

26
The BANKS System
  • Connects to any database using JDBC
  • JDBC metadata features used to provide schema
    browsing
  • No programming needed for customization
  • Minimal preprocessing of database to create
    indices and give weights to links
  • Extensive set of browsing features

BANKS
User
HTTP
JDBC
Web Server Servlets
Database
27
The BANKS System
  • BANKS provides keyword search coupled with
    extensive browsing facilities
  • Schema browsing data browsing
  • Graphical display of data
  • Implemented using Java servlets
  • Keyword search response times typically 1 to 3
    seconds on
  • DBLP database with 100,000 tuples/300,000 edges
  • P3 600 MHz, 512 MB RAM
  • Check online demo with (part of) DBLP data
  • http//www.cse.iitb.ac.in/banks

28
Roadmap
  • Motivation
  • Model
  • Algorithm
  • System
  • Results
  • Related Work
  • Conclusion

29
Example of Browsing in BANKS
30
Anecdotes
  • Mohan
  • Returns C. Mohan at top based on prestige (number
    of papers written)
  • Transaction
  • Returns Jim Grays classic paper and textbook as
    top answers based on prestige (number of
    citations)
  • Sunita Seltzer
  • No common papers, but both have papers with
    Stonebraker system finds this connection

31
Effect of Parameters
  • Log scaling of edge weights worked well
  • (1- ?) E ? N versus E N? -- made little
    difference
  • Best with ? .2 (subdue node weights but not
    entirely)

EdgeLog
32
Roadmap
  • Motivation
  • Model
  • Algorithm
  • System
  • Results
  • Related Work
  • Conclusion

33
Related Work
  • DataSpot (DTL)/Mercado Intuifind Dar et. al.
    VLDB 98
  • Based on patent by Palmon (filed 1995, granted
    1998)
  • Based on hypergraph model, similar answer model
    to ours
  • Differences our model of backward link weights
    and prestige
  • Proximity Search Goldman et. al. VLDB98
  • Different model of proximity based on adding up
    support
  • No edge weights, prestige, different evaluation
    algorithm
  • Information units (linked Web pages) Li et. al.
    WWW10
  • No directionality, only studied in Web context
  • Microsoft DBExplorer Agrawal et. al. this
    conference
  • No ranking, based on SQL generation
  • Addresses efficient construction of text indexes

34
Conclusions and Future Work
  • Keyword searching in databases Important and
    practical
  • Future work
  • BANKS for XML
  • Disambiguating queries by selecting
  • Tree structure coauthors or cites
  • Boolean queries, thesaurus
  • Metadata column/relation names

35
Thesis Work
  • Memory cognizant query optimization
  • Parametric query optimization (PQO)
  • For linear cost functions
  • For piecewise linear cost function
  • Proposed PQO for p2p-XML-stream?!!!

36
Memory Cognizant Query Optimization
  • Deals with
  • dividing available memory amongst the operators
    running simultaneously in a pipeline
  • deciding where to break a "big" pipeline
  • integrating these decisions into an optimizer

37
Parametric Query Optimization (PQO)
  • The cost of a query plan depends on many
    parameters
  • PQO optimizes a query into a number of candidate
    plans, each optimal for some region of the
    parameter space

38
PQO for Linear Cost functions
  • We have developed an algorithm which is
  • Simple
  • Minimally intrusive
  • Works for arbitrary number of parameters
  • Almost optimal (in terms of number of invocations
    of the conventional optimizer)

39
PQO for Piecewise Linear Cost Functions
  • We have extended existing optimizers and the
    extensions are
  • Intrusive
  • Works for arbitrary number of parameters
  • Very general since nonlinear and discontinuous
    cost functions can be approximated to piecewise
    linear form
  • We showed how to extend the System R algorithm
  • We have also extended the Volcano optimization
    algorithm

40
Publications
  • Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe,
    Soumen Chakrabarti, S. Sudarshan Keyword
    Searching and Browsing in databases using BANKS.
    ICDE 2002
  • Arvind Hulgeri, Gaurav Bhalotia, Charuta Nakhe,
    Soumen Chakrabarti, S. Sudarshan Keyword Search
    in Databases. IEEE Data Engineering Bulletin
    24(3) 22-32 (2001)
  • Arvind Hulgeri, S. Seshadri, S. Sudarshan Memory
    Cognizant Query Optimization. COMAD 2000
  • Arvind Hulgeri, S. Sudarshan Parametric Query
    Optimization for Linear and Piecewise Linear Cost
    Functions. Submitted to VLDB2002
  • Available from my homepage http//www.cse.iitb.a
    c.in/aru

41
Thank you!
  • ChnageFontToWebdings(Thank you!)

42
BANKS Query Result Example
  • Result of Soumen Sunita

43
(No Transcript)
44
Browsing Features
  • Hyperlinks are automatically added to all
    displayed results
  • Template facilities to do a variety of tasks
  • Browsing data by grouping and creating crosstabs
  • e.g., theses grouped by department and year
  • Hierarchical views of data
  • Nested XML style, even on relational data
  • Graphical displays
  • Bar charts, pie charts, etc
  • Templates are generic and can be applied on any
    data matching assumed schema
  • Can be applied after applying selections
  • New templates can be created by user,
    interactively

45
Combining Keyword Search and Browsing
  • Catalog searching applications
  • Keywords may restrict answers to a small set,
    then user needs to browse answers
  • If there are multiple answers, hierarchical
    browsing required on the answers

46
Keyword Searching and Browsing in Databases using
BANKS
  • Charuta Nakhe
  • Joint work with
  • Gaurav Bhalotia, Arvind Hulgeri, Soumen
    Chakrabarti, S. Sudarshan
  • I.I.T. Bombay

47
Query Bangalore Hong-Kong
Cities
Bangalore
x
y
SRC
DST
Jet Air 411
SRC
DST
x
Udyan Exp.
y
Mumbai
Cathay 123
z
Trains
Flights
z
Hong-Kong
xltzlty
Write a Comment
User Comments (0)
About PowerShow.com