Title: Keyword Querying on SemiStructured Data
Joint work with Soumen Chakrabarti, and
a large number of past
and current students
2Keyword Search on Semi-Structured Data
- A significant fraction of data is resident in
relational databases or in semi-structured data
(XML) - Organizational, government, scientific, medical
data - Goal ad-hoc/exploratory database querying using
keywords - SQL/XQuery is not appropriate for casual users
- Form interfaces cumbersome
- Require separate form for each type of query
- Not suitable for ad hoc queries
3Keyword Search on Semistructured Data (Cont.)
- Differences from IR/Web Search
- Normalization splits related data across multiple
tuples - In XML/HTML, edges represent connections between
different nodes - To answer a keyword query we need to find a
(closely) connected set of tuples/nodes that
together match all given keywords
4Graph Representation of Data
- Database modeled as a graph
- Nodes tuples
- Directed edges for foreign key, inclusion
dependencies, etc - Information integration Graph representation of
integrated information keyword querying - can model relational, XML, HTML, .., data in a
single graph
5Graph Data Model (2)
- E.g., XML data
- ltproceedings confVLDB, year 2009gt
- ltpaper id1gt
- lttitlegtRecovering from Query
Optimizationlt/titlegt . - lt/papergt
- ltpaper id2gt
- lttitlegtConcurrency Control for
Keyword Searchlt/titlegt . - ltcite ref1gtRecovering from Query
Optimizationlt/citegt - lt/papergt
- lt/proceedingsgt
6Answer Model
- Answer Minimal rooted directed tree connecting
nodes containing keywords - Undirected tree Discover, DBXplorer, ..
- Multiple answers possible
- Answer relevance computed from answer edge
score combined with answer node score
Eg. Sudarshan Roy
7Edge Directionality
- Some popular tuples are connected to many other
tuples - Paper1 ? vldb06 ? paper2
- Students ? departments ? university
- Popular tuples would create misleading shortcuts
- E.g. every student would be closely linked with
every other student via the department/university
8Edge Weight Model in BANKS
- Idea define different forward and backward edge
weights - Forward edge weight (in direction of foreign key
ref/XML containment) - Default to 1, can be based on schema
- e.g. citation link weight 10, writes link
weight 5 - Backward edge u?v weight (where foreign key v?u)
- Proportional to edges pointing to u
- Log scaling
9Node Prestige in BANKS
- Node weight (prestige) based on indegree
- More incoming edges ? higher prestige
- Google PageRank style transfer of prestige
- Node weight computed using biased random walk
model - Bias based on edge type, direction
10Response Ranking
- Edge Score EA
- Smaller tree gt higher score
- E.g., BANKS EA 1/ (S edge weights)
- Variant
- Score of path from root to leaf probability of
random walk from root reaching that leaf - Tree score product of leaf path scores
- Node Score NA
- Measure of authority of nodes in tree
- E.g., BANKS NA S leaf/root nodes log (node
authorities) - Overall score f (EA, NA)
- E.g., BANKS f (EA, NA) EA . NAl
- l0.2 works well
11The BANKS System
Web Server Servlets
XML Data source
- Available on the web, with DBLP, IMDB and IITB
ETD data - http//www.cse.iitb.ac.in/banks/
- Preprocessing to create indices and give weights
to links - Provides keyword search and browsing features
12BANKS Architecture
- Data resident on disk
- Graph representation of data resident in memory
- Nodes and edges with their types/counts
- 16xV8xE bytes
- Search done in memory
- Why in memory?
- Allows us to use interesting graph traversal
based algorithms without being constrained by SQL
and related performance issues - With current memory sizes, database graphs for
many applications will fit in server memory - External memory search ongoing work
13Related Work
- Keyword querying on relational databases
- DBExplorer ICDE02, Discover VLDB02
- Use SQL generation
- BANKS ICDE02 (G. Bhalotia, Charuta N., A.
Hulgeri, Soumen C.,
S. Sudarshan) - pays more attention to result ranking
- does not require schema
- Keyword querying on XML
- Tree model (answer based on containment edges)
- XRank (Cornell), proximity in XML (ATT Research)
- Schema-Free XQuery (Michigan),
- Tree model cannot handle arbitrary graph edges
- Graph model
- Sphere search (VLDB2005)
- Generates XML tags to represent context
- Query can specify keyword context
- Does not explore edge weights
14Proximity (Near) Queries
15Proximity Queries
- Node weight by proximity
- E.g. author (near recovery)
- Node prestige gt if close to multiple nodes
matching near keywords - Example applications
- Finding experts on a particular area
- faculty (near earthquake)
- Author (near recovery)
Analysis of Earthquake ..
Earthquake Resistant
Earthquake Measurement
Building Earthquake
16Proximity via Spreading Activation
- Idea
- Each near keyword has activation of 1
- Divided among nodes matching keyword,
proportional to their node prestige - Each node
- keeps fraction 1-µ of its received activation and
- spreads fraction µ amongst its neighbors
- Graph may have cycles
- Combine activation received from neighbors
- a 1 (1-a1)(1-a2) (belief function)
- Additive combination (a1a2) may diverge w/
17Activation Change Propagation
- Algorithm to incrementally propagate activation
change d - Nodes to propagate d from are in queue
- Best first propagation
- Propagation to node already in queue simply
modifies its d value - Stops when d becomes smaller than cutoff
18Near Queries with Multiple Keywords
- Spread activation from each keyword separately
- Then combine the activations from different
keywords - OR use addition or belief combination
- AND take product of activations
- Gives better results
19Proximity and Tree Scores
- Queries can combine proximity scores with tree
scores - author(near transactions) data integration
- Related work
- Goldman et al VLDB98
- Considers only shortest path from each node
- Author (near Surajit Chaudhuri)
- Object Rank VLDB04
- Done independently
- Precomputed high space overhead
- Subsequently extended to IR context in the SPIN
20Example Answers
- Anecdotal results on DBLP Bibliography
- author (near recovery) Dave Lomet, C. Mohan,
etc - Transaction Jim Grays classic paper and
textbook at the top based on prestige ( of
citations) - Johnson(near OLAP) Theodore Johnson
- And on IIT Bombay Thesis/Dissertation Database
- faculty (near earthquake) R.S. Jangid, P.
Banerji, R. Sinha - faculty (near database)
21Other Query Extensions
- Restriction of context
- authorwashington vs. statewashington
- Twig and approximate twig patterns
- recovery cites optimization
22Graph Search Algorithms To Find Answer Trees
23Bidirectional Expansion for Keyword Searchon
Graph Databases
Varun Kacholia
Shashank Pandit Soumen Chakrabarti
S. Sudarshan Rushi Desai
Hrishikesh Karambelkar
24Finding Answer Trees
- Backward Expanding Search
- Intuition travel backwards from keyword nodes
till you hit a common node
Query sudarshan roy
MultiQuery Optimization
Prasan Roy
25Backward Search Algorithm
- Algorithm
- Run concurrent single source shortest path
iterators from each node matching a keyword - Traverse the graph edges in reverse direction
- Output next nearest node on each get-next() call
- Do best-first search across iterators
- Output node if in the intersection of sets of
nodes reached from each keyword
26Backward Search Limitations
- Wasteful exploration of graph
- Frequently occurring keywords
- Hub nodes in the graph (high in-degree)
Shashank Sudarshan Database
Schema Legend
27Bidirectional Search Motivation
28Bidir Search Intuition
- First cut solution
- Dont go backward if keyword matches many nodes
- Dont go backward if node points to a hub
- Instead explore forward from other keywords
29Bidir Search Example
Shashank Sudarshan Database
Schema Legend
30Bidir Search Issues
- What should threshold for not expanding be?
- Our solution prioritize expansion of nodes based
on spreading activation - to penalize frequent keywords and bushy trees
- How to manage exploration in both directions?
31Bidir Search Spreading Activation
- Spreading Activation
- Node with highest activation explored first
- Every node given an initial activation
- Gives low activation to frequently occurring
32Bidir Search Spreading Activation
- Spreading Activation
- Node with highest activation explored first
- Activation spread to neighbors (µ 0.3)
- Gives low activation to neighbors of hubs
0.7 x 1/5 x 1/4
0.7 x 1/5 x 1/4
0.7 x 1/5 x 1/4
0.3 x 1/5
0.7 x 1/5 x 1/4
33Bidir Search Iterators
- How to manage exploration in both directions?
- Single backward iterator single forward
iterator w/ suitable datastructures - E.g., to keep track of parents of nodes
Dist from A, Dist from B
2,3 8
8,8 2
34Bidir Search Algorithm
- Algorithm
- Activate matching nodes insert into backward
iterator - while (iterators are not empty)
- Choose iterator for expansion in best-first
manner - Explore node with highest activation
- Spread activation to neighbors
- Update path weights (and other datastructures)
- Propagate values to ancestors if necessary
- Insert nodes explored in the backward direction
into the forward iterator / for future forward
exploration / - Stop when top-k results are produced
35Bidir Search top-k results
- Results need not be generated in-order
- Naïve solution
- Store results in an intermediate heap
- Output top k results after mk total results have
been generated (m 10) - Can do better
- Compute upper bound on score of next result
output answers with a higher score - Similar to NRA algorithm (Fagin et al., PODS01)
- Datasets
- DBLP, IMDB 2 million nodes, 9 million edges
- US Patent DB 4 million nodes, 15 million edges
- Workload
- Keywords randomly picked from results of SQL join
statements - Search algorithms
- MI-Bkwd original backward search
- Iterator for every node matching a keyword
- SI-Bkwd backward search with single backward
iterator - Bidirec bidirectional search
- Time taken/nodes explored
- Measured when 10th answer is generated (or last
answer if answers lt 10) - Origin size
- nodes matched by keywords in the query
37Experiments (2)
- MI-Bkwd versus SI-Bkwd
- SI-Bkwd gain increases with origin size,
38Experiments (3)
- SI-Bkwd versus Bidirec
- Bidirec gain increases with origin size,
39Experiments (4)
- Precision/Recall experiments
- Relevant answers are well-defined can be
generated through SQL statements - Both MI-Backward and Bidirectional show similar
performance - Recall 100
- Precision 100 at near full recall
- Few irrelevant answers produced before generating
all relevant answers - Bidirectional runs faster, yet minimal loss of
relevant results!
- Bidirectional search as dynamic per-tuple join
ordering - Related work in this area Eddies
- Unlike Eddies, bidirectional search is
- Schema-less
- Priority based on activation instead of
selectivity - Generates answers in relevance order
- Graph model
- Common denominator representation
- Multiple types of queries required
- Near queries, spanning tree queries
- Ranking is critical
- Edge and node weights, spreading activation
- Efficient graph search is critical
- Bidirectional graph search
42Ongoing and Future Work
- Graphs larger than memory
- Idea Use multi-level graph representation
- Higher levels are condensed representation of
lower levels - Revised approach to search
- Search on condensed super-graph (in-memory), to
find potential answers - Expand nodes (disk I/O)
- Redo search on expanded nodes to find real
43Graph Condensation
- Cluster nodes to get supergraph
- Different clusterings possible e.g.
Raghavan/Garcia Molina ICDE 2003 for web graphs - Currently building infrastructure and exploring
44Searching in Condensed Graphs
- Weight of super-edge S1?S2 min(real edge
weigths between S1 and S2) - Issues
- E.g. minimal answers on full graph may be
non-minimal on condensed graph - Multi-granularity representation
- Other types of queries on condensed graphs
45Further Directions
- Querying graph representation of integrated data
S. Sarawagi
Sunita S.
Suneeta S.
46Future of Keyword Search in DBs
- Next generation of intelligent search will
require context information - E.g. search email, files, calendar, ..
- Information integration will be important
- Graph structured data will be a key component
- Security
- Is there a killer app?
47Thank You!
48Experiments (5)
- Comparison with Sparse
- Hristidis et al. VLDB03
- Generate join expressions leading to query
results - Use DB-provided scores for ranking tuples and
aggregate them to rank answer trees - For top-k results automatically determine
required number of join expressions - Sparse-LB
- Manually generate required join expressions
- Sparse needs to do at least this much (and
usually a lot more!) - Bidirectional versus Sparse-LB
- Bidirectional outperforms by a factor of 3
(esp. when joins is large)
49Experiments (6)
- SI-Bkwd versus Bidirec by origin size
- Bidirec gains more with unbalanced origin sizes
A (T,S,S,S) B (M,M,M,M) C (M,L,L,L) D
(M,M,L,L) E (T,L,L,L) F (T,S,M,L) G
(T,M,L,L) H (T,T,T,L)
50Bidir Search top-k results (2)
- Compute upper bound on score of next result
output answers with a higher score - Computing the bound
- mi minimum path length explored backward from
keyword i - unseen answer node 1/(m1 m2 mn )
- visited answer node suppose reached from first x
keywords with distance di - 1/(d1 d2 dx ) (mx1 mx2
mn ) - combine this with max node prestige
- We simply use 1/(m1 m2 mn )!
- Experiments show no significant loss in using
this heuristic
51Bidirectional Search (1)
- Single backward search iterator across all
keywords - Unlike per-matched-node in backward exp.
- Changes answer set slightly
- Different justifications for same root may be
lost - Didnt find any meaningful answers lost
- Spreading activation to prioritize backward
search - Activation spread per keyword
- Nodes prioritized by sum of activations
- Single forward iterator
52Bidir. Search (2)
- For each node in backward iterator
- dist(u,i) best path from u to node in Si
- Si nodes matching keyword Ki
- sp(u,i) next node in shortest path from u to Ki
- a(u,i) activation at u from keyword Ki
- a(u) sum of a(u,i)
53Bidir. Search (3)
- Spreading of activation
- Done separately for each keyword
- For nodes in Si (nodes matching keywords)
- initial activation proportional to node prestige
- total of 1 across nodes in Si
- Node retains µ fraction of received activation,
spreads (1-µ) fraction - Activation spread from a node V divided among
neighbors Ui in inverse proportion to weight Ui?V
- Thus incorporates path score too
- Activation combining function max
54Bidir. Search (3)
- Forward search iterator
- Forward search from all nodes reached by backward
search - Track best forward path to each keyword
- Initially infinite cost
- Whenever this changes, propagate cost change to
all affected ancestors
55Bidir. Search (3)
- On each path length update (due to backward or
forward search) - Check if node can reach all keywords
- If so, add it to output heap
- If same undirected tree not already present
- Output heap deals with out of order answer
generation - When to output nodes from heap
- For each keyword Ki, track Mi
- Mi minimum path length to Ki among all
yet-to-be-explored nodes in backward search tree - 1/Max (Mi) is upper bound on edge score
- 1/Sum(Mi) can be used instead at risk of
out-of-order answers - Use max of node scores to compute overall score
upper bound - Output answer if its score is gt overall score
upper bound
56Probabilistic Edge Score Model
- Probabilistic edge scoring model alternative to
edge weight model - Path weight ? probability of following each
edge - Edge probability
- Forward ? 1/out-degree
- Backward ? 1/in-degree
- Can have separate in/out degrees by edge type,
probability of following each edge type - Edge Score E (harmonic) mean of path weights
from leaves to root