Title: Keyword Querying on SemiStructured Data
1Keyword Querying on Semi-Structured Data
Joint work with Soumen Chakrabarti, and
a large number of past
and current students
http//www.cse.iitb.ac.in/banks/
2Keyword Search on Semi-Structured Data
- A significant fraction of data is resident in
relational databases or in semi-structured data
(XML) - Organizational, government, scientific, medical
data - Goal ad-hoc/exploratory database querying using
keywords - SQL/XQuery is not appropriate for casual users
- Form interfaces cumbersome
- Require separate form for each type of query
- Not suitable for ad hoc queries
3Keyword Search on Semistructured Data (Cont.)
- Differences from IR/Web Search
- Normalization splits related data across multiple
tuples - In XML/HTML, edges represent connections between
different nodes - To answer a keyword query we need to find a
(closely) connected set of tuples/nodes that
together match all given keywords
4Graph Representation of Data
- Database modeled as a graph
- Nodes tuples
- Directed edges for foreign key, inclusion
dependencies, etc - Information integration Graph representation of
integrated information keyword querying - can model relational, XML, HTML, .., data in a
single graph
5Graph Data Model (2)
- E.g., XML data
- ltproceedings confVLDB, year 2009gt
- ltpaper id1gt
- lttitlegtRecovering from Query
Optimizationlt/titlegt . - lt/papergt
- ltpaper id2gt
- lttitlegtConcurrency Control for
Keyword Searchlt/titlegt . - ltcite ref1gtRecovering from Query
Optimizationlt/citegt - lt/papergt
- lt/proceedingsgt
6Answer Model
- Answer Minimal rooted directed tree connecting
nodes containing keywords - Undirected tree Discover, DBXplorer, ..
- Multiple answers possible
- Answer relevance computed from answer edge
score combined with answer node score
(prestige)
Eg. Sudarshan Roy
7Edge Directionality
- Some popular tuples are connected to many other
tuples - Paper1 ? vldb06 ? paper2
- Students ? departments ? university
- Popular tuples would create misleading shortcuts
- E.g. every student would be closely linked with
every other student via the department/university
8Edge Weight Model in BANKS
- Idea define different forward and backward edge
weights - Forward edge weight (in direction of foreign key
ref/XML containment) - Default to 1, can be based on schema
- e.g. citation link weight 10, writes link
weight 5 - Backward edge u?v weight (where foreign key v?u)
- Proportional to edges pointing to u
- Log scaling
9Node Prestige in BANKS
- Node weight (prestige) based on indegree
- More incoming edges ? higher prestige
- Google PageRank style transfer of prestige
- Node weight computed using biased random walk
model - Bias based on edge type, direction
10Response Ranking
- Edge Score EA
- Smaller tree gt higher score
- E.g., BANKS EA 1/ (S edge weights)
- Variant
- Score of path from root to leaf probability of
random walk from root reaching that leaf - Tree score product of leaf path scores
- Node Score NA
- Measure of authority of nodes in tree
- E.g., BANKS NA S leaf/root nodes log (node
authorities) - Overall score f (EA, NA)
- E.g., BANKS f (EA, NA) EA . NAl
- l0.2 works well
11The BANKS System
JDBC
HTTP
Database
BANKS
User
Web Server Servlets
preprocess
XML Data source
- Available on the web, with DBLP, IMDB and IITB
ETD data - http//www.cse.iitb.ac.in/banks/
- Preprocessing to create indices and give weights
to links - Provides keyword search and browsing features
12BANKS Architecture
- Data resident on disk
- Graph representation of data resident in memory
- Nodes and edges with their types/counts
- 16xV8xE bytes
- Search done in memory
- Why in memory?
- Allows us to use interesting graph traversal
based algorithms without being constrained by SQL
and related performance issues - With current memory sizes, database graphs for
many applications will fit in server memory - External memory search ongoing work
13Related Work
- Keyword querying on relational databases
- DBExplorer ICDE02, Discover VLDB02
- Use SQL generation
- BANKS ICDE02 (G. Bhalotia, Charuta N., A.
Hulgeri, Soumen C.,
S. Sudarshan) - pays more attention to result ranking
- does not require schema
- Keyword querying on XML
- Tree model (answer based on containment edges)
- XRank (Cornell), proximity in XML (ATT Research)
- Schema-Free XQuery (Michigan),
- Tree model cannot handle arbitrary graph edges
- Graph model
- Sphere search (VLDB2005)
- Generates XML tags to represent context
- Query can specify keyword context
- Does not explore edge weights
14Proximity (Near) Queries
15Proximity Queries
- Node weight by proximity
- E.g. author (near recovery)
- Node prestige gt if close to multiple nodes
matching near keywords - Example applications
- Finding experts on a particular area
- faculty (near earthquake)
- Author (near recovery)
Analysis of Earthquake ..
Mohan
Raghu
Earthquake Resistant
Earthquake Measurement
Building Earthquake
16Proximity via Spreading Activation
- Idea
- Each near keyword has activation of 1
- Divided among nodes matching keyword,
proportional to their node prestige - Each node
- keeps fraction 1-µ of its received activation and
- spreads fraction µ amongst its neighbors
- Graph may have cycles
- Combine activation received from neighbors
- a 1 (1-a1)(1-a2) (belief function)
- Additive combination (a1a2) may diverge w/
cycles
17Activation Change Propagation
- Algorithm to incrementally propagate activation
change d - Nodes to propagate d from are in queue
- Best first propagation
- Propagation to node already in queue simply
modifies its d value - Stops when d becomes smaller than cutoff
0.2
0.12
1
.6
0.08
0.08
0.2
0.12
18Near Queries with Multiple Keywords
- Spread activation from each keyword separately
- Then combine the activations from different
keywords - OR use addition or belief combination
- AND take product of activations
- Gives better results
19Proximity and Tree Scores
- Queries can combine proximity scores with tree
scores - author(near transactions) data integration
- Related work
- Goldman et al VLDB98
- Considers only shortest path from each node
- Author (near Surajit Chaudhuri)
- Object Rank VLDB04
- Done independently
- Precomputed high space overhead
- Subsequently extended to IR context in the SPIN
system
20Example Answers
- Anecdotal results on DBLP Bibliography
- author (near recovery) Dave Lomet, C. Mohan,
etc - Transaction Jim Grays classic paper and
textbook at the top based on prestige ( of
citations) - Johnson(near OLAP) Theodore Johnson
- And on IIT Bombay Thesis/Dissertation Database
- faculty (near earthquake) R.S. Jangid, P.
Banerji, R. Sinha - faculty (near database)
21Other Query Extensions
- Restriction of context
- authorwashington vs. statewashington
- Twig and approximate twig patterns
- recovery cites optimization
22Graph Search Algorithms To Find Answer Trees
23Bidirectional Expansion for Keyword Searchon
Graph Databases
Varun Kacholia
Shashank Pandit Soumen Chakrabarti
S. Sudarshan Rushi Desai
Hrishikesh Karambelkar
http//www.cse.iitb.ac.in/banks/
24Finding Answer Trees
- Backward Expanding Search
- BANKS ICDE02
- Intuition travel backwards from keyword nodes
till you hit a common node
Query sudarshan roy
..
MultiQuery Optimization
paper
writes
Sudarshan
Prasan Roy
authors
25Backward Search Algorithm
- Algorithm
- Run concurrent single source shortest path
iterators from each node matching a keyword - Traverse the graph edges in reverse direction
- Output next nearest node on each get-next() call
- Do best-first search across iterators
- Output node if in the intersection of sets of
nodes reached from each keyword
26Backward Search Limitations
- Wasteful exploration of graph
- Frequently occurring keywords
- Hub nodes in the graph (high in-degree)
Shashank Sudarshan Database
Schema Legend
Database
author
writes
paper
Shashank
Sudarshan
27Bidirectional Search Motivation
28Bidir Search Intuition
- First cut solution
- Dont go backward if keyword matches many nodes
- Dont go backward if node points to a hub
- Instead explore forward from other keywords
29Bidir Search Example
Shashank Sudarshan Database
Database
Schema Legend
author
writes
Shashank
Sudarshan
paper
30Bidir Search Issues
- What should threshold for not expanding be?
- Our solution prioritize expansion of nodes based
on spreading activation - to penalize frequent keywords and bushy trees
- How to manage exploration in both directions?
31Bidir Search Spreading Activation
- Spreading Activation
- Node with highest activation explored first
- Every node given an initial activation
- Gives low activation to frequently occurring
keywords
1/5
1/5
1/5
1/5
1/5
John
32Bidir Search Spreading Activation
- Spreading Activation
- Node with highest activation explored first
- Activation spread to neighbors (µ 0.3)
- Gives low activation to neighbors of hubs
0.7 x 1/5 x 1/4
0
1
1/5
1
0.7 x 1/5 x 1/4
0
1
0
0.7 x 1/5 x 1/4
0.3 x 1/5
1
0.7 x 1/5 x 1/4
0
33Bidir Search Iterators
- How to manage exploration in both directions?
-
- Single backward iterator single forward
iterator w/ suitable datastructures - E.g., to keep track of parents of nodes
Dist from A, Dist from B
7
6
8,8
2,3 8
8,8 2
2,8
8,1
8,1
1,8
3
4
5
0,8
8,0
2
1
A
B
34Bidir Search Algorithm
- Algorithm
- Activate matching nodes insert into backward
iterator - while (iterators are not empty)
- Choose iterator for expansion in best-first
manner - Explore node with highest activation
- Spread activation to neighbors
- Update path weights (and other datastructures)
- Propagate values to ancestors if necessary
- Insert nodes explored in the backward direction
into the forward iterator / for future forward
exploration / - Stop when top-k results are produced
35Bidir Search top-k results
- Results need not be generated in-order
- Naïve solution
- Store results in an intermediate heap
- Output top k results after mk total results have
been generated (m 10) - Can do better
- Compute upper bound on score of next result
output answers with a higher score - Similar to NRA algorithm (Fagin et al., PODS01)
36Experiments
- Datasets
- DBLP, IMDB 2 million nodes, 9 million edges
- US Patent DB 4 million nodes, 15 million edges
- Workload
- Keywords randomly picked from results of SQL join
statements - Search algorithms
- MI-Bkwd original backward search
- Iterator for every node matching a keyword
- SI-Bkwd backward search with single backward
iterator - Bidirec bidirectional search
- Time taken/nodes explored
- Measured when 10th answer is generated (or last
answer if answers lt 10) - Origin size
- nodes matched by keywords in the query
37Experiments (2)
- MI-Bkwd versus SI-Bkwd
- SI-Bkwd gain increases with origin size,
keywords
38Experiments (3)
- SI-Bkwd versus Bidirec
- Bidirec gain increases with origin size,
keywords
39Experiments (4)
- Precision/Recall experiments
- Relevant answers are well-defined can be
generated through SQL statements - Both MI-Backward and Bidirectional show similar
performance - Recall 100
- Precision 100 at near full recall
- Few irrelevant answers produced before generating
all relevant answers - Bidirectional runs faster, yet minimal loss of
relevant results!
40Discussion
- Bidirectional search as dynamic per-tuple join
ordering - Related work in this area Eddies
- Unlike Eddies, bidirectional search is
- Schema-less
- Priority based on activation instead of
selectivity - Generates answers in relevance order
41Conclusions
- Graph model
- Common denominator representation
- Multiple types of queries required
- Near queries, spanning tree queries
- Ranking is critical
- Edge and node weights, spreading activation
- Efficient graph search is critical
- Bidirectional graph search
42Ongoing and Future Work
- Graphs larger than memory
- Idea Use multi-level graph representation
- Higher levels are condensed representation of
lower levels - Revised approach to search
- Search on condensed super-graph (in-memory), to
find potential answers - Expand nodes (disk I/O)
- Redo search on expanded nodes to find real
answers
43Graph Condensation
S1
S2
S3
- Cluster nodes to get supergraph
- Different clusterings possible e.g.
Raghavan/Garcia Molina ICDE 2003 for web graphs - Currently building infrastructure and exploring
techniques
44Searching in Condensed Graphs
- Weight of super-edge S1?S2 min(real edge
weigths between S1 and S2) - Issues
- E.g. minimal answers on full graph may be
non-minimal on condensed graph - Multi-granularity representation
- Other types of queries on condensed graphs
45Further Directions
- Querying graph representation of integrated data
0.1
S. Sarawagi
Sunita S.
5
Suneeta S.
46Future of Keyword Search in DBs
- Next generation of intelligent search will
require context information - E.g. search email, files, calendar, ..
- Information integration will be important
- Graph structured data will be a key component
- Security
- Is there a killer app?
47Thank You!
Questions??
48Experiments (5)
- Comparison with Sparse
- Hristidis et al. VLDB03
- Generate join expressions leading to query
results - Use DB-provided scores for ranking tuples and
aggregate them to rank answer trees - For top-k results automatically determine
required number of join expressions - Sparse-LB
- Manually generate required join expressions
- Sparse needs to do at least this much (and
usually a lot more!) - Bidirectional versus Sparse-LB
- Bidirectional outperforms by a factor of 3
(esp. when joins is large)
49Experiments (6)
- SI-Bkwd versus Bidirec by origin size
- Bidirec gains more with unbalanced origin sizes
A (T,S,S,S) B (M,M,M,M) C (M,L,L,L) D
(M,M,L,L) E (T,L,L,L) F (T,S,M,L) G
(T,M,L,L) H (T,T,T,L)
50Bidir Search top-k results (2)
- Compute upper bound on score of next result
output answers with a higher score - Computing the bound
- mi minimum path length explored backward from
keyword i - unseen answer node 1/(m1 m2 mn )
- visited answer node suppose reached from first x
keywords with distance di - 1/(d1 d2 dx ) (mx1 mx2
mn ) - combine this with max node prestige
- We simply use 1/(m1 m2 mn )!
- Experiments show no significant loss in using
this heuristic
51Bidirectional Search (1)
- Single backward search iterator across all
keywords - Unlike per-matched-node in backward exp.
- Changes answer set slightly
- Different justifications for same root may be
lost - Didnt find any meaningful answers lost
- Spreading activation to prioritize backward
search - Activation spread per keyword
- Nodes prioritized by sum of activations
- Single forward iterator
52Bidir. Search (2)
- For each node in backward iterator
- dist(u,i) best path from u to node in Si
- Si nodes matching keyword Ki
- sp(u,i) next node in shortest path from u to Ki
- a(u,i) activation at u from keyword Ki
- a(u) sum of a(u,i)
53Bidir. Search (3)
- Spreading of activation
- Done separately for each keyword
- For nodes in Si (nodes matching keywords)
- initial activation proportional to node prestige
- total of 1 across nodes in Si
- Node retains µ fraction of received activation,
spreads (1-µ) fraction - Activation spread from a node V divided among
neighbors Ui in inverse proportion to weight Ui?V
- Thus incorporates path score too
- Activation combining function max
1
1
1
2
1
1
1
1
keyword1
keyword2
54Bidir. Search (3)
- Forward search iterator
- Forward search from all nodes reached by backward
search - Track best forward path to each keyword
- Initially infinite cost
- Whenever this changes, propagate cost change to
all affected ancestors
k2
k1
55Bidir. Search (3)
- On each path length update (due to backward or
forward search) - Check if node can reach all keywords
- If so, add it to output heap
- If same undirected tree not already present
- Output heap deals with out of order answer
generation - When to output nodes from heap
- For each keyword Ki, track Mi
- Mi minimum path length to Ki among all
yet-to-be-explored nodes in backward search tree - 1/Max (Mi) is upper bound on edge score
- 1/Sum(Mi) can be used instead at risk of
out-of-order answers - Use max of node scores to compute overall score
upper bound - Output answer if its score is gt overall score
upper bound
56Probabilistic Edge Score Model
- Probabilistic edge scoring model alternative to
edge weight model - Path weight ? probability of following each
edge - Edge probability
- Forward ? 1/out-degree
- Backward ? 1/in-degree
- Can have separate in/out degrees by edge type,
probability of following each edge type - Edge Score E (harmonic) mean of path weights
from leaves to root