Keyword Querying on SemiStructured Data

About This Presentation

Title:

Keyword Querying on SemiStructured Data

Description:

A significant fraction of data is resident in relational databases or in semi ... Similar to NRA algorithm (Fagin et al., PODS'01) Experiments. Datasets ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 57

Provided by: Char183

Category:

more less

Transcript and Presenter's Notes

Title: Keyword Querying on SemiStructured Data

1
Keyword Querying on Semi-Structured Data

S. Sudarshan IIT Bombay

Joint work with Soumen Chakrabarti, and
a large number of past
and current students
http//www.cse.iitb.ac.in/banks/
2
Keyword Search on Semi-Structured Data

A significant fraction of data is resident in
relational databases or in semi-structured data
(XML)
Organizational, government, scientific, medical
data
Goal ad-hoc/exploratory database querying using
keywords
SQL/XQuery is not appropriate for casual users
Form interfaces cumbersome
Require separate form for each type of query
Not suitable for ad hoc queries

3
Keyword Search on Semistructured Data (Cont.)

Differences from IR/Web Search
Normalization splits related data across multiple
tuples
In XML/HTML, edges represent connections between
different nodes
To answer a keyword query we need to find a
(closely) connected set of tuples/nodes that
together match all given keywords

4
Graph Representation of Data

Database modeled as a graph
Nodes tuples
Directed edges for foreign key, inclusion
dependencies, etc
Information integration Graph representation of
integrated information keyword querying
can model relational, XML, HTML, .., data in a
single graph

5
Graph Data Model (2)

E.g., XML data
ltproceedings confVLDB, year 2009gt
ltpaper id1gt
lttitlegtRecovering from Query
Optimizationlt/titlegt .
lt/papergt
ltpaper id2gt
lttitlegtConcurrency Control for
Keyword Searchlt/titlegt .
ltcite ref1gtRecovering from Query
Optimizationlt/citegt
lt/papergt
lt/proceedingsgt

6
Answer Model

Answer Minimal rooted directed tree connecting
nodes containing keywords
Undirected tree Discover, DBXplorer, ..

Multiple answers possible
Answer relevance computed from answer edge
score combined with answer node score
(prestige)

Eg. Sudarshan Roy
7
Edge Directionality

Some popular tuples are connected to many other
tuples
Paper1 ? vldb06 ? paper2
Students ? departments ? university
Popular tuples would create misleading shortcuts
E.g. every student would be closely linked with
every other student via the department/university

8
Edge Weight Model in BANKS

Idea define different forward and backward edge
weights
Forward edge weight (in direction of foreign key
ref/XML containment)
Default to 1, can be based on schema
e.g. citation link weight 10, writes link
weight 5
Backward edge u?v weight (where foreign key v?u)
Proportional to edges pointing to u
Log scaling

9
Node Prestige in BANKS

Node weight (prestige) based on indegree
More incoming edges ? higher prestige
Google PageRank style transfer of prestige
Node weight computed using biased random walk
model
Bias based on edge type, direction

10
Response Ranking

Edge Score EA
Smaller tree gt higher score
E.g., BANKS EA 1/ (S edge weights)
Variant
Score of path from root to leaf probability of
random walk from root reaching that leaf
Tree score product of leaf path scores
Node Score NA
Measure of authority of nodes in tree
E.g., BANKS NA S leaf/root nodes log (node
authorities)
Overall score f (EA, NA)
E.g., BANKS f (EA, NA) EA . NAl
l0.2 works well

11
The BANKS System
JDBC
HTTP
Database
BANKS
User
Web Server Servlets
preprocess
XML Data source

Available on the web, with DBLP, IMDB and IITB
ETD data
http//www.cse.iitb.ac.in/banks/
Preprocessing to create indices and give weights
to links
Provides keyword search and browsing features

12
BANKS Architecture

Data resident on disk
Graph representation of data resident in memory
Nodes and edges with their types/counts
16xV8xE bytes
Search done in memory
Why in memory?
Allows us to use interesting graph traversal
based algorithms without being constrained by SQL
and related performance issues
With current memory sizes, database graphs for
many applications will fit in server memory
External memory search ongoing work

13
Related Work

Keyword querying on relational databases
DBExplorer ICDE02, Discover VLDB02
Use SQL generation
BANKS ICDE02 (G. Bhalotia, Charuta N., A.
Hulgeri, Soumen C.,
S. Sudarshan)
pays more attention to result ranking
does not require schema
Keyword querying on XML
Tree model (answer based on containment edges)
XRank (Cornell), proximity in XML (ATT Research)
Schema-Free XQuery (Michigan),
Tree model cannot handle arbitrary graph edges
Graph model
Sphere search (VLDB2005)
Generates XML tags to represent context
Query can specify keyword context
Does not explore edge weights

14
Proximity (Near) Queries
15
Proximity Queries

Node weight by proximity
E.g. author (near recovery)
Node prestige gt if close to multiple nodes
matching near keywords
Example applications
Finding experts on a particular area
faculty (near earthquake)
Author (near recovery)

Analysis of Earthquake ..
Mohan
Raghu
Earthquake Resistant
Earthquake Measurement
Building Earthquake
16
Proximity via Spreading Activation

Idea
Each near keyword has activation of 1
Divided among nodes matching keyword,
proportional to their node prestige
Each node
keeps fraction 1-µ of its received activation and
spreads fraction µ amongst its neighbors
Graph may have cycles
Combine activation received from neighbors
a 1 (1-a1)(1-a2) (belief function)
Additive combination (a1a2) may diverge w/
cycles

17
Activation Change Propagation

Algorithm to incrementally propagate activation
change d
Nodes to propagate d from are in queue
Best first propagation
Propagation to node already in queue simply
modifies its d value
Stops when d becomes smaller than cutoff

0.2
0.12
1
.6
0.08
0.08
0.2
0.12
18
Near Queries with Multiple Keywords

Spread activation from each keyword separately
Then combine the activations from different
keywords
OR use addition or belief combination
AND take product of activations
Gives better results

19
Proximity and Tree Scores

Queries can combine proximity scores with tree
scores
author(near transactions) data integration
Related work
Goldman et al VLDB98
Considers only shortest path from each node
Author (near Surajit Chaudhuri)
Object Rank VLDB04
Done independently
Precomputed high space overhead
Subsequently extended to IR context in the SPIN
system

20
Example Answers

Anecdotal results on DBLP Bibliography
author (near recovery) Dave Lomet, C. Mohan,
etc
Transaction Jim Grays classic paper and
textbook at the top based on prestige ( of
citations)
Johnson(near OLAP) Theodore Johnson
And on IIT Bombay Thesis/Dissertation Database
faculty (near earthquake) R.S. Jangid, P.
Banerji, R. Sinha
faculty (near database)

21
Other Query Extensions

Restriction of context
authorwashington vs. statewashington
Twig and approximate twig patterns
recovery cites optimization

22
Graph Search Algorithms To Find Answer Trees
23
Bidirectional Expansion for Keyword Searchon
Graph Databases
Varun Kacholia
Shashank Pandit Soumen Chakrabarti
S. Sudarshan Rushi Desai
Hrishikesh Karambelkar
http//www.cse.iitb.ac.in/banks/
24
Finding Answer Trees

Backward Expanding Search
BANKS ICDE02
Intuition travel backwards from keyword nodes
till you hit a common node

Query sudarshan roy
..
MultiQuery Optimization
paper
writes
Sudarshan
Prasan Roy
authors
25
Backward Search Algorithm

Algorithm
Run concurrent single source shortest path
iterators from each node matching a keyword
Traverse the graph edges in reverse direction
Output next nearest node on each get-next() call
Do best-first search across iterators
Output node if in the intersection of sets of
nodes reached from each keyword

26
Backward Search Limitations

Wasteful exploration of graph
Frequently occurring keywords
Hub nodes in the graph (high in-degree)

Shashank Sudarshan Database

Schema Legend
Database

author
writes
paper
Shashank
Sudarshan
27
Bidirectional Search Motivation
28
Bidir Search Intuition

First cut solution
Dont go backward if keyword matches many nodes
Dont go backward if node points to a hub
Instead explore forward from other keywords

29
Bidir Search Example
Shashank Sudarshan Database

Database
Schema Legend

author
writes
Shashank
Sudarshan
paper
30
Bidir Search Issues

What should threshold for not expanding be?
Our solution prioritize expansion of nodes based
on spreading activation
to penalize frequent keywords and bushy trees
How to manage exploration in both directions?

31
Bidir Search Spreading Activation

Spreading Activation
Node with highest activation explored first
Every node given an initial activation
Gives low activation to frequently occurring
keywords

1/5
1/5
1/5
1/5
1/5
John
32
Bidir Search Spreading Activation

Spreading Activation
Node with highest activation explored first
Activation spread to neighbors (µ 0.3)
Gives low activation to neighbors of hubs

0.7 x 1/5 x 1/4
0
1
1/5
1
0.7 x 1/5 x 1/4
0
1
0
0.7 x 1/5 x 1/4
0.3 x 1/5
1
0.7 x 1/5 x 1/4
0
33
Bidir Search Iterators

How to manage exploration in both directions?
Single backward iterator single forward
iterator w/ suitable datastructures
E.g., to keep track of parents of nodes

Dist from A, Dist from B
7
6
8,8
2,3 8
8,8 2
2,8

8,1
8,1
1,8
3
4
5
0,8
8,0
2
1
A
B
34
Bidir Search Algorithm

Algorithm
Activate matching nodes insert into backward
iterator
while (iterators are not empty)
Choose iterator for expansion in best-first
manner
Explore node with highest activation
Spread activation to neighbors
Update path weights (and other datastructures)
Propagate values to ancestors if necessary
Insert nodes explored in the backward direction
into the forward iterator / for future forward
exploration /
Stop when top-k results are produced

35
Bidir Search top-k results

Results need not be generated in-order
Naïve solution
Store results in an intermediate heap
Output top k results after mk total results have
been generated (m 10)
Can do better
Compute upper bound on score of next result
output answers with a higher score
Similar to NRA algorithm (Fagin et al., PODS01)

36
Experiments

Datasets
DBLP, IMDB 2 million nodes, 9 million edges
US Patent DB 4 million nodes, 15 million edges
Workload
Keywords randomly picked from results of SQL join
statements
Search algorithms
MI-Bkwd original backward search
Iterator for every node matching a keyword
SI-Bkwd backward search with single backward
iterator
Bidirec bidirectional search
Time taken/nodes explored
Measured when 10th answer is generated (or last
answer if answers lt 10)
Origin size
nodes matched by keywords in the query

37
Experiments (2)

MI-Bkwd versus SI-Bkwd
SI-Bkwd gain increases with origin size,
keywords

38
Experiments (3)

SI-Bkwd versus Bidirec
Bidirec gain increases with origin size,
keywords

39
Experiments (4)

Precision/Recall experiments
Relevant answers are well-defined can be
generated through SQL statements
Both MI-Backward and Bidirectional show similar
performance
Recall 100
Precision 100 at near full recall
Few irrelevant answers produced before generating
all relevant answers
Bidirectional runs faster, yet minimal loss of
relevant results!

40
Discussion

Bidirectional search as dynamic per-tuple join
ordering
Related work in this area Eddies
Unlike Eddies, bidirectional search is
Schema-less
Priority based on activation instead of
selectivity
Generates answers in relevance order

41
Conclusions

Graph model
Common denominator representation
Multiple types of queries required
Near queries, spanning tree queries
Ranking is critical
Edge and node weights, spreading activation
Efficient graph search is critical
Bidirectional graph search

42
Ongoing and Future Work

Graphs larger than memory
Idea Use multi-level graph representation
Higher levels are condensed representation of
lower levels
Revised approach to search
Search on condensed super-graph (in-memory), to
find potential answers
Expand nodes (disk I/O)
Redo search on expanded nodes to find real
answers

43
Graph Condensation
S1
S2
S3

Cluster nodes to get supergraph
Different clusterings possible e.g.
Raghavan/Garcia Molina ICDE 2003 for web graphs
Currently building infrastructure and exploring
techniques

44
Searching in Condensed Graphs

Weight of super-edge S1?S2 min(real edge
weigths between S1 and S2)
Issues
E.g. minimal answers on full graph may be
non-minimal on condensed graph
Multi-granularity representation
Other types of queries on condensed graphs

45
Further Directions

Querying graph representation of integrated data

0.1
S. Sarawagi
Sunita S.
5
Suneeta S.
46
Future of Keyword Search in DBs

Next generation of intelligent search will
require context information
E.g. search email, files, calendar, ..
Information integration will be important
Graph structured data will be a key component
Security
Is there a killer app?

47
Thank You!
Questions??
48
Experiments (5)

Comparison with Sparse
Hristidis et al. VLDB03
Generate join expressions leading to query
results
Use DB-provided scores for ranking tuples and
aggregate them to rank answer trees
For top-k results automatically determine
required number of join expressions
Sparse-LB
Manually generate required join expressions
Sparse needs to do at least this much (and
usually a lot more!)
Bidirectional versus Sparse-LB
Bidirectional outperforms by a factor of 3
(esp. when joins is large)

49
Experiments (6)

SI-Bkwd versus Bidirec by origin size
Bidirec gains more with unbalanced origin sizes

A (T,S,S,S) B (M,M,M,M) C (M,L,L,L) D
(M,M,L,L) E (T,L,L,L) F (T,S,M,L) G
(T,M,L,L) H (T,T,T,L)
50
Bidir Search top-k results (2)

Compute upper bound on score of next result
output answers with a higher score
Computing the bound
mi minimum path length explored backward from
keyword i
unseen answer node 1/(m1 m2 mn )
visited answer node suppose reached from first x
keywords with distance di
1/(d1 d2 dx ) (mx1 mx2
mn )
combine this with max node prestige
We simply use 1/(m1 m2 mn )!
Experiments show no significant loss in using
this heuristic

51
Bidirectional Search (1)

Single backward search iterator across all
keywords
Unlike per-matched-node in backward exp.
Changes answer set slightly
Different justifications for same root may be
lost
Didnt find any meaningful answers lost
Spreading activation to prioritize backward
search
Activation spread per keyword
Nodes prioritized by sum of activations
Single forward iterator

52
Bidir. Search (2)

For each node in backward iterator
dist(u,i) best path from u to node in Si
Si nodes matching keyword Ki
sp(u,i) next node in shortest path from u to Ki
a(u,i) activation at u from keyword Ki
a(u) sum of a(u,i)

53
Bidir. Search (3)

Spreading of activation
Done separately for each keyword
For nodes in Si (nodes matching keywords)
initial activation proportional to node prestige
total of 1 across nodes in Si
Node retains µ fraction of received activation,
spreads (1-µ) fraction
Activation spread from a node V divided among
neighbors Ui in inverse proportion to weight Ui?V
Thus incorporates path score too
Activation combining function max

1
1
1
2
1
1
1
1
keyword1
keyword2
54
Bidir. Search (3)

Forward search iterator
Forward search from all nodes reached by backward
search
Track best forward path to each keyword
Initially infinite cost
Whenever this changes, propagate cost change to
all affected ancestors

k2
k1
55
Bidir. Search (3)

On each path length update (due to backward or
forward search)
Check if node can reach all keywords
If so, add it to output heap
If same undirected tree not already present
Output heap deals with out of order answer
generation
When to output nodes from heap
For each keyword Ki, track Mi
Mi minimum path length to Ki among all
yet-to-be-explored nodes in backward search tree
1/Max (Mi) is upper bound on edge score
1/Sum(Mi) can be used instead at risk of
out-of-order answers
Use max of node scores to compute overall score
upper bound
Output answer if its score is gt overall score
upper bound

56
Probabilistic Edge Score Model

Probabilistic edge scoring model alternative to
edge weight model
Path weight ? probability of following each
edge
Edge probability
Forward ? 1/out-degree
Backward ? 1/in-degree
Can have separate in/out degrees by edge type,
probability of following each edge type
Edge Score E (harmonic) mean of path weights
from leaves to root