XRANK: Ranked Keyword Search over XML Documents - PowerPoint PPT Presentation

About This Presentation

Title:

XRANK: Ranked Keyword Search over XML Documents

Description:

Stack to store current Dewey-ID, ranks, position List, longest common prefixes : deweyStack ... Compute a list of results for each of query keywords and ... – PowerPoint PPT presentation

Number of Views:113

Avg rating:3.0/5.0

Slides: 73

Provided by: cseIi3

Category:

more less

Transcript and Presenter's Notes

Title: XRANK: Ranked Keyword Search over XML Documents

1
XRANK Ranked Keyword Search over XML Documents
Lin Guo Feng Shao Chavdar Botev Jayavel
Shanmugasundaram
Presentation by Meghana Kshirsagar Nitin
Gupta Indian Institute of Technology, Bombay
2
Outline

Motivation
Problem Definition, Query Semantics
Ranking Function
A New Datastructure Dewey Inverted List (DIL)
Algorithms
Performance Evaluation

3
Motivation
4
Motivation - I

Why do we need search over XML data?
Why not use search techniques used on WWW
(keyword search on HTML)?

5
Motivation - IIKeyword Search XML Vs HTML

XML
structural
Links IDREFs and Xlinks
Tags Content specifiers
ranking
Result XML element (a tree)
Element-level ranking
Proximity
width
height

HTML
structural
Links document-to-document
Tags Format specifiers
ranking
Result Document
Page-level ranking
Proximity
width distance between words

6
Problem Definition,Query Semantics,and Ranking
7
Problem Definition

Input Set of keywords
Output Ranked XML elements

What is a result? How to rank results ?
8
Bird's eye view of the system
Results
Query Keywords
XML doc repository
Preprocessing (ElemRank computation)
9
What is a result?

A minimal Steiner tree of XML elements

Result-set is a set of XML elements that
includes a subset of elements containing all
query-keywords at least once, after excluding the
occurrences of keywords in contained results (if
any).

10
result 1
result 2
11
Result Graphical representation
containment edge
ancestor
descendant
12
Ranking Which results to return first?

Properties
The Ranking function should
reflect Result Specificity
consider Keyword-Proximity
be Hyperlink Aware
Ranking function
f (height, width, link-structure)

13
Less specific result
More specific result
14
Ranking Function
For a single XML element (node)
r (v1, ki) ElemRank ( vt ) . decayt-1
v1
vt
ki
15
Ranking Function
Combining ranks in case of multiple occurrences
Overall Rank
16
Semantics of the ranking function
Link structure
r (v1, ki) ElemRank ( vt ) . decayt-1
Specificity (height)
Proximity
17
ElemRank Computation adopt PageRank??

PageRank

Short-comings
Fails to capture
bidirectional transfer of ElemRanks
discrimination between edge-types (containment
and hyperlink)
doesn't aggregate ElemRanks for reverse
containment relationships

18
ElemRank Computation - I

Consider Both forward and reverse ElemRank
propagation.

Ne total of XML elements
Nh(u) hyperlinks from 'u'
Nc(u) children of 'u'
E HE U CE U CE'
CE' reverse containment edges

19
ElemRank Computation - II

Seperate containment and hyperlink edges

CE containment edges
HE hyperlink edges
ElemRank (sub elements) a 1 / ( sibling
sub-elements )

20
ElemRank Computation - III

Sum over the reverse-containment edges,
instead of distributing the weight

Nd(u) total XML documents
Nde(v) elements in the XML doc containing v
ElemRank (parent) a Sum (ElemRank(sub-eleme
nts))

21
Datastructures and Algorithms
22
Naïve Algorithm

Approach
XML element doc
Use keyword search on WWW

Limitations
Space overhead (in inverted indices)
Failure to model Hierarchical relationships
(ancestordecendent).
Inaccurate Ranking
Need a new datastructure which can model
hierarchical relationships !!
Answer Dewey Inverted Lists

23
Labeling nodes using Dewey Ids
24
Dewey Inverted Lists

One entry per keyword
Entry for keyword 'k' has Dewey-IDs of elements
directly containing 'k'

Simple equi merge-join of Dewey-ID-lists won't
work !
Need to compute prefixes.

25
System Architecture
26
DIL Query Processing

Simple equality merge-join will not work
Need to find LCP (longest common prefix) over all
elements with query keyword-match.
Single pass over the inverted lists suffices!
Compute LCP while merging the ILs of individual
keywords.
ILs are sorted on Dewey-IDs

27
Datastructures

Array of all inverted lists invertedList
invertedListi for keyword 'i'
each invertedListi is sorted on Dewey-ID
Heap to maintain top-m results resultHeap
Stack to store current Dewey-ID, ranks, position
List, longest common prefixes deweyStack

28
Algorithm on DILs - Abstract

While all inverted-lists are not processed
Read the next entry from DIL having smallest
Dewey-ID
call this 'currentEntry'
Find the longest common prefix (lcp) between
stack components and entry read from DIL
lcp (deweyStack , currentEntry)
Pop non-matching entries from Dewey-stack Add
result to heap if appropriate
check if current top-of-stack contains all
keywords
if yes, compute OverallRank, put this result onto
heap
else
non-matching entries are popped one component at
a time and update (rank, posList) on each pop
Push non-matching part of 'currentEntry' to
'deweyStack'
non-matching components of 'currentEntry.deweyID'
are pushed onto stack
Update components of top entry of deweyStack

29
Example
Query XQL Ricardo
30
Algorithm Trace Step 1
Ranki Rank due to keyword 'i' PosListi
List of occurrences of keyword 'i'
Smallest ID 5.0.3.0.0
DeweyStack
DIL invertedList
push all components and find rank, posL
31
Algorithm Trace Step 2
Smallest ID 5.0.3.0.1
DeweyStack
DIL invertedList
find lcp and pop nonmatching components
32
Algorithm Trace Step 3
Smallest ID 5.0.3.0.1
DeweyStack
DIL invertedList
updated rank, posL
33
Algorithm Trace Step 4
Smallest ID 5.0.3.0.1
DeweyStack
DIL invertedList
push non-matching components
34
Algorithm Trace Step 5
Smallest ID 6.0.3.8.3
DeweyStack
DIL invertedList
find lcp, update, finally pop all components
35
Problems with DIL

Scans the entire inverted-list for all keywords
before a result is output
Very inefficient for top-k computation

36
Other Techniques - RDIL

Ranked Dewey Inverted List
For efficient top-k result computation
IL is ordered by ElemRank
Each IL has a B tree index on the Dewey-IDs
Algorithm with RDIL uses a threshold

37
Algorithm using RDIL (Abstract)

Choose the next entry from one of the
invertedList in a Round-Robin fashion.
say chosen IL invertedListi
d top-ranked Dewey-ID from invertedListi
Find the longest common prefix that contains all
query-keywords
Probe the B tree index of all other keyword ILs,
for the longest common prefix
Claim
d2 smallest Dewey-ID in invertedListj of
query-keyword 'j'
d3 immediate predecessor of d2
lcp max_prefix (lcp ( d, d2) , lcp ( d, d3))
Check if 'lcp' is a complete result
Recompute 'threshold' sum (ElemRank of last
processed element in each query keyword IL)
If (rank of top-k results on heap) gt
threshold) return

38
Performance of RDIL

Works well for queries with highly correlated
keywords
BUT ! becomes equivalent (actually worse) to
DIL for totally uncorrelated keywords
Need an intermediate technique

39
HDIL

Uses both DIL and RDIL
Adaptive strategy
Start with RDIL
Switch to DIL if performance is bad
Performance?
Estimated remaining time for RDIL (m r ) t
/ r
t time spent so far
r no. of results above threshold so far
m desired no. of results
Estimated remaining time for DIL ?
No. of query-keywords is known
Size of each IL is known

40
HDIL

Datastructures?
Store full IL sorted on Dewey-ID
Store small fraction of IL sorted on ElemRank
Share the leaf level between IL and B tree (in
RDIL)
Overhead top levels of B tree

41
Updating the lists

Updation is easy
Insertion very bad!
techniques from Tatarinov et al.
we've seen a better technique in this course )
OrdPath

42
Evaluation

Criteria
no. of query-keywords
correlation between query-keywords
desired no. of query results
selectivity of keywords
Setup
Datasets used DBLP, Xmark
d1 0.35, d2 0.25, d3 0.25
2.8GHz Pentium IV 1GB RAM 80GB HDD

43
Performance - 1
44
Performance - 2
45
Critique

New datastructure (DIL) defined to represent
hierarchical relationships accurately and
efficiently.
Hyperlinks and IDREFs are considered only while
computing ElemRank. Not used while returning
results.
Only containment edges (ancestor-descendant) are
considered while computing result trees.
Works only on trees, can't handle graphs.

46
(No Transcript)
47
(No Transcript)
48
The SphereSearch Engine for Unified Banked
Retrieval of Heterogenous XML and Web Documents

Jens Graupmann Ralf Schenkel Gerhard
Weikum
Max-Plack-Institut fur Informatik
Presentation by
Nitin Gupta Meghana Kshirsagar
Indian Institute of Technology Bombay

49
Why another search engine ?

To cope with diversity in the structures and
annotations of the data
Ranked retrieval paradigm for producing relevance
ordered results lists rather than a mere boolean
retrieval.
Short comings of the current search engines
Concept aware
Context aware (or link-awareness)
Abstraction aware
Query Language

50
Concept awareness

Example researcher max planck yields many
results about researchers who work at the
institute Max Plack Society
Better formulation would be researcher
personmax planck
Objective attained by
Transformation to XML
Data Annotation

51
Concept awareness Transformation

ltExperimentsgt
... Text1 ...
ltSettingsgt
... Text2 ...
lt/Settingsgt
lt/Experimentsgt
...

ltH1gtExperimentslt/H1gt
... Text1 ...
ltH2gtSettingslt/H2gt
... Text2 ...
ltH1gt ...

52
Abstraction Awareness

Example Synonyms, Ontologies
Is connection to various encyclopedias/ Wiki's
possible?
Objective attained by using
Ontology Service provides quantified ontological
information to the system
Preprocessed information based on focused web
crawls to estimate statistical correlations
between the characteristic words of related
concepts

53
Context Awareness

Query may not be answered by web search engines
as no single web page may be a match
Unlike usual navigation axes in XML, context
should go beyond trees.
Consider graph structure spanned by
Xlink/XPointer references and href hyperlinks
Objective attained by
introduction of the concept of a SPHERE

54
Context Awareness Sphere

What is a sphere?
Relevance of an element for a group of query
conditions is not just determined by its own
content, but also by the content of other
neighboring elements, including linked documents,
in an environment - called Sphere - of the
element.

55
Query Language

Query S (Q, J) consists of
set Q G1 .. Gq of query groups
set J J1 .. Jm of join conditions
Each Qi consists of
set of keyword conditions t1 .. tk
set of concept value conditions c1 v1 ... cl
vl
Each join has the form Qi.v (or ) Qj.w

56
Query Language

Example
P(professor, locationGermany)
C(course, databases)
R(project, XML)
A(gothic, church)
B(romanic, church)
A.location B.location

German professors who teach database courses and
have projects on XML
Gothic and Romanic churches at the same location
57
Data Model

Collection X (D, L) of XML documents D together
with a set L of (href, Xpointer, or Xlink) links
between their elements
Consider all attributes as elements then element
level graph GE(X) (VE(X), EE(X)) has the union
of all the elements of the document as nodes and
undirected edges between them
Each edge has nonnegative weight
1 for parent-child ? for links
A distance function dX(x,y) computes weight of
a shortest path in GE(X) between x and y

58
Spheres and Query Groups

Node-score ns(n,t) is computed using Okapi BM25
model
Similarity condition K Compute exp(K) for the
keyword. The node score is defined as max
x?exp(K) sim(K,x) ns(n,x)
where sim(K,x) is the ontological similarity
Concept value
ns(n, cv) 0 if name(n) ? c
ns(n,v) otherwise
Similarity concept value c v sim(name(n), c)
ns(n,v)
This is insufficient
in the presence of linked documents
when content is spread over several elements

59
Spheres and Query Groups

Sphere Sd(n) set of nodes at distance d from
node n
sd(n,t) ? v ? Sd(n) ns(v,t)
s(n,t) ? si(n,t) ai

s(1,t) 1 40.5 20.25 50.125 4.175
s(2,t) 3 00.5 00.25 10.125 3.125
s(n, G) ? j s(n,tj) ? j s(n, cjvj)
60
Spheres and Query Groups Ranking

Create a connection graph G(N) (V(N), E(N))
Weight of an edge between x,y
0 if x and y are not connected
1/ dx(x,y)1 otherwise

Compactness C(N) of a potential answer N is then
the sum of the total edge weights of a maximal
spanning tree for G(N), and the score is given
by s(N, S) ß C(N) (1- ß) ?i s(ni, Gi)
61
Spheres and Query Groups Joins

New virtual links to form an extended collection
X' (D, L')
Connect the elements that match the join
Similarity join For Qi.v Qj.w, consider sets
N(v) (resp N(w)) with name v (w) or contain v (w)
in their content. For each pair x N(v), y N(w)
add a link x,y with weight 1/csim(x,y)

62
System Architecture
Focused web crawls used to estimate statistical
correlations between the characteristic words of
related concepts. Current version uses Dice
coefficient.
Content stored in inverted lists with
corresponding tfidf-style term statistics
Indexer stores with each element the
corresponding Dewer encoding of its position
within the document
63
Query Processor

First compute a result list for each query group
Add virtual links for join conditions
Compute the compactness of a subset of all
potential answers of the query in order to return
the top-k results

Compute a list of results for each of query
keywords and concept-value conditions.
Candidate nodes Nodes that are at distance at
most D from any node that occurs in at least one
of the lists. Sphere score is computed only for
these nodes since only these can have a non-zero
score!
For eachl candidate node N, look up the node
scores of nodes in the sphere of N, and adding
these scores with a proper damping factor.

64
Query Processor

Virtual links Processor considers only a limited
set of possible end points for efficient
computation
Nodes in the spheres upto distance D around nodes
with nonzero sphere score for any query group
Why? Any other node will have distance atleast
D1 to any results node and thus contributes at
most 1/ (D1)1 to the compactness, which is
negligible
This set of candidate nodes can be computed on
the fly
Set further reduced by testing join attributes,
for example A.x B.y results in two sets of
potential end points.

65
Query Processor

Generating answers
Naïve method generate all possible potential
answers from the answers to query groups, compute
connection graphs and compactness, and finally
their score
For top-k answers, use Fagin's Threshold
Algorithm with sorted lists only
Input Sorted list of node scores and pairwise
node scores (edges)
Output k potential answers with the best scores

66
Experiments

Sun V40z, 16GB RAM, Windows 2003 Server, Tomcat
4.0.6 environment, Oracle 10g database
Benchmarks XMach, Xmark, INEX, TREC

Does not consider XML at all
Semantically poor tags
Designed for XQuery-style exact match
Wikipedia Collection from the Wikipedia project
HTML Collection transformed into XML and
annotated Wikipedia Collection Extension of
Wikipedia with IMDB data, with generated XML
files for each movie and actor DBLP Collection
Based on the DBLP project which indexes more than
480,000 publications INEX Set of 12,107 XML
documents, a set of queries with and without
structural constraints
67
Experiments
Conversion from HTML to XML
Dataset Statistics
68
Experiments

SSE-basic basic version limited to keyword
conditions using sphere-based scoring
SSE-CV basic version plus concept-value
conditions
SSE-QC CV version plus query groups (full
contest awareness)
SSE-Join full version will all features
SSE-KW very restricted version with simple
keyword search
GoogleWiki Google search restricted to
Wikipedia.org
GoogleWiki Google on wikipedia.org with
Google's operator for query expansion
GoogleWeb Google search on the entire web
GoogleWeb Google search on the entire web with
query expansion

69
Experiments
Aggregated results for Wikipedia
70
Experiments
Aggregated results for Wikipedia and DBLP
71
Experiments
Graph showing the average runtimes for different
versions
72
Thank you

Write a Comment

User Comments (0)