Title: XRANK: Ranked Keyword Search over XML Documents
1XRANK Ranked Keyword Search over XML Documents
Lin Guo Feng Shao Chavdar Botev Jayavel
Shanmugasundaram
Presentation by Meghana Kshirsagar Nitin
Gupta Indian Institute of Technology, Bombay
2Outline
- Motivation
- Problem Definition, Query Semantics
- Ranking Function
- A New Datastructure Dewey Inverted List (DIL)
- Algorithms
- Performance Evaluation
3Motivation
4Motivation - I
- Why do we need search over XML data?
- Why not use search techniques used on WWW
(keyword search on HTML)?
5Motivation - IIKeyword Search XML Vs HTML
- XML
- structural
- Links IDREFs and Xlinks
- Tags Content specifiers
- ranking
- Result XML element (a tree)
- Element-level ranking
- Proximity
- width
- height
- HTML
- structural
- Links document-to-document
- Tags Format specifiers
- ranking
- Result Document
- Page-level ranking
- Proximity
- width distance between words
6Problem Definition,Query Semantics,and Ranking
7Problem Definition
- Input Set of keywords
- Output Ranked XML elements
What is a result? How to rank results ?
8Bird's eye view of the system
Results
Query Keywords
XML doc repository
Preprocessing (ElemRank computation)
9What is a result?
- A minimal Steiner tree of XML elements
- Result-set is a set of XML elements that
- includes a subset of elements containing all
query-keywords at least once, after excluding the
occurrences of keywords in contained results (if
any).
10result 1
result 2
11Result Graphical representation
containment edge
ancestor
descendant
12Ranking Which results to return first?
- Properties
- The Ranking function should
- reflect Result Specificity
- consider Keyword-Proximity
- be Hyperlink Aware
- Ranking function
- f (height, width, link-structure)
13Less specific result
More specific result
14Ranking Function
For a single XML element (node)
r (v1, ki) ElemRank ( vt ) . decayt-1
v1
vt
ki
15Ranking Function
Combining ranks in case of multiple occurrences
Overall Rank
16Semantics of the ranking function
Link structure
r (v1, ki) ElemRank ( vt ) . decayt-1
Specificity (height)
Proximity
17ElemRank Computation adopt PageRank??
- Short-comings
- Fails to capture
- bidirectional transfer of ElemRanks
- discrimination between edge-types (containment
and hyperlink) - doesn't aggregate ElemRanks for reverse
containment relationships
18ElemRank Computation - I
- Consider Both forward and reverse ElemRank
propagation.
- Ne total of XML elements
- Nh(u) hyperlinks from 'u'
- Nc(u) children of 'u'
- E HE U CE U CE'
- CE' reverse containment edges
19ElemRank Computation - II
- Seperate containment and hyperlink edges
- CE containment edges
- HE hyperlink edges
- ElemRank (sub elements) a 1 / ( sibling
sub-elements )
20ElemRank Computation - III
- Sum over the reverse-containment edges,
instead of distributing the weight
- Nd(u) total XML documents
- Nde(v) elements in the XML doc containing v
- ElemRank (parent) a Sum (ElemRank(sub-eleme
nts))
21Datastructures and Algorithms
22Naïve Algorithm
- Approach
- XML element doc
- Use keyword search on WWW
- Limitations
- Space overhead (in inverted indices)
- Failure to model Hierarchical relationships
(ancestordecendent). - Inaccurate Ranking
- Need a new datastructure which can model
hierarchical relationships !! - Answer Dewey Inverted Lists
23Labeling nodes using Dewey Ids
24Dewey Inverted Lists
- One entry per keyword
-
- Entry for keyword 'k' has Dewey-IDs of elements
directly containing 'k'
- Simple equi merge-join of Dewey-ID-lists won't
work ! - Need to compute prefixes.
25System Architecture
26DIL Query Processing
- Simple equality merge-join will not work
- Need to find LCP (longest common prefix) over all
elements with query keyword-match. - Single pass over the inverted lists suffices!
- Compute LCP while merging the ILs of individual
keywords. - ILs are sorted on Dewey-IDs
27Datastructures
- Array of all inverted lists invertedList
- invertedListi for keyword 'i'
- each invertedListi is sorted on Dewey-ID
- Heap to maintain top-m results resultHeap
- Stack to store current Dewey-ID, ranks, position
List, longest common prefixes deweyStack
28Algorithm on DILs - Abstract
- While all inverted-lists are not processed
- Read the next entry from DIL having smallest
Dewey-ID - call this 'currentEntry'
- Find the longest common prefix (lcp) between
stack components and entry read from DIL - lcp (deweyStack , currentEntry)
- Pop non-matching entries from Dewey-stack Add
result to heap if appropriate - check if current top-of-stack contains all
keywords - if yes, compute OverallRank, put this result onto
heap - else
- non-matching entries are popped one component at
a time and update (rank, posList) on each pop - Push non-matching part of 'currentEntry' to
'deweyStack' - non-matching components of 'currentEntry.deweyID'
are pushed onto stack - Update components of top entry of deweyStack
29Example
Query XQL Ricardo
30Algorithm Trace Step 1
Ranki Rank due to keyword 'i' PosListi
List of occurrences of keyword 'i'
Smallest ID 5.0.3.0.0
DeweyStack
DIL invertedList
push all components and find rank, posL
31Algorithm Trace Step 2
Smallest ID 5.0.3.0.1
DeweyStack
DIL invertedList
find lcp and pop nonmatching components
32Algorithm Trace Step 3
Smallest ID 5.0.3.0.1
DeweyStack
DIL invertedList
updated rank, posL
33Algorithm Trace Step 4
Smallest ID 5.0.3.0.1
DeweyStack
DIL invertedList
push non-matching components
34Algorithm Trace Step 5
Smallest ID 6.0.3.8.3
DeweyStack
DIL invertedList
find lcp, update, finally pop all components
35Problems with DIL
- Scans the entire inverted-list for all keywords
before a result is output - Very inefficient for top-k computation
36Other Techniques - RDIL
- Ranked Dewey Inverted List
- For efficient top-k result computation
- IL is ordered by ElemRank
- Each IL has a B tree index on the Dewey-IDs
- Algorithm with RDIL uses a threshold
37Algorithm using RDIL (Abstract)
- Choose the next entry from one of the
invertedList in a Round-Robin fashion. - say chosen IL invertedListi
- d top-ranked Dewey-ID from invertedListi
- Find the longest common prefix that contains all
query-keywords - Probe the B tree index of all other keyword ILs,
for the longest common prefix - Claim
- d2 smallest Dewey-ID in invertedListj of
query-keyword 'j' - d3 immediate predecessor of d2
- lcp max_prefix (lcp ( d, d2) , lcp ( d, d3))
- Check if 'lcp' is a complete result
- Recompute 'threshold' sum (ElemRank of last
processed element in each query keyword IL) - If (rank of top-k results on heap) gt
threshold) return
38Performance of RDIL
- Works well for queries with highly correlated
keywords - BUT ! becomes equivalent (actually worse) to
DIL for totally uncorrelated keywords - Need an intermediate technique
39HDIL
- Uses both DIL and RDIL
- Adaptive strategy
- Start with RDIL
- Switch to DIL if performance is bad
- Performance?
- Estimated remaining time for RDIL (m r ) t
/ r - t time spent so far
- r no. of results above threshold so far
- m desired no. of results
- Estimated remaining time for DIL ?
- No. of query-keywords is known
- Size of each IL is known
40HDIL
- Datastructures?
- Store full IL sorted on Dewey-ID
- Store small fraction of IL sorted on ElemRank
- Share the leaf level between IL and B tree (in
RDIL) - Overhead top levels of B tree
41Updating the lists
- Updation is easy
- Insertion very bad!
- techniques from Tatarinov et al.
- we've seen a better technique in this course )
OrdPath
42Evaluation
- Criteria
- no. of query-keywords
- correlation between query-keywords
- desired no. of query results
- selectivity of keywords
- Setup
- Datasets used DBLP, Xmark
- d1 0.35, d2 0.25, d3 0.25
- 2.8GHz Pentium IV 1GB RAM 80GB HDD
43Performance - 1
44Performance - 2
45Critique
- New datastructure (DIL) defined to represent
hierarchical relationships accurately and
efficiently. - Hyperlinks and IDREFs are considered only while
computing ElemRank. Not used while returning
results. - Only containment edges (ancestor-descendant) are
considered while computing result trees. - Works only on trees, can't handle graphs.
46(No Transcript)
47(No Transcript)
48The SphereSearch Engine for Unified Banked
Retrieval of Heterogenous XML and Web Documents
- Jens Graupmann Ralf Schenkel Gerhard
Weikum - Max-Plack-Institut fur Informatik
- Presentation by
- Nitin Gupta Meghana Kshirsagar
- Indian Institute of Technology Bombay
49Why another search engine ?
- To cope with diversity in the structures and
annotations of the data - Ranked retrieval paradigm for producing relevance
ordered results lists rather than a mere boolean
retrieval. - Short comings of the current search engines
- Concept aware
- Context aware (or link-awareness)
- Abstraction aware
- Query Language
50Concept awareness
- Example researcher max planck yields many
results about researchers who work at the
institute Max Plack Society - Better formulation would be researcher
personmax planck - Objective attained by
- Transformation to XML
- Data Annotation
51Concept awareness Transformation
- ltExperimentsgt
- ... Text1 ...
- ltSettingsgt
- ... Text2 ...
- lt/Settingsgt
- lt/Experimentsgt
- ...
- ltH1gtExperimentslt/H1gt
- ... Text1 ...
- ltH2gtSettingslt/H2gt
- ... Text2 ...
- ltH1gt ...
52Abstraction Awareness
- Example Synonyms, Ontologies
- Is connection to various encyclopedias/ Wiki's
possible? - Objective attained by using
- Ontology Service provides quantified ontological
information to the system - Preprocessed information based on focused web
crawls to estimate statistical correlations
between the characteristic words of related
concepts
53Context Awareness
- Query may not be answered by web search engines
as no single web page may be a match - Unlike usual navigation axes in XML, context
should go beyond trees. - Consider graph structure spanned by
Xlink/XPointer references and href hyperlinks - Objective attained by
- introduction of the concept of a SPHERE
54Context Awareness Sphere
- What is a sphere?
- Relevance of an element for a group of query
conditions is not just determined by its own
content, but also by the content of other
neighboring elements, including linked documents,
in an environment - called Sphere - of the
element.
55Query Language
- Query S (Q, J) consists of
- set Q G1 .. Gq of query groups
- set J J1 .. Jm of join conditions
- Each Qi consists of
- set of keyword conditions t1 .. tk
- set of concept value conditions c1 v1 ... cl
vl - Each join has the form Qi.v (or ) Qj.w
56Query Language
- Example
- P(professor, locationGermany)
- C(course, databases)
- R(project, XML)
- A(gothic, church)
- B(romanic, church)
- A.location B.location
German professors who teach database courses and
have projects on XML
Gothic and Romanic churches at the same location
57Data Model
- Collection X (D, L) of XML documents D together
with a set L of (href, Xpointer, or Xlink) links
between their elements - Consider all attributes as elements then element
level graph GE(X) (VE(X), EE(X)) has the union
of all the elements of the document as nodes and
undirected edges between them - Each edge has nonnegative weight
- 1 for parent-child ? for links
- A distance function dX(x,y) computes weight of
a shortest path in GE(X) between x and y
58Spheres and Query Groups
- Node-score ns(n,t) is computed using Okapi BM25
model - Similarity condition K Compute exp(K) for the
keyword. The node score is defined as max
x?exp(K) sim(K,x) ns(n,x) - where sim(K,x) is the ontological similarity
- Concept value
- ns(n, cv) 0 if name(n) ? c
- ns(n,v) otherwise
- Similarity concept value c v sim(name(n), c)
ns(n,v) - This is insufficient
- in the presence of linked documents
- when content is spread over several elements
59Spheres and Query Groups
- Sphere Sd(n) set of nodes at distance d from
node n - sd(n,t) ? v ? Sd(n) ns(v,t)
- s(n,t) ? si(n,t) ai
s(1,t) 1 40.5 20.25 50.125 4.175
s(2,t) 3 00.5 00.25 10.125 3.125
s(n, G) ? j s(n,tj) ? j s(n, cjvj)
60Spheres and Query Groups Ranking
- Create a connection graph G(N) (V(N), E(N))
- Weight of an edge between x,y
- 0 if x and y are not connected
- 1/ dx(x,y)1 otherwise
Compactness C(N) of a potential answer N is then
the sum of the total edge weights of a maximal
spanning tree for G(N), and the score is given
by s(N, S) ß C(N) (1- ß) ?i s(ni, Gi)
61Spheres and Query Groups Joins
- New virtual links to form an extended collection
X' (D, L') - Connect the elements that match the join
- Similarity join For Qi.v Qj.w, consider sets
N(v) (resp N(w)) with name v (w) or contain v (w)
in their content. For each pair x N(v), y N(w)
add a link x,y with weight 1/csim(x,y)
62System Architecture
Focused web crawls used to estimate statistical
correlations between the characteristic words of
related concepts. Current version uses Dice
coefficient.
Content stored in inverted lists with
corresponding tfidf-style term statistics
Indexer stores with each element the
corresponding Dewer encoding of its position
within the document
63Query Processor
- First compute a result list for each query group
- Add virtual links for join conditions
- Compute the compactness of a subset of all
potential answers of the query in order to return
the top-k results
- Compute a list of results for each of query
keywords and concept-value conditions. - Candidate nodes Nodes that are at distance at
most D from any node that occurs in at least one
of the lists. Sphere score is computed only for
these nodes since only these can have a non-zero
score! - For eachl candidate node N, look up the node
scores of nodes in the sphere of N, and adding
these scores with a proper damping factor.
64Query Processor
- Virtual links Processor considers only a limited
set of possible end points for efficient
computation - Nodes in the spheres upto distance D around nodes
with nonzero sphere score for any query group - Why? Any other node will have distance atleast
D1 to any results node and thus contributes at
most 1/ (D1)1 to the compactness, which is
negligible - This set of candidate nodes can be computed on
the fly - Set further reduced by testing join attributes,
for example A.x B.y results in two sets of
potential end points.
65Query Processor
- Generating answers
- Naïve method generate all possible potential
answers from the answers to query groups, compute
connection graphs and compactness, and finally
their score - For top-k answers, use Fagin's Threshold
Algorithm with sorted lists only - Input Sorted list of node scores and pairwise
node scores (edges) - Output k potential answers with the best scores
66Experiments
- Sun V40z, 16GB RAM, Windows 2003 Server, Tomcat
4.0.6 environment, Oracle 10g database - Benchmarks XMach, Xmark, INEX, TREC
Does not consider XML at all
Semantically poor tags
Designed for XQuery-style exact match
Wikipedia Collection from the Wikipedia project
HTML Collection transformed into XML and
annotated Wikipedia Collection Extension of
Wikipedia with IMDB data, with generated XML
files for each movie and actor DBLP Collection
Based on the DBLP project which indexes more than
480,000 publications INEX Set of 12,107 XML
documents, a set of queries with and without
structural constraints
67Experiments
Conversion from HTML to XML
Dataset Statistics
68Experiments
- SSE-basic basic version limited to keyword
conditions using sphere-based scoring - SSE-CV basic version plus concept-value
conditions - SSE-QC CV version plus query groups (full
contest awareness) - SSE-Join full version will all features
- SSE-KW very restricted version with simple
keyword search - GoogleWiki Google search restricted to
Wikipedia.org - GoogleWiki Google on wikipedia.org with
Google's operator for query expansion - GoogleWeb Google search on the entire web
- GoogleWeb Google search on the entire web with
query expansion
69Experiments
Aggregated results for Wikipedia
70Experiments
Aggregated results for Wikipedia and DBLP
71Experiments
Graph showing the average runtimes for different
versions
72Thank you