XQuery Processing with Relevance Ranking - PowerPoint PPT Presentation

About This Presentation

Title:

XQuery Processing with Relevance Ranking

Description:

where $b/abstract ~ ('SAX' and not 'DOM') order by score($b) descending ... a pull-based, event-oriented processing of local XML files (instead of DOM-based) ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 16

Provided by: lambd

Learn more at: https://lambda.uta.edu

Category:

more less

Transcript and Presenter's Notes

Title: XQuery Processing with Relevance Ranking

1
XQuery Processing with Relevance Ranking

Leonidas Fegaras
University of Texas at Arlington
fegaras_at_cse.uta.edu
http//lambda.uta.edu/

2
Motivation

Many IR techniques for approximate matching over
text-rich documents
keyword search queries over flat documents only
ranked results
XQuery is very powerful in expressing exact
queries over XML
Our goal
XQuery processing IR-style approximate
matching
Challenges
relevance ranking functions
ordering the results by their relevance scores
propagating and combining scores in a query
complexity of XQuery
multiple documents
nested queries

3
XQuery with Approximate Matching

XQuery syntax extensions
full-text search e S
search specification S phrase, S1 and
S2, S1 or S2, not S
relevance assessment score(e)
all indexed documents document()
ltanswergt
( for db in document()/biblio,
b in db/bibtitle ("XQuery
processing"
and "relevance")
where b/abstract ("SAX" and not "DOM")
order by score(b) descending
return ltpapergt b/author/name,
b/title,
score(b)
lt/papergt
)position()lt10
lt/answergt

4
Design Goals

Design an XML indexing scheme for
path navigation
search predicates
Provide relevance ranking functions based on the
indexes
Build a highly pipelined XQuery engine
that uses merge-joins exclusively
does not materialize intermediate results
This engine should consist of operators that
propagate and combine relevance scores
naturally reflect the syntactic structures of
XQuery and
can be composed into pipelines in the same way
the corresponding XQuery structures are composed
to form complex queries
The XQuery translation should be concise, clean,
and completely compositional

5
Related Work

TIX algebra Al-Khalifa et al, SIGMOD'03
TexQuery Amer-Yahia et al, WWW'04
XQuery/IR Bremer et al, WebDB'02
Languages ELIXIR, XIRQL
Systems XXL, XRANK, XIRCUS

6
Relevance Ranking

e term the term is associated with a pair
(weight, position)
the position is related to the beginning of the e
element
the weight is based on
the standard IR tf-idf (term-frequency/inverse-d
ocument-frequency)
the difference between the nesting levels of term
and e
The scores for phrases boolean connectives are
based on term proximity
a conjunction of terms is summarized by their
center of mass
position (Spositioni weighti) / Sweighti
weight (Spositioni weighti) / Spositioni
two sets of position/weight pairs
positive terms (T) a disjunction of
possibilities
negative terms (F) a conjunction of forbidden
terms

7
Relevance Ranking (cont.)

Merging terms from sets A and B
A ?B ( (p1w1p2w2)/(p1p2) ,
(p1w1p2w2)/(w1w2) )
(p1,w1) ?A, (p2,w2) ?B
Position/weights of search specifications
S1 and S2.T S1.T ?S2.T S1 and S2.F
S1.F ?S2.F
S1 or S2.T S1.T ?S2.T S1 or S2.F
S1.F ?S2.F
not S.T S.F not S.F S.T
Cost of a search specification S
calculate
ú p1-p2ú / size w1 (1-w2) (p1,w1)
?S.T, (p2,w2) ?S.F
reduce the set by the function x Åy xy-xy

8
Inverse Indexes
keys
postings
hits

Four indexes
XML tags each hit has a begin/end position
text terms each hit has a position
attribute names
attribute values
Each index delivers the posting/hit pairs in
(document_number,begin_position) order

9
The Pipeline Units

abstract class Element
float score // relevance assessment of
element
class Fragment extends Element
int document // document ID
short begin // the start position in
document
short end // the end position in document
short level // depth of term in document
class ConstructedElement extends Element
String tagname
Element sequence // children
Attributes attributes // SAX-like attributes
class PCData extends Element String data

10
The Pipeline Units (cont.)

Need an element to capture all indexed elements
class Pattern extends Element
int min_level // minimum depth in document
int max_level // maximum depth in document
for queries such as count(document()//)
as a starting element for document()
The unit of communication between pipeline
operators is a tuple
class Tuple Element components
one element for each for-variable in a FLWOR
expression

11
Pipeline Iterators

class Iterator
Tuple current() // current tuple from stream
void open () // open the stream iterator
Tuple next () // get the next tuple from
stream
boolean eos () // is this the end of
stream?
An iterator reads data from the input stream(s)
and delivers data to the output stream
Connected through pipelines
an iterator (the producer) delivers a stream
element to the output only when requested by the
next operator in pipeline (the consumer)
to deliver one stream element to the output, the
producer becomes a consumer by requesting from
the previous iterator as many elements as
necessary to produce a single element, etc, until
the end of stream

12
Example

class Child extends Iterator
String tag
Iterator input
IndexIterator ti
Tuple next ()
while (!ti.eos() !input.eos())
if (input.current0 instanceof Fragment)
Fragment f (Fragment) input.current0
Posting p ti.posting()
TagHit h (TagHit) ti.hit()
if ( f.document p.document
f.begin lt h.begin f.end gt
h.end h.level f.level1)
ti.next()
return new Tuple(new
Fragment(p.document,h.begin,h.end,h.level))
...

13
For-Loops using Iterators

Need a stepper for a for-loop

class Step extends Iterator boolean first
Tuple tuple void open () first true
current tuple Tuple next () first
false return current void set ( Tuple t )
tuple t boolean eos () return
!first Tuple Loop.next () if
(!left.eos()) while (right.eos())
left.next()
right_step.set(left.current())
right.open() current
left.current().append(right.current())
right.next() return current
Loop
right
right_step
left
right pipeline
set
Step
class Loop extends Iterator Iterator left
Step right_step Iterator right
14
Let-Bindings using Iterators

Let-bindings are the hardest to implement
the let-value may be a sequence
one producer -- many consumers
we do not want to materialize the let-value in
memory

queue
tail
head
fastest consumer
slowest consumer
backlog
Some cases are hopeless let ve return (v,v)
15
Future Work

Integration with a pull-based, event-oriented
processing of local XML files (instead of
DOM-based)
Incorporate evaluation techniques for top-K
selection queries
Use it in a peer-to-peer system as a distributed
XML database
current P2P indexing techniques (based on DHTs)
are an overkill
for query /A/B need to send all A index
entries from peer A to peer B
preprocessing of XQueries using Bloom filters