Title: XQuery Processing with Relevance Ranking
1XQuery Processing with Relevance Ranking
- Leonidas Fegaras
- University of Texas at Arlington
- fegaras_at_cse.uta.edu
- http//lambda.uta.edu/
2Motivation
- Many IR techniques for approximate matching over
text-rich documents - keyword search queries over flat documents only
- ranked results
- XQuery is very powerful in expressing exact
queries over XML - Our goal
- XQuery processing IR-style approximate
matching - Challenges
- relevance ranking functions
- ordering the results by their relevance scores
- propagating and combining scores in a query
- complexity of XQuery
- multiple documents
- nested queries
3XQuery with Approximate Matching
- XQuery syntax extensions
- full-text search e S
- search specification S phrase, S1 and
S2, S1 or S2, not S - relevance assessment score(e)
- all indexed documents document()
- ltanswergt
- ( for db in document()/biblio,
- b in db/bibtitle ("XQuery
processing" - and "relevance")
- where b/abstract ("SAX" and not "DOM")
- order by score(b) descending
- return ltpapergt b/author/name,
- b/title,
- score(b)
- lt/papergt
- )position()lt10
- lt/answergt
4Design Goals
- Design an XML indexing scheme for
- path navigation
- search predicates
- Provide relevance ranking functions based on the
indexes - Build a highly pipelined XQuery engine
- that uses merge-joins exclusively
- does not materialize intermediate results
- This engine should consist of operators that
- propagate and combine relevance scores
- naturally reflect the syntactic structures of
XQuery and - can be composed into pipelines in the same way
the corresponding XQuery structures are composed
to form complex queries - The XQuery translation should be concise, clean,
and completely compositional
5Related Work
- TIX algebra Al-Khalifa et al, SIGMOD'03
- TexQuery Amer-Yahia et al, WWW'04
- XQuery/IR Bremer et al, WebDB'02
- Languages ELIXIR, XIRQL
- Systems XXL, XRANK, XIRCUS
6Relevance Ranking
- e term the term is associated with a pair
- (weight, position)
- the position is related to the beginning of the e
element - the weight is based on
- the standard IR tf-idf (term-frequency/inverse-d
ocument-frequency) - the difference between the nesting levels of term
and e - The scores for phrases boolean connectives are
based on term proximity - a conjunction of terms is summarized by their
center of mass - position (Spositioni weighti) / Sweighti
- weight (Spositioni weighti) / Spositioni
- two sets of position/weight pairs
- positive terms (T) a disjunction of
possibilities - negative terms (F) a conjunction of forbidden
terms
7Relevance Ranking (cont.)
- Merging terms from sets A and B
- A ?B ( (p1w1p2w2)/(p1p2) ,
(p1w1p2w2)/(w1w2) ) - (p1,w1) ?A, (p2,w2) ?B
- Position/weights of search specifications
- S1 and S2.T S1.T ?S2.T S1 and S2.F
S1.F ?S2.F - S1 or S2.T S1.T ?S2.T S1 or S2.F
S1.F ?S2.F - not S.T S.F not S.F S.T
- Cost of a search specification S
- calculate
- ú p1-p2ú / size w1 (1-w2) (p1,w1)
?S.T, (p2,w2) ?S.F - reduce the set by the function x Ã…y xy-xy
8Inverse Indexes
keys
postings
hits
- Four indexes
- XML tags each hit has a begin/end position
- text terms each hit has a position
- attribute names
- attribute values
- Each index delivers the posting/hit pairs in
(document_number,begin_position) order
9The Pipeline Units
- abstract class Element
- float score // relevance assessment of
element -
- class Fragment extends Element
- int document // document ID
- short begin // the start position in
document - short end // the end position in document
- short level // depth of term in document
-
- class ConstructedElement extends Element
- String tagname
- Element sequence // children
- Attributes attributes // SAX-like attributes
-
- class PCData extends Element String data
10The Pipeline Units (cont.)
- Need an element to capture all indexed elements
- class Pattern extends Element
- int min_level // minimum depth in document
- int max_level // maximum depth in document
-
- for queries such as count(document()//)
- as a starting element for document()
- The unit of communication between pipeline
operators is a tuple - class Tuple Element components
- one element for each for-variable in a FLWOR
expression
11Pipeline Iterators
- class Iterator
- Tuple current() // current tuple from stream
- void open () // open the stream iterator
- Tuple next () // get the next tuple from
stream - boolean eos () // is this the end of
stream? -
- An iterator reads data from the input stream(s)
and delivers data to the output stream - Connected through pipelines
- an iterator (the producer) delivers a stream
element to the output only when requested by the
next operator in pipeline (the consumer) - to deliver one stream element to the output, the
producer becomes a consumer by requesting from
the previous iterator as many elements as
necessary to produce a single element, etc, until
the end of stream
12Example
- class Child extends Iterator
- String tag
- Iterator input
- IndexIterator ti
-
- Tuple next ()
- while (!ti.eos() !input.eos())
- if (input.current0 instanceof Fragment)
- Fragment f (Fragment) input.current0
- Posting p ti.posting()
- TagHit h (TagHit) ti.hit()
- if ( f.document p.document
- f.begin lt h.begin f.end gt
h.end h.level f.level1) - ti.next()
- return new Tuple(new
Fragment(p.document,h.begin,h.end,h.level)) - ...
13For-Loops using Iterators
- Need a stepper for a for-loop
class Step extends Iterator boolean first
Tuple tuple void open () first true
current tuple Tuple next () first
false return current void set ( Tuple t )
tuple t boolean eos () return
!first Tuple Loop.next () if
(!left.eos()) while (right.eos())
left.next()
right_step.set(left.current())
right.open() current
left.current().append(right.current())
right.next() return current
Loop
right
right_step
left
right pipeline
set
Step
class Loop extends Iterator Iterator left
Step right_step Iterator right
14Let-Bindings using Iterators
- Let-bindings are the hardest to implement
- the let-value may be a sequence
- one producer -- many consumers
- we do not want to materialize the let-value in
memory
queue
tail
head
fastest consumer
slowest consumer
backlog
Some cases are hopeless let ve return (v,v)
15Future Work
- Integration with a pull-based, event-oriented
processing of local XML files (instead of
DOM-based) - Incorporate evaluation techniques for top-K
selection queries - Use it in a peer-to-peer system as a distributed
XML database - current P2P indexing techniques (based on DHTs)
are an overkill - for query /A/B need to send all A index
entries from peer A to peer B - preprocessing of XQueries using Bloom filters