XQuery Processing with Relevance Ranking - PowerPoint PPT Presentation

About This Presentation
Title:

XQuery Processing with Relevance Ranking

Description:

where $b/abstract ~ ('SAX' and not 'DOM') order by score($b) descending ... a pull-based, event-oriented processing of local XML files (instead of DOM-based) ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 16
Provided by: lambd
Learn more at: https://lambda.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: XQuery Processing with Relevance Ranking


1
XQuery Processing with Relevance Ranking
  • Leonidas Fegaras
  • University of Texas at Arlington
  • fegaras_at_cse.uta.edu
  • http//lambda.uta.edu/

2
Motivation
  • Many IR techniques for approximate matching over
    text-rich documents
  • keyword search queries over flat documents only
  • ranked results
  • XQuery is very powerful in expressing exact
    queries over XML
  • Our goal
  • XQuery processing IR-style approximate
    matching
  • Challenges
  • relevance ranking functions
  • ordering the results by their relevance scores
  • propagating and combining scores in a query
  • complexity of XQuery
  • multiple documents
  • nested queries

3
XQuery with Approximate Matching
  • XQuery syntax extensions
  • full-text search e S
  • search specification S phrase, S1 and
    S2, S1 or S2, not S
  • relevance assessment score(e)
  • all indexed documents document()
  • ltanswergt
  • ( for db in document()/biblio,
  • b in db/bibtitle ("XQuery
    processing"
  • and "relevance")
  • where b/abstract ("SAX" and not "DOM")
  • order by score(b) descending
  • return ltpapergt b/author/name,
  • b/title,
  • score(b)
  • lt/papergt
  • )position()lt10
  • lt/answergt

4
Design Goals
  • Design an XML indexing scheme for
  • path navigation
  • search predicates
  • Provide relevance ranking functions based on the
    indexes
  • Build a highly pipelined XQuery engine
  • that uses merge-joins exclusively
  • does not materialize intermediate results
  • This engine should consist of operators that
  • propagate and combine relevance scores
  • naturally reflect the syntactic structures of
    XQuery and
  • can be composed into pipelines in the same way
    the corresponding XQuery structures are composed
    to form complex queries
  • The XQuery translation should be concise, clean,
    and completely compositional

5
Related Work
  • TIX algebra Al-Khalifa et al, SIGMOD'03
  • TexQuery Amer-Yahia et al, WWW'04
  • XQuery/IR Bremer et al, WebDB'02
  • Languages ELIXIR, XIRQL
  • Systems XXL, XRANK, XIRCUS

6
Relevance Ranking
  • e term the term is associated with a pair
  • (weight, position)
  • the position is related to the beginning of the e
    element
  • the weight is based on
  • the standard IR tf-idf (term-frequency/inverse-d
    ocument-frequency)
  • the difference between the nesting levels of term
    and e
  • The scores for phrases boolean connectives are
    based on term proximity
  • a conjunction of terms is summarized by their
    center of mass
  • position (Spositioni weighti) / Sweighti
  • weight (Spositioni weighti) / Spositioni
  • two sets of position/weight pairs
  • positive terms (T) a disjunction of
    possibilities
  • negative terms (F) a conjunction of forbidden
    terms

7
Relevance Ranking (cont.)
  • Merging terms from sets A and B
  • A ?B ( (p1w1p2w2)/(p1p2) ,
    (p1w1p2w2)/(w1w2) )
  • (p1,w1) ?A, (p2,w2) ?B
  • Position/weights of search specifications
  • S1 and S2.T S1.T ?S2.T S1 and S2.F
    S1.F ?S2.F
  • S1 or S2.T S1.T ?S2.T S1 or S2.F
    S1.F ?S2.F
  • not S.T S.F not S.F S.T
  • Cost of a search specification S
  • calculate
  • ú p1-p2ú / size w1 (1-w2) (p1,w1)
    ?S.T, (p2,w2) ?S.F
  • reduce the set by the function x Ã…y xy-xy

8
Inverse Indexes
keys
postings
hits
  • Four indexes
  • XML tags each hit has a begin/end position
  • text terms each hit has a position
  • attribute names
  • attribute values
  • Each index delivers the posting/hit pairs in
    (document_number,begin_position) order

9
The Pipeline Units
  • abstract class Element
  • float score // relevance assessment of
    element
  • class Fragment extends Element
  • int document // document ID
  • short begin // the start position in
    document
  • short end // the end position in document
  • short level // depth of term in document
  • class ConstructedElement extends Element
  • String tagname
  • Element sequence // children
  • Attributes attributes // SAX-like attributes
  • class PCData extends Element String data

10
The Pipeline Units (cont.)
  • Need an element to capture all indexed elements
  • class Pattern extends Element
  • int min_level // minimum depth in document
  • int max_level // maximum depth in document
  • for queries such as count(document()//)
  • as a starting element for document()
  • The unit of communication between pipeline
    operators is a tuple
  • class Tuple Element components
  • one element for each for-variable in a FLWOR
    expression

11
Pipeline Iterators
  • class Iterator
  • Tuple current() // current tuple from stream
  • void open () // open the stream iterator
  • Tuple next () // get the next tuple from
    stream
  • boolean eos () // is this the end of
    stream?
  • An iterator reads data from the input stream(s)
    and delivers data to the output stream
  • Connected through pipelines
  • an iterator (the producer) delivers a stream
    element to the output only when requested by the
    next operator in pipeline (the consumer)
  • to deliver one stream element to the output, the
    producer becomes a consumer by requesting from
    the previous iterator as many elements as
    necessary to produce a single element, etc, until
    the end of stream

12
Example
  • class Child extends Iterator
  • String tag
  • Iterator input
  • IndexIterator ti
  • Tuple next ()
  • while (!ti.eos() !input.eos())
  • if (input.current0 instanceof Fragment)
  • Fragment f (Fragment) input.current0
  • Posting p ti.posting()
  • TagHit h (TagHit) ti.hit()
  • if ( f.document p.document
  • f.begin lt h.begin f.end gt
    h.end h.level f.level1)
  • ti.next()
  • return new Tuple(new
    Fragment(p.document,h.begin,h.end,h.level))
  • ...

13
For-Loops using Iterators
  • Need a stepper for a for-loop

class Step extends Iterator boolean first
Tuple tuple void open () first true
current tuple Tuple next () first
false return current void set ( Tuple t )
tuple t boolean eos () return
!first Tuple Loop.next () if
(!left.eos()) while (right.eos())
left.next()
right_step.set(left.current())
right.open() current
left.current().append(right.current())
right.next() return current
Loop
right
right_step
left
right pipeline
set
Step
class Loop extends Iterator Iterator left
Step right_step Iterator right
14
Let-Bindings using Iterators
  • Let-bindings are the hardest to implement
  • the let-value may be a sequence
  • one producer -- many consumers
  • we do not want to materialize the let-value in
    memory

queue
tail
head
fastest consumer
slowest consumer
backlog
Some cases are hopeless let ve return (v,v)
15
Future Work
  • Integration with a pull-based, event-oriented
    processing of local XML files (instead of
    DOM-based)
  • Incorporate evaluation techniques for top-K
    selection queries
  • Use it in a peer-to-peer system as a distributed
    XML database
  • current P2P indexing techniques (based on DHTs)
    are an overkill
  • for query /A/B need to send all A index
    entries from peer A to peer B
  • preprocessing of XQueries using Bloom filters
Write a Comment
User Comments (0)
About PowerShow.com