Identifying and Ranking Relevant Document Elements - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Identifying and Ranking Relevant Document Elements

Description:

Convert the topic into a Boolean expression. Annotate the accumulators ... Decompress only those dn needed. Nodes to Use. Build a big list of tags and children nodes ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 16
Provided by: andrewt150
Category:

less

Transcript and Presenter's Notes

Title: Identifying and Ranking Relevant Document Elements


1
Identifying and Ranking Relevant Document Elements
  • Andrew Trotman
  • Richard A. OKeefe
  • Department of Computer Science
  • University of Otago
  • Dunedin, New Zealand

2
IR System Design
  • Build a corpus tree
  • Annotate the indexes
  • Convert the topic into a Boolean expression
  • Annotate the accumulators
  • Resolve the query against the indexes
  • Compute the coverage
  • Choose the results

3
Document Trees
  • Each well-formed document is a tree
  • Attributes _at_ and instancing are included
  • Trees are not unique

4
Build a Corpus Tree
  • The superimposition of all document trees
  • Built as encountered
  • Label each node as encountered
  • Matches the structure of no one document
  • Has every path through every document

5
Annotate the Indexes
  • Traditionally in inverted files
  • ltd1,f1gtltd2,f2gt,,ltdn,fngt
  • Now
  • ltd1,p1,f1gt,ltd2,p2,f2gt,ltdn,pn,fngt
  • or
  • ltd1,p1,w1gt,ltd2,p2,w2gt,ltdn,pn,wngt
  • Where
  • pn is a node from the corpus tree
  • wn is a term number used for phrase search

6
Fast Access Indexes
term1
  • Sorted by pn
  • Store d before w
  • Only examine some pn
  • Decompress only those dn needed

7
Nodes to Use
  • Build a big list of tags and children nodes
  • about (./p, term)
  • about(./_at_c, term)
  • about(./p2, term)
  • about(./sec/p, term)
  • Tag equivalence
  • e.g. _at_c p2

8
Convert Topic to Boolean
  • For each about clause
  • Collect like terms
  • Mandatory terms are ANDed
  • Optional terms are ORed
  • Exclusion terms are ANDed
  • Build a sub-expression
  • Mandatory AND ( OR Optional) NOT Exclusion
  • Operators around about clause are preserved
  • Boolean / BM25 hybrid Ranking

9
Resolve Query Against Indexes
  • Build then walk the parse tree
  • At leaves, load postings and process
  • BM25 score added to each document accumulator
  • Coverage added to each document accumulator
  • Convert into bit-string one per leaf
  • At nodes, Boolean operate
  • Boolean operate the bit-strings
  • Ignore the accumulators
  • At root of parse tree
  • Bit-string of relevant documents
  • Accumulators with BM25 scores and coverage

10
Annotate the Accumulators
  • Traditionally
  • Per document
  • One real-valued cell
  • Now
  • Per document
  • One real-valued cell
  • Per node in corpus tree
  • One integer-valued cell
  • As Always
  • Accumulators are stored sparsely
  • There is only one set of accumulators per query

11
Compute the Coverage
  • For each posting
  • document elements at or above the posting
  • For a term / document pair
  • superimposition of coverage of each posting
  • Weighted Coverage for a document node
  • The how many times a given node is covered by
    unique terms (how many unique search terms lie at
    or below the node)

12
Properties of Weighted Coverage
  • The highest coverage
  • Must be at the root of the document tree
  • For all nodes with the highest coverage
  • Lowest nodes are most information dense
  • As weighting decreases
  • specificity increases
  • Likely that
  • tags between two relevant tags are also relevant

13
Choose The Results
  • Recall determined by Boolean expression
  • Precision determined by
  • Documents ranking
  • BM25 score
  • Node ranking
  • Weighted coverage
  • Use the lowest nodes with the highest coverage
  • For SCAS
  • Restrict weighted coverage to required nodes

14
INEX03 Results
  • Did badly at inex_eval
  • Prefers tags high in the tree
  • Doesnt return overlapping results
  • Creates a dependence of tags on documents
  • Did well at inex_eval_ng
  • Coverage related to density
  • Relevant documents have relevant sections
  • Did better at generalized than specific
  • Tags returned in document order by BM25

15
Questions?
Write a Comment
User Comments (0)
About PowerShow.com