INEX: Evaluating contentoriented XML retrieval

1 / 35
About This Presentation
Title:

INEX: Evaluating contentoriented XML retrieval

Description:

XML retrieval allows users to retrieve document components that are more ... Links (hyperlink), cross-references, citations ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 36
Provided by: mou57

less

Transcript and Presenter's Notes

Title: INEX: Evaluating contentoriented XML retrieval


1
INEX Evaluating content-oriented XML retrieval
  • Mounia Lalmas
  • Queen Mary University of London
  • http//qmir.dcs.qmul.ac.uk

2
Outline
  • Content-oriented XML retrieval
  • Evaluating XML retrieval INEX

3
XML Retrieval
  • Traditional IR is about finding relevant
    documents to a users information need, e.g.
    entire book.
  • XML retrieval allows users to retrieve document
    components that are more focussed to their
    information needs, e.g a chapter of a book
    instead of an entire book.
  • The structure of documents is exploited to
    identify which document components to retrieve.

4
Structured Documents
  • Linear order of words, sentences, paragraphs
  • Hierarchy or logical structure of a books
    chapters, sections
  • Links (hyperlink), cross-references, citations
  • Temporal and spatial relationships in multimedia
    documents

5
Structured Documents
  • Explicit structure formalised through
    document representation standards (mark-up
    languages)
  • Layout
  • LaTeX (publishing), HTML (Web publishing)
  • Structure
  • SGML, XML (Web publishing, engineering), MPEG-7
    (broadcasting)
  • Content/Semantic
  • RDF, DAML OIL, OWL (semantic web)

ltbgtltfont size2gtSDRlt/fontgtlt/bgtltimg
src"qmir.jpg" border0gt
ltsectiongt ltsubsectiongt ltparagraphgt
lt/paragraphgt ltparagraphgt lt/paragraphgt
lt/subsectiongtlt/sectiongt
ltBook rdfaboutbookgt ltrdfauthor../gt
ltrdftitle/gtlt/Bookgt
6
XML eXtensible Mark-up Language
  • Meta-language (user-defined tags) currently being
    adopted as the document format language by W3C
  • Used to describe content and structure (and not
    layout)
  • Grammar described in DTD (? used for validation)

ltlecturegt lttitlegt Structured Document
Retrieval lt/titlegt ltauthorgt ltfnmgt Smith
lt/fnmgt ltsnmgt John lt/snmgt lt/authorgt ltchaptergt
lttitlegt Introduction into XML
retrieval lt/titlegt ltparagraphgt .
lt/paragraphgt lt/chaptergt
lt/lecturegt
lt!ELEMENT lecture (title, author,chapter)gt lt!ELE
MENT author (fnm,snm)gt lt!ELEMENT fnm PCDATAgt
7
XML eXtensible Mark-up Language
  • Use of XPath notation to refer to the XML
    structure

chapter/title title is a direct sub-component of
chapter //title any title chapter//title title
is a direct or indirect sub-component of
chapter chapter/paragraph2 any direct second
paragraph of any chapter chapter/ all direct
sub-components of a chapter
ltlecturegt lttitlegt Structured Document
Retrieval lt/titlegt ltauthorgt ltfnmgt Smith
lt/fnmgt ltsnmgt John lt/snmgt lt/authorgt ltchaptergt
lttitlegt Introduction into SDR lt/titlegt
ltparagraphgt . lt/paragraphgt
lt/chaptergt lt/lecturegt
8
Querying XML documents
  • Content-only (CO) queries
  • 'open standards for digital video in distance
    learning'
  • Content-and-structure (CAS) queries
  • //article about(., 'formal methods verify
    correctness aviation systems')
  • /body//section
  • about(.,'case study application
    model checking theorem proving')
  • Structure-only (SA) queries
  • /article//section/paragraph2

9
Content-oriented XML retrieval
  • Return document components of varying
    granularity (e.g. a book, a chapter, a section, a
    paragraph, a table, a figure, etc), relevant to
    the users information need both with regards to
    content and structure.

10
Content-oriented XML retrieval
  • Retrieve the best components according to
    content and structure criteria
  • INEX most specific component that satisfies the
    query, while being exhaustive to the query
  • Shakespeare study best entry points, which are
    components from which many relevant components
    can be reached through browsing
  • ???

11
Challenges
0.2
  • Article ?XML,?retrieval


  • ?authoring
  • 0.9 XML 0.5
    XML 0.2 XML
  • 0.4 retrieval
    0.7
    authoring

0.2
0.4
0.5
Section 2
Section 1
Title
0.6
0.4
0.4
  • no fixed retrieval unit nested elements
    element types
  • how to obtain document and collection statistics?
  • which component is a good retrieval unit?
  • which components contribute best to content of
    Article?
  • how to estimate?
  • how to aggregate?

12
Approaches
vector space model
bayesian network
fusion
collection statistics
language model
cognitive model
smoothing
proximity search
tuning
belief model
boolean model
relevance feedback
phrase
parameter estimation
probabilistic model
logistic regression
component statistics
ontology
term statistics
natural language processing
extending DB model
13
Vector space model
article index
abstract index
section index
merge
sub-section index
paragraph index
tf and idf as for fixed and non-nested retrieval
units
(IBM Haifa, INEX 2003)
14
Language model
element language model collection language
model smoothing parameter ?
element score
high value of ? leads to increase in size of
retrieved elements
element size element score article score
rank element
query expansion with blind feedback ignore
elements with ? 20 terms
results with ? 0.9, 0.5 and 0.2 similar
(University of Amsterdam, INEX 2003)
15
Evaluation of XML retrieval INEX
  • Evaluating the effectiveness of content-oriented
    XML retrieval approaches
  • Collaborative effort ? participants contribute to
    the development of the collection
  • queries
  • relevance assessments
  • Similar methodology as for TREC, but adapted to
    XML retrieval
  • 40 participants worldwide
  • Workshop in Schloss Dagstuhl in December (20
    institutions)

16
INEX Test Collection
  • Documents (500MB), which consist of 12,107
    articles in XML format from the IEEE Computer
    Society 8 millions elements
  • INEX 2002
  • 30 CO and 30 CAS queries
  • inex2002 metric
  • INEX 2003
  • 36 CO and 30 CAS queries
  • CAS queries are defined according to enhanced
    subset of XPath
  • inex2002 and inex2003 metrics
  • INEX 2004 is just starting

17
Tasks
  • CO aim is to decrease user effort by pointing
    the user to the most specific relevant portions
    of documents.
  • SCAS retrieve relevant nodes that match the
    structure specified in the query.
  • VCAS retrieve relevant nodes that may not be the
    same as the target elements, but are structurally
    similar.

18
Relevance in XML
  • A element is relevant if it has significant and
    demonstrable bearing on the matter at hand
  • Common assumptions in IR
  • Objectivity
  • Topicality
  • Binary nature
  • Independence

article
section
1 2 3
paragraph
1 2
19
Relevance in INEX
all sections relevant ? article very relevant all
sections relevant ? article better than
sections one section relevant ? article less
relevant one section relevant ? section better
than article
article
section
  • Exhaustivity
  • how exhaustively a document component discusses
    the query 0, 1, 2, 3
  • Specificity
  • how focused the component is on the query 0, 1,
    2, 3
  • Relevance
  • (3,3), (2,3), (1,1), (0,0),

20
Relevance assessment task
  • Completeness
  • Element ? parent element, children element
  • Consistency
  • Parent of a relevant element must also be
    relevant, although to a different extent
  • Exhaustivity increase going ?
  • Specificity decrease going ?
  • Use of an online interface
  • Assessing a query takes a week!
  • Average 2 topics per participants

article
section
1 2 3
paragraph
1 2
21
Interface
Current assessments
Groups
Navigation
22
Assessments
  • With respect to the elemens to assess
  • 26 assessments on elements in the pool (66
    in INEX 2002).
  • 68 highly specific elements not in the pool
  • 7 elements automatically assessed
  • INEX 2002
  • 23 inconsistent assessments per query for one
    rule

23
Metrics
  • Need to consider
  • Two dimensions of relevance
  • Independency assumption does not hold
  • No predefined retrieval unit
  • Overlap
  • Linear vs. clustered ranking

article
section
24
INEX 2002 metric
  • Quantization
  • strict
  • generalized

25
INEX 2002 metric
  • Precision as defined by Raghavan89 (based on
    ESL)
  • where n is estimated

26
Overlap problem
27
INEX 2003 metric
  • Ideal concept space (Wong Yao 95)

c
t
28
INEX 2003 metric
  • Quantization
  • strict
  • generalised

29
INEX 2003 metric
  • Ignoring overlap

30
INEX 2003 metric
  • Considering overlap

31
INEX 2003 metric
  • Penalises overlap by only scoring novel
    information in overlapping results
  • Assume uniform distribution of relevant
    information
  • Issue of stability
  • Size considered directly in precision (is it
    intuitive that large is good or not?)
  • Recall defined using exh only
  • Precision defined using spec only

32
Alternative metrics
  • User-effort oriented measures
  • Expected Relevant Ratio
  • Tolerance to Irrelevance
  • Discounted Cumulated Gain

33
Lessons learnt
  • Good definition of relevance
  • Expressing CAS queries was not easy
  • Relevance assessment process must be improved
  • Further development on metrics needed
  • User studies required

34
Conclusion
  • XML retrieval is not just about the effective
    retrieval of XML documents, but also about how to
    evaluate effectiveness
  • INEX 2004 tracks
  • Relevance feedback
  • Interactive
  • Heterogeneous collection
  • Natural language query

http//inex.is.informatik.uni-duisburg.de2004/
35
INEX Evaluating content-oriented XML retrieval
  • Mounia Lalmas
  • Queen Mary University of London
  • http//qmir.dcs.qmul.ac.uk
Write a Comment
User Comments (0)