INEX: Evaluating contentoriented XML retrieval

1 / 35

About This Presentation

Title:

INEX: Evaluating contentoriented XML retrieval

Description:

XML retrieval allows users to retrieve document components that are more ... Links (hyperlink), cross-references, citations ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 36

Provided by: mou57

more less

Transcript and Presenter's Notes

Title: INEX: Evaluating contentoriented XML retrieval

1
INEX Evaluating content-oriented XML retrieval

Mounia Lalmas
Queen Mary University of London
http//qmir.dcs.qmul.ac.uk

2
Outline

Content-oriented XML retrieval
Evaluating XML retrieval INEX

3
XML Retrieval

Traditional IR is about finding relevant
documents to a users information need, e.g.
entire book.
XML retrieval allows users to retrieve document
components that are more focussed to their
information needs, e.g a chapter of a book
instead of an entire book.
The structure of documents is exploited to
identify which document components to retrieve.

4
Structured Documents

Linear order of words, sentences, paragraphs
Hierarchy or logical structure of a books
chapters, sections
Links (hyperlink), cross-references, citations
Temporal and spatial relationships in multimedia
documents

5
Structured Documents

Explicit structure formalised through
document representation standards (mark-up
languages)
Layout
LaTeX (publishing), HTML (Web publishing)
Structure
SGML, XML (Web publishing, engineering), MPEG-7
(broadcasting)
Content/Semantic
RDF, DAML OIL, OWL (semantic web)

ltbgtltfont size2gtSDRlt/fontgtlt/bgtltimg
src"qmir.jpg" border0gt
ltsectiongt ltsubsectiongt ltparagraphgt
lt/paragraphgt ltparagraphgt lt/paragraphgt
lt/subsectiongtlt/sectiongt
ltBook rdfaboutbookgt ltrdfauthor../gt
ltrdftitle/gtlt/Bookgt
6
XML eXtensible Mark-up Language

Meta-language (user-defined tags) currently being
adopted as the document format language by W3C
Used to describe content and structure (and not
layout)
Grammar described in DTD (? used for validation)

ltlecturegt lttitlegt Structured Document
Retrieval lt/titlegt ltauthorgt ltfnmgt Smith
lt/fnmgt ltsnmgt John lt/snmgt lt/authorgt ltchaptergt
lttitlegt Introduction into XML
retrieval lt/titlegt ltparagraphgt .
lt/paragraphgt lt/chaptergt
lt/lecturegt
lt!ELEMENT lecture (title, author,chapter)gt lt!ELE
MENT author (fnm,snm)gt lt!ELEMENT fnm PCDATAgt
7
XML eXtensible Mark-up Language

Use of XPath notation to refer to the XML
structure

chapter/title title is a direct sub-component of
chapter //title any title chapter//title title
is a direct or indirect sub-component of
chapter chapter/paragraph2 any direct second
paragraph of any chapter chapter/ all direct
sub-components of a chapter
ltlecturegt lttitlegt Structured Document
Retrieval lt/titlegt ltauthorgt ltfnmgt Smith
lt/fnmgt ltsnmgt John lt/snmgt lt/authorgt ltchaptergt
lttitlegt Introduction into SDR lt/titlegt
ltparagraphgt . lt/paragraphgt
lt/chaptergt lt/lecturegt
8
Querying XML documents

Content-only (CO) queries
'open standards for digital video in distance
learning'
Content-and-structure (CAS) queries
//article about(., 'formal methods verify
correctness aviation systems')
/body//section
about(.,'case study application
model checking theorem proving')
Structure-only (SA) queries
/article//section/paragraph2

9
Content-oriented XML retrieval

Return document components of varying
granularity (e.g. a book, a chapter, a section, a
paragraph, a table, a figure, etc), relevant to
the users information need both with regards to
content and structure.

10
Content-oriented XML retrieval

Retrieve the best components according to
content and structure criteria
INEX most specific component that satisfies the
query, while being exhaustive to the query
Shakespeare study best entry points, which are
components from which many relevant components
can be reached through browsing
???

11
Challenges
0.2

Article ?XML,?retrieval
?authoring
0.9 XML 0.5
XML 0.2 XML
0.4 retrieval
0.7
authoring

0.2
0.4
0.5
Section 2
Section 1
Title
0.6
0.4
0.4

no fixed retrieval unit nested elements
element types
how to obtain document and collection statistics?
which component is a good retrieval unit?
which components contribute best to content of
Article?
how to estimate?
how to aggregate?

12
Approaches
vector space model
bayesian network
fusion
collection statistics
language model
cognitive model
smoothing
proximity search
tuning
belief model
boolean model
relevance feedback
phrase
parameter estimation
probabilistic model
logistic regression
component statistics
ontology
term statistics
natural language processing
extending DB model
13
Vector space model
article index
abstract index
section index
merge
sub-section index
paragraph index
tf and idf as for fixed and non-nested retrieval
units
(IBM Haifa, INEX 2003)
14
Language model
element language model collection language
model smoothing parameter ?
element score
high value of ? leads to increase in size of
retrieved elements
element size element score article score
rank element
query expansion with blind feedback ignore
elements with ? 20 terms
results with ? 0.9, 0.5 and 0.2 similar
(University of Amsterdam, INEX 2003)
15
Evaluation of XML retrieval INEX

Evaluating the effectiveness of content-oriented
XML retrieval approaches
Collaborative effort ? participants contribute to
the development of the collection
queries
relevance assessments
Similar methodology as for TREC, but adapted to
XML retrieval
40 participants worldwide
Workshop in Schloss Dagstuhl in December (20
institutions)

16
INEX Test Collection

Documents (500MB), which consist of 12,107
articles in XML format from the IEEE Computer
Society 8 millions elements
INEX 2002
30 CO and 30 CAS queries
inex2002 metric
INEX 2003
36 CO and 30 CAS queries
CAS queries are defined according to enhanced
subset of XPath
inex2002 and inex2003 metrics
INEX 2004 is just starting

17
Tasks

CO aim is to decrease user effort by pointing
the user to the most specific relevant portions
of documents.
SCAS retrieve relevant nodes that match the
structure specified in the query.
VCAS retrieve relevant nodes that may not be the
same as the target elements, but are structurally
similar.

18
Relevance in XML

A element is relevant if it has significant and
demonstrable bearing on the matter at hand
Common assumptions in IR
Objectivity
Topicality
Binary nature
Independence

article
section
1 2 3
paragraph
1 2
19
Relevance in INEX
all sections relevant ? article very relevant all
sections relevant ? article better than
sections one section relevant ? article less
relevant one section relevant ? section better
than article
article
section

Exhaustivity
how exhaustively a document component discusses
the query 0, 1, 2, 3
Specificity
how focused the component is on the query 0, 1,
2, 3
Relevance
(3,3), (2,3), (1,1), (0,0),

20
Relevance assessment task

Completeness
Element ? parent element, children element
Consistency
Parent of a relevant element must also be
relevant, although to a different extent
Exhaustivity increase going ?
Specificity decrease going ?
Use of an online interface
Assessing a query takes a week!
Average 2 topics per participants

article
section
1 2 3
paragraph
1 2
21
Interface
Current assessments
Groups
Navigation
22
Assessments

With respect to the elemens to assess
26 assessments on elements in the pool (66
in INEX 2002).
68 highly specific elements not in the pool
7 elements automatically assessed
INEX 2002
23 inconsistent assessments per query for one
rule

23
Metrics

Need to consider
Two dimensions of relevance
Independency assumption does not hold
No predefined retrieval unit
Overlap
Linear vs. clustered ranking

article
section
24
INEX 2002 metric

Quantization
strict
generalized

25
INEX 2002 metric

Precision as defined by Raghavan89 (based on
ESL)
where n is estimated

26
Overlap problem
27
INEX 2003 metric

Ideal concept space (Wong Yao 95)

c
t
28
INEX 2003 metric

Quantization
strict
generalised

29
INEX 2003 metric

Ignoring overlap

30
INEX 2003 metric

Considering overlap

31
INEX 2003 metric

Penalises overlap by only scoring novel
information in overlapping results
Assume uniform distribution of relevant
information
Issue of stability
Size considered directly in precision (is it
intuitive that large is good or not?)
Recall defined using exh only
Precision defined using spec only

32
Alternative metrics

User-effort oriented measures
Expected Relevant Ratio
Tolerance to Irrelevance
Discounted Cumulated Gain

33
Lessons learnt

Good definition of relevance
Expressing CAS queries was not easy
Relevance assessment process must be improved
Further development on metrics needed
User studies required

34
Conclusion

XML retrieval is not just about the effective
retrieval of XML documents, but also about how to
evaluate effectiveness
INEX 2004 tracks
Relevance feedback
Interactive
Heterogeneous collection
Natural language query

http//inex.is.informatik.uni-duisburg.de2004/
35
INEX Evaluating content-oriented XML retrieval