INEX 2003

1 / 27
About This Presentation
Title:

INEX 2003

Description:

Evaluation metrics working group. Djoerd Hiemstra, Jaap Kamps, Gabriella Kazai, ... Systems could not be tuned - need to publish metrics prior to evaluation ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 28
Provided by: sf990

less

Transcript and Presenter's Notes

Title: INEX 2003


1
INEX 2003
  • Evaluation metrics working group

Djoerd Hiemstra, Jaap Kamps, Gabriella Kazai,
Yosi Mass, Vojkan Mihajlovic, Paul Ogilvie,
Jovan Pehcevski, Arjen de Vries, Huyen-Trang Vu
Dagstuhl, Germany, 15-17 December 2003
2
What we did
  • First session overview of current and proposed
    metrics
  • INEX02
  • INEX03 (ng)
  • Expected Relevant Ratio
  • Tolerance to Irrelevance
  • Second session discussion -gt results reported
    here

3
General Concerns
  • Systems could not be tuned -gt need to publish
    metrics prior to evaluation
  • Issue in understanding metrics
  • Installation issues
  • Stability of metrics questioned
  • Consistency of assessments due to cognitive load
    (more complex less reliable?)

4
Tasks to evaluate
  • CO aim is to decrease user effort by pointing
    the user to the most specific relevant portions
    of documents.
  • SCAS retrieve relevant nodes that match the
    structure specified in the query.
  • VCAS retrieve relevant nodes that may not be the
    same as the target elements, but are structurally
    similar.

5
Metrics need to consider
  • Two dimensions of relevance
  • Independency assumption does not hold
  • No predefined retrieval unit
  • Varying user effort per result
  • Overlap
  • Related result components
  • Linear vs. clustered ranking

6
CO
  • What to evaluate?
  • Retrieve relevant (to any degree) XML fragments
    within the XML documents
  • Influence of strict INEX 2002 metric retrieve
    only 33
  • Question regarding cost of assessment if not used
    in evaluation

7
Agreement
  • Graded assessments are valuable!
  • Not just for near misses!
  • Systems should find all levels of relevant info,
    but should rank highly relevant first
  • Systems that do well on retrieving 33 may not do
    well on recall-oriented retrieval
  • This is currently not evaluated sufficiently (due
    to problems with generalised metrics)
  • Future metrics can make use of the rich data
    even if we dont yet know how (Birger)

8
Suggestions
  • Need additional strict quantization functions
    (INEX 2002 metric)
  • Specificity-oriented metrics, e.g.

9
Suggestions
  • Exhaustivity-oriented metrics, e.g.
  • P_at_5, P_at_10, P_at_20,

10
Suggestions
  • We also need the generalised metrics BUT there
    are unsolved issues due to overlap of result
    elements!

11
Overlap problem
12
Suggestions
  • Remove overlapping results from submissions
  • Penalise overlapping result (agreed)
  • Only score first hit on same relevant reference
    component (varying recall-base)
  • Worry
  • Article-only baseline hard to beat (Solution
    go to TREC if more than 3 articles in run ? )
  • Stability of evaluation score (INEX02) r1
    article 31 r1 sec 33 r2 sec 33
    r2 article 31 P0.750 P10

..but the right ordering is the best (Paul)
13
INEX 2003 metric (ng)
  • Penalises overlap by only scoring novel
    information in overlapping results
  • Problem assumes uniform distribution of relevant
    information
  • Suggestion use assessments data
  • Issue of stability? Say article has 10 sections,
    each with 10 words r1 article 31 r1 sec
    33r2 sec 33 r2 article
    31Po0.300.3 Po10.10.30.90.37Ro(10)/
    20.5 Ro(1 0.110.9)/20.5

14
INEX 2003 metric (ng)
  • Other issues
  • Size considered directly in precision (is it
    intuitive that large is good or not?)
  • Recall defined using exh only
  • Precision defined using spec onlyr1 13
    r1 33r2 13 r2 33r3 13 r3
    33both get 100 precision (but recall is OK, so
    MAP works) -gt P_at_DCV cannot be used

15
Alternative metrics
  • User-effort oriented measures
  • Expected Relevant Ratio (Benjamin)
  • Tolerance to Irrelevance (Arjen, Gabriella,
    Mounia)
  • Discounted Cumulated Gain
  • Component length (Huyen-Trang Vu)
  • Assessments-based (Gabriella)

16
Which metric for what task
  • CO
  • INEX 2002
  • INEX 2003 (ng)
  • Expected Relevant Ratio
  • Tolerance to Irrelevance
  • SCAS
  • INEX 2002
  • Expected Relevant Ratio
  • Tolerance to Irrelevance
  • VCAS
  • ?? (extend CO metrics partial score based on
    structural similarity using distance measures)

17
Appendix
18
Overview of metrics
  • INEX02
  • INEX03 (ng)
  • Expected Relevant Ratio
  • Tolerance to Irrelevance

19
INEX 2002 metric 1
  • Quantization
  • strict
  • generalized

20
INEX 2002 metric 2
  • Precision as defined by Raghavan89 (based on
    ESL)
  • where n is estimated

21
INEX 2003 metric (ng) 1
  • Ideal concept space (Wong Yao 95)

c
t
22
INEX 2003 metric (ng) 2
  • Quantization
  • strict
  • generalised

23
INEX 2003 metric (ng) 3
  • Ignoring overlap

24
INEX 2003 metric (ng) 4
  • Considering overlap

25
Expected Relevant Ratio
  • Considers probability of relevance and users
    browsing behaviour
  • (For further details see ERR slides)

26
Tolerance to Irrelevance
  • Based on time to view
  • Count hits based on users tolerance to
    irrelevance
  • User model instantiated in metrics of
  • Precision at DCV
  • ESL
  • Raghavans P(relretr)
  • (For further details see T2I slides)

27
References
  • INEX02 in INEX02 proceedings
  • INEX03 as technical report
  • ERR in preparation
  • T2I submitted for publication
  • CG-based (Gabriella) submitted for publication
Write a Comment
User Comments (0)