INEX 2003

1 / 27

About This Presentation

Title:

INEX 2003

Description:

Evaluation metrics working group. Djoerd Hiemstra, Jaap Kamps, Gabriella Kazai, ... Systems could not be tuned - need to publish metrics prior to evaluation ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 28

Provided by: sf990

more less

Transcript and Presenter's Notes

Title: INEX 2003

1
INEX 2003

Evaluation metrics working group

Djoerd Hiemstra, Jaap Kamps, Gabriella Kazai,
Yosi Mass, Vojkan Mihajlovic, Paul Ogilvie,
Jovan Pehcevski, Arjen de Vries, Huyen-Trang Vu
Dagstuhl, Germany, 15-17 December 2003
2
What we did

First session overview of current and proposed
metrics
INEX02
INEX03 (ng)
Expected Relevant Ratio
Tolerance to Irrelevance
Second session discussion -gt results reported
here

3
General Concerns

Systems could not be tuned -gt need to publish
metrics prior to evaluation
Issue in understanding metrics
Installation issues
Stability of metrics questioned
Consistency of assessments due to cognitive load
(more complex less reliable?)

4
Tasks to evaluate

CO aim is to decrease user effort by pointing
the user to the most specific relevant portions
of documents.
SCAS retrieve relevant nodes that match the
structure specified in the query.
VCAS retrieve relevant nodes that may not be the
same as the target elements, but are structurally
similar.

5
Metrics need to consider

Two dimensions of relevance
Independency assumption does not hold
No predefined retrieval unit
Varying user effort per result
Overlap
Related result components
Linear vs. clustered ranking

6
CO

What to evaluate?
Retrieve relevant (to any degree) XML fragments
within the XML documents
Influence of strict INEX 2002 metric retrieve
only 33
Question regarding cost of assessment if not used
in evaluation

7
Agreement

Graded assessments are valuable!
Not just for near misses!
Systems should find all levels of relevant info,
but should rank highly relevant first
Systems that do well on retrieving 33 may not do
well on recall-oriented retrieval
This is currently not evaluated sufficiently (due
to problems with generalised metrics)
Future metrics can make use of the rich data
even if we dont yet know how (Birger)

8
Suggestions

Need additional strict quantization functions
(INEX 2002 metric)
Specificity-oriented metrics, e.g.

9
Suggestions

Exhaustivity-oriented metrics, e.g.
P_at_5, P_at_10, P_at_20,

10
Suggestions

We also need the generalised metrics BUT there
are unsolved issues due to overlap of result
elements!

11
Overlap problem
12
Suggestions

Remove overlapping results from submissions
Penalise overlapping result (agreed)
Only score first hit on same relevant reference
component (varying recall-base)
Worry
Article-only baseline hard to beat (Solution
go to TREC if more than 3 articles in run ? )
Stability of evaluation score (INEX02) r1
article 31 r1 sec 33 r2 sec 33
r2 article 31 P0.750 P10

..but the right ordering is the best (Paul)
13
INEX 2003 metric (ng)

Penalises overlap by only scoring novel
information in overlapping results
Problem assumes uniform distribution of relevant
information
Suggestion use assessments data
Issue of stability? Say article has 10 sections,
each with 10 words r1 article 31 r1 sec
33r2 sec 33 r2 article
31Po0.300.3 Po10.10.30.90.37Ro(10)/
20.5 Ro(1 0.110.9)/20.5

14
INEX 2003 metric (ng)

Other issues
Size considered directly in precision (is it
intuitive that large is good or not?)
Recall defined using exh only
Precision defined using spec onlyr1 13
r1 33r2 13 r2 33r3 13 r3
33both get 100 precision (but recall is OK, so
MAP works) -gt P_at_DCV cannot be used

15
Alternative metrics

User-effort oriented measures
Expected Relevant Ratio (Benjamin)
Tolerance to Irrelevance (Arjen, Gabriella,
Mounia)
Discounted Cumulated Gain
Component length (Huyen-Trang Vu)
Assessments-based (Gabriella)

16
Which metric for what task

CO
INEX 2002
INEX 2003 (ng)
Expected Relevant Ratio
Tolerance to Irrelevance
SCAS
INEX 2002
Expected Relevant Ratio
Tolerance to Irrelevance
VCAS
?? (extend CO metrics partial score based on
structural similarity using distance measures)