Title: INEX 2003
1INEX 2003
- Evaluation metrics working group
Djoerd Hiemstra, Jaap Kamps, Gabriella Kazai,
Yosi Mass, Vojkan Mihajlovic, Paul Ogilvie,
Jovan Pehcevski, Arjen de Vries, Huyen-Trang Vu
Dagstuhl, Germany, 15-17 December 2003
2What we did
- First session overview of current and proposed
metrics - INEX02
- INEX03 (ng)
- Expected Relevant Ratio
- Tolerance to Irrelevance
- Second session discussion -gt results reported
here
3General Concerns
- Systems could not be tuned -gt need to publish
metrics prior to evaluation - Issue in understanding metrics
- Installation issues
- Stability of metrics questioned
- Consistency of assessments due to cognitive load
(more complex less reliable?)
4Tasks to evaluate
- CO aim is to decrease user effort by pointing
the user to the most specific relevant portions
of documents. - SCAS retrieve relevant nodes that match the
structure specified in the query. - VCAS retrieve relevant nodes that may not be the
same as the target elements, but are structurally
similar.
5Metrics need to consider
- Two dimensions of relevance
- Independency assumption does not hold
- No predefined retrieval unit
- Varying user effort per result
- Overlap
- Related result components
- Linear vs. clustered ranking
6CO
- What to evaluate?
- Retrieve relevant (to any degree) XML fragments
within the XML documents - Influence of strict INEX 2002 metric retrieve
only 33 - Question regarding cost of assessment if not used
in evaluation
7Agreement
- Graded assessments are valuable!
- Not just for near misses!
- Systems should find all levels of relevant info,
but should rank highly relevant first - Systems that do well on retrieving 33 may not do
well on recall-oriented retrieval - This is currently not evaluated sufficiently (due
to problems with generalised metrics) - Future metrics can make use of the rich data
even if we dont yet know how (Birger)
8Suggestions
- Need additional strict quantization functions
(INEX 2002 metric) - Specificity-oriented metrics, e.g.
9Suggestions
- Exhaustivity-oriented metrics, e.g.
- P_at_5, P_at_10, P_at_20,
10Suggestions
- We also need the generalised metrics BUT there
are unsolved issues due to overlap of result
elements!
11Overlap problem
12Suggestions
- Remove overlapping results from submissions
- Penalise overlapping result (agreed)
- Only score first hit on same relevant reference
component (varying recall-base) - Worry
- Article-only baseline hard to beat (Solution
go to TREC if more than 3 articles in run ? ) - Stability of evaluation score (INEX02) r1
article 31 r1 sec 33 r2 sec 33
r2 article 31 P0.750 P10
..but the right ordering is the best (Paul)
13INEX 2003 metric (ng)
- Penalises overlap by only scoring novel
information in overlapping results - Problem assumes uniform distribution of relevant
information - Suggestion use assessments data
- Issue of stability? Say article has 10 sections,
each with 10 words r1 article 31 r1 sec
33r2 sec 33 r2 article
31Po0.300.3 Po10.10.30.90.37Ro(10)/
20.5 Ro(1 0.110.9)/20.5
14INEX 2003 metric (ng)
- Other issues
- Size considered directly in precision (is it
intuitive that large is good or not?) - Recall defined using exh only
- Precision defined using spec onlyr1 13
r1 33r2 13 r2 33r3 13 r3
33both get 100 precision (but recall is OK, so
MAP works) -gt P_at_DCV cannot be used
15Alternative metrics
- User-effort oriented measures
- Expected Relevant Ratio (Benjamin)
- Tolerance to Irrelevance (Arjen, Gabriella,
Mounia) - Discounted Cumulated Gain
- Component length (Huyen-Trang Vu)
- Assessments-based (Gabriella)
16Which metric for what task
- CO
- INEX 2002
- INEX 2003 (ng)
- Expected Relevant Ratio
- Tolerance to Irrelevance
- SCAS
- INEX 2002
- Expected Relevant Ratio
- Tolerance to Irrelevance
- VCAS
- ?? (extend CO metrics partial score based on
structural similarity using distance measures)
17Appendix
18Overview of metrics
- INEX02
- INEX03 (ng)
- Expected Relevant Ratio
- Tolerance to Irrelevance
19INEX 2002 metric 1
- Quantization
- strict
- generalized
20INEX 2002 metric 2
- Precision as defined by Raghavan89 (based on
ESL) - where n is estimated
21INEX 2003 metric (ng) 1
- Ideal concept space (Wong Yao 95)
c
t
22INEX 2003 metric (ng) 2
- Quantization
- strict
- generalised
23INEX 2003 metric (ng) 3
24INEX 2003 metric (ng) 4
25Expected Relevant Ratio
- Considers probability of relevance and users
browsing behaviour - (For further details see ERR slides)
26Tolerance to Irrelevance
- Based on time to view
- Count hits based on users tolerance to
irrelevance - User model instantiated in metrics of
- Precision at DCV
- ESL
- Raghavans P(relretr)
- (For further details see T2I slides)
27References
- INEX02 in INEX02 proceedings
- INEX03 as technical report
- ERR in preparation
- T2I submitted for publication
- CG-based (Gabriella) submitted for publication