Information Retrieval Evaluation with Partial Relevance Judgment - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Information Retrieval Evaluation with Partial Relevance Judgment

Description:

... such as TREC, but its effect on some system-oriented measures such as mean ... 9 groups of data (TREC 5-8: ad hoc track; TREC 9, 2001, and 2002: Web track; REC ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 16
Provided by: sheng3
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval Evaluation with Partial Relevance Judgment


1
Information Retrieval Evaluation with Partial
Relevance Judgment
  • Shengli Wu and Sally McClean
  • School of Computing and Mathematics University of
    Ulster

2
  • Motivation
  • Measures
  • Experimental results
  • Conclusions

3
Motivation
  • Partial relevance judgment has become a norm for
    the evaluation of some major information
    retrieval evaluation events such as TREC, but its
    effect on some system-oriented measures such as
    mean average precision has not been well
    understood.
  • A new measure, average precision at all document
    levels, is evaluated.

4
Measure MAP
  • Mean average precision (MAP)
  • 1(r) 2(i) 3(i) 4(r) 5(i)
  • map(1/12/4)/50.3

5
Measures (AP_all)
  • average precision over all document levels
  • 1(r) 2(i) 3(i) 4(r) 5(i)
  • ap_all(11/21/32/42/5)/50.5555

6
Precision at 10 document level
  • P(10) counts how many relevant documents in the
    top-10 documents.
  • 1(r) 2(i) 3(i) 4(r) 5(i) 6(i) 7(i) 8(i) 9(r)
    10(r)
  • P104/100.4

7
Experimental Setting
  • 9 groups of data (TREC 5-8 ad hoc track TREC 9,
    2001, and 2002 Web track REC 2003 and 2004
    robust track) for binary relevance judgment, 699
    queries in total.
  • 3 groups of data (TREC 9, TREC 2001, and the
    second half of TREC 2003) for 3 categories
    relevance judgment, 150 queries in total.

8
Error rates of the measures
  • Suppose we have two results A and B such that
    A's average performance is better than B's
    average performance by over 5 over all k
    queries. Then we consider these l queries one by
    one. If A is better than B by over 5 for m
    queries, and B is better than A by over 5 for n
    queries (kmn), then the error rate is n/(mn).

9
Error rates of using three measures

10
How different are these measures?
  • For the same group of results, we may obtain
    different values if we use different measures to
    evaluate them.
  • One question is How different are these measures?

11
Kendall's tau coefficient among rankings
generated using different measures
12
Some drawbacks of MAP
  • In the condition of partial relevance judgment,
    the MAP value of one result is effected by other
    results involved.
  • The fewer results are involved, the bigger is the
    MAP value for any result.

13
Correlation of rankings using full sets of
results and rankings using partial sets of results
14
performance difference of the same result in
different environments
15
Conclusions
  • The values of all these four measures are
    exaggerated when incomplete relevance judgment
    such as the pooling method is applied.
  • From these experimental results, we conclude that
    MAP is the most sensitive but least reliable
    measure, and both NAP and NDCG are the most
    reliable but least sensitive measures, while
    R-precision is in the middle.
Write a Comment
User Comments (0)
About PowerShow.com