Using Question Series to Evaluate QA System Effectiveness - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Using Question Series to Evaluate QA System Effectiveness

Description:

each series is about a specified target, and the goal is to gather info about, ... (1998-2000), AP newswire (1998-2000), and Xinhua News Agency (1996-2000) ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 22
Provided by: elle128
Category:

less

Transcript and Presenter's Notes

Title: Using Question Series to Evaluate QA System Effectiveness


1
Using Question Series to Evaluate QA System
Effectiveness
  • Ellen Voorhees

2
TREC 2004 Series Task
  • Test set of questions divided into different
    series
  • each series is about a specified target, and the
    goal is to gather info about, or define, the
    target
  • a series contains factoid and list questions,
    plus a final other question
  • individual questions are tagged as to type

3
Question Series
Total of 65 series minimum questions 4
maximum questions 10 23 PERSON targets, 25
ORGANIZATION targets, 17 THING targets
4
Question Series
  • Series are abstractions of user sessions
  • target taken from web logs
  • assessor created questions and looked for answers
    in corpus, also recording other info
  • NIST heavily edited results for test set
  • Limitations
  • unlike QACIAD in NTCIR4, not true dialogs
  • not an accurate model of the information
    assessors most interested in

5
Document Set
  • AQUAINT documents
  • articles from NY Times newswire(1998-2000), AP
    newswire (1998-2000), and Xinhua News Agency
    (1996-2000)
  • approximately 3 gb of text
  • approximately 1,033,000 articles
  • Document(s) from this set required to support
    responses

6
Factoid Component
  • Same as previous years
  • response is docid, answer-string or NIL
  • human assessors judge pairs with exactly one of
    wrong, unsupported, inexact, right
  • NIL right if no known answer in collection,
    otherwise wrong
  • Factoid scoring
  • component score is accuracy, percentage of
    questions whose response was right
  • 230 total factoid questions, 22 with no known
    correct answer in collection

7
List Component
  • Systems return set of docid, string
  • questions seek multiple instances of a type
  • target number of instances not given
  • systems to return complete set of answers
  • multiple answers per doc and multiple docs with
    answers
  • List scoring
  • single assessor per question
  • created list of known, correct answers from
    results
  • combine instance recall precision using F
  • F (2?P?R)/(PR)
  • component score average F over 55 list questions

8
Other Component
  • System response is an unordered set of
    docid,string pairs
  • each string is interpreted as a facet in the
    definition of the target
  • answers from previous questions in the series to
    be excluded in this response

9
Other Component
  • Other scoring
  • assessor determines set of information nuggets
    that good response should contain
  • assessor marks which nuggets appear in system
    response
  • using assessor judgments, compute nugget recall
    plus an approximation to nugget precision (a
    function of response length)
  • score for an other question is F(b3), which
    gives recall 3 times the weight of precision
  • other component score is average F(b3) over 64
    questions

10
Overall Combined Score Results
Final combined scores for best run per group for
top 10 groups
11
Aggregating by Question Type
  • Facilitates question-type analysis
  • e.g., What is the relative performance of systems
    on list questions? others?
  • But evaluation doesnt match user task
  • series structure completely ignored

12
Aggregating by Series
  • Treat series as the unit of evaluation
  • score individual questions as before
  • combine different question scores as for overall
    score, but use only series questions
  • if no list question, use SeriesScore2/3factoid1/
    3other
  • compute average over all series
  • Better match for user task
  • each series given equal weight in final score

13
Run Scores when Aggregatedby Question Type vs.
by Series
Kendall t 0.971(for all runs when ranked by
score using different aggregation methods)
14
Does it Matter?
  • Yes!
  • Despite apparently inconsequential differences in
    scores, aggregating by series is a much better
    evaluation
  • the series is small enough to be meaningful at
    the task level---it represents a single users
    interaction
  • the series is large enough for an individual
    series score to be meaningful

15
Distribution of Series Scores over Runs
16
Distribution of Question Scores
  • Individual question distributions are badly
    skewed toward 0
  • questions for which median score 0
  • factoids 212/230 (92.2)
  • lists 39/55 (70.9)
  • others 41/64 (64.1)
  • With heavily skewed distributions,
    meta-evaluation not meaningful

17
Sensitivity Analysis
  • Empirically determine the difference in scores
    needed to be confident one run is better than
    another
  • methodology originally created for document
    retrieval where topic score distributions are not
    heavily skewed
  • methodology uses run pairs to establish
    relationship among number of series in test,
    size of the difference in final scores, and
    likelihood that the comparison is stable

18
Sensitivity of Per-Series Score
d .05
19
Are TREC Runs Different?
  • Can perseries score differentiate among runs
    seen in practice?
  • with 63 runs submitted to track, can make 1953
    comparisons
  • 1380 (70.7) pairs have diff 0.05
  • many in remaining 30 really are equivalent

20
Per-Series Scores Differences
21
Conclusion
  • Series task added to recent QA evaluations
  • reasonable abstraction of actual user task
  • broad scope for participation
  • Series-based evaluation also has nice evaluation
    properties
  • series is an appropriate level of aggregation
  • series-based average scores sufficiently stable
    and sensitive
Write a Comment
User Comments (0)
About PowerShow.com