Title: Using Question Series to Evaluate QA System Effectiveness
1Using Question Series to Evaluate QA System
Effectiveness
2TREC 2004 Series Task
- Test set of questions divided into different
series - each series is about a specified target, and the
goal is to gather info about, or define, the
target - a series contains factoid and list questions,
plus a final other question - individual questions are tagged as to type
3Question Series
Total of 65 series minimum questions 4
maximum questions 10 23 PERSON targets, 25
ORGANIZATION targets, 17 THING targets
4Question Series
- Series are abstractions of user sessions
- target taken from web logs
- assessor created questions and looked for answers
in corpus, also recording other info - NIST heavily edited results for test set
- Limitations
- unlike QACIAD in NTCIR4, not true dialogs
- not an accurate model of the information
assessors most interested in
5Document Set
- AQUAINT documents
- articles from NY Times newswire(1998-2000), AP
newswire (1998-2000), and Xinhua News Agency
(1996-2000) - approximately 3 gb of text
- approximately 1,033,000 articles
- Document(s) from this set required to support
responses
6Factoid Component
- Same as previous years
- response is docid, answer-string or NIL
- human assessors judge pairs with exactly one of
wrong, unsupported, inexact, right - NIL right if no known answer in collection,
otherwise wrong - Factoid scoring
- component score is accuracy, percentage of
questions whose response was right - 230 total factoid questions, 22 with no known
correct answer in collection
7List Component
- Systems return set of docid, string
- questions seek multiple instances of a type
- target number of instances not given
- systems to return complete set of answers
- multiple answers per doc and multiple docs with
answers - List scoring
- single assessor per question
- created list of known, correct answers from
results - combine instance recall precision using F
- F (2?P?R)/(PR)
- component score average F over 55 list questions
8Other Component
- System response is an unordered set of
docid,string pairs - each string is interpreted as a facet in the
definition of the target - answers from previous questions in the series to
be excluded in this response
9Other Component
- Other scoring
- assessor determines set of information nuggets
that good response should contain - assessor marks which nuggets appear in system
response - using assessor judgments, compute nugget recall
plus an approximation to nugget precision (a
function of response length) - score for an other question is F(b3), which
gives recall 3 times the weight of precision - other component score is average F(b3) over 64
questions
10Overall Combined Score Results
Final combined scores for best run per group for
top 10 groups
11Aggregating by Question Type
- Facilitates question-type analysis
- e.g., What is the relative performance of systems
on list questions? others? - But evaluation doesnt match user task
- series structure completely ignored
12Aggregating by Series
- Treat series as the unit of evaluation
- score individual questions as before
- combine different question scores as for overall
score, but use only series questions - if no list question, use SeriesScore2/3factoid1/
3other - compute average over all series
- Better match for user task
- each series given equal weight in final score
13Run Scores when Aggregatedby Question Type vs.
by Series
Kendall t 0.971(for all runs when ranked by
score using different aggregation methods)
14Does it Matter?
- Yes!
- Despite apparently inconsequential differences in
scores, aggregating by series is a much better
evaluation - the series is small enough to be meaningful at
the task level---it represents a single users
interaction - the series is large enough for an individual
series score to be meaningful
15Distribution of Series Scores over Runs
16Distribution of Question Scores
- Individual question distributions are badly
skewed toward 0 - questions for which median score 0
- factoids 212/230 (92.2)
- lists 39/55 (70.9)
- others 41/64 (64.1)
- With heavily skewed distributions,
meta-evaluation not meaningful
17Sensitivity Analysis
- Empirically determine the difference in scores
needed to be confident one run is better than
another - methodology originally created for document
retrieval where topic score distributions are not
heavily skewed - methodology uses run pairs to establish
relationship among number of series in test,
size of the difference in final scores, and
likelihood that the comparison is stable
18Sensitivity of Per-Series Score
d .05
19Are TREC Runs Different?
- Can perseries score differentiate among runs
seen in practice? - with 63 runs submitted to track, can make 1953
comparisons - 1380 (70.7) pairs have diff 0.05
- many in remaining 30 really are equivalent
20Per-Series Scores Differences
21Conclusion
- Series task added to recent QA evaluations
- reasonable abstraction of actual user task
- broad scope for participation
- Series-based evaluation also has nice evaluation
properties - series is an appropriate level of aggregation
- series-based average scores sufficiently stable
and sensitive