Using Question Series to Evaluate QA System Effectiveness - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Using Question Series to Evaluate QA System Effectiveness

Description:

each series is about a specified target, and the goal is to gather info about, ... (1998-2000), AP newswire (1998-2000), and Xinhua News Agency (1996-2000) ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 22

Provided by: elle128

Category:

more less

Transcript and Presenter's Notes

Title: Using Question Series to Evaluate QA System Effectiveness

1
Using Question Series to Evaluate QA System
Effectiveness

Ellen Voorhees

2
TREC 2004 Series Task

Test set of questions divided into different
series
each series is about a specified target, and the
goal is to gather info about, or define, the
target
a series contains factoid and list questions,
plus a final other question
individual questions are tagged as to type

3
Question Series
Total of 65 series minimum questions 4
maximum questions 10 23 PERSON targets, 25
ORGANIZATION targets, 17 THING targets
4
Question Series

Series are abstractions of user sessions
target taken from web logs
assessor created questions and looked for answers
in corpus, also recording other info
NIST heavily edited results for test set
Limitations
unlike QACIAD in NTCIR4, not true dialogs
not an accurate model of the information
assessors most interested in

5
Document Set

AQUAINT documents
articles from NY Times newswire(1998-2000), AP
newswire (1998-2000), and Xinhua News Agency
(1996-2000)
approximately 3 gb of text
approximately 1,033,000 articles
Document(s) from this set required to support
responses

6
Factoid Component

Same as previous years
response is docid, answer-string or NIL
human assessors judge pairs with exactly one of
wrong, unsupported, inexact, right
NIL right if no known answer in collection,
otherwise wrong
Factoid scoring
component score is accuracy, percentage of
questions whose response was right
230 total factoid questions, 22 with no known
correct answer in collection

7
List Component

Systems return set of docid, string
questions seek multiple instances of a type
target number of instances not given
systems to return complete set of answers
multiple answers per doc and multiple docs with
answers
List scoring
single assessor per question
created list of known, correct answers from
results
combine instance recall precision using F
F (2?P?R)/(PR)
component score average F over 55 list questions

8
Other Component

System response is an unordered set of
docid,string pairs
each string is interpreted as a facet in the
definition of the target
answers from previous questions in the series to
be excluded in this response

9
Other Component

Other scoring
assessor determines set of information nuggets
that good response should contain
assessor marks which nuggets appear in system
response
using assessor judgments, compute nugget recall
plus an approximation to nugget precision (a
function of response length)
score for an other question is F(b3), which
gives recall 3 times the weight of precision
other component score is average F(b3) over 64
questions

10
Overall Combined Score Results
Final combined scores for best run per group for
top 10 groups
11
Aggregating by Question Type

Facilitates question-type analysis
e.g., What is the relative performance of systems
on list questions? others?
But evaluation doesnt match user task
series structure completely ignored

12
Aggregating by Series

Treat series as the unit of evaluation
score individual questions as before
combine different question scores as for overall
score, but use only series questions
if no list question, use SeriesScore2/3factoid1/
3other
compute average over all series
Better match for user task
each series given equal weight in final score

13
Run Scores when Aggregatedby Question Type vs.
by Series
Kendall t 0.971(for all runs when ranked by
score using different aggregation methods)
14
Does it Matter?

Yes!
Despite apparently inconsequential differences in
scores, aggregating by series is a much better
evaluation
the series is small enough to be meaningful at
the task level---it represents a single users
interaction
the series is large enough for an individual
series score to be meaningful

15
Distribution of Series Scores over Runs
16
Distribution of Question Scores

Individual question distributions are badly
skewed toward 0
questions for which median score 0
factoids 212/230 (92.2)
lists 39/55 (70.9)
others 41/64 (64.1)
With heavily skewed distributions,
meta-evaluation not meaningful

17
Sensitivity Analysis

Empirically determine the difference in scores
needed to be confident one run is better than
another
methodology originally created for document
retrieval where topic score distributions are not
heavily skewed
methodology uses run pairs to establish
relationship among number of series in test,
size of the difference in final scores, and
likelihood that the comparison is stable

18
Sensitivity of Per-Series Score
d .05
19
Are TREC Runs Different?