Evaluating Question Answering Systems - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

Evaluating Question Answering Systems

Description:

strings limited to 50 or 250 bytes. document must support answer. Guidelines ... ethnologist primatologist animals behaviorist. wife.vansLawick ... – PowerPoint PPT presentation

Number of Views:202

Avg rating:3.0/5.0

Slides: 51

Provided by: elle128

Category:

more less

Transcript and Presenter's Notes

Title: Evaluating Question Answering Systems

1
Evaluating Question Answering Systems

Ellen M. Voorhees

2
So, enough already!

Evaluations in HLT proliferating
DARPA speech evals
MUC, ACE
TREC, NTCIR, CLEF
senseval, parseval, DUC, TDT...
Lets stop evaluating and get some real work
done!

3
Case for Community Evaluations

Form/solidify a research community
Establish the research methodology
Facilitate technology transfer
Document the state-of-the-art
Amortize the costs of infrastructure

4
Of course, some downside

Evaluations do take resources from other efforts
money to defray evaluation costs
researcher time
minimize effect by keeping evaluation-only tasks
such as result reporting simple
Overfitting
entire community trains to peculiarities of test
set
minimize effect by having multiple test sets,
evolving the evaluation task

5
What is a Good Eval Task?

Abstraction of real-world task so variables
affecting performance can be controlled...
...but must capture salient aspects of real task
or exercise is pointless
Metrics must accurately predict relative
effectiveness on those aspects
Adequate level of difficulty
Best if measures are diagnostic

6
TREC QA Track

Goal
encourage research into systems that return
answers, rather than document lists
Motivation bring benefits of large-scale
evaluation to QA task
provide a common problem for the IR and IE
communities
investigate appropriate evaluation
methodologiesfor QA

7
Original QA Track Task

Given
set of fact-based, short-answer questions
3 gb newspaper/newswire text
Return
ranked list of document, answer-string pairs
strings limited to 50 or 250 bytes
document must support answer
Guidelines
completely automatic processing
answer guaranteed to exist in collection
assume documents are factual

8
Sample Questions

How much folic acid should an expectant mother
get daily?
Who invented the paper clip?
What university was Woodrow Wilson president of?
Where is Rider College located?
Name a film in which Jude Law acted.

9
Evaluation

Human assessors judge correctness of responses
Score using mean reciprocal rank
score for individual question is the reciprocal
of the rank at which the first correct response
returned (0 if no correct response returned)
score of a run is mean over set of questions

10
Question Answering Techniques

250 bytes enough for passage retrieval
techniques 50 bytes is not
For 50 byte limit
classify question type
look for close-by entities that match entailed
answer type
fall back to passage retrieval if failure

11
Template Matching

Determine question type by template matching
Template-matching is successful when questions
are predictable
Who vs. What ltperson-descriptiongt
Occasional problems even when templates match
Who was the first American in space?
As for Wilson himself, he became a senator by
defeating Jerry Brown, who has been called the
first American in space.

12
Question Answering Techniques

TREC-8 approaches generally retained
better question classification
wider variety of methods for finding entailed
answer types
frequent use of WordNet
High-quality document search still helpful
TREC 2001 saw onset of methods that substitute
massive amounts of data for sophisticated
processing

13
TREC 2001 Main Task Results
Scores for the best run of the top 8 groups using
strict evaluation
14
Source of Questions

TREC-8 developed for track
as unambiguous as possible
tended to be back-formulations
TREC-9 ideas mined from Excite log
no reference to docs during creation
TREC 2001 literal questions from logs
logs donated by AskJeeves and Microsoft
Had major impact on task real questions much
harder because more ambiguous

15
Evaluation Methodology

Different philosophies
IR the user is the sole judge of a
satisfactory response
human assessors judge responses
flexible interpretation of correct response
final scores comparative, not absolute
IE there exists the answer
answer keys developed by application expert
requires enumeration of all acceptable responses
at outset
subsequent scoring trivial final scores absolute

16
QA Evaluation Methodology

NIST assessors judge answer strings
binary judgment of correct/incorrect
document provides context for answer
Each question independently judged by 3 assessors
can build high-quality final judgment set
provided data for measuring effect of differences
on final scores

17
User Evaluation Necessary

Even for these questions, context matters
Taj Mahal casino in Atlantic City
Legitimate differences in opinion as to whether
string contains correct answer
granularity of dates
completeness of names
confusability of answer string
If assessors opinions differ, so will eventual
end-users opinions

18
Validating the Methodology

User-based evaluation is appropriate, but is it
reliable?
how do judgment differences affect scores?
Does the methodology produce the equivalent of an
IR test collection for QA?
researchers able to evaluate their own runs

19
Mean Reciprocal Rank by Qrels
20
Kendall Correlations
21
Comparative Scores are Stable

Mean Kendall t of .96 (18 swaps) equivalent to
variation found in TREC IR test collections
Judgment sets using 1 judges opinion equivalent
to adjudicated judgments
adjudicated gt 3 times the cost of 1 judge qrels

22
QA Test Collections

Goal is reusable test collections
researchers evaluate own variants
main source of improvement for IR systems in TREC
But not the case for QA
strings judged little overlap across runs
need procedure for deciding if unjudged string is
okay given a set of judged strings

23
Automatic Evaluation

In general, evaluating strings is equivalent to
solving original problem
Approximations
U. Ottawa accept any string that contains a
string judged correct
MITRE have human produce answer key and use
word recall threshold
NIST produce patterns from judged strings
accept any string with match

24
Example Patterns

Who invented Silly Putty?
General\sElectric
Where is the location of the Orange Bowl?
\sMiami\s \sin\sMiami\s\.?\sto\sMiam
i at\sMiamiMiami\s'?\ss\sdowntown Orange.\
sin\s.MiamiOrange\sBowl\s,\sMiami Miami\s'
?\ss OrangeDade\sCounty
Who was Jane Goodall?
naturalist expert\son\schimpschimpanzee\s
specialist chimpanzee\sresearcherchimpanzee\s
-?\sobserver ethologists?pioneered.study\sof\
sprimates anthropologistethnologist primatolo
gist animal\sbehaviorist
wife.van\sLawickscientist\sof\sunquestionabl
e\sreputation most\srecognizable\sliving\sscie
ntist

25
Judging Strings Using Patterns

Consider a string correct if any pattern for that
question matches
Compute MRR score for each run based on the
pattern judgments
Compute Kendall t between ranking based on
adjudicated judgments and ranking based on
pattern judgments
t.96, equivalent to different humans, for
TREC-8
t.89 for TREC-9

26
Problem Solved?

Not really...
patterns were created from runs that were then
re-scored
patterns dont differentiate between documents
patterns dont penalize answer stuffing
errors are correlated with system functionality
document pools woefully incomplete
Useful if limitations understood
Still need full solution

27
Extensions to the Original Task

TREC 2001
systems must determine if there is an answer
present in the collection
list task added
TREC 2002
exact answer rather than text snippet
TREC 2003
definition questions added
main task included all question types
TREC 2004
question series as abstraction of dialog

28
Motivation for Exact Answers
What river in the US is known as the Big Muddy?

the Mississippi
Known as Big Muddy, the Mississippi is the
longest
as Big Muddy , the Mississippi is the longest
messed with . Known as Big Muddy , the Mississip
Mississippi is the longest river in the US
the Mississippi is the longest river in the US,
the Mississippi is the longest river(Mississippi)
has brought the Mississippi to ist lowest
ipes.In Life on the Mississippi,Mark Twain wrote
t
SoutheastMississippiMark Twainofficials began
Known Mississippi US, Minnesota Gulf Mexico
Mud Island,MississippiThe-- history,Memphis

29
Motivation for Exact Answers

Text snippets masking important differences among
systems
Pinpointing precise extent of answer important to
driving technology
not a statement that deployed systems should
return only exact answers
exact answers may be important as component in
larger language systems

30
Recognizing Exact Answers

Gave assessors guidelines
most minimal response possible not the only exact
answer
e.g., accept Mississippi river for What is the
longest river in the United States?
ungrammatical responses not exact
e.g., in Mississippi vs. Mississippi in
justification is not exact
e.g., At 2,348 miles the Mississippi river is
the longest US river is inexact

31
Assessors Continue to Disagree

80 judgments Wrong
50 of responses where at least one judgment was
not W had disagreements
Of those, 33 involved disagreements between
Right and ineXact
well-known granularity issue now reflected here
For dates and quantities, disagreement among
Wrong and ineXact

32
But Comparative Results Still Stable

Kendall t scores between system rankings gt 0.9
Scores for rankings using adjudicatedjudgments gt
0.94

33
QA List Task

Goal force systems to assemble an answer from
multiple documents
Instance-finding task
Name 4 U.S. cities that have a Shubert theater
What are 9 novels written by John Updike?
later tracks did not give target number of
instances
response is an unordered set of docid,
answer-string pairs

34
QA List Evaluation

Each list judged as a unit
individual instances marked correct/unsupported/in
correct/inexact
subset of correct instances could be marked
distinct
Evaluation metric
accuracy when target number of instances given
average F(b1) when no target number specified

35
TREC 2003 List Results
F score of best run for top 15 groups for list
component
36
Definition Questions

Represented about 25 of test set in TREC 2001
What is an atom? What is epilepsy?
What are invertebrates? Who is Colin Powell?
Are hard for systems to answer assessors to
judge
lack of context/user model unrealistic
while real, there are better ways of finding
definitions than looking in large corpus
What is a good answer?

37
Issues

Have same concept-matching problem as in other
NLP evals (e.g., summarization)
want to reward systems for retrieving all of the
important concepts required penalize systems
for retrieving irrelevant or redundant concepts
Recall RetrievedRequired/Required
Precision RetrievedRequired/Retrieved
but concepts represented in English in many ways
no one-to-one correspondence between items
concepts
Different questions have very different sizes for
Required

38
Definition Evaluation

Have assessor create list of concepts that
definition should contain
indicate essential concepts
okay concepts
Mark concepts in system responses
mark a concept at most once
individual item may have multiple, one, or no
concepts

39
What is a golden parachute?

Assessor nuggets
Agreement between companies and top executives
Provides remuneration to executives who lose jobs
Remuneration is usually very generous
Encourages executives not to resist takeover
beneficial to shareholders
Incentive for executives to join companies
Arrangement for which IRS can impose excise tax

Judged system response
40
Evaluation

With this methodology, concept recall computable,
but not concept precision
no satisfactory way to list all concepts
retrieved
assessors cannot enumerate all concepts in text
granularity issue
unnatural task
items not well correlated, very easy to game
Rough approximation to concept precision length
count (non-white-space) characters in all items
intuition is that users prefer shorter of two
definitions with same concepts

41
TREC 2003 Definition Results
F(ß5) score of best run for top 15 groups for
definition component
42
Reliability of Definition Evaluation

Mistakes by assessors
exist in all evaluations
can be directly measured since no pooling
Differences of opinion
different assessors disagree as to correctness
inherent in NLP tasks
Sample of questions
different systems do relatively differently on
different questions
particular sample of questions can skew results
more questions lead to more stable results

43
Mistakes by Assessors

14 pairs of identical definition components
across all pairs, 19 different definition
questions judged differently
(roughly) uniform across assessors
number of questions affected ranges from 0 to 10
difference in F(ß5) scores ranges from 0.0 to
0.043, with a mean of 0.013
differences in F scores 0.043 for some
different systems clearly must be considered
equivalent
New task
consistency improved somewhat as assessors gained
experience
better training re granularity will help some
will never eliminate all errors

44
Differences of Opinion

Each question independently judged by 2 assessors
assessors differed in what nuggets desired
assessors differed in whether nuggets vital
assessors did not differ as much in whether a
nugget was present (modulo mistakes)
Correlation among system rankings when questions
judged by different assessors
compute Kendall t correlation between rankings
t0.848, representing 113/1485 pairwise swaps
8 swaps among systems whose F(ß5) scores as
judged by original assessors differed by gt 0.1
largest F(ß5) difference with swap was 0.123

45
Sample of Questions in Test Set

Need differencegt 0.1 for error rate lt 5
5
46
Definition Evaluation

Noise within definition evaluation comparatively
large
need to consider F scores within 0.1 of one
another equivalent
coarse evaluation
large equivalence classes of runs
one fix is to increase number of questions
larger sample of questions
individual mistakes have less effect
evaluation more costly

47
TREC 2004 Task

Process question series
each series is about a specified target, and the
goal is to gather info about, or define, the
target
a series contains factoid and list questions,
plus a final other question
questions are tagged as to type and scored by
type
final score is a weighted average of 3 component
scores
FinalScore 1/2Factoid 1/4List 1/4Other

48
QA Track Results

Solidify a community
enormous growth in QA community
world-wide interest (e.g., QA tasks in NTCIR,
CLEF)
Establish the research methodology
showed that even facts are context-sensitive
first steps toward evaluation complex answers
Facilitate technology transfer
common architecture for factoid questions
Document the state-of-the-art
task for which NLP techniques shows real benefit
rough boundary when IR techniques insufficient
Amortize the costs of infrastructure
patterns partial solution use with care

49
Where to next?

Synergy between QA and summarization
continue to explore evaluation methodologies for
complex questions
reduce emphasis on factoids to support role
Context-sensitive QA
personalized to user
takes account of implicit background as well as
explicit cues within interaction

50
(No Transcript)

Write a Comment

User Comments (0)