Title: Evaluating Question Answering Systems
1Evaluating Question Answering Systems
2So, enough already!
- Evaluations in HLT proliferating
- DARPA speech evals
- MUC, ACE
- TREC, NTCIR, CLEF
- senseval, parseval, DUC, TDT...
- Lets stop evaluating and get some real work
done!
3Case for Community Evaluations
- Form/solidify a research community
- Establish the research methodology
- Facilitate technology transfer
- Document the state-of-the-art
- Amortize the costs of infrastructure
4Of course, some downside
- Evaluations do take resources from other efforts
- money to defray evaluation costs
- researcher time
- minimize effect by keeping evaluation-only tasks
such as result reporting simple - Overfitting
- entire community trains to peculiarities of test
set - minimize effect by having multiple test sets,
evolving the evaluation task
5What is a Good Eval Task?
- Abstraction of real-world task so variables
affecting performance can be controlled... - ...but must capture salient aspects of real task
or exercise is pointless - Metrics must accurately predict relative
effectiveness on those aspects - Adequate level of difficulty
- Best if measures are diagnostic
6TREC QA Track
- Goal
- encourage research into systems that return
answers, rather than document lists - Motivation bring benefits of large-scale
evaluation to QA task - provide a common problem for the IR and IE
communities - investigate appropriate evaluation
methodologiesfor QA
7Original QA Track Task
- Given
- set of fact-based, short-answer questions
- 3 gb newspaper/newswire text
- Return
- ranked list of document, answer-string pairs
- strings limited to 50 or 250 bytes
- document must support answer
- Guidelines
- completely automatic processing
- answer guaranteed to exist in collection
- assume documents are factual
8Sample Questions
- How much folic acid should an expectant mother
get daily? - Who invented the paper clip?
- What university was Woodrow Wilson president of?
- Where is Rider College located?
- Name a film in which Jude Law acted.
9 Evaluation
- Human assessors judge correctness of responses
- Score using mean reciprocal rank
- score for individual question is the reciprocal
of the rank at which the first correct response
returned (0 if no correct response returned) - score of a run is mean over set of questions
10Question Answering Techniques
- 250 bytes enough for passage retrieval
techniques 50 bytes is not - For 50 byte limit
- classify question type
- look for close-by entities that match entailed
answer type - fall back to passage retrieval if failure
11Template Matching
- Determine question type by template matching
- Template-matching is successful when questions
are predictable - Who vs. What ltperson-descriptiongt
- Occasional problems even when templates match
- Who was the first American in space?
- As for Wilson himself, he became a senator by
defeating Jerry Brown, who has been called the
first American in space.
12Question Answering Techniques
- TREC-8 approaches generally retained
- better question classification
- wider variety of methods for finding entailed
answer types - frequent use of WordNet
- High-quality document search still helpful
- TREC 2001 saw onset of methods that substitute
massive amounts of data for sophisticated
processing
13TREC 2001 Main Task Results
Scores for the best run of the top 8 groups using
strict evaluation
14Source of Questions
- TREC-8 developed for track
- as unambiguous as possible
- tended to be back-formulations
- TREC-9 ideas mined from Excite log
- no reference to docs during creation
- TREC 2001 literal questions from logs
- logs donated by AskJeeves and Microsoft
- Had major impact on task real questions much
harder because more ambiguous
15Evaluation Methodology
- Different philosophies
- IR the user is the sole judge of a
satisfactory response - human assessors judge responses
- flexible interpretation of correct response
- final scores comparative, not absolute
- IE there exists the answer
- answer keys developed by application expert
- requires enumeration of all acceptable responses
at outset - subsequent scoring trivial final scores absolute
16QA Evaluation Methodology
- NIST assessors judge answer strings
- binary judgment of correct/incorrect
- document provides context for answer
- Each question independently judged by 3 assessors
- can build high-quality final judgment set
- provided data for measuring effect of differences
on final scores
17User Evaluation Necessary
- Even for these questions, context matters
- Taj Mahal casino in Atlantic City
- Legitimate differences in opinion as to whether
string contains correct answer - granularity of dates
- completeness of names
- confusability of answer string
- If assessors opinions differ, so will eventual
end-users opinions
18Validating the Methodology
- User-based evaluation is appropriate, but is it
reliable? - how do judgment differences affect scores?
- Does the methodology produce the equivalent of an
IR test collection for QA? - researchers able to evaluate their own runs
19Mean Reciprocal Rank by Qrels
20Kendall Correlations
21Comparative Scores are Stable
- Mean Kendall t of .96 (18 swaps) equivalent to
variation found in TREC IR test collections - Judgment sets using 1 judges opinion equivalent
to adjudicated judgments - adjudicated gt 3 times the cost of 1 judge qrels
22QA Test Collections
- Goal is reusable test collections
- researchers evaluate own variants
- main source of improvement for IR systems in TREC
- But not the case for QA
- strings judged little overlap across runs
- need procedure for deciding if unjudged string is
okay given a set of judged strings
23Automatic Evaluation
- In general, evaluating strings is equivalent to
solving original problem - Approximations
- U. Ottawa accept any string that contains a
string judged correct - MITRE have human produce answer key and use
word recall threshold - NIST produce patterns from judged strings
accept any string with match
24Example Patterns
- Who invented Silly Putty?
- General\sElectric
- Where is the location of the Orange Bowl?
- \sMiami\s \sin\sMiami\s\.?\sto\sMiam
i at\sMiamiMiami\s'?\ss\sdowntown Orange.\
sin\s.MiamiOrange\sBowl\s,\sMiami Miami\s'
?\ss OrangeDade\sCounty - Who was Jane Goodall?
- naturalist expert\son\schimpschimpanzee\s
specialist chimpanzee\sresearcherchimpanzee\s
-?\sobserver ethologists?pioneered.study\sof\
sprimates anthropologistethnologist primatolo
gist animal\sbehaviorist - wife.van\sLawickscientist\sof\sunquestionabl
e\sreputation most\srecognizable\sliving\sscie
ntist
25Judging Strings Using Patterns
- Consider a string correct if any pattern for that
question matches - Compute MRR score for each run based on the
pattern judgments - Compute Kendall t between ranking based on
adjudicated judgments and ranking based on
pattern judgments - t.96, equivalent to different humans, for
TREC-8 - t.89 for TREC-9
26Problem Solved?
- Not really...
- patterns were created from runs that were then
re-scored - patterns dont differentiate between documents
- patterns dont penalize answer stuffing
- errors are correlated with system functionality
- document pools woefully incomplete
- Useful if limitations understood
- Still need full solution
27Extensions to the Original Task
- TREC 2001
- systems must determine if there is an answer
present in the collection - list task added
- TREC 2002
- exact answer rather than text snippet
- TREC 2003
- definition questions added
- main task included all question types
- TREC 2004
- question series as abstraction of dialog
28Motivation for Exact Answers
What river in the US is known as the Big Muddy?
- the Mississippi
- Known as Big Muddy, the Mississippi is the
longest - as Big Muddy , the Mississippi is the longest
- messed with . Known as Big Muddy , the Mississip
- Mississippi is the longest river in the US
- the Mississippi is the longest river in the US,
- the Mississippi is the longest river(Mississippi)
- has brought the Mississippi to ist lowest
- ipes.In Life on the Mississippi,Mark Twain wrote
t - SoutheastMississippiMark Twainofficials began
- Known Mississippi US, Minnesota Gulf Mexico
- Mud Island,MississippiThe-- history,Memphis
29Motivation for Exact Answers
- Text snippets masking important differences among
systems - Pinpointing precise extent of answer important to
driving technology - not a statement that deployed systems should
return only exact answers - exact answers may be important as component in
larger language systems
30Recognizing Exact Answers
- Gave assessors guidelines
- most minimal response possible not the only exact
answer - e.g., accept Mississippi river for What is the
longest river in the United States? - ungrammatical responses not exact
- e.g., in Mississippi vs. Mississippi in
- justification is not exact
- e.g., At 2,348 miles the Mississippi river is
the longest US river is inexact
31Assessors Continue to Disagree
- 80 judgments Wrong
- 50 of responses where at least one judgment was
not W had disagreements - Of those, 33 involved disagreements between
Right and ineXact - well-known granularity issue now reflected here
- For dates and quantities, disagreement among
Wrong and ineXact
32But Comparative Results Still Stable
- Kendall t scores between system rankings gt 0.9
- Scores for rankings using adjudicatedjudgments gt
0.94
33QA List Task
- Goal force systems to assemble an answer from
multiple documents - Instance-finding task
- Name 4 U.S. cities that have a Shubert theater
- What are 9 novels written by John Updike?
- later tracks did not give target number of
instances - response is an unordered set of docid,
answer-string pairs
34QA List Evaluation
- Each list judged as a unit
- individual instances marked correct/unsupported/in
correct/inexact - subset of correct instances could be marked
distinct - Evaluation metric
- accuracy when target number of instances given
-
- average F(b1) when no target number specified
35TREC 2003 List Results
F score of best run for top 15 groups for list
component
36Definition Questions
- Represented about 25 of test set in TREC 2001
- What is an atom? What is epilepsy?
- What are invertebrates? Who is Colin Powell?
- Are hard for systems to answer assessors to
judge - lack of context/user model unrealistic
- while real, there are better ways of finding
definitions than looking in large corpus - What is a good answer?
37Issues
- Have same concept-matching problem as in other
NLP evals (e.g., summarization) - want to reward systems for retrieving all of the
important concepts required penalize systems
for retrieving irrelevant or redundant concepts - Recall RetrievedRequired/Required
- Precision RetrievedRequired/Retrieved
- but concepts represented in English in many ways
- no one-to-one correspondence between items
concepts - Different questions have very different sizes for
Required
38Definition Evaluation
- Have assessor create list of concepts that
definition should contain - indicate essential concepts
- okay concepts
- Mark concepts in system responses
- mark a concept at most once
- individual item may have multiple, one, or no
concepts
39 What is a golden parachute?
- Assessor nuggets
- Agreement between companies and top executives
- Provides remuneration to executives who lose jobs
- Remuneration is usually very generous
- Encourages executives not to resist takeover
beneficial to shareholders - Incentive for executives to join companies
- Arrangement for which IRS can impose excise tax
Judged system response
40Evaluation
- With this methodology, concept recall computable,
but not concept precision - no satisfactory way to list all concepts
retrieved - assessors cannot enumerate all concepts in text
- granularity issue
- unnatural task
- items not well correlated, very easy to game
- Rough approximation to concept precision length
- count (non-white-space) characters in all items
- intuition is that users prefer shorter of two
definitions with same concepts
41TREC 2003 Definition Results
F(ß5) score of best run for top 15 groups for
definition component
42Reliability of Definition Evaluation
- Mistakes by assessors
- exist in all evaluations
- can be directly measured since no pooling
- Differences of opinion
- different assessors disagree as to correctness
- inherent in NLP tasks
- Sample of questions
- different systems do relatively differently on
different questions - particular sample of questions can skew results
- more questions lead to more stable results
43Mistakes by Assessors
- 14 pairs of identical definition components
- across all pairs, 19 different definition
questions judged differently - (roughly) uniform across assessors
- number of questions affected ranges from 0 to 10
- difference in F(ß5) scores ranges from 0.0 to
0.043, with a mean of 0.013 - differences in F scores 0.043 for some
different systems clearly must be considered
equivalent - New task
- consistency improved somewhat as assessors gained
experience - better training re granularity will help some
- will never eliminate all errors
44Differences of Opinion
- Each question independently judged by 2 assessors
- assessors differed in what nuggets desired
- assessors differed in whether nuggets vital
- assessors did not differ as much in whether a
nugget was present (modulo mistakes) - Correlation among system rankings when questions
judged by different assessors - compute Kendall t correlation between rankings
- t0.848, representing 113/1485 pairwise swaps
- 8 swaps among systems whose F(ß5) scores as
judged by original assessors differed by gt 0.1 - largest F(ß5) difference with swap was 0.123
45Sample of Questions in Test Set
Need differencegt 0.1 for error rate lt 5
5
46Definition Evaluation
- Noise within definition evaluation comparatively
large - need to consider F scores within 0.1 of one
another equivalent - coarse evaluation
- large equivalence classes of runs
- one fix is to increase number of questions
- larger sample of questions
- individual mistakes have less effect
- evaluation more costly
47TREC 2004 Task
- Process question series
- each series is about a specified target, and the
goal is to gather info about, or define, the
target - a series contains factoid and list questions,
plus a final other question - questions are tagged as to type and scored by
type - final score is a weighted average of 3 component
scores - FinalScore 1/2Factoid 1/4List 1/4Other
48QA Track Results
- Solidify a community
- enormous growth in QA community
- world-wide interest (e.g., QA tasks in NTCIR,
CLEF) - Establish the research methodology
- showed that even facts are context-sensitive
- first steps toward evaluation complex answers
- Facilitate technology transfer
- common architecture for factoid questions
- Document the state-of-the-art
- task for which NLP techniques shows real benefit
- rough boundary when IR techniques insufficient
- Amortize the costs of infrastructure
- patterns partial solution use with care
49Where to next?
- Synergy between QA and summarization
- continue to explore evaluation methodologies for
complex questions - reduce emphasis on factoids to support role
- Context-sensitive QA
- personalized to user
- takes account of implicit background as well as
explicit cues within interaction
50(No Transcript)