Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises

About This Presentation
Title:

Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises

Description:

His mother called him 'WILD THING' and he said 'I'LL EAT YOU UP! ... to where the wild things are. 35. And when he came to where the wild things are they roared their ... –

Number of Views:87
Avg rating:3.0/5.0
Slides: 83
Provided by: Kathleen281
Category:

less

Transcript and Presenter's Notes

Title: Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises


1
Lessons Learned from Evaluation of Summarization
Systems Nightmares and Pleasant Surprises
  • Kathleen McKeown
  • Department of Computer Science
  • Columbia University
  • Major contributers Ani Nenkova, Becky Passonneau

2
(No Transcript)
3
Questions
  • What kinds of evaluation are possible?
  • What are the pitfalls?
  • Are evaluation metrics fair?
  • Is real research progress possible?
  • What are the benefits?
  • Should we evaluate our systems?

4
What is the feel of the evaluation?
  • Is it competitive?
  • Does it foster a feeling of community?
  • Are the guidelines clearly established ahead of
    time?
  • Are the metrics fair? Do they measure what you
    want to measure?

5
(No Transcript)
6
The night Max wore his wolf suit and made
mischief of one kind
7
and another
and another
8
His mother called him WILD THING and he said
ILL EAT YOU UP! so he was sent to bed without
eating anything.
9
DARPA GALE Global Autonomous Language Environment
  • Three large teams BBN, IBM, SRI
  • SRI UC Berkeley, U Washington, UCSD, Columbia,
    NYU, UMASS, NCRI, Systran, Fair Isaacs, Ohio
    State
  • Generate responses to open-ended questions
  • 17 templates definitions, biographies, events,
    relationships, reactions, etc.
  • Using English, Chinese, and Arabic text and
    speech, blogs to news
  • Find all instances when a fact is mentioned
    (redundancy)

10
GALE Evaluation
  • Can systems do at least 50 as well as a human?
  • If not, the GALE program will not continue
  • The team that does worst may be cut
  • Independent evaluator BAE
  • Has never done text evaluation before
  • Has experience with task based evaluation
  • Gold Standard
  • System responses graded by two judges
  • Relevant facts added to the pool
  • Granularity of scoring nuggets
  • Metrics
  • Variants of precision/recall weighted
  • Document citations
  • Redundancy

11
Year 1 Sample QA
  • LIST FACTS ABOUT The Trial of Saddam Hussein
  • The judge , however, that all people should have
    heard voices, the order of a court to solve
    technical problems. (Chi)
  • His account of events surrounding the torture and
    execution of more than 140 men and teenage boys
    from the Dujail , appeared to do little to
    advance the prosecution's goal of establishing
    Saddam 's "command responsibility" for the
    deaths.
  • A trial without Saddam could be an embarrassment
    for the U.S. government, which has worked hard to
    help create a tribunal that would be perceived by
    Iraqis as independent and fair.
  • As the trial got under way, a former secret
    police officer testified that he had not received
    any orders from Saddam during the investigations
    that followed an assassination attempt against
    him in Dujail in 1982 .

12
Year 1 Results
  • F-value (Beta of 1)
  • Machine average 0.230
  • Human average 0.353
  • Machine to Human average 0.678

13
DUC Document Understanding Conference
  • Established and funded by DARPA TIDES
  • Run by independent evaluator NIST
  • Open to summarization community
  • Annual evaluations on common datasets
  • 2001-present
  • Tasks
  • Single document summarization
  • Headline summarization
  • Multi-document summarization
  • Multi-lingual summarization
  • Focused summarization
  • Update summarization

14
DUC is changing direction again
  • DARPA GALE effort cutting back participation in
    DUC
  • Considering co-locating with TREC QA
  • Considering new data sources and tasks

15
DUC Evaluation
  • Gold Standard
  • Human summaries written by NIST
  • From 2 to 9 summaries per input set
  • Multiple metrics
  • Manual
  • Coverage (early years)
  • Pyramids (later years)
  • Responsiveness (later years)
  • Quality questions
  • Automatic
  • Rouge (-1, -2, -skipbigrams, LCS, BE)
  • Granularity
  • Manual sub-sentential elements
  • Automatic sentences

16
TREC definition pilot
  • Long answer to request for a definition
  • As a pilot, less emphasis on results
  • Part of TREC QA

17
Evaluation Methods
  • Pool system responses and break into nuggets
  • A judge scores nuggets as vital, OK or invalid
  • Measure information precision and recall
  • Can a judge reliably determine which facts belong
    in a definition?

18
Considerations Across Evaluations
  • Independent evaluator
  • Not always as knowledgeable as researchers
  • Impartial determination of approach
  • Extensive collection of resources
  • Determination of task
  • Appealing to a broad cross-section of community
  • Changes over time
  • DUC 2001-2002 Single and multi-document
  • DUC 2003 headlines, multi-document
  • DUC 2004 headlines, multilingual and
    multi-document, focused
  • DUC 2005 focused summarization
  • DUC 2006 focused and a new task, up for
    discussion
  • How long do participants have to prepare?
  • When is a task dropped?
  • Scoring of text at the sub-sentential level

19
Task-based Evaluation
  • Use the summarization system as browser to do
    another task
  • Newsblaster write a report given a broad prompt
  • DARPA utility evaluation given a request for
    information, use question answering to write
    report

20
Task Evaluation
  • Hypothesis multi-document summaries enable users
    to find information efficiently
  • Task fact-gathering given topic and questions
  • Resembles intelligence analyst task

21
User Study Objectives
  • Does multi-document summarization help?
  • Do summaries help the user find information
    needed to perform a report writing task?
  • Do users use information from summaries in
    gathering their facts?
  • Do summaries increase user satisfaction with the
    online news system?
  • Do users create better quality reports with
    summaries?
  • How do full multi-document summaries compare with
    minimal 1-sentence summaries such as Google News?

22
User Study Design
  • Compared 4 parallel news browsing systems
  • Level 1 Source documents only
  • Level 2 One sentence multi-document summaries
    (e.g., Google News) linked to documents
  • Level 3 Newsblaster multi-document summaries
    linked to documents
  • Level 4 Human written multi-document summaries
    linked to documents
  • All groups write reports given four scenarios
  • A task similar to analysts
  • Can only use Newsblaster for research
  • Time-restricted

23
User Study Execution
  • 4 scenarios
  • 4 event clusters each
  • 2 directly relevant, 2 peripherally relevant
  • Average 10 documents/cluster
  • 45 participants
  • Balance between liberal arts, engineering
  • 138 reports
  • Exit survey
  • Multiple-choice and open-ended questions
  • Usage tracking
  • Each click logged, on or off-site

24
Geneva Prompt
  • The conflict between Israel and the Palestinians
    has been difficult for government negotiators to
    settle. Most recently, implementation of the
    road map for peace, a diplomatic effort
    sponsored by
  • Who participated in the negotiations that
    produced the Geneva Accord?
  • Apart from direct participants, who supported the
    Geneva Accord preparations and how?
  • What has the response been to the Geneva Accord
    by the Palestinians?

25
Measuring Effectiveness
  • Score report content and compare across summary
    conditions
  • Compare user satisfaction per summary condition
  • Comparing where subjects took report content from

26
Newsblaster
27
User Satisfaction
  • More effective than a web search with Newsblaster
  • Not true with documents only or single-sentence
    summaries
  • Easier to complete the task with summaries than
    with documents only
  • Enough time with summaries than documents only
  • Summaries helped most
  • 5 single sentence summaries
  • 24 Newsblaster summaries
  • 43 human summaries

28
User Study Conclusions
  • Summaries measurably improve a news browsers
    effectiveness for research
  • Users are more satisfied with Newsblaster
    summaries are better than single-sentence
    summaries like those of Google News
  • Users want search
  • Not included in evaluation

29
Potential Problems
30
That very night in Maxs room a forest grew
31
And grew
32
And grew until the ceiling hung with vines and
the walls became the world all around
33
And an ocean tumbled by with a private boat for
Max and he sailed all through the night and day
34
And he sailed in and out of weeks and almost over
a year to where the wild things are
35
And when he came to where the wild things are
they roared their terrible roars and gnashed
their terrible teeth
36
Comparing Text Against Text
  • Which human summary makes a good gold standard?
    Many summaries are good
  • At what granularity is the comparison made?
  • When can we say that two pieces of text match?

37
Measuring variation
  • Types of variation between humans

Translation ? same content
? different wording
Applications
Summarization ? different content??
? different wording
Generation ? different content??
? different wording
38
Human variation content words (Ani Nenkova)
  • Summaries differ in vocabulary
  • Differences cannot be
  • explained by paraphrase
  • 7 translations
  • 20 documents
  • 7 summaries
  • ? 20 document sets
  • Faster vocabulary growth in summarization

39
Variation impacts evaluation
  • Comparing content is hard
  • All kinds of judgment calls
  • Paraphrases
  • VP vs. NP
  • Ministers have been exchanged
  • Reciprocal ministerial visits
  • Length and constituent type
  • Robotics assists doctors in the medical operating
    theater
  • Surgeons started using robotic assistants

40
Nightmare only one gold standard
  • System may have chosen an equally good sentence
    but not in the one gold standard
  • Pinochet arrested in London on Oct 16 at a
    Spanish judges request for atrocities against
    Spaniards in Chile.
  • Former Chilean dictator Augusto Pinochet has
    been arrested in London at the request of the
    Spanish government
  • In DUC 2001 (one gold standard), human model had
    significant impact on scores (McKeown et al)
  • Five human summaries needed to avoid changes in
    rank (Nenkova and Passonneau)
  • DUC2003 data
  • 3 topic sets, 1 highest scoring and 2 lowest
    scoring
  • 10 model summaries

41
How many summaries are enough?
42
Scoring
  • Two main approaches used in DUC
  • ROUGE (Lin and Hovy)
  • Pyramids (Nenkova and Passonneau)
  • Problems
  • Are the results stable?
  • How difficult is it to do the scoring?

43
ROUGE Recall-Oriented Understudy for Gisting
Evaluation
Rouge Ngram co-occurrence metrics measuring
content overlap
Counts of n-gram overlaps between candidate and
model summaries
Total n-grams in summary model
44
ROUGE
  • Experimentation with different units of
    comparison unigrams, bigrams, longest common
    substring, skip-bigams, basic elements
  • Automatic and thus easy to apply
  • Important to consider confidence intervals when
    determining differences between systems
  • Scores falling within same interval not
    significantly different
  • Rouge scores place systems into large groups can
    be hard to definitively say one is better than
    another
  • Sometimes results unintuitive
  • Multilingual scores as high as English scores
  • Use in speech summarization shows no
    discrimination
  • Good for training regardless of intervals can
    see trends

45
Pyramids
  • Uses multiple human summaries
  • Information is ranked by its importance
  • Allows for multiple good summaries
  • A pyramid is created from the human summaries
  • Elements of the pyramid are content units
  • System summaries are scored by comparison with
    the pyramid

46
Content units better study of variation than
sentences
  • Semantic units
  • Link different surface realizations with the same
    meaning
  • Emerge from the comparison of several texts

47
Content unit example
  • S1 Pinochet arrested in London on Oct 16 at a
    Spanish judges request for atrocities against
    Spaniards in Chile.
  • S2 Former Chilean dictator Augusto Pinochet has
    been arrested in London at the request of the
    Spanish government.
  • S3 Britain caused international controversy and
    Chilean turmoil by arresting former Chilean
    dictator Pinochet in London.

48
SCU A cable car caught fire (Weight 4)
  • A. The cause of the fire was unknown.
  • B. A cable car caught fire just after entering a
    mountainside tunnel in an alpine resort in
    Kaprun, Austria on the morning of November 11,
    2000.
  • C. A cable car pulling skiers and snowboarders
    to the Kitzsteinhorn resort, located 60 miles
    south of Salzburg in the Austrian Alps, caught
    fire inside a mountain tunnel, killing
    approximately 170 people.
  • D. On November 10, 2000, a cable car filled to
    capacity caught on fire, trapping 180 passengers
    inside the Kitzsteinhorn mountain, located in the
    town of Kaprun, 50 miles south of Salzburg in the
    central Austrian Alps.

49
SCU The cause of the fire is unknown (Weight 1)
  • A. The cause of the fire was unknown.
  • B. A cable car caught fire just after entering a
    mountainside tunnel in an alpine resort in
    Kaprun, Austria on the morning of November 11,
    2000.
  • C. A cable car pulling skiers and snowboarders
    to the Kitzsteinhorn resort, located 60 miles
    south of Salzburg in the Austrian Alps, caught
    fire inside a mountain tunnel, killing
    approximately 170 people.
  • D. On November 10, 2000, a cable car filled to
    capacity caught on fire, trapping 180 passengers
    inside the Kitzsteinhorn mountain, located in the
    town of Kaprun, 50 miles south of Salzburg in the
    central Austrian Alps.

50
Idealized representation
  • Tiers of differentially weighted SCUs
  • Top few SCUs, high weight
  • Bottom many SCUs, low weight

W3
W2
W1
51
Comparison of Scoring Methods in DUC05
  • Analysis of scores for the 20 pyramid sets
  • Columbia prepared pyramids
  • Participants scored systems against pyramids
  • Comparisons between Pyramid (original,modified),
    responsiveness, and Rouge-SU4
  • Pyramids score computed from multiple humans
  • Responsiveness is just one humans judgment
  • Rouge-SU4 equivalent to Rouge-2

52
Creation of pyramids
  • Done at Columbia for each of 20 out of 50 sets
  • Primary annotator, secondary checker
  • Held round-table discussions of problematic
    constructions that occurred in this data set
  • Comma separated lists
  • Extractive reserves have been formed for managed
    harvesting of timber, rubber, Brazil nuts, and
    medical plants without deforestation.
  • General vs. specific
  • Eastern Europe vs. Hungary, Poland, Lithuania,
    and Turkey

53
Characteristics of the Responses
  • Proportion of SCUs of Weight 1 is large
  • 44 (D324) to 81 (D695)
  • Mean SCU weight 1.9
  • Agreement among human responders is quite low

54
of SCUs at each weight
SCU Weights
55
Preview of Results
  • Manual metrics
  • Large differences between humans and machines
  • No single system the clear winner
  • But a top group identified by all metrics
  • Significant differences
  • Different predictions from manual and automatic
    metrics
  • Correlations between metrics
  • Some correlation but one cannot be substituted
    for another
  • This is good

56
Human performance/Best sys
  • Pyramid Modified Resp
    ROUGE-SU4
  • B 0.5472 B 0.4814 A 4.895
    A 0.1722
  • A 0.4969 A 0.4617 B 4.526
    B 0.1552
  • 14 0.2587 10 0.2052 4 2.85
    15 0.139

Best system 50 of human performance on manual
metrics Best system 80 of human performance on
ROUGE
57
  • Pyramid
  • original Modified Resp
    Rouge-SU4
  • 14 0.2587 10 0.2052 4 2.85
    15 0.139
  • 17 0.2492 17 0.1972 14 2.8
    4 0.134
  • 15 0.2423 14 0.1908 10 2.65
    17 0.1346
  • 10 0.2379 7 0.1852 15 2.6
    19 0.1275
  • 4 0.2321 15 0.1808 17 2.55
    11 0.1259
  • 7 0.2297 4 0.177 11 2.5
    10 0.1278
  • 16 0.2265 16 0.1722 28 2.45
    6 0.1239
  • 6 0.2197 11 0.1703 21 2.45
    7 0.1213
  • 32 0.2145 6 0.1671 6 2.4
    14 0.1264
  • 21 0.2127 12 0.1664 24 2.4
    25 0.1188
  • 12 0.2126 19 0.1636 19 2.4
    21 0.1183
  • 11 0.2116 21 0.1613 6 2.4
    16 0.1218
  • 26 0.2106 32 0.1601 27 2.35
    24 0.118
  • 19 0.2072 26 0.1464 12 2.35
    12 0.116
  • 28 0.2048 3 0.145 7 2.3
    3 0.1198
  • 13 0.1983 28 0.1427 25 2.2
    28 0.1203
  • 3 0.1949 13 0.1424 32 2.15
    27 0.110

58
  • Pyramid
  • original Modified Resp
    Rouge-SU4
  • 14 0.2587 10 0.2052 4 2.85
    15 0.139
  • 17 0.2492 17 0.1972 14 2.8
    4 0.134
  • 15 0.2423 14 0.1908 10 2.65
    17 0.1346
  • 10 0.2379 7 0.1852 15 2.6
    19 0.1275
  • 4 0.2321 15 0.1808 17 2.55
    11 0.1259
  • 7 0.2297 4 0.177 11 2.5
    10 0.1278
  • 16 0.2265 16 0.1722 28 2.45
    6 0.1239
  • 6 0.2197 11 0.1703 21 2.45
    7 0.1213
  • 32 0.2145 6 0.1671 6 2.4
    14 0.1264
  • 21 0.2127 12 0.1664 24 2.4
    25 0.1188
  • 12 0.2126 19 0.1636 19 2.4
    21 0.1183
  • 11 0.2116 21 0.1613 6 2.4
    16 0.1218
  • 26 0.2106 32 0.1601 27 2.35
    24 0.118
  • 19 0.2072 26 0.1464 12 2.35
    12 0.116
  • 28 0.2048 3 0.145 7 2.3
    3 0.1198
  • 13 0.1983 28 0.1427 25 2.2
    28 0.1203
  • 3 0.1949 13 0.1424 32 2.15
    27 0.110

59
  • Pyramid
  • original Modified Resp
    Rouge-SU4
  • 14 0.2587 10 0.2052 4 2.85
    15 0.139
  • 17 0.2492 17 0.1972 14 2.8
    4 0.134
  • 15 0.2423 14 0.1908 10 2.65
    17 0.1346
  • 10 0.2379 7 0.1852 15 2.6
    19 0.1275
  • 4 0.2321 15 0.1808 17 2.55
    11 0.1259
  • 7 0.2297 4 0.177 11 2.5
    10 0.1278
  • 16 0.2265 16 0.1722 28 2.45
    6 0.1239
  • 6 0.2197 11 0.1703 21 2.45
    7 0.1213
  • 32 0.2145 6 0.1671 6 2.4
    14 0.1264
  • 21 0.2127 12 0.1664 24 2.4
    25 0.1188
  • 12 0.2126 19 0.1636 19 2.4
    21 0.1183
  • 11 0.2116 21 0.1613 6 2.4
    16 0.1218
  • 26 0.2106 32 0.1601 27 2.35
    24 0.118
  • 19 0.2072 26 0.1464 12 2.35
    12 0.116
  • 28 0.2048 3 0.145 7 2.3
    3 0.1198
  • 13 0.1983 28 0.1427 25 2.2
    28 0.1203
  • 3 0.1949 13 0.1424 32 2.15
    27 0.110

60
  • Pyramid
  • original Modified Resp
    Rouge-SU4
  • 14 0.2587 10 0.2052 4 2.85
    15 0.139
  • 17 0.2492 17 0.1972 14 2.8
    4 0.134
  • 15 0.2423 14 0.1908 10 2.65
    17 0.1346
  • 10 0.2379 7 0.1852 15 2.6
    19 0.1275
  • 4 0.2321 15 0.1808 17 2.55
    11 0.1259
  • 7 0.2297 4 0.177 11 2.5
    10 0.1278
  • 16 0.2265 16 0.1722 28 2.45
    6 0.1239
  • 6 0.2197 11 0.1703 21 2.45
    7 0.1213
  • 32 0.2145 6 0.1671 6 2.4
    14 0.1264
  • 21 0.2127 12 0.1664 24 2.4
    25 0.1188
  • 12 0.2126 19 0.1636 19 2.4
    21 0.1183
  • 11 0.2116 21 0.1613 6 2.4
    16 0.1218
  • 26 0.2106 32 0.1601 27 2.35
    24 0.118
  • 19 0.2072 26 0.1464 12 2.35
    12 0.116
  • 28 0.2048 3 0.145 7 2.3
    3 0.1198
  • 13 0.1983 28 0.1427 25 2.2
    28 0.1203
  • 3 0.1949 13 0.1424 32 2.15
    27 0.110

61
Significant Differences
  • Manual metrics
  • Few differences between systems
  • Pyramid 23 is worse
  • Responsive 23 and 31 are worse
  • Both humans better than all systems
  • Automatic (Rouge-SU4)
  • More differences between systems
  • One human indistinguishable from 5 systems

62
Correlations Pearsons, 25 systems
Pyr-mod Resp-1 Resp2 R-2 R-SU4
Pyr-orig 0.96 0.77 0.86 0.84 0.80
Pyr-mod 0.81 0.90 0.90 0.86
Resp-1 0.83 0.92 0.92
Resp-2 0.88 0.87
R-2 0.98
63
Correlations Pearsons, 25 systems
Pyr-mod Resp-1 Resp2 R-2 R-SU4
Pyr-orig 0.96 0.77 0.86 0.84 0.80
Pyr-mod 0.81 0.90 0.90 0.86
Resp-1 0.83 0.92 0.92
Resp-2 0.88 0.87
R-2 0.98
Questionable that responsiveness could be a gold
standard
64
Pyramid and responsiveness
Pyr-mod Resp-1 Resp2 R-2 R-SU4
Pyr-orig 0.96 0.77 0.86 0.84 0.80
Pyr-mod 0.81 0.90 0.90 0.86
Resp-1 0.83 0.92 0.92
Resp-2 0.88 0.87
R-2 0.98
High correlation, but the metrics are not
mutually substitutable
65
Pyramid and Rouge
Pyr-mod Resp-1 Resp2 R-2 R-SU4
Pyr-orig 0.96 0.77 0.86 0.84 0.80
Pyr-mod 0.81 0.90 0.90 0.86
Resp-1 0.83 0.92 0.92
Resp-2 0.88 0.87
R-2 0.98
High correlation, but the metrics are not
mutually substitutable
66
Correlations
  • Original and modified can substitute for each
    other
  • High correlation between manual and automatic,
    but automatic not yet a substitute
  • Similar patterns between pyramid and
    responsiveness

67
Nightmare
  • Scoring metric that is not stable used to decide
    funding
  • Insignificant differences between systems
    determine funding

68
Is Task Evaluation Nightmare Free?
  • Impact of user interface issues
  • Can have more impact than the summary
  • Controlling for proper mix of subjects
  • Quantity of subjects and time to carry out is
    large

69
Till Max said Be still! and tamed them with the
magic trick
70
Of staring into their yellow eyes without
blinking once And they were frightened and called
him the most wild thing of all
71
And made him king of all wild things
72
And now, cried Max Let the wild rumpus start!
73
(No Transcript)
74
(No Transcript)
75
(No Transcript)
76
Are we having fun yet?Benefits of evaluation
  • Emergence of evaluation methods
  • ROUGE
  • Pyramids
  • Nuggetteer
  • Research into characteristics of metrics
  • Analyses of sub-sentential units
  • Paraphrase as a research issue

77
Available Data
  • DUC data sets
  • 4 years of summary/document set pairs
  • Multidocument summarization training data not
    available beforehand
  • 4 years of scoring patterns
  • Led to analysis of human summaries
  • Pyramids
  • Pyramids and peers for 40 topics (DUC04, DUC05)
  • Many more from Nenkova and Passonneau
  • Training data for paraphrase
  • Training data for abstraction -gt see systems
    moving away from pure sentence extraction

78
Wrapping up
79
Lessons Learned
  • Evaluation environment is important
  • Find a task with broad appeal
  • Use independent evaluator
  • At least a committee
  • Use multiple gold standards
  • Compare text at the content unit level
  • Evaluate the metrics
  • Look at significant differences

80
Is Evaluation Worth It?
  • DUC creation of a community
  • From 15 participants year 1 -gt 30 participants
    year 5
  • No longer impacts funding
  • Enables research into evaluation
  • At start, no idea how to evaluate summaries
  • But, results do not tell us everything

81
And he sailed back over a year, in and out of
weeks and through a day
82
And into the night of his very own room where he
found his supper waiting for him .. And it was
still warm.
Write a Comment
User Comments (0)
About PowerShow.com