Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises presentation

About This Presentation

Title:

Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises

Description:

His mother called him 'WILD THING' and he said 'I'LL EAT YOU UP! ... to where the wild things are. 35. And when he came to where the wild things are they roared their ... –

Number of Views:87

Avg rating:3.0/5.0

Slides: 83

Provided by: Kathleen281

Learn more at: https://www.ling.ohio-state.edu

Category:

more less

Transcript and Presenter's Notes

Title: Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises

1
Lessons Learned from Evaluation of Summarization
Systems Nightmares and Pleasant Surprises

Kathleen McKeown
Department of Computer Science
Columbia University
Major contributers Ani Nenkova, Becky Passonneau

2
(No Transcript)
3
Questions

What kinds of evaluation are possible?
What are the pitfalls?
Are evaluation metrics fair?
Is real research progress possible?
What are the benefits?
Should we evaluate our systems?

4
What is the feel of the evaluation?

Is it competitive?
Does it foster a feeling of community?
Are the guidelines clearly established ahead of
time?
Are the metrics fair? Do they measure what you
want to measure?

5
(No Transcript)
6
The night Max wore his wolf suit and made
mischief of one kind
7
and another
and another
8
His mother called him WILD THING and he said
ILL EAT YOU UP! so he was sent to bed without
eating anything.
9
DARPA GALE Global Autonomous Language Environment

Three large teams BBN, IBM, SRI
SRI UC Berkeley, U Washington, UCSD, Columbia,
NYU, UMASS, NCRI, Systran, Fair Isaacs, Ohio
State
Generate responses to open-ended questions
17 templates definitions, biographies, events,
relationships, reactions, etc.
Using English, Chinese, and Arabic text and
speech, blogs to news
Find all instances when a fact is mentioned
(redundancy)

10
GALE Evaluation

Can systems do at least 50 as well as a human?
If not, the GALE program will not continue
The team that does worst may be cut
Independent evaluator BAE
Has never done text evaluation before
Has experience with task based evaluation
Gold Standard
System responses graded by two judges
Relevant facts added to the pool
Granularity of scoring nuggets
Metrics
Variants of precision/recall weighted
Document citations
Redundancy

11
Year 1 Sample QA

LIST FACTS ABOUT The Trial of Saddam Hussein
The judge , however, that all people should have
heard voices, the order of a court to solve
technical problems. (Chi)
His account of events surrounding the torture and
execution of more than 140 men and teenage boys
from the Dujail , appeared to do little to
advance the prosecution's goal of establishing
Saddam 's "command responsibility" for the
deaths.
A trial without Saddam could be an embarrassment
for the U.S. government, which has worked hard to
help create a tribunal that would be perceived by
Iraqis as independent and fair.
As the trial got under way, a former secret
police officer testified that he had not received
any orders from Saddam during the investigations
that followed an assassination attempt against
him in Dujail in 1982 .

12
Year 1 Results

F-value (Beta of 1)
Machine average 0.230
Human average 0.353
Machine to Human average 0.678

13
DUC Document Understanding Conference

Established and funded by DARPA TIDES
Run by independent evaluator NIST
Open to summarization community
Annual evaluations on common datasets
2001-present
Tasks
Single document summarization
Headline summarization
Multi-document summarization
Multi-lingual summarization
Focused summarization
Update summarization

14
DUC is changing direction again

DARPA GALE effort cutting back participation in
DUC
Considering co-locating with TREC QA
Considering new data sources and tasks

15
DUC Evaluation

Gold Standard
Human summaries written by NIST
From 2 to 9 summaries per input set
Multiple metrics
Manual
Coverage (early years)
Pyramids (later years)
Responsiveness (later years)
Quality questions
Automatic
Rouge (-1, -2, -skipbigrams, LCS, BE)
Granularity
Manual sub-sentential elements
Automatic sentences

16
TREC definition pilot

Long answer to request for a definition
As a pilot, less emphasis on results
Part of TREC QA

17
Evaluation Methods

Pool system responses and break into nuggets
A judge scores nuggets as vital, OK or invalid
Measure information precision and recall
Can a judge reliably determine which facts belong
in a definition?

18
Considerations Across Evaluations

Independent evaluator
Not always as knowledgeable as researchers
Impartial determination of approach
Extensive collection of resources
Determination of task
Appealing to a broad cross-section of community
Changes over time
DUC 2001-2002 Single and multi-document
DUC 2003 headlines, multi-document
DUC 2004 headlines, multilingual and
multi-document, focused
DUC 2005 focused summarization
DUC 2006 focused and a new task, up for
discussion
How long do participants have to prepare?
When is a task dropped?
Scoring of text at the sub-sentential level

19
Task-based Evaluation

Use the summarization system as browser to do
another task
Newsblaster write a report given a broad prompt
DARPA utility evaluation given a request for
information, use question answering to write
report

20
Task Evaluation

Hypothesis multi-document summaries enable users
to find information efficiently
Task fact-gathering given topic and questions
Resembles intelligence analyst task

21
User Study Objectives

Does multi-document summarization help?
Do summaries help the user find information
needed to perform a report writing task?
Do users use information from summaries in
gathering their facts?
Do summaries increase user satisfaction with the
online news system?
Do users create better quality reports with
summaries?
How do full multi-document summaries compare with
minimal 1-sentence summaries such as Google News?

22
User Study Design

Compared 4 parallel news browsing systems
Level 1 Source documents only
Level 2 One sentence multi-document summaries
(e.g., Google News) linked to documents
Level 3 Newsblaster multi-document summaries
linked to documents
Level 4 Human written multi-document summaries
linked to documents
All groups write reports given four scenarios
A task similar to analysts
Can only use Newsblaster for research
Time-restricted

23
User Study Execution

4 scenarios
4 event clusters each
2 directly relevant, 2 peripherally relevant
Average 10 documents/cluster
45 participants
Balance between liberal arts, engineering
138 reports
Exit survey
Multiple-choice and open-ended questions
Usage tracking
Each click logged, on or off-site

24
Geneva Prompt

The conflict between Israel and the Palestinians
has been difficult for government negotiators to
settle. Most recently, implementation of the
road map for peace, a diplomatic effort
sponsored by
Who participated in the negotiations that
produced the Geneva Accord?
Apart from direct participants, who supported the
Geneva Accord preparations and how?
What has the response been to the Geneva Accord
by the Palestinians?

25
Measuring Effectiveness

Score report content and compare across summary
conditions
Compare user satisfaction per summary condition
Comparing where subjects took report content from

26
Newsblaster
27
User Satisfaction

More effective than a web search with Newsblaster
Not true with documents only or single-sentence
summaries
Easier to complete the task with summaries than
with documents only
Enough time with summaries than documents only
Summaries helped most
5 single sentence summaries
24 Newsblaster summaries
43 human summaries

28
User Study Conclusions

Summaries measurably improve a news browsers
effectiveness for research
Users are more satisfied with Newsblaster
summaries are better than single-sentence
summaries like those of Google News
Users want search
Not included in evaluation

29
Potential Problems
30
That very night in Maxs room a forest grew
31
And grew
32
And grew until the ceiling hung with vines and
the walls became the world all around
33
And an ocean tumbled by with a private boat for
Max and he sailed all through the night and day
34
And he sailed in and out of weeks and almost over
a year to where the wild things are
35
And when he came to where the wild things are
they roared their terrible roars and gnashed
their terrible teeth
36
Comparing Text Against Text

Which human summary makes a good gold standard?
Many summaries are good
At what granularity is the comparison made?
When can we say that two pieces of text match?

37
Measuring variation

Types of variation between humans

Translation ? same content
? different wording
Applications
Summarization ? different content??
? different wording
Generation ? different content??
? different wording
38
Human variation content words (Ani Nenkova)

Summaries differ in vocabulary
Differences cannot be
explained by paraphrase

7 translations
20 documents
7 summaries
? 20 document sets
Faster vocabulary growth in summarization

39
Variation impacts evaluation

Comparing content is hard
All kinds of judgment calls
Paraphrases
VP vs. NP
Ministers have been exchanged
Reciprocal ministerial visits
Length and constituent type
Robotics assists doctors in the medical operating
theater
Surgeons started using robotic assistants

40
Nightmare only one gold standard

System may have chosen an equally good sentence
but not in the one gold standard
Pinochet arrested in London on Oct 16 at a
Spanish judges request for atrocities against
Spaniards in Chile.
Former Chilean dictator Augusto Pinochet has
been arrested in London at the request of the
Spanish government
In DUC 2001 (one gold standard), human model had
significant impact on scores (McKeown et al)
Five human summaries needed to avoid changes in
rank (Nenkova and Passonneau)
DUC2003 data
3 topic sets, 1 highest scoring and 2 lowest
scoring
10 model summaries

41
How many summaries are enough?
42
Scoring

Two main approaches used in DUC
ROUGE (Lin and Hovy)
Pyramids (Nenkova and Passonneau)
Problems
Are the results stable?
How difficult is it to do the scoring?

43
ROUGE Recall-Oriented Understudy for Gisting
Evaluation
Rouge Ngram co-occurrence metrics measuring
content overlap
Counts of n-gram overlaps between candidate and
model summaries
Total n-grams in summary model
44
ROUGE

Experimentation with different units of
comparison unigrams, bigrams, longest common
substring, skip-bigams, basic elements
Automatic and thus easy to apply
Important to consider confidence intervals when
determining differences between systems
Scores falling within same interval not
significantly different
Rouge scores place systems into large groups can
be hard to definitively say one is better than
another
Sometimes results unintuitive
Multilingual scores as high as English scores
Use in speech summarization shows no
discrimination
Good for training regardless of intervals can
see trends

45
Pyramids

Uses multiple human summaries
Information is ranked by its importance
Allows for multiple good summaries
A pyramid is created from the human summaries
Elements of the pyramid are content units
System summaries are scored by comparison with
the pyramid

46
Content units better study of variation than
sentences

Semantic units
Link different surface realizations with the same
meaning
Emerge from the comparison of several texts

47
Content unit example

S1 Pinochet arrested in London on Oct 16 at a
Spanish judges request for atrocities against
Spaniards in Chile.
S2 Former Chilean dictator Augusto Pinochet has
been arrested in London at the request of the
Spanish government.
S3 Britain caused international controversy and
Chilean turmoil by arresting former Chilean
dictator Pinochet in London.

48
SCU A cable car caught fire (Weight 4)

A. The cause of the fire was unknown.
B. A cable car caught fire just after entering a
mountainside tunnel in an alpine resort in
Kaprun, Austria on the morning of November 11,
2000.
C. A cable car pulling skiers and snowboarders
to the Kitzsteinhorn resort, located 60 miles
south of Salzburg in the Austrian Alps, caught
fire inside a mountain tunnel, killing
approximately 170 people.
D. On November 10, 2000, a cable car filled to
capacity caught on fire, trapping 180 passengers
inside the Kitzsteinhorn mountain, located in the
town of Kaprun, 50 miles south of Salzburg in the
central Austrian Alps.

49
SCU The cause of the fire is unknown (Weight 1)

A. The cause of the fire was unknown.
B. A cable car caught fire just after entering a
mountainside tunnel in an alpine resort in
Kaprun, Austria on the morning of November 11,
2000.
C. A cable car pulling skiers and snowboarders
to the Kitzsteinhorn resort, located 60 miles
south of Salzburg in the Austrian Alps, caught
fire inside a mountain tunnel, killing
approximately 170 people.
D. On November 10, 2000, a cable car filled to
capacity caught on fire, trapping 180 passengers
inside the Kitzsteinhorn mountain, located in the
town of Kaprun, 50 miles south of Salzburg in the
central Austrian Alps.

50
Idealized representation

Tiers of differentially weighted SCUs
Top few SCUs, high weight
Bottom many SCUs, low weight

W3
W2
W1
51
Comparison of Scoring Methods in DUC05

Analysis of scores for the 20 pyramid sets
Columbia prepared pyramids
Participants scored systems against pyramids
Comparisons between Pyramid (original,modified),
responsiveness, and Rouge-SU4
Pyramids score computed from multiple humans
Responsiveness is just one humans judgment
Rouge-SU4 equivalent to Rouge-2

52
Creation of pyramids

Done at Columbia for each of 20 out of 50 sets
Primary annotator, secondary checker
Held round-table discussions of problematic
constructions that occurred in this data set
Comma separated lists
Extractive reserves have been formed for managed
harvesting of timber, rubber, Brazil nuts, and
medical plants without deforestation.
General vs. specific
Eastern Europe vs. Hungary, Poland, Lithuania,
and Turkey

53
Characteristics of the Responses

Proportion of SCUs of Weight 1 is large
44 (D324) to 81 (D695)
Mean SCU weight 1.9
Agreement among human responders is quite low

54
of SCUs at each weight
SCU Weights
55
Preview of Results

Manual metrics
Large differences between humans and machines
No single system the clear winner
But a top group identified by all metrics
Significant differences
Different predictions from manual and automatic
metrics
Correlations between metrics
Some correlation but one cannot be substituted
for another
This is good

56
Human performance/Best sys

Pyramid Modified Resp
ROUGE-SU4
B 0.5472 B 0.4814 A 4.895
A 0.1722
A 0.4969 A 0.4617 B 4.526
B 0.1552
14 0.2587 10 0.2052 4 2.85
15 0.139

Best system 50 of human performance on manual
metrics Best system 80 of human performance on
ROUGE
57

Pyramid
original Modified Resp
Rouge-SU4
14 0.2587 10 0.2052 4 2.85
15 0.139
17 0.2492 17 0.1972 14 2.8
4 0.134
15 0.2423 14 0.1908 10 2.65
17 0.1346
10 0.2379 7 0.1852 15 2.6
19 0.1275
4 0.2321 15 0.1808 17 2.55
11 0.1259
7 0.2297 4 0.177 11 2.5
10 0.1278
16 0.2265 16 0.1722 28 2.45
6 0.1239
6 0.2197 11 0.1703 21 2.45
7 0.1213
32 0.2145 6 0.1671 6 2.4
14 0.1264
21 0.2127 12 0.1664 24 2.4
25 0.1188
12 0.2126 19 0.1636 19 2.4
21 0.1183
11 0.2116 21 0.1613 6 2.4
16 0.1218
26 0.2106 32 0.1601 27 2.35
24 0.118
19 0.2072 26 0.1464 12 2.35
12 0.116
28 0.2048 3 0.145 7 2.3
3 0.1198
13 0.1983 28 0.1427 25 2.2
28 0.1203
3 0.1949 13 0.1424 32 2.15
27 0.110

Pyramid
original Modified Resp
Rouge-SU4
14 0.2587 10 0.2052 4 2.85
15 0.139
17 0.2492 17 0.1972 14 2.8
4 0.134
15 0.2423 14 0.1908 10 2.65
17 0.1346
10 0.2379 7 0.1852 15 2.6
19 0.1275
4 0.2321 15 0.1808 17 2.55
11 0.1259
7 0.2297 4 0.177 11 2.5
10 0.1278
16 0.2265 16 0.1722 28 2.45
6 0.1239
6 0.2197 11 0.1703 21 2.45
7 0.1213
32 0.2145 6 0.1671 6 2.4
14 0.1264
21 0.2127 12 0.1664 24 2.4
25 0.1188
12 0.2126 19 0.1636 19 2.4
21 0.1183
11 0.2116 21 0.1613 6 2.4
16 0.1218
26 0.2106 32 0.1601 27 2.35
24 0.118
19 0.2072 26 0.1464 12 2.35
12 0.116
28 0.2048 3 0.145 7 2.3
3 0.1198
13 0.1983 28 0.1427 25 2.2
28 0.1203
3 0.1949 13 0.1424 32 2.15
27 0.110

Pyramid
original Modified Resp
Rouge-SU4
14 0.2587 10 0.2052 4 2.85
15 0.139
17 0.2492 17 0.1972 14 2.8
4 0.134
15 0.2423 14 0.1908 10 2.65
17 0.1346
10 0.2379 7 0.1852 15 2.6
19 0.1275
4 0.2321 15 0.1808 17 2.55
11 0.1259
7 0.2297 4 0.177 11 2.5
10 0.1278
16 0.2265 16 0.1722 28 2.45
6 0.1239
6 0.2197 11 0.1703 21 2.45
7 0.1213
32 0.2145 6 0.1671 6 2.4
14 0.1264
21 0.2127 12 0.1664 24 2.4
25 0.1188
12 0.2126 19 0.1636 19 2.4
21 0.1183
11 0.2116 21 0.1613 6 2.4
16 0.1218
26 0.2106 32 0.1601 27 2.35
24 0.118
19 0.2072 26 0.1464 12 2.35
12 0.116
28 0.2048 3 0.145 7 2.3
3 0.1198
13 0.1983 28 0.1427 25 2.2
28 0.1203
3 0.1949 13 0.1424 32 2.15
27 0.110

Pyramid
original Modified Resp
Rouge-SU4
14 0.2587 10 0.2052 4 2.85
15 0.139
17 0.2492 17 0.1972 14 2.8
4 0.134
15 0.2423 14 0.1908 10 2.65
17 0.1346
10 0.2379 7 0.1852 15 2.6
19 0.1275
4 0.2321 15 0.1808 17 2.55
11 0.1259
7 0.2297 4 0.177 11 2.5
10 0.1278
16 0.2265 16 0.1722 28 2.45
6 0.1239
6 0.2197 11 0.1703 21 2.45
7 0.1213
32 0.2145 6 0.1671 6 2.4
14 0.1264
21 0.2127 12 0.1664 24 2.4
25 0.1188
12 0.2126 19 0.1636 19 2.4
21 0.1183
11 0.2116 21 0.1613 6 2.4
16 0.1218
26 0.2106 32 0.1601 27 2.35
24 0.118
19 0.2072 26 0.1464 12 2.35
12 0.116
28 0.2048 3 0.145 7 2.3
3 0.1198
13 0.1983 28 0.1427 25 2.2
28 0.1203
3 0.1949 13 0.1424 32 2.15
27 0.110

61
Significant Differences

Manual metrics
Few differences between systems
Pyramid 23 is worse
Responsive 23 and 31 are worse
Both humans better than all systems
Automatic (Rouge-SU4)
More differences between systems
One human indistinguishable from 5 systems

62
Correlations Pearsons, 25 systems
Pyr-mod Resp-1 Resp2 R-2 R-SU4
Pyr-orig 0.96 0.77 0.86 0.84 0.80
Pyr-mod 0.81 0.90 0.90 0.86
Resp-1 0.83 0.92 0.92
Resp-2 0.88 0.87
R-2 0.98
63
Correlations Pearsons, 25 systems
Pyr-mod Resp-1 Resp2 R-2 R-SU4
Pyr-orig 0.96 0.77 0.86 0.84 0.80
Pyr-mod 0.81 0.90 0.90 0.86
Resp-1 0.83 0.92 0.92
Resp-2 0.88 0.87
R-2 0.98
Questionable that responsiveness could be a gold
standard
64
Pyramid and responsiveness
Pyr-mod Resp-1 Resp2 R-2 R-SU4
Pyr-orig 0.96 0.77 0.86 0.84 0.80
Pyr-mod 0.81 0.90 0.90 0.86
Resp-1 0.83 0.92 0.92
Resp-2 0.88 0.87
R-2 0.98
High correlation, but the metrics are not
mutually substitutable
65
Pyramid and Rouge
Pyr-mod Resp-1 Resp2 R-2 R-SU4
Pyr-orig 0.96 0.77 0.86 0.84 0.80
Pyr-mod 0.81 0.90 0.90 0.86
Resp-1 0.83 0.92 0.92
Resp-2 0.88 0.87
R-2 0.98
High correlation, but the metrics are not
mutually substitutable
66
Correlations

Original and modified can substitute for each
other
High correlation between manual and automatic,
but automatic not yet a substitute
Similar patterns between pyramid and
responsiveness

67
Nightmare

Scoring metric that is not stable used to decide
funding
Insignificant differences between systems
determine funding

68
Is Task Evaluation Nightmare Free?

Impact of user interface issues
Can have more impact than the summary
Controlling for proper mix of subjects
Quantity of subjects and time to carry out is
large

69
Till Max said Be still! and tamed them with the
magic trick
70
Of staring into their yellow eyes without
blinking once And they were frightened and called
him the most wild thing of all
71
And made him king of all wild things
72
And now, cried Max Let the wild rumpus start!
73
(No Transcript)
74
(No Transcript)
75
(No Transcript)
76
Are we having fun yet?Benefits of evaluation

Emergence of evaluation methods
ROUGE
Pyramids
Nuggetteer
Research into characteristics of metrics
Analyses of sub-sentential units
Paraphrase as a research issue

77
Available Data

DUC data sets
4 years of summary/document set pairs
Multidocument summarization training data not
available beforehand
4 years of scoring patterns
Led to analysis of human summaries
Pyramids
Pyramids and peers for 40 topics (DUC04, DUC05)
Many more from Nenkova and Passonneau
Training data for paraphrase
Training data for abstraction -gt see systems
moving away from pure sentence extraction

78
Wrapping up
79
Lessons Learned

Evaluation environment is important
Find a task with broad appeal
Use independent evaluator
At least a committee
Use multiple gold standards
Compare text at the content unit level
Evaluate the metrics
Look at significant differences

80
Is Evaluation Worth It?

DUC creation of a community
From 15 participants year 1 -gt 30 participants
year 5
No longer impacts funding
Enables research into evaluation
At start, no idea how to evaluate summaries
But, results do not tell us everything

81
And he sailed back over a year, in and out of
weeks and through a day
82
And into the night of his very own room where he
found his supper waiting for him .. And it was
still warm.

Write a Comment

User Comments (0)

About PowerShow.com