Title: The Pyramid Method at DUC05
1The Pyramid Method at DUC05
- Ani Nenkova
- Becky Passonneau
- Kathleen McKeown
- Other team members David Elson, Advaith
Siddharthan, Sergey Siegelman
2Overview
- Review of Pyramids (Kathy)
- Characteristics of the responses
- Analyses (Ani)
- Scores and Significant Differences
- Reliability of Pyramid scoring
- Comparisons between annotators
- Impact of editing on scores
- Impact of Weight 1 SCUs
- Correlation with responsiveness and Rouge
- Lessons learned
3Pyramids
- Uses multiple human summaries
- Previous data indicated 5 needed for score
stability - Information is ranked by its importance
- Allows for multiple good summaries
- A pyramid is created from the human summaries
- Elements of the pyramid are content units
- System summaries are scored by comparison with
the pyramid
4Summarization Content Units
- Near-paraphrases from different human summaries
- Clause or less
- Avoids explicit semantic representation
- Emerges from analysis of human summaries
5SCU A cable car caught fire (Weight 4)
- A. The cause of the fire was unknown.
- B. A cable car caught fire just after entering a
mountainside tunnel in an alpine resort in
Kaprun, Austria on the morning of November 11,
2000. - C. A cable car pulling skiers and snowboarders
to the Kitzsteinhorn resort, located 60 miles
south of Salzburg in the Austrian Alps, caught
fire inside a mountain tunnel, killing
approximately 170 people. - D. On November 10, 2000, a cable car filled to
capacity caught on fire, trapping 180 passengers
inside the Kitzsteinhorn mountain, located in the
town of Kaprun, 50 miles south of Salzburg in the
central Austrian Alps.
6SCU The cause of the fire is unknown (Weight 1)
- A. The cause of the fire was unknown.
- B. A cable car caught fire just after entering a
mountainside tunnel in an alpine resort in
Kaprun, Austria on the morning of November 11,
2000. - C. A cable car pulling skiers and snowboarders
to the Kitzsteinhorn resort, located 60 miles
south of Salzburg in the Austrian Alps, caught
fire inside a mountain tunnel, killing
approximately 170 people. - D. On November 10, 2000, a cable car filled to
capacity caught on fire, trapping 180 passengers
inside the Kitzsteinhorn mountain, located in the
town of Kaprun, 50 miles south of Salzburg in the
central Austrian Alps.
7SCU The accident happened in the Austrian Alps
(Weight 3)
- A. The cause of the fire was unknown.
- B. A cable car caught fire just after entering a
mountainside tunnel in an alpine resort in
Kaprun, Austria on the morning of November 11,
2000. - C. A cable car pulling skiers and snowboarders
to the Kitzsteinhorn resort, located 60 miles
south of Salzburg in the Austrian Alps, caught
fire inside a mountain tunnel, killing
approximately 170 people. - D. On November 10, 2000, a cable car filled to
capacity caught on fire, trapping 180 passengers
inside the Kitzsteinhorn mountain, located in the
town of Kaprun, 50 miles south of Salzburg in the
central Austrian Alps.
8Idealized representation
- Tiers of differentially weighted SCUs
- Top few SCUs, high weight
- Bottom many SCUs, low weight
W3
W2
W1
9Creation of pyramids
- Done for each of 20 out of 50 sets
- Primary annotator, secondary checker
- Held round-table discussions of problematic
constructions that occurred in this data set - Comma separated lists
- Extractive reserves have been formed for managed
harvesting of timber, rubber, Brazil nuts, and
medical plants without deforestation. - General vs. specific
- Eastern Europe vs. Hungary, Poland, Lithuania,
and Turkey
10Characteristics of the Responses
- Proportion of SCUs of Weight 1 is large
- 44 (D324) to 81 (D695)
- Mean SCU weight 1.9
- Agreement among human responders is quite low
11 of SCUs at each weight
SCU Weights
12Pyramids DUC 2003
- 100 word summaries (vs. 250 word)
- 10 500-word articles per cluster (vs. 30 720-word
articles) - 3 clusters (vs. 20 clusters)
- Mean SCU Weight (7 models)
- 2005 avg 1.9
- 2003 avg 2.4
- Proportion of SCUs of W1
- 2005 avg 60, 44 to 81
- 2003 avg 40, 37 to 47
13DUC03 DUC05
.4
.4
14Computing pyramid scoresIdeally informative
summary
- Does not include an SCU from a lower tier unless
all SCUs from higher tiers are included as well
15Ideally informative summary
- Does not include an SCU from a lower tier unless
all SCUs from higher tiers are included as well
16Ideally informative summary
- Does not include an SCU from a lower tier unless
all SCUs from higher tiers are included as well
17Ideally informative summary
- Does not include an SCU from a lower tier unless
all SCUs from higher tiers are included as well
18Ideally informative summary
- Does not include an SCU from a lower tier unless
all SCUs from higher tiers are included as well
19Ideally informative summary
- Does not include an SCU from a lower tier unless
all SCUs from higher tiers are included as well
20Original Pyramid Score
- SCORE D/MAX
- D Sum of the weights of the SCUs in a summary
- MAX Sum of the weights of the SCUs in a ideally
informative summary - Measures the proportion of good information in
the summary precision
21Modified pyramid score (recall)
- EN average SCUs in human models
- This is the number of content units humans chose
to convey about the story - WCompute the weight of a maximally informative
summary of size EN - D/W is the modified pyramid score
- Shows the proportion of expected good information
22Scoring Methods
- Presents scores for the 20 pyramid sets
- Recompute Rouge for comparison
- We compute Rouge using only 7 models
- 8 and 9 reserved for computing human performance
- Best because of significant topic effect
- Comparisons between Pyramid (original,modified),
responsiveness, and Rouge-SU4 - Pyramids score computed from multiple humans
- Responsiveness is just one humans judgment
- Rouge-SU4 equivalent to Rouge-2
23Preview of Results
- Manual metrics
- Large differences between humans and machines
- No single system the clear winner
- But a top group identified by all metrics
- Significant differences
- Different predictions from manual and automatic
metrics - Correlations between metrics
- Some correlation but one cannot be substituted
for another - This is good
24Human performance/Best sys
- Pyramid Modified Resp
ROUGE-SU4 - B 0.5472 B 0.4814 A 4.895
A 0.1722 - A 0.4969 A 0.4617 B 4.526
B 0.1552 -
- 14 0.2587 10 0.2052 4 2.85
15 0.139
Best system 50 of human performance on manual
metrics Best system 80 of human performance on
ROUGE
25- Pyramid
- original Modified Resp
Rouge-SU4 - 14 0.2587 10 0.2052 4 2.85
15 0.139 - 17 0.2492 17 0.1972 14 2.8
4 0.134 - 15 0.2423 14 0.1908 10 2.65
17 0.1346 - 10 0.2379 7 0.1852 15 2.6
19 0.1275 - 4 0.2321 15 0.1808 17 2.55
11 0.1259 - 7 0.2297 4 0.177 11 2.5
10 0.1278 - 16 0.2265 16 0.1722 28 2.45
6 0.1239 - 6 0.2197 11 0.1703 21 2.45
7 0.1213 - 32 0.2145 6 0.1671 6 2.4
14 0.1264 - 21 0.2127 12 0.1664 24 2.4
25 0.1188 - 12 0.2126 19 0.1636 19 2.4
21 0.1183 - 11 0.2116 21 0.1613 6 2.4
16 0.1218 - 26 0.2106 32 0.1601 27 2.35
24 0.118 - 19 0.2072 26 0.1464 12 2.35
12 0.116 - 28 0.2048 3 0.145 7 2.3
3 0.1198 - 13 0.1983 28 0.1427 25 2.2
28 0.1203 - 3 0.1949 13 0.1424 32 2.15
27 0.110
26- Pyramid
- original Modified Resp
Rouge-SU4 - 14 0.2587 10 0.2052 4 2.85
15 0.139 - 17 0.2492 17 0.1972 14 2.8
4 0.134 - 15 0.2423 14 0.1908 10 2.65
17 0.1346 - 10 0.2379 7 0.1852 15 2.6
19 0.1275 - 4 0.2321 15 0.1808 17 2.55
11 0.1259 - 7 0.2297 4 0.177 11 2.5
10 0.1278 - 16 0.2265 16 0.1722 28 2.45
6 0.1239 - 6 0.2197 11 0.1703 21 2.45
7 0.1213 - 32 0.2145 6 0.1671 6 2.4
14 0.1264 - 21 0.2127 12 0.1664 24 2.4
25 0.1188 - 12 0.2126 19 0.1636 19 2.4
21 0.1183 - 11 0.2116 21 0.1613 6 2.4
16 0.1218 - 26 0.2106 32 0.1601 27 2.35
24 0.118 - 19 0.2072 26 0.1464 12 2.35
12 0.116 - 28 0.2048 3 0.145 7 2.3
3 0.1198 - 13 0.1983 28 0.1427 25 2.2
28 0.1203 - 3 0.1949 13 0.1424 32 2.15
27 0.110
27- Pyramid
- original Modified Resp
Rouge-SU4 - 14 0.2587 10 0.2052 4 2.85
15 0.139 - 17 0.2492 17 0.1972 14 2.8
4 0.134 - 15 0.2423 14 0.1908 10 2.65
17 0.1346 - 10 0.2379 7 0.1852 15 2.6
19 0.1275 - 4 0.2321 15 0.1808 17 2.55
11 0.1259 - 7 0.2297 4 0.177 11 2.5
10 0.1278 - 16 0.2265 16 0.1722 28 2.45
6 0.1239 - 6 0.2197 11 0.1703 21 2.45
7 0.1213 - 32 0.2145 6 0.1671 6 2.4
14 0.1264 - 21 0.2127 12 0.1664 24 2.4
25 0.1188 - 12 0.2126 19 0.1636 19 2.4
21 0.1183 - 11 0.2116 21 0.1613 6 2.4
16 0.1218 - 26 0.2106 32 0.1601 27 2.35
24 0.118 - 19 0.2072 26 0.1464 12 2.35
12 0.116 - 28 0.2048 3 0.145 7 2.3
3 0.1198 - 13 0.1983 28 0.1427 25 2.2
28 0.1203 - 3 0.1949 13 0.1424 32 2.15
27 0.110
28- Pyramid
- original Modified Resp
Rouge-SU4 - 14 0.2587 10 0.2052 4 2.85
15 0.139 - 17 0.2492 17 0.1972 14 2.8
4 0.134 - 15 0.2423 14 0.1908 10 2.65
17 0.1346 - 10 0.2379 7 0.1852 15 2.6
19 0.1275 - 4 0.2321 15 0.1808 17 2.55
11 0.1259 - 7 0.2297 4 0.177 11 2.5
10 0.1278 - 16 0.2265 16 0.1722 28 2.45
6 0.1239 - 6 0.2197 11 0.1703 21 2.45
7 0.1213 - 32 0.2145 6 0.1671 6 2.4
14 0.1264 - 21 0.2127 12 0.1664 24 2.4
25 0.1188 - 12 0.2126 19 0.1636 19 2.4
21 0.1183 - 11 0.2116 21 0.1613 6 2.4
16 0.1218 - 26 0.2106 32 0.1601 27 2.35
24 0.118 - 19 0.2072 26 0.1464 12 2.35
12 0.116 - 28 0.2048 3 0.145 7 2.3
3 0.1198 - 13 0.1983 28 0.1427 25 2.2
28 0.1203 - 3 0.1949 13 0.1424 32 2.15
27 0.110
29Significant Differences
- Manual metrics
- Few differences between systems
- Pyramid 23 is worse
- Responsive 23 and 31 are worse
- Both humans better than all systems
- Automatic (Rouge-SU4)
- Many differences between systems
- One human indistinguishable from 5 systems
30Multiple and pairwise comparisons
- Multiple comparisons
- Tukeys method
- Control for the experiment-wise type I error
- Show fewer significant differences
- Pairwise comparisons
- Wilcoxon paired test
- Controls the error for individual comparisons
- Appropriate how your system did for development
31Peer
Better than
21 32 6 12 19 11 16 4 15 7 14 17 10 A B 23 23 23 23 23 23 23 23 23 23 23 23 20 23 20 23 20 30 24 31 1 27 25 28 13 26 3 21 32 6 12 19 11 16 4 15 7 14 17 10 23 20 30 24 31 1 27 25 28 13 26 3 21 32 6 12 19 11 16 4 15 7 14 17 10
- Modified pyramid significant differences
- One systems accounts for most of the differences
- Humans significantly better than all systems
3226 13 20 3 32 25 7 12 27 6 16 19 24 21 28 11 17 15 10 14 4 B A 23 23 23 23 23 23 23 23 23 23 31 23 31 23 31 23 31 23 31 23 31 23 31 23 31 23 31 1 23 31 1 23 31 1 30 26 13 20 23 31 1 30 26 13 20 3 23 31 1 30 26 13 20 3 32 25 7 12 27 6 16 19 24 21 28 11 17 15 10 14 4 23 31 1 30 26 13 20 3 32 25 7 12 27 6 16 19 24 21 28 11 17 15 10 14 4
- Responsiveness 1 Significant differences
- Differences primarily between 2 systems
- Differences between humans and each system
3316 12 15 28 3 7 4 14 17 10 B A 23 23 23 23 23 23 23 23 31 20 23 31 20 23 31 20 23 31 1 30 26 13 20 3 32 25 7 12 27 6 16 19 24 21 28 11 17 15 10 14 4 23 31 1 30 26 13 20 3 32 25 7 12 27 6 16 19 24 21 28 11 17 15 10 14 4
- Responsive-2
- Similar shape to original
3420 31 26 1 32 11 28 13 30 27 3 16 21 12 24 25 7 14 6 19 10 17 4 15 B A 23 23 23 23 23 20 23 20 31 23 20 31 23 20 31 23 20 31 23 20 31 23 20 31 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 32 11 28 13 30 27 3 16 21 12 24 25 7 14 6 23 20 31 26 1 32 11 28 13 30 27 3 16 21 12 24 25 7 14 6 19 10 17 4 15
- Skip-bigram significant differences
- Many more differences between systems than any
manual metric - No difference between human and 5 systems
35(No Transcript)
36Pairwise comparisons Modified Pyramid
10 17 14 7 15 4 16 11 19 12 6 32 21 3 26 13 28 25 27 31 24 30 20 23 3 25 27 24 30 20 23 25 27 1 24 30 20 23 13 25 27 31 24 30 20 23 3 25 27 1 24 30 20 23 25 27 31 24 30 20 23 24 30 23 24 30 23 24 30 23 30 23 31 30 23 24 30 20 23 24 30 23 30 23 23 23 30 20 23
37Agreement between annotators
Overall Low High
Percent Agreement 95 90 96
Kappa .57 .46 .62
Alpha .57 .41 .59
Alpha-Dice .67 .49 .68
38Editing of participant annotations
- To correct obvious errors
- Ensures uniform checking
- Predominantly involved correct splitting
unmatching SCUs - Average paired differences
- Original 0.0043
- Modified 0.0005
- Average magnitude of the difference
- Original 0.0115
- Modified 0.0032
39Excluding weight 1 SCUs
- Removing weight 1 SCUs improves agreement
- Kappa 0.64 (was 0.57)
- Annotating without weight 1 has negligible impact
on scores - Set D324 done without weight 1 SCUs
- Ave.magnitude between paired differences
- On average 0.07 difference
40Correlations Pearsons, 25 systems
Pyr-mod Resp-1 Resp2 R-2 R-SU4
Pyr-orig 0.96 0.77 0.86 0.84 0.80
Pyr-mod 0.81 0.90 0.90 0.86
Resp-1 0.83 0.92 0.92
Resp-2 0.88 0.87
R-2 0.98
41Correlations Pearsons, 25 systems
Pyr-mod Resp-1 Resp2 R-2 R-SU4
Pyr-orig 0.96 0.77 0.86 0.84 0.80
Pyr-mod 0.81 0.90 0.90 0.86
Resp-1 0.83 0.92 0.92
Resp-2 0.88 0.87
R-2 0.98
Questionable that responsiveness could be a gold
standard
42Pyramid and responsiveness
Pyr-mod Resp-1 Resp2 R-2 R-SU4
Pyr-orig 0.96 0.77 0.86 0.84 0.80
Pyr-mod 0.81 0.90 0.90 0.86
Resp-1 0.83 0.92 0.92
Resp-2 0.88 0.87
R-2 0.98
High correlation, but the metrics are not
mutually substitutable
43Pyramid and Rouge
Pyr-mod Resp-1 Resp2 R-2 R-SU4
Pyr-orig 0.96 0.77 0.86 0.84 0.80
Pyr-mod 0.81 0.90 0.90 0.86
Resp-1 0.83 0.92 0.92
Resp-2 0.88 0.87
R-2 0.98
High correlation, but the metrics are not
mutually substitutable
44Lessons Learned
- Comparing content is hard
- All kinds of judgment calls
- We didnt evaluate the NIST assessors in previous
years - Paraphrases
- VP vs. NP
- Ministers have been exchanged
- Reciprocal ministerial visits
- Length and constituent type
- Robotics assists doctors in the medical operating
theater - Surgeons started using robotic assistants
45Modified scores better
- Easier peer annotation
- Can drop weight 1 SCUs
- Better agreement
- No emphasis on splitting non-matching SCUs
46Agreement between annotators
- Participants can perform peer annotation reliably
- Absolute difference between scores
- Original 0.0555
- Modified 0.0617
- Empirical prediction of difference 0.06 (HLT 2004)
47Correlations
- Original and modified can substitute for each
other - High correlation between manual and automatic,
but automatic not yet a substitute - Similar patterns between pyramid and
responsiveness
48Current Directions
- Automated identification of SCUs (Harnly et al
05) - Applied to DUC05 pyramid data set
- Correlation of .91 with modified pyramid scores
49Questions
- What was the experience annotating pyramids?
- Does it shed insight on the problem
- Are people willing to do it again?
- Would you have been willing to go through
training? - If youve done pyramid analysis, can you share
your insights
50(No Transcript)
51(No Transcript)
52Correlations of Scores on Matched Sets
53SCU Weight by Cardinality(Ten pyramids)
54Mean SCU Weight(Ten pyramids)