Title: Designing TestBeds for General Anaphora Resolution
1Designing Test-Beds for General Anaphora
Resolution
- Oana Postolache
- oana_at_coli.uni-sb.de
- University of Saarland, Saarbrücken, GermanyAl.
I. Cuza University of Iasi, Romania
Dan Cristea dcristea_at_infoiasi.ro Al. I. Cuza
University of Iasi, Romania ICS - Romanian
Academy, the Iasi Branch, Romania
2Motivation
- AR is always a key component of other NLP
processes (ex. summarisation, IE, Q/A) - In the larger setting is it often of importance
to measure the degree in which a component
degrade the overall performance of the system - Ex the detection of markables alone, the AR
component alone, etc.
3Aims
- Propose a methodology for detection of
bottlenecks in a pipe-line NLP system - Experiments with an architecture made of a
markable detection module and an AR resolution
module - Propose a methodology of evaluation of the
behavior of such a system when markables are not
given - Reports recent results of a markable detection
module and an AR resolution module on two types
of input
4Evaluation of a minimum AR system
RE-extractor
AR-engine
5Evaluation of a minimum AR system
Test the whole system globally
Test the RE-extractor
RE-extractor
AR-engine
Test only the AR-engine
6Our corpora
- A plain text corpus of approx. 19,500 words in
1,966 sentences, extracted from the Orwells
novel 1984 (Orwell, 1949) - A manually annotated corpus for syntactic
structure containing approx. 6,250 words in 281
sentences, extracted from the English Penn
Treebank (Marcus et al., 1994).
7Markables
- Generally, conformant with MUC-7 and ACE criteria
- Differences
- do not include relative clauses
- each term of an apposition is taken separately
(Big Brother, the primal traitor) - conjoined expressions are annotated individually
(John and Mary, hills and mountains) - modifying nouns appearing in noun-noun
modification are not marked separately (glass
doors, prison food, the junk bond market)
8Markables
- What do we mark?
- noun phrases
- definite (the principle, the flying object)
- indefinite (a book, a future star)
- undetermined (sole guardian of truth)
- names (Winston Smith, The Ministry of Love)
- dates (April)
- currency expressions (40)
- percentages (48)
- pronouns
- personal (I, you, he, him, she, her, it, they,
them) - possessive (his, her, hers, its, their, theirs)
- reflexive (himself, herself, itself, themselves)
- demonstrative (this, that, these, those)
- wh-pronouns when they replace an entity (which,
who, whom, whose, that) - numerals
- when they refer to entities (four of them, the
first, the second)
9The Orwell corpus
- Chapters 1, 2, 3 and 5 from George Orwells
Ninety eighty four - Automatic detection of markables
- POS-tagging
- FDG parser
- markable any construction dominated by a
noun/pronoun - detection of head and lemma (given)
- deletion of relative clauses
10The Orwell corpus dimension
11The Penn Treebank corpus
- 7 files from WSJ
- Extraction of markables from the PTB-style
constituency trees - Collins rules to extract head
- WordNet script for lemma
- Dependency links between words
12The Penn Treebank corpus
13AR-engine the architecture
14Terminology
text layer .
REa
REb
REc
REd
REx
PSx
projection layer
DEm
DEj
DE1
semantic layer
15Terminology
text layer .
REa
REb
REc
REd
REx
PSx
projection layer
DEm
DEj
DE1
semantic layer
16Terminology
text layer .
REa
REb
REc
REd
REx
PSx
projection layer
DEm
DEj
DE1
semantic layer
17What is an AR model?
text layer .
REa
REb
REc
REd
REx
PSx
projection layer
DEm
DEj
DE1
semantic layer
18Phases of the engine
projection phase
19Phases of the engine
proposing/evoking phase
REa
20Phases of the engine
proposing/evoking phase
text layer .
projection layer ..
semantic layer.
21Phases of the engine
completion phase
REa
22Phases of the engine
completion phase
REa
DEa
23Phases of the engine
completion phase
text layer .
projection layer ..
semantic layer.
24Phases of the engine
completion phase
text layer .
projection layer ..
semantic layer.
25Phases of the engine
re-evaluation phase
.
REb
REa
REc
text layer .
projection layer ..
PSb
PSc
?
semantic layer.
..
DEa
26Phases of the engine
re-evaluation phase
.
REb
REa
REc
text layer .
projection layer ..
PSb
PSc
semantic layer.
..
DEa
27Phases of the engine
re-evaluation phase
.
REb
REa
REc
text layer .
projection layer ..
PSb
PSc
semantic layer.
..
DEa
28Phases of the engine
re-evaluation phase
.
REb
REa
REc
text layer .
projection layer ..
PSb
PSc
semantic layer.
..
DEa
29Phases of the engine
re-evaluation phase
.
REb
REa
REc
text layer .
projection layer ..
PSb
PSc
semantic layer.
..
DEa
30Phases of the engine
re-evaluation phase
.
REb
REa
REc
text layer .
projection layer ..
semantic layer.
..
DEa
31Our model primary attributes
- Lexical morphological
- lemma
- number
- POS
- headForm
- Syntactic
- synt-role
- dependency-link
- npText
- includedNPs
- isDefinite, isIndefinite,
- predNameOf
- Semantic
- isMaleName, isFemaleName, isFamilyName, isPerson
- HeSheItThey
- Positional
- offset
- sentenceID
32Our model knowledge sources
- For each attribute there is a knowledge source
that fetches the value using - The POS tagger output
- The FDG structure
- Large name databases
- The WordNet hierarchy
- Punctuation
33Knowledge sources - HeSheItThey
- HeSheItThey Phe, Pshe, Pit, Pthey
- for pronouns straightforward
- for NPs
- n synsets of the head
- f synsets which are hyponyms of ltfemalegt
- m synsets which are hyponyms of ltmalegt
- p synsets which are hyponyms of ltpersongt
- If NP is plural Phe0, Pshe0, Pit0, Pthey1
- Else Phe , Pshe ,
Pit , Pthey0
34Knowledge sources - wh
- Source for detecting the referee of a wh-pronoun
- Case1
- I saw a blond boy who was playing in the
garden. - Case2
- The colour of the chair which was underneath
the table - The atmosphere of happiness which she carried
with her.
35Our model rules
- Demolishing rules
- IncludingRule prohibits coreference between
nested REs - Certifying rules
- PredNameRule
- ProperNameRule
- Promoting/demoting rules
- HeSheItTheyRule
- RoleRule
- NumberRule
- LemmaRule
- PersonRule
- SynonymyRule
- HypernymyRule
- WordnetChainRule
36Our model domain of referential accessibility
37Evaluation of the RE-extractor
Test the RE-extractor
RE-extractor
AR-engine
When a gold-test pair of markables match?
38Evaluation of the RE-extractor
Test the RE-extractor
RE-extractor
AR-engine
markable
gold
- When a gold-test pair of markables match?
- head matching (HM) if they have the same head
test
markable
39Evaluation of the RE-extractor
Test the RE-extractor
P, R, F
RE-extractor
AR-engine
l1
gold
- When a gold-test pair of markables match?
- partial matching (PM) if they have the same
head and the mutual overlap is higher than 50
(compared to the longest span)
test
l2
l2 / l1 gt 0.5
40Evaluation of the AR-engine
- Same set of markables (on the identity of head
criterion) - For each anaphor in the gold
- If it belongs to a chain that doesnt contain any
other anaphor, then we look in the test set to
see if it belongs to a similar trivial chain, in
which case it will take the value 1
i
Ci 1
test
41Evaluation of the AR-engine
- Same set of markables (on the identity of head
criterion) - For each anaphor in the gold
- If it belongs to a chain that doesnt contain any
other anaphor, then we look in the test set to
see if it belongs to a similar trivial chain,
otherwise it will get the value 0
i
Ci 0
test
42Evaluation of the AR-engine
- Same set of markables (on the identity of head
criterion) - For each anaphor in the gold
- If the anaphor belongs to a chain containing
other n anaphors, then we look in the test set
and count how many of these n anaphors belong to
the chain corresponding to the current test-set
anaphor (we note this number with m). The ratio
m/n will be the value assigned to the current
anaphor.
i
1
1
1
gold
ci 2/3
test
0
1
1
43Evaluation of the AR-engine
- Same set of markables (on the identity of head
criterion) - For each anaphor in the gold
- If the anaphor belongs to a chain containing
other n anaphors, then we look in the test set
and count how many of these n anaphors belong to
the chain corresponding to the current test-set
anaphor (we note this number with m). The ratio
m/n will be the value assigned to the current
anaphor. - Then we add this number for all anaphors and
divide by no. of anaphors ?ci / N
i
1
1
1
gold
ci 2/3
test
0
1
1
44Evaluation of the AR-engine working on
coreferences
RE-extractor
AR-engine
45Evaluation of the whole system
- Possibly different set of markables, identified
on the identity of head criterion and, where
found both in gold and test, possibly different
spans - same global formula but the contribution of each
markable is factored by the mutual overlapping
score, showing the test versus gold overlapping
of markables
a
gold
mosi b/a
test
b
46Evaluation of the whole system
- Possibly different set of markables, identified
on the identity of head criterion and, where
found both in gold and test, possibly different
spans - same global formula but the contribution of each
markable is factored by the mutual overlapping
score, showing the test versus gold overlapping
of markables
i
1
1
1
gold
ci 1.2/3
test
0.5
0
0.7
R ?ci / Ng
47Evaluation of the whole system
- Possibly different set of markables, identified
on the identity of head criterion and, where
found both in gold and test, possibly different
spans - misses (failings to find certain markables)
influence R - false-alarms (markables erroneously considered
in the test) influence P
i
1
1
1
gold
test
0
0.7
false-alarm
miss
48Evaluation of the whole system
RE-extractor
AR-engine
49Commentaries
- RE-extractor module gives better results on PTB
than on Orwell - human syntactic annotation versus automatic FDG
structure detection - AR module gives slightly better results on PTB
than on Orwell - news (finance) versus belles-lettres
- heads in PTB extracted by rules relying on the
human syntactic annotation, in Orwell extracted
by rules relying on the FDG results - Difficult to compare with other
approaches/authors - apparently we are in the upper class
- BUT not the same corpus, not the same evaluation
metric
50Conclusions
- propose a methodology to evaluate pipe-line
architectures when the gold and test data are
available in-between intermediate steps in the
processing chain. The method allows to appreciate
the contribution of individual modules
irrespective of the depreciation of the results
due to the weakness of the contributing modules - report and compare new coreference resolution
results on input belonging to two different
registers belles-lettres and news, and to two
different types of input plain-text and treebank
annotation - introduce a method to evaluate a coreference
resolution module when the markables in test and
gold differ not only by number but also by span - the coreference resolution model uses a new
heuristic based on WordNet (the HeSheItThey
metric a kind of natural gender for nouns)
which helps a lot.
51Thank you