Designing TestBeds for General Anaphora Resolution

About This Presentation

Title:

Designing TestBeds for General Anaphora Resolution

Description:

Designing Test-Beds for General Anaphora Resolution. Oana Postolache. oana_at_coli.uni-sb.de ... AR is always a key component of other NLP processes (ex. ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 52

Provided by: cris50

Category:

more less

Transcript and Presenter's Notes

Title: Designing TestBeds for General Anaphora Resolution

1
Designing Test-Beds for General Anaphora
Resolution

Oana Postolache
oana_at_coli.uni-sb.de
University of Saarland, Saarbrücken, GermanyAl.
I. Cuza University of Iasi, Romania

Dan Cristea dcristea_at_infoiasi.ro Al. I. Cuza
University of Iasi, Romania ICS - Romanian
Academy, the Iasi Branch, Romania
2
Motivation

AR is always a key component of other NLP
processes (ex. summarisation, IE, Q/A)
In the larger setting is it often of importance
to measure the degree in which a component
degrade the overall performance of the system
Ex the detection of markables alone, the AR
component alone, etc.

3
Aims

Propose a methodology for detection of
bottlenecks in a pipe-line NLP system
Experiments with an architecture made of a
markable detection module and an AR resolution
module
Propose a methodology of evaluation of the
behavior of such a system when markables are not
given
Reports recent results of a markable detection
module and an AR resolution module on two types
of input

4
Evaluation of a minimum AR system
RE-extractor
AR-engine
5
Evaluation of a minimum AR system
Test the whole system globally
Test the RE-extractor
RE-extractor
AR-engine
Test only the AR-engine
6
Our corpora

A plain text corpus of approx. 19,500 words in
1,966 sentences, extracted from the Orwells
novel 1984 (Orwell, 1949)
A manually annotated corpus for syntactic
structure containing approx. 6,250 words in 281
sentences, extracted from the English Penn
Treebank (Marcus et al., 1994).

7
Markables

Generally, conformant with MUC-7 and ACE criteria
Differences
do not include relative clauses
each term of an apposition is taken separately
(Big Brother, the primal traitor)
conjoined expressions are annotated individually
(John and Mary, hills and mountains)
modifying nouns appearing in noun-noun
modification are not marked separately (glass
doors, prison food, the junk bond market)

8
Markables

What do we mark?
noun phrases
definite (the principle, the flying object)
indefinite (a book, a future star)
undetermined (sole guardian of truth)
names (Winston Smith, The Ministry of Love)
dates (April)
currency expressions (40)
percentages (48)
pronouns
personal (I, you, he, him, she, her, it, they,
them)
possessive (his, her, hers, its, their, theirs)
reflexive (himself, herself, itself, themselves)
demonstrative (this, that, these, those)
wh-pronouns when they replace an entity (which,
who, whom, whose, that)
numerals
when they refer to entities (four of them, the
first, the second)

9
The Orwell corpus

Chapters 1, 2, 3 and 5 from George Orwells
Ninety eighty four
Automatic detection of markables
POS-tagging
FDG parser
markable any construction dominated by a
noun/pronoun
detection of head and lemma (given)
deletion of relative clauses

10
The Orwell corpus dimension
11
The Penn Treebank corpus

7 files from WSJ
Extraction of markables from the PTB-style
constituency trees
Collins rules to extract head
WordNet script for lemma
Dependency links between words

12
The Penn Treebank corpus
13
AR-engine the architecture
14
Terminology
text layer .
REa
REb
REc
REd
REx
PSx
projection layer
DEm
DEj
DE1
semantic layer
15
Terminology
text layer .
REa
REb
REc
REd
REx
PSx
projection layer
DEm
DEj
DE1
semantic layer
16
Terminology
text layer .
REa
REb
REc
REd
REx
PSx
projection layer
DEm
DEj
DE1
semantic layer
17
What is an AR model?
text layer .
REa
REb
REc
REd
REx
PSx
projection layer
DEm
DEj
DE1
semantic layer
18
Phases of the engine
projection phase
19
Phases of the engine
proposing/evoking phase

REa

20
Phases of the engine
proposing/evoking phase
text layer .
projection layer ..
semantic layer.
21
Phases of the engine
completion phase

REa

22
Phases of the engine
completion phase

REa

DEa
23
Phases of the engine
completion phase
text layer .
projection layer ..
semantic layer.
24
Phases of the engine
completion phase
text layer .
projection layer ..
semantic layer.
25
Phases of the engine
re-evaluation phase
.
REb
REa
REc
text layer .

projection layer ..
PSb
PSc
?
semantic layer.
..
DEa
26
Phases of the engine
re-evaluation phase
.
REb
REa
REc
text layer .

projection layer ..
PSb
PSc
semantic layer.
..
DEa
27
Phases of the engine
re-evaluation phase
.
REb
REa
REc
text layer .

projection layer ..
PSb
PSc
semantic layer.
..
DEa
28
Phases of the engine
re-evaluation phase
.
REb
REa
REc
text layer .

projection layer ..
PSb
PSc
semantic layer.
..
DEa
29
Phases of the engine
re-evaluation phase
.
REb
REa
REc
text layer .

projection layer ..
PSb
PSc
semantic layer.
..
DEa
30
Phases of the engine
re-evaluation phase
.
REb
REa
REc
text layer .

projection layer ..
semantic layer.
..
DEa
31
Our model primary attributes

Lexical morphological
lemma
number
POS
headForm
Syntactic
synt-role
dependency-link
npText
includedNPs
isDefinite, isIndefinite,
predNameOf
Semantic
isMaleName, isFemaleName, isFamilyName, isPerson
HeSheItThey
Positional
offset
sentenceID

32
Our model knowledge sources

For each attribute there is a knowledge source
that fetches the value using
The POS tagger output
The FDG structure
Large name databases
The WordNet hierarchy
Punctuation

33
Knowledge sources - HeSheItThey

HeSheItThey Phe, Pshe, Pit, Pthey
for pronouns straightforward
for NPs
n synsets of the head
f synsets which are hyponyms of ltfemalegt
m synsets which are hyponyms of ltmalegt
p synsets which are hyponyms of ltpersongt
If NP is plural Phe0, Pshe0, Pit0, Pthey1
Else Phe , Pshe ,
Pit , Pthey0

34
Knowledge sources - wh

Source for detecting the referee of a wh-pronoun
Case1
I saw a blond boy who was playing in the
garden.
Case2
The colour of the chair which was underneath
the table
The atmosphere of happiness which she carried
with her.

35
Our model rules

Demolishing rules
IncludingRule prohibits coreference between
nested REs
Certifying rules
PredNameRule
ProperNameRule
Promoting/demoting rules
HeSheItTheyRule
RoleRule
NumberRule
LemmaRule
PersonRule
SynonymyRule
HypernymyRule
WordnetChainRule

36
Our model domain of referential accessibility

Linear

37
Evaluation of the RE-extractor
Test the RE-extractor
RE-extractor
AR-engine
When a gold-test pair of markables match?
38
Evaluation of the RE-extractor
Test the RE-extractor
RE-extractor
AR-engine
markable
gold

When a gold-test pair of markables match?
head matching (HM) if they have the same head

test
markable
39
Evaluation of the RE-extractor
Test the RE-extractor
P, R, F
RE-extractor
AR-engine
l1
gold

When a gold-test pair of markables match?
partial matching (PM) if they have the same
head and the mutual overlap is higher than 50
(compared to the longest span)

test
l2
l2 / l1 gt 0.5
40
Evaluation of the AR-engine

Same set of markables (on the identity of head
criterion)
For each anaphor in the gold
If it belongs to a chain that doesnt contain any
other anaphor, then we look in the test set to
see if it belongs to a similar trivial chain, in
which case it will take the value 1

i
Ci 1
test
41
Evaluation of the AR-engine

Same set of markables (on the identity of head
criterion)
For each anaphor in the gold
If it belongs to a chain that doesnt contain any
other anaphor, then we look in the test set to
see if it belongs to a similar trivial chain,
otherwise it will get the value 0

i
Ci 0
test
42
Evaluation of the AR-engine

Same set of markables (on the identity of head
criterion)
For each anaphor in the gold
If the anaphor belongs to a chain containing
other n anaphors, then we look in the test set
and count how many of these n anaphors belong to
the chain corresponding to the current test-set
anaphor (we note this number with m). The ratio
m/n will be the value assigned to the current
anaphor.

i
1
1
1
gold
ci 2/3
test
0
1
1
43
Evaluation of the AR-engine

Same set of markables (on the identity of head
criterion)
For each anaphor in the gold
If the anaphor belongs to a chain containing
other n anaphors, then we look in the test set
and count how many of these n anaphors belong to
the chain corresponding to the current test-set
anaphor (we note this number with m). The ratio
m/n will be the value assigned to the current
anaphor.
Then we add this number for all anaphors and
divide by no. of anaphors ?ci / N

i
1
1
1
gold
ci 2/3
test
0
1
1
44
Evaluation of the AR-engine working on
coreferences
RE-extractor
AR-engine
45
Evaluation of the whole system

Possibly different set of markables, identified
on the identity of head criterion and, where
found both in gold and test, possibly different
spans
same global formula but the contribution of each
markable is factored by the mutual overlapping
score, showing the test versus gold overlapping
of markables

a
gold
mosi b/a
test
b
46
Evaluation of the whole system

Possibly different set of markables, identified
on the identity of head criterion and, where
found both in gold and test, possibly different
spans
same global formula but the contribution of each
markable is factored by the mutual overlapping
score, showing the test versus gold overlapping
of markables

i
1
1
1
gold
ci 1.2/3
test
0.5
0
0.7
R ?ci / Ng
47
Evaluation of the whole system

Possibly different set of markables, identified
on the identity of head criterion and, where
found both in gold and test, possibly different
spans
misses (failings to find certain markables)
influence R
false-alarms (markables erroneously considered
in the test) influence P

i
1
1
1
gold
test
0
0.7
false-alarm
miss
48
Evaluation of the whole system
RE-extractor
AR-engine
49
Commentaries

RE-extractor module gives better results on PTB
than on Orwell
human syntactic annotation versus automatic FDG
structure detection
AR module gives slightly better results on PTB
than on Orwell
news (finance) versus belles-lettres
heads in PTB extracted by rules relying on the
human syntactic annotation, in Orwell extracted
by rules relying on the FDG results
Difficult to compare with other
approaches/authors
apparently we are in the upper class
BUT not the same corpus, not the same evaluation
metric

50
Conclusions

propose a methodology to evaluate pipe-line
architectures when the gold and test data are
available in-between intermediate steps in the
processing chain. The method allows to appreciate
the contribution of individual modules
irrespective of the depreciation of the results
due to the weakness of the contributing modules
report and compare new coreference resolution
results on input belonging to two different
registers belles-lettres and news, and to two
different types of input plain-text and treebank
annotation
introduce a method to evaluate a coreference
resolution module when the markables in test and
gold differ not only by number but also by span
the coreference resolution model uses a new
heuristic based on WordNet (the HeSheItThey
metric a kind of natural gender for nouns)
which helps a lot.

51
Thank you

Write a Comment

User Comments (0)

About PowerShow.com

Designing TestBeds for General Anaphora Resolution - PowerPoint PPT Presentation

Designing TestBeds for General Anaphora Resolution

Designing Test-Beds for General Anaphora Resolution. Oana Postolache. oana_at_coli.uni-sb.de ... AR is always a key component of other NLP processes (ex. ... – PowerPoint PPT presentation