Learning Dutch noun phrase coreference resolution - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Learning Dutch noun phrase coreference resolution

Description:

Articles from KNACK, a Flemish weekly magazine with articles on national and ... on fitness. Best. individual. Generate new population using crossover and mutation ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 26
Provided by: Vero171
Category:

less

Transcript and Presenter's Notes

Title: Learning Dutch noun phrase coreference resolution


1
Learning Dutch noun phrase coreference resolution
Véronique Hoste and Walter Daelemans CNTS
Language Technology group University of
Antwerp CLIN 2004
2
Outline
  • Definition
  • Data set
  • Cross-validation
  • The effect of optimization
  • The effect of skewedness
  • Results on the test set
  • Error analysis
  • Conclusion

3
Definition (Hirst, 81)
  • Anaphora is the device of making in discourse an
    abbreviated reference to some entity in the
    expectation that the perceiver will be able to
    disabbreviate the reference and thereby determine
    the identity of the entity.

4
Definition (Hirst, 81)
  • Anaphora is the device of making in discourse an
    abbreviated reference to some entity in the
    expectation that the perceiver will be able to
    disabbreviate the reference and thereby determine
    the identity of the entity.

ANAPHOR
5
Definition (Hirst, 81)
ANTECEDENT or REFERENT
ANAPHOR
  • Anaphora is the device of making in discourse an
    abbreviated reference to some entity in the
    expectation that the perceiver will be able to
    disabbreviate the reference and thereby determine
    the identity of the entity.

6
Definition (Hirst, 81)
ANTECEDENT or REFERENT
ANAPHOR
  • Anaphora is the device of making in discourse an
    abbreviated reference to some entity in the
    expectation that the perceiver will be able to
    disabbreviate the reference and thereby determine
    the identity of the entity.

RESOLUTION
7
Example (KNACK-2002)
  • Zacarias Moussaoui, de eerste persoon die
    door het Amerikaanse gerecht aangeklaagd is voor
    de terreuraanvallen van 11 september, pleit
    onschuldig bij zijn eerste verschijning voor de
    rechtbank. De Fransman van Marokkaanse afkomst
    wordt ervan verdacht de twintigste
    vliegtuigkaper te zijn die door omstandigheden
    (hij zat in een Amerikaanse cel) niet aan de
    kapingen kon deelnemen.

8
Example (ctd.)
  • Zacarias Moussaoui, de eerste persoon die
    door het Amerikaanse gerecht aangeklaagd is voor
    de terreuraanvallen van 11 september, pleit
    onschuldig bij zijn eerste verschijning voor de
    rechtbank. De Fransman van Marokkaanse afkomst
    wordt ervan verdacht de twintigste
    vliegtuigkaper te zijn die door omstandigheden
    (hij zat in een Amerikaanse cel) niet aan de
    kapingen kon deelnemen.

9
Example (ctd.)
  • Zacarias Moussaoui, de eerste persoon die
    door het Amerikaanse gerecht aangeklaagd is voor
    de terreuraanvallen van 11 september, pleit
    onschuldig bij zijn eerste verschijning voor de
    rechtbank. De Fransman van Marokkaanse afkomst
    wordt ervan verdacht de twintigste
    vliegtuigkaper te zijn die door omstandigheden
    (hij zat in een Amerikaanse cel) niet aan de
    kapingen kon deelnemen.

10
KNACK-2002
  • New corpus annotated with coreferential relations
    between noun phrases
  • Existing corpora for Dutch are small and only
    contain anaphorical relations for pronouns (op
    den Akker et al., 2002) (Bouma, 2003)
  • Articles from KNACK, a Flemish weekly magazine
    with articles on national and international
    current affairs.
  • 267 annotated texts, ca. 12,500 annotated NPs
  • Experiments random selection of 50 texts (25 for
    training, 25 for testing)

11
Which anaphora?
  • Annotation adaptation of the MUC guidelines
  • http//cnts.uia.ac.be/hoste/manual_dutch.ps
  • Identity, bound, ISA (identity of sense),
    modality relations
  • lt-gt part-whole relation If the gas tank is
    empty, you should refuel the car.
  • Between NPs
  • Personal, possessive and demonstrative pronouns
  • Non lexicalized reflexive pronouns
  • Names and named entities
  • Definite NPs

12
Approaches
  • The field is still highly knowledge-based
    (constraints and preferences centering and
    focusing theory), e.g. Lappin Leass (1994),
    Baldwin (1996), Poesio et al. (2004)
  • Recently machine learning (C4.5, Ripper, Maximum
    entropy) in which coreference resolution is
    defined as a classification task
  • E.g. De Verenigde staten probeerden van
    Pakistan en India de belofte af te dwingen dat
    ze geen kernwapens zouden inzetten.
  • ze - de belofte not coreferential
  • ze - Pakistan en India coreferential
  • ze - De Verenigde Staten not
    coreferential

13
Preprocessing
14
Positive and negative instances
  • Per NP type (Pronouns/Proper nouns/Common
    nouns)
  • Positive combination of the anaphor with each
    preceding element in the coreference chain.
  • Negative combination of the anaphor with each
    preceding NP which is not part of the coreference
    chain (search scope lt 20 sentences)
  • Highly skewed class distribution
  • positive 6,457 inst.
  • negative 95,919 inst.

15
Information sources
  • Positional features (eg. dist_sent, dist_NP)
  • Local context features
  • Morphological and lexical features (e.g.
    i/j/ij-pron, j_demon, j_def, i/j/ij-proper,
    num_agree)
  • Syntactic features (e.g. i/j/ij_SBJ/OBJ/PREDC,
    appositive)
  • String-matching features (comp_match,
    part_match, alias, same_head)
  • Semantic features (synonym, hyperonym, same_NE,
    (linguistic) gender of antecedent and anaphor)

16
Algorithms compared
  • Ripper
  • Cohen, 95
  • Rule Induction
  • Algorithm parameters different class ordering
    principles negative conditions or not loss
    ratio values cover parameter values
  • TiMBL
  • Memory-Based Learning
  • Algorithm parameters IB1, igtree overlap, mvdm
    5 feature weighting methods 4 distance weighting
    methods different values of k

17
Two step procedure
  • First step cross-validation
  • Application of Timbl and Ripper on training set
    10-fold-cv
  • Extensive feature selection and parameter
    optimization using a genetic algorithm
  • Undersampling of the negative class
  • Evaluation accuracy, precision, recall, F-beta
  • Second step testing
  • Training of Timbl and Ripper on train set
    testing on test set.
  • Reconstruction of coreference chains
  • Evaluation using MUC scoring software

18
GA optimization
Feature weighting 0,1,2,3,4
Neighbour weighting 0,1,2,3
Values 0,1,2
k
0 1 0 1 2 0 2 1 0 2 0 0 2 1 0 2 2 0 3 2
2.0288721872
Parameters
Features
19
Cross-validation results
Default
GA optimization
20
Testing
  • Application of optimized classifiers on held-out
    test set
  • Antecedent selection 1 antecedent per anaphor.
    Some basic heuristics to select the most likely
    antecedent among the positive instances
  • New evaluation procedure using the MUC scoring
    software evaluation of the equivalence classes
    (transitive closure of a coreference chain)

21
2 baselines
  • Baseline I link every NP to its immediately
    preceding NP
  • Baseline II application of simple rules, viz.
    (i) select the closest NP with same gender and
    number, (ii) select the closest antecedent which
    matches the anaphor

22
First test results
23
Error analysis
  • POS tagging / chunking errors and
    inconsistencies
  • De moeder van Moussaoui gaf een persconferentie
    waarin ze om een eerlijk proces vroeg.
  • In de opiniepeilingen liggen Jospin en Chirac zij
    aan zij.
  • Zacarias Moussaoui, de eerste persoon () De
    moeder van Moussaoui ()
  • Low informativeness of some feature vectors e.g
    linguistic gender vs. real gender, erroneous
    apposition recognition
  • Zij stelden dat het moeilijk zou zijn om de
    studie te dupliceren. Waarmee werd gezegd dat ze
    niet wetenschappelijk verantwoord was uitgevoerd.

24
Error analysis
  • Limited synonym recognition
  • Donderdag gaven Stevaert en Picque elkaar de
    schuld voor het disfunctioneren van twee
    onbemande cameras. Picque - bevoegd voor de
    erkenning van de flitspalen ()
  • No recognition of hyponyms
  • Zacarias Moussaoui is aangeklaagd voor de
    terreuraanvallen van 11 september. Hij kon door
    omstandigheden niet aan de kapingen deelnemen.
  • . no world knowledge

25
Conclusion
  • First system for Dutch noun phrase coreference
    resolution
  • Approach works for English (results among state
    of the art)
  • Substantial room for improvement
  • Future work
  • restart preprocessing
  • use web for synonym, hyp(er)onym and collocation
    search
Write a Comment
User Comments (0)
About PowerShow.com