Corpus Annotation II - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Corpus Annotation II

Description:

S Sedan i l rdags v rdas han vid r ttspsykiatriska kliniken p Karolinska ... S Dit f rdes han CB sedan en l kare vid Kronobergsh ktet i Stockholm ... – PowerPoint PPT presentation

Number of Views:548
Avg rating:3.0/5.0
Slides: 38
Provided by: marti85
Category:

less

Transcript and Presenter's Notes

Title: Corpus Annotation II


1
Corpus Annotation II
  • Martin Volk
  • Stockholm University

2
Overview
  • Clean-Up and Text Structure Recognition
  • Sentence Boundary Recognition
  • Proper Name Recognition and Classification
  • Part-of-Speech Tagging
  • Tagging Correction and Sentence Boundary Corr.
  • Lemmatisation and Lemma Filtering
  • NP/PP Chunk Recognition
  • Recognition of Local and Temporal PPs
  • Clause Boundary Recognition

3
Input Docs
Tokenizer and Sentence Boundary Recognizer
Abbreviations
Proper Name Recognizer Persons, Locations,
First Name list Location list
Training Corpus SUC
Part-of-Speech Tagger and Lemmatiser
Morph. Rules Lexicon
Swetwol Morph. Analyser for Lemmas, Tags,
Compounds
4
Part-of-Speech Tagging for German
  • Was done with the Tree-Tagger
  • (from Helmut Schmid, IMS Stuttgart).
  • The Tree-Tagger
  • is a statistical tagger.
  • uses the STTS tag set (50 PoS tags and 3 tags for
    punctuation).
  • assigns 1 tag to each word form.
  • preserves pre-set tags.

5
A statistical Part-of-Speech tagger
  • learns tagging rules from a manually
    Part-of-Speech annotated corpus ( training
    corpus).
  • Vid/PR kliniken/NN i/PR Huddinge/PM övervakas/VB
    nu/AB Mijailovic/PM ständigt/AB av/PR två/RG
    vårdare/NN.
  • applies the learned rules to new sentences.
  • Problems
  • words that were not in the training corpus.
  • words with many possible tags.

6
Two Swedish example word forms with multiple PoS
tags in SUC
  • av
  • adverb (AB) 48 times
  • particle (PL) 407 times
  • proper name (PM) 4 times
  • preposition (PR) 14580 times
  • foreign word (UO) 2 times
  • lagar (EN laws or to make/repair)
  • noun (NN) 43 times
  • verb (VB) 5 times

7
Part-of-Speech Tagging for Swedish
  • is done with the TreeTagger
  • which is trained on SUC (Stockholm-UmeÃ¥-Corpus 1
    million words)
  • with the SUC tag set (slightly enlarged)
  • originally 22 tags plus VBFIN, VBINF, VBSUP,
    VBIMP
  • has an estimated error rate of 4 (ie. every 25th
    word is incorrectly tagged!)

8
Part-of-Speech Tagging with Lemmatisation
  • The TreeTagger also assigns lemmas that it has
    learned from the training corpus.
  • Rule If word form W in the corpus has
  • lemma L1 with tag T1 and
  • lemma L2 with tag T2,
  • then the TreeTagger will assign the lemma
    corresponding to the chosen tag.
  • Example Swedish lÃ¥g
  • ligger (EN to lay) and VVFIN (finite full verb)
  • lÃ¥g (EN low) and JJ (adjective)
  • nice example of PoS Tagging as word sense
    disambiguation

9
PoS Tagging with Lemmatisation
  • But, it is possible that word form W has more
    than one lemma with tag T1 in the training
    corpus.
  • Example Swedish kön
  • kö (EN queue) noun
  • kön (EN gender, sex) noun
  • The TreeTagger will simply assign all lemmas to W
    that go with T1 (no lemma disambiguation).

10
Tagging Correction in German
  • Correction of observed tagger problems
  • Sentence-initial adjectives
  • are often tagged as noun (NN)
  • '...lichenr' or '...ischenr' ? ADJA
  • Verb group patterns
  • the verb in front of 'worden' must be perfect
    participle
  • VVXXX 'worden' ? VVPP
  • if verb modal verb then the verb must be
    infinitive
  • VVXXX VMYYY ? VVINF
  • Unknown prepositions (a, via, innert, ennet)

11
Correction of sentence boundaries
  • E.g. suspected ordinal number followed by a
    capitalized
  • determiner or
  • pronoun or
  • preposition or
  • adverb
  • ? insert sentence boundary.
  • Open question Could all sentence boundary
    detection be done after PoS tagging?

12
Lemmatisation for Swedish
  • is (partly) done by the TreeTagger by re-using
    the lemmas from SUC (Stockholm-Umeå-Corpus)
  • Limits
  • word forms that are not in SUC. In particular
  • names ? proper name recognition
  • compounds ? Swetwol
  • neologisms, foreign expressions ? ??
  • SUC lemmas have no compound boundaries
  • (byskolan ? byskola), (konstindustriskolan ?
    konstindustriskola)
  • elliptical compounds (e.g. kostnads- och
    tidseffektivt) ? ??
  • TreeTagger ignores the hyphen.
  • upper case / lower case (e.g. Bo vs. bo) ? ??
  • TreeTagger treats them separately.

13
Morphological information
  • such as case, number, gender etc.
  • is important for correct linguistic analysis.
  • could be taken from SUC based on the triple
  • word form PoS tag lemma
  • Examples
  • kön NN kön ? NEUtrum SINgular INDefinite
    NOMinative
  • kön NN kö ? UTRum SINgular DEFinite
    NOMinative
  • Limits
  • word forms that are not in SUC, and
  • triples that have more than 1 set of
    morphological features.

14
Lemmatisation for Swedish
  • can be done with Swetwol (Lingsoft Oy, Helsinki)
    for
  • adjectives (inflection lyckligt - lyckliga,
    gradation söt - sötare - sötaste),
  • nouns (inflection hus husen huset ),
  • verbs (inflection arbeta arbetar - arbetat ).
  • Swetwol
  • is a two-level morphology analyzer for Swedish
  • is lexicon-based
  • returns all possible interpretations for each
    word form
  • kön ? kön N NEU INDEF SG/PL NOM
  • kön ? kö N UTR DEF SG NOM
  • segments compound words dynamically if all parts
    are known
  • cirkusskolan ? cirkusskola
  • analyzes hyphenated compounds only if all parts
    are known
  • FN-uppdraget ? FN-uppdrag
  • tPA-plantan ? ?? although plantan ? planta
  • ? feed last element to Swetwol

15
Lemmatisation for German
  • can be done with Gertwol (Lingsoft Oy, Helsinki)
    for
  • adjectives (inflection schöne - schönes,
    gradation schöner - schönste),
  • nouns (inflection Haus Hauses Häuser
    Häusern),
  • prepositions (contraction zum zur zu), and
  • verbs (inflection zeige zeigst zeigt zeigte
    zeigten ).
  • Gertwol
  • is a two-level morphology analyzer for German
  • is lexicon-based
  • returns all possible interpretations for each
    word form
  • segments compound words dynamically
  • analyzes hyphenated compounds only if all parts
    are known
  • e.g. Software-Aktien but not Informix-Aktien
  • ? feed last element to Gertwol

16
Lemma Filtering (a project by Julian Käser)
  • After lemmatisation Merging of Ger/Swetwol and
    tagger information
  • Case 1 The lemma was prespecified during proper
    name recognition (IBMs ? IBM)
  • Case 2 Ger/Swetwol does not find a lemma ?
    insert the word form as lemma (mark it with '?')

17
Lemma Filtering
  • Case 3 Ger/Swetwol finds exactly one lemma for
    the given PoS ? insert the lemma
  • Case 4 Ger/Swetwol finds multiple lemmas for the
    given PoS ? disambiguate and insert the best
    lemma
  • Disambiguation weights the segmentation symbols
  • Strong compound segment boundary 4 points
  • Weak compound segment boundary 2 points
  • Derivational segment boundary 1 point
  • the lemma with the lowest score wins!
  • Examples
  • Abteilungen ? Abteilunge (5 points) vs.
    Abteilung (3 points)
  • rÃ¥dhusklockan ? rÃ¥dhusklocka (6 p.) vs.
    rådhusklocka (8 p.)

18
Lemma Filtering
  • Case 5 Ger/Swetwol finds a lemma but not for the
    given PoS
  • ? this indicates a tagger error (Ger/Swetwol is
    more reliable than the tagger.)
  • Case 5.1 Ger/Swetwol finds a lemma for exactly
    one PoS ? insert the lemma and exchange the PoS
    tag
  • Case 5.2 Ger/Swetwol finds lemmas for more than
    one PoS ? find closest PoS tag, or guess
  • Option Check if the PoS tag in the corpus was
    licensed by SUC. If yes, ask the user for a
    decision.

19
Lemma Filtering for German
  • 0.74 of all PoS tags were exchanged (2 of
    Adjective tags, Noun tags, Verb tags).
  • In other words 14'000 tags / annual volume of
    the ComputerZeitung were exchanged.
  • 85 are cases with exactly one Gertwol tag, 15
    are guesses.

20
Limitations of Gertwol
  • Compounds are lemmatized only if all parts are
    known.
  • Idea Use a corpus for lemmatizing remaining
    compounds
  • Examples kaputtreden, Waferfabriken
  • Solution
  • If first part occurs standing alone AND
  • second part occurs standing alone with lemma,
  • then segment and lemmatize!
  • and store first part as lemma (of itself)! !!

21
NP/PP Chunk Recognition (a project by Dominik A.
Merz)
  • adapted to Swedish by Jerker Hagman, 2004
  • Pattern matcher with patterns over PoS-tags
  • Example patterns
  • ADV ADJ --gt AP
  • ART AP NN --gt NP
  • PR NP --gt PP
  • Example tree

22
Jerker Hagmans results
  • 135 chunking rules
  • Categories
  • AdjP, AdvP,
  • MPN, Coordinated_MPN, MPN_genitive
  • NP, Coordinated_NP, NP_genitive
  • PP
  • VerbCluster (hade gÃ¥tt), InfinitiveGroup (att
    minska)
  • Evaluation against a small treebank
  • 75 precision
  • 68 recall

23
Recognition of temporal PPs in German (a project
by Stefan Höfler)
  • A second step towards semantic annotation.
  • Starting point
  • Prepositions (3) that always introduce a temporal
    PP binnen, während, zeit
  • Prepositions (30) that may introduce a temporal
    PP ab, an, auf, bis, ... additional evidence
  • Additional evidence
  • Temporal adverb in PP heute, niemals, wann, ...
  • Temporal noun in PP Minute, Stunde, Jahr,
    Anfang, ...

24
Recognition of temporal PPs
  • Evaluation corpus 990 sentences with manually
    checked 263 temporal PPs
  • Result
  • Precision 81
  • Recall 76

25
Recognition of local PPs
  • Starting point
  • Prepositions that always introduce a local PP
    fern, oberhalb, südlich von
  • Prepositions that may introduce a local PP ab,
    auf, bei, ... additional evidence
  • Additional evidence
  • Local adverb in PP dort, hier, oben, rechts,
    ...
  • Local noun in PP Strasse, Quartier, Land,
    Norden, ltLOCgt, ...

26
Recognition of temporal and local PPs
27
A Word on Recall and Precision
  • The focus varies with the application!
  • Often Precision is more important than Recall!
  • Idea If I annotate something, then I want to be
    'sure' that it is correct.

28
Clause Boundary Recognition (a project by
Gaudenz Lügstenmann)
  • Definition A clause is a unit consisting of a
    full verb together with its (non-clausal)
    complements and adjuncts.
  • A sentence consists of one or more clauses, and a
    clause consists of one or more phrases.
  • Clauses are important for determining the
    cooccurrence of verbs and PPs (among other
    things).

29
Dagens Nyheter, 20. Sept. 2004
  • ltSgt Mijailovic vÃ¥rdas pÃ¥ sjukhus
  • ltSgt Anna Lindhs mördare Mijailo Mijailovic är sÃ¥
    sjuk ltCBgt att han förts till sjukhus.
  • ltSgt Sedan i lördags vÃ¥rdas han vid
    rättspsykiatriska kliniken på Karolinska
    universitetssjukhuset i Huddinge. ltSgt Dit fördes
    han ltCBgt sedan en läkare vid Kronobergshäktet i
    Stockholm konstaterat ltCBgt att han det fanns risk
    ltCBgt att han skulle försöka ltCBgt ta livet av sig
    i häktet. ltSgt Det skriver Aftonbladet och
    Expressen.
  • ltSgt Mijailovic, ltCBgt som väntar pÃ¥ rättegÃ¥ngen i
    Högsta domstolen ltCBgt efter att ha dömts till
    sluten psykiatrisk vård och inte till fängelse,
    ltCBgt ska enligt tidningarna ha slutat ta sina
    tabletter ltCBgt och blivit starkt förvirrad. ltSgt
    Enligt Kriminalvårdsstyrelsens bestämmelser ska i
    sådana fall en fånge föras till sjukhus.

30
Clause Boundary Recognition
  • Exceptions from the definition Clauses with more
    than one verb
  • Coordinated verbs
  • Daten können überführt und verarbeitet werden.
  • Perception verb infinitive verb (AcI)
  • die den Markt wachsen sehen.
  • 'lassen' infinitive verb
  • lässt die Handbücher übertragen

31
Clause Boundary Recognition
  • Exceptions from the definition Clauses without a
    verb
  • Elliptical clauses (e.g. in coordinated
    structures)
  • Examples
  • Er beobachtet den Markt und seine Mitarbeiter die
    Konkurrenz.
  • Heute kann die Welt nur mehr knapp 30 dieser
    früher äusserst populären Riesenbilder bewundern,
    drei davon in der Schweiz.

32
Clause Boundary Recognition
  • The German CB recognizer is realized as a pattern
    matcher over PoS tags. (34 patterns)
  • Example
  • Comma Relative Pronoun
  • Finite verb ... Conjunction ... Finite Verb
  • Most difficult CB without overt punctuation
    symbol or trigger word
  • Example Simple Budgetreduzierungen in der IT in
    den Vordergrund zu stellen ltCBgt ist der falsche
    Ansatz.
  • This happens often in Swedish.?

33
Clause Boundary Recognition for German
  • Evaluation corpus 1150 sentences with 754
    intra-sentential CBs.
  • Results (counting all CBs)
  • Precision 95.8
  • Recall 84.9
  • Results (counting only intra-sentential CBs)
  • Precision 90.5
  • Recall 61.1

34
Using a PoS Tagger for Clause Boundary
Recognition in German
  • A CB recognizer can be seen as a disambiguator
    over commas and CB-trigger-tokens (if we
    disregard the CBs without trigger).
  • A tagger may serve the same purpose.
  • Example
  • ... schrieb der Präsident,ltCogt Michael
    Eisner,ltCogt im Jahresbericht.
  • ... schrieb der Präsident,ltCBgt der Michael Eisner
    kannte,ltCBgt im Jahresbericht.

35
Using a PoS Tagger for Clause Boundary
Recognition in German
  • Evaluation corpus 1150 sentences with 754
    intra-sentential CBs.
  • Training the Brill-Tagger on 75 and applying it
    on the remaining 25
  • Results
  • 93 Precision
  • 91 Recall
  • Caution very small evaluation corpus!!

36
Clause Boundary Recognition vs. Clause Recognition
  • CB recognition marks only the boundaries. It does
    not identify discontinuous parts of clauses. It
    does not identify nesting.
  • Example
  • ltSgt Mijailovic, ltCBgt som väntar pÃ¥ rättegÃ¥ngen i
    Högsta domstolen ltCBgt efter att ha dömts till
    sluten psykiatrisk vård och inte till fängelse,
    ltCBgt ska enligt tidningarna ha slutat ta sina
    tabletter ltCBgt och blivit starkt förvirrad.
  • ltCgt Mijailovic, ltCgt som väntar pÃ¥ rättegÃ¥ngen i
    Högsta domstolen ltCgt efter att ha dömts till
    sluten psykiatrisk vård och inte till fängelse,
    lt/Cgtlt/Cgt ska enligt tidningarna ha slutat ta sina
    tabletter lt/CgtltCgt och blivit starkt
    förvirrad.lt/Cgt
  • Clause Recognition should be done with a
    recursive parsing approach because of clause
    nesting.

37
Summary
  • Part-of-Speech tagging based on statistical
    methods is robust and reliable.
  • The TreeTagger assigns PoS tags and lemmas.
  • Swetwol is a morphological analyser that given a
    word form outputs the PoS tag, the lemma and the
    morphological features for all its readings.
  • Multiple knowledge sources (e.g. PoS-tagger and
    Swetwol) may lead to conflicting tags.
  • Chunking (partial parsing) builds partial trees.
  • Clause boundary detection can be realized as
    pattern matching over PoS tags.
Write a Comment
User Comments (0)
About PowerShow.com