Annotation of Grammatemes in the Prague Dependency Treebank 2'0 - PowerPoint PPT Presentation

About This Presentation
Title:

Annotation of Grammatemes in the Prague Dependency Treebank 2'0

Description:

Examples of grammateme value assignment. Final remarks. LREC 2006, ... 16,065 sentences with 1,960,657 tokens. 75 % of the m-layer data annotated at the a-layer ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 31
Provided by: razi2
Category:

less

Transcript and Presenter's Notes

Title: Annotation of Grammatemes in the Prague Dependency Treebank 2'0


1
Annotation of Grammatemes in the Prague
Dependency Treebank 2.0
  • Magda Razímová
  • Zdenek abokrtský
  • Institute of Formal and Applied Linguistics
  • Charles University
  • Prague, Czech Republic
  • razimova,zabokrtsky_at_ufal.mff.cuni.cz

2
Outline of the talk
  • Introduction
  • Prague Dependency Treebank 2.0
  • Annotation of grammatemes
  • Motivation
  • Grammateme attributes
  • Two-level node hierarchy
  • Examples of grammateme value assignment
  • Final remarks

3
Introduction
  • grammatemes in the PDT 2.0
  • one type of attributes of nodes of a deep
    syntactic tree
  • capturing morphological meanings that are
    semantically indispensable
  • number for nouns, degree of comparison for
    adjectives, tense for verbs, etc.
  • annotation of grammatemes
  • the last task in the PDT 2.0 annotation procedure
  • possible to assign automatically profiting from
    the already available annotation
  • annotation of the same sentence at the lower
    layers
  • already available components of the t-tree (tree
    structure, types of dependency relations,
    co-reference, etc.)

4
Historical backgroundand development of PDT
project
  • mid 1960s Praguian Functional Generative
    Description (Petr Sgall et al.)
  • 1994 Czech National Corpus
  • 1995 PDT started
  • 1998 PDT 0.5 pre-release
  • 2001 PDT 1.0 released by LDC
  • manual annotation of morphology and surface
    syntax
  • 2006 PDT 2.0 to be released by LDC
  • interlinked morphological, surface-syntactic and
    complex
  • deep-syntactic annotation
  • including annotation of grammatemes

5
Outline of the talk
  • Introduction
  • Prague Dependency Treebank 2.0
  • Annotation of grammatemes
  • Motivation
  • Grammateme attributes
  • Two-level node hierarchy
  • Examples of grammateme value assignment
  • Final remarks

6
Layers of annotation
  • tectogrammatical layer
  • deep-syntactic dependency tree
  • analytical layer
  • surface-syntactic dependency tree
  • morphological layer
  • m-lemma and m-tag
  • associated with each token
  • word layer
  • original text, segmented on word boundaries

lit He-was would went toforest. He would have
gone to the forest.
7
Interlinking the layers
  • any unit at any layer has
  • a PDT unique ID
  • neighboring layers connected by top-down pointers

lit He-was would went toforest. He would have
gone to the forest.
8
Size of the PDT 2.0 data (i)
  • 7,129 manually annotated textual documents
  • all documents annotated at the m-layer
  • 16,065 sentences with 1,960,657 tokens
  • 75 of the m-layer data annotated at the a-layer
  • 5,338 documents, 87,980 sentences,
    1,504,847 tokens
  • 44 of the m-layer data annotated also at the
    t-layer
  • 3,168  documents, 49,442  sentences, 833,357 
    tokens

9
Size of the PDT 2.0 data (ii)
  • training data (80 )
  • development test data (10 )
  • evaluation test data (10 )

10
M-layer
  • sentence represented as a sequence of tokens
  • each token lemmatized and tagged (attributes
    m-lemma and m-tag)
  • positional m-tag
  • 15 characters
  • 1. (main) POS
  • 2. detailed POS
  • 3. gender
  • 4. number
  • 5. case
  • ...

lit. Some contours problem(gen)
reflexive_pronoun though after resurgence(instr)
Havel's speech(instr) they-seem to-be
clearer. Some contours of the problem seem to be
clearer after the resurgence by Havel's speech.
11
A-layer
  • rooted ordered tree with labeled nodes and edges
  • a-nodes
  • one token of the m-layer is represented by
    exactly one a-node
  • labeled with a-lemmas (identical with word forms)
  • a-edges
  • represent dependency relations (Sb, Obj, Adv,
    Atr)
  • represent non-dependency relations (Coord)
  • analytical function attribute appears as an
    a-node attribute

Some contours of the problem seem to be clearer
after the resurgence by Havel's speech.
12
T-layer
  • rooted ordered tree with labeled nodes and edges
  • t-nodes
  • complex typed feature structures
  • represent auto-semantic words
  • functional words do not have nodes of their own
  • artificially added nodes
  • t-edges
  • dependency relations (functor)
  • non-dependency relations (coordination
    constructions)
  • functor attribute appears as an t-node attribute

Some contours of the problem seem to be clearer
after the resurgence by Havel's speech.
13
Areas of annotation at the t-layer
Vem bylo predáno osvedcení o úspeném
absolvování kurzu.
  • tree structure
  • t-lemma attribute
  • dependency relation
  • (functor and subfunctor)
  • topic-focus attributes
  • co-reference attributes
  • node typing attributes (nodetype and sempos)
  • grammateme attributes

lit. To all was handed over a certificate of
successful graduation from the course. They all
received a certificate of successful graduation
from this course.
14
Outline of the talk
  • Introduction
  • Prague Dependency Treebank 2.0
  • Annotation of grammatemes
  • Motivation
  • Grammateme attributes
  • Two-level node hierarchy
  • Examples of grammateme value assignment
  • Final remarks

15
Grammatemes Motivation
  • grammatemes
  • t-node attributes representing inflectional
    information that is semantically indispensable
    (morphological meanings such as number for nouns,
    tense for verbs, degree of comparison for
    adjectives, etc.)
  • semantically irrelevant morphological meanings
    are not part of the t-layer (e.g. case for nouns)

16
Grammateme attributes
  • 15 grammatemes
  • indeftype
  • numertype
  • negation
  • degcmp
  • tense
  • aspect
  • verbmod
  • deontmod
  • dispmod
  • resultative
  • iterativeness
  • number
  • gender
  • person
  • politeness

17
Conditioned presence/absence of grammatemes
  • obviously, not all grammatemes are relevant for
    all nodes
  • no tense for dog, no degree of comparison for
    (he) waits, etc.
  • how to formally declare presence/absence of a
    given grammateme attribute in a given node?
  • the need for node typing
  • chosen solution two-level typing
  • 1st level 8 more general types of nodes
  • grammatemes relevant only for one of them
  • 2nd level 19 more specific subtypes,
    corresponding to detailed semantic parts of
    speech

18
Presence/absence of grammateme values Two-level
t-node hierarchy
  • 1st level attribute nodetype
  • 2nd level attribute sempos

19
First level of the hierarchy attribute nodetype
  • 8 attribute values
  • root qcomplex list atom coap dphr
    fphr complex
  • fully automatic annotation - use of
  • the tree structure ? root
  • t-attributes
  • t-lemma ? qcomplex list
  • functor ? atom coap dphr fphr
  • else ? complex

Levnejí benzín na Východe, draí na Západe
Cheaper gasoline in the East, more expensive one
in the West
20
Second level of the hierarchy attribute sempos
  • only complex nodes grouped into semantic parts of
    speech
  • 19 values of the attribute sempos
  • n. ... adj. ... adv. ... v. ...
  • fully automatic annotation use of
  • m-tag
  • t-lemma
  • other t-attributes
  • sempos value delimits the set of relevant
    grammatemes

21
Values of nodetype and sempos in the PDT 2.0
an overview
  • nodetype values
  • sempos values

22
Grammateme value assignment
  • n-tred environment for processing the PDT data
    http//ufal.mff.cuni.cz/pajas
  • automatic annotation
  • 2000 lines of Perl code
  • crucial importance of inter-layer links use of
  • t-attributes, a-attributes, m-attributes
  • rules using special economic notation
  • 2000 lines written in a text file
  • lexical resources
  • special purpose lists of adverbs / verbs
  • manual annotation of special problems
  • two annotators working in parallel
  • simplified annotation environment treebank
    positions extracted into simple HTML forms

23
Simple HTML-basedenvironment for manual
annotation
lit The difference you would have to pay
yourself.
24
Automatic vs. manual assignment
  • at the t-layer of the PDT 2.0
  • 1,594,333 grammateme values assigned
  • at 550,947 complex nodes
  • manually assigned
  • 17,520 grammateme values
  • inter-annotator agreement 70-85

25
Grammateme assignment and m-tag
n.denot numbersg
  • number grammateme values sg pl
  • assigned automatically using m-tag
  • e.g. les (forest)
  • m-layer tag NNIS2-----A----
  • ? t-layer numbersg
  • manual assignment
  • nouns with only plural forms (identified by a
    list extracted from the machine-readable
    dictionary of standard Czech)
  • e.g. dvere (door/doors)
  • m-layer always plural
  • t-layer annotator decision sg pl

lit He-was would went toforest. He would have
gone to the forest.
26
Grammateme assignment and tree structure
v verbmodcdn
  • mood grammateme verbmod values ind imp cdn
  • assigned automatically
  • one-word verbal forms
  • e.g. jde (goes)
  • m-tag information
  • verbal forms consisting of more word forms
    (represented by a single node at the t-layer)
  • e.g. byl by el (would have gone)
  • corresponding a-layer subtree involves the node
    by
  • m-tag of the node by

lit He-was would went toforest. He would have
gone to the forest.
27
Grammateme assignment and co-reference
Ze zbytku suroviny mlékárna vyrábí suené mléko,
které vyváí do Asie a Jiní Ameriky.
  • grammatemes gender, number and person in relative
    pronouns are left underspecified (value inher),
    since they are imposed only by grammatical
    agreement (thus can be inherited from the
    antecedents)

lit. From remainder of raw material the diary
produces dried milk, which it exports to Asia
and South America. From the rest of the material,
the diary produces dried milk, which is exported
by it to Asia and South America.
28
Outline of the talk
  • Introduction
  • Prague Dependency Treebank 2.0
  • Annotation of grammatemes
  • Motivation
  • Grammateme attributes
  • Two-level node hierarchy
  • Examples of grammateme value assignment
  • Final remarks

29
Final remarks
  • achievements
  • two-level typing of t-layer nodes which makes it
    possible to formally capture presence/absence of
    individual grammatemes in a given node
  • automatic procedure for capturing the node
    classification and the grammateme attributes
  • verification of the procedure on large-scale data
  • experience
  • it was the existence of the lower annotation
    layers and the existence of inter-layer links
    what allowed to make the procedure of grammateme
    assignment more or less automatic

30
  • http//ufal.mff.cuni.cz/pdt2.0/
Write a Comment
User Comments (0)
About PowerShow.com