Title: Annotation of Grammatemes in the Prague Dependency Treebank 2'0
1Annotation of Grammatemes in the Prague
Dependency Treebank 2.0
- Magda Razímová
- Zdenek abokrtský
- Institute of Formal and Applied Linguistics
- Charles University
- Prague, Czech Republic
- razimova,zabokrtsky_at_ufal.mff.cuni.cz
2Outline of the talk
- Introduction
- Prague Dependency Treebank 2.0
- Annotation of grammatemes
- Motivation
- Grammateme attributes
- Two-level node hierarchy
- Examples of grammateme value assignment
- Final remarks
3Introduction
- grammatemes in the PDT 2.0
- one type of attributes of nodes of a deep
syntactic tree - capturing morphological meanings that are
semantically indispensable - number for nouns, degree of comparison for
adjectives, tense for verbs, etc. - annotation of grammatemes
- the last task in the PDT 2.0 annotation procedure
- possible to assign automatically profiting from
the already available annotation - annotation of the same sentence at the lower
layers - already available components of the t-tree (tree
structure, types of dependency relations,
co-reference, etc.)
4Historical backgroundand development of PDT
project
- mid 1960s Praguian Functional Generative
Description (Petr Sgall et al.) - 1994 Czech National Corpus
- 1995 PDT started
- 1998 PDT 0.5 pre-release
- 2001 PDT 1.0 released by LDC
- manual annotation of morphology and surface
syntax - 2006 PDT 2.0 to be released by LDC
- interlinked morphological, surface-syntactic and
complex - deep-syntactic annotation
- including annotation of grammatemes
5Outline of the talk
- Introduction
- Prague Dependency Treebank 2.0
- Annotation of grammatemes
- Motivation
- Grammateme attributes
- Two-level node hierarchy
- Examples of grammateme value assignment
- Final remarks
6Layers of annotation
- tectogrammatical layer
- deep-syntactic dependency tree
- analytical layer
- surface-syntactic dependency tree
- morphological layer
- m-lemma and m-tag
- associated with each token
- word layer
- original text, segmented on word boundaries
lit He-was would went toforest. He would have
gone to the forest.
7Interlinking the layers
- any unit at any layer has
- a PDT unique ID
- neighboring layers connected by top-down pointers
lit He-was would went toforest. He would have
gone to the forest.
8Size of the PDT 2.0 data (i)
- 7,129 manually annotated textual documents
- all documents annotated at the m-layer
- 16,065 sentences with 1,960,657 tokens
- 75 of the m-layer data annotated at the a-layer
- 5,338 documents, 87,980 sentences,
1,504,847 tokens - 44 of the m-layer data annotated also at the
t-layer - 3,168 documents, 49,442 sentences, 833,357
tokens
9Size of the PDT 2.0 data (ii)
- training data (80 )
- development test data (10 )
- evaluation test data (10 )
10M-layer
- sentence represented as a sequence of tokens
- each token lemmatized and tagged (attributes
m-lemma and m-tag) - positional m-tag
- 15 characters
- 1. (main) POS
- 2. detailed POS
- 3. gender
- 4. number
- 5. case
- ...
lit. Some contours problem(gen)
reflexive_pronoun though after resurgence(instr)
Havel's speech(instr) they-seem to-be
clearer. Some contours of the problem seem to be
clearer after the resurgence by Havel's speech.
11A-layer
- rooted ordered tree with labeled nodes and edges
- a-nodes
- one token of the m-layer is represented by
exactly one a-node - labeled with a-lemmas (identical with word forms)
- a-edges
- represent dependency relations (Sb, Obj, Adv,
Atr) - represent non-dependency relations (Coord)
- analytical function attribute appears as an
a-node attribute
Some contours of the problem seem to be clearer
after the resurgence by Havel's speech.
12T-layer
- rooted ordered tree with labeled nodes and edges
- t-nodes
- complex typed feature structures
- represent auto-semantic words
- functional words do not have nodes of their own
- artificially added nodes
- t-edges
- dependency relations (functor)
- non-dependency relations (coordination
constructions) - functor attribute appears as an t-node attribute
Some contours of the problem seem to be clearer
after the resurgence by Havel's speech.
13Areas of annotation at the t-layer
Vem bylo predáno osvedcení o úspeném
absolvování kurzu.
- tree structure
- t-lemma attribute
- dependency relation
- (functor and subfunctor)
- topic-focus attributes
- co-reference attributes
- node typing attributes (nodetype and sempos)
- grammateme attributes
lit. To all was handed over a certificate of
successful graduation from the course. They all
received a certificate of successful graduation
from this course.
14Outline of the talk
- Introduction
- Prague Dependency Treebank 2.0
- Annotation of grammatemes
- Motivation
- Grammateme attributes
- Two-level node hierarchy
- Examples of grammateme value assignment
- Final remarks
15Grammatemes Motivation
- grammatemes
- t-node attributes representing inflectional
information that is semantically indispensable
(morphological meanings such as number for nouns,
tense for verbs, degree of comparison for
adjectives, etc.) - semantically irrelevant morphological meanings
are not part of the t-layer (e.g. case for nouns)
16Grammateme attributes
- indeftype
- numertype
- negation
- degcmp
- tense
- aspect
- verbmod
- deontmod
- dispmod
- resultative
- iterativeness
- number
- gender
- person
- politeness
17Conditioned presence/absence of grammatemes
- obviously, not all grammatemes are relevant for
all nodes - no tense for dog, no degree of comparison for
(he) waits, etc. - how to formally declare presence/absence of a
given grammateme attribute in a given node? - the need for node typing
- chosen solution two-level typing
- 1st level 8 more general types of nodes
- grammatemes relevant only for one of them
- 2nd level 19 more specific subtypes,
corresponding to detailed semantic parts of
speech
18Presence/absence of grammateme values Two-level
t-node hierarchy
- 1st level attribute nodetype
- 2nd level attribute sempos
19First level of the hierarchy attribute nodetype
- 8 attribute values
- root qcomplex list atom coap dphr
fphr complex - fully automatic annotation - use of
- the tree structure ? root
- t-attributes
- t-lemma ? qcomplex list
- functor ? atom coap dphr fphr
- else ? complex
Levnejí benzín na Východe, draí na Západe
Cheaper gasoline in the East, more expensive one
in the West
20Second level of the hierarchy attribute sempos
- only complex nodes grouped into semantic parts of
speech - 19 values of the attribute sempos
- n. ... adj. ... adv. ... v. ...
- fully automatic annotation use of
- m-tag
- t-lemma
- other t-attributes
- sempos value delimits the set of relevant
grammatemes
21Values of nodetype and sempos in the PDT 2.0
an overview
22Grammateme value assignment
- n-tred environment for processing the PDT data
http//ufal.mff.cuni.cz/pajas - automatic annotation
- 2000 lines of Perl code
- crucial importance of inter-layer links use of
- t-attributes, a-attributes, m-attributes
- rules using special economic notation
- 2000 lines written in a text file
- lexical resources
- special purpose lists of adverbs / verbs
- manual annotation of special problems
- two annotators working in parallel
- simplified annotation environment treebank
positions extracted into simple HTML forms
23Simple HTML-basedenvironment for manual
annotation
lit The difference you would have to pay
yourself.
24Automatic vs. manual assignment
- at the t-layer of the PDT 2.0
- 1,594,333 grammateme values assigned
- at 550,947 complex nodes
- manually assigned
- 17,520 grammateme values
- inter-annotator agreement 70-85
25Grammateme assignment and m-tag
n.denot numbersg
- number grammateme values sg pl
- assigned automatically using m-tag
- e.g. les (forest)
- m-layer tag NNIS2-----A----
- ? t-layer numbersg
- manual assignment
- nouns with only plural forms (identified by a
list extracted from the machine-readable
dictionary of standard Czech) - e.g. dvere (door/doors)
- m-layer always plural
- t-layer annotator decision sg pl
lit He-was would went toforest. He would have
gone to the forest.
26Grammateme assignment and tree structure
v verbmodcdn
- mood grammateme verbmod values ind imp cdn
- assigned automatically
- one-word verbal forms
- e.g. jde (goes)
- m-tag information
- verbal forms consisting of more word forms
(represented by a single node at the t-layer) - e.g. byl by el (would have gone)
- corresponding a-layer subtree involves the node
by - m-tag of the node by
lit He-was would went toforest. He would have
gone to the forest.
27Grammateme assignment and co-reference
Ze zbytku suroviny mlékárna vyrábí suené mléko,
které vyváí do Asie a Jiní Ameriky.
- grammatemes gender, number and person in relative
pronouns are left underspecified (value inher),
since they are imposed only by grammatical
agreement (thus can be inherited from the
antecedents)
lit. From remainder of raw material the diary
produces dried milk, which it exports to Asia
and South America. From the rest of the material,
the diary produces dried milk, which is exported
by it to Asia and South America.
28Outline of the talk
- Introduction
- Prague Dependency Treebank 2.0
- Annotation of grammatemes
- Motivation
- Grammateme attributes
- Two-level node hierarchy
- Examples of grammateme value assignment
- Final remarks
29Final remarks
- achievements
- two-level typing of t-layer nodes which makes it
possible to formally capture presence/absence of
individual grammatemes in a given node - automatic procedure for capturing the node
classification and the grammateme attributes - verification of the procedure on large-scale data
- experience
- it was the existence of the lower annotation
layers and the existence of inter-layer links
what allowed to make the procedure of grammateme
assignment more or less automatic
30- http//ufal.mff.cuni.cz/pdt2.0/