Title: Prague%20Dependency%20Treebank%201.0
1Prague Dependency Treebank 1.0
- CD-ROM PRESENTATION
- Dec 18, 2000
2Prague Dependency Treebank 1.0
Functional Generative Description
- CD-ROM PRESENTATION
- Dec 18, 2000
3Functional Generative Description
- theoretical framework based on the findings of
European structural linguistics, esp. of the
classical Prague School - methodological requirements of a formal
description - levels
- tectogrammatical (underlying) representations
(TRs) with dependency based syntax - morphemics
- phonemics and phonetics
- TRs (see Sgall, Hajicová and Panevová 1986,
formally specified by Petkevic, also in a
declarative way)
4Dependency tree
My younger brother arrived there yesterday.
Linearized form, one-to-one relation ((I)Appurt
(younger)Rstr brother)Act arrive.Pret.Indic (Dir
there) (Temp yesterday)
5Dependency Tree
- labels - lexical meanings (abstract symbols) with
indices - functors
- subscripts at parentheses oriented towards head
- grammatemes - values of morphological categories
- Tense, Modality, Number, Definiteness, etc.
- projectivity
- valency
- arguments (inner participants) and adjuncts
(circumstantials or 'free modifications') - obligatory and optional with a given head,
- deletable or not
6Dependency Tree
- participants (arguments) of verbs
- Actor/Bearer (underlying subject)
- Objective (Patient, underlying direct object)
- Addressee (underlying indirect object)
- Effect ('second' object to choose so. as sth.)
- Origin (to make sth. out of sth.)
- adjuncts
- Locative, several Directional and Temporal
modifications - Condition, Means, Manner, etc.
7Dependency Tree
Complementations dependent mainly on nouns
- inner participants
- Material (Partitive) two baskets of sth.
- Identity the river Danube the notion of
operator
- free modifications
- Possession (Appurtenance) my table Jim's
brother - Restrictive rich man
- Descriptive the Swedes, who are a Scandinavian
nation
8Dependency Tree
- syntactic grammatemes
- Loc, Dir - in, on, under, between...
- Regard - with, without
- operational (testable) criteria
- for distinguishing
- arguments from adjuncts,
- from each other
- deletability (dialogue test)
9Simplified valency frames
- brother N Appurt
- man N
- glass N Material
- full A Material
- read V Act Addr Obj
- change V Act Obj Orig Eff
- give V Act Addr Obj
obligatory complementations in blue
10Topic-focus articulation
- contextual boundness
- main verb CB/NB (T/F)
- dependents to the left/right
- communicative dynamism
- left-right (mother, sisters, transitive)
- partial ordering
- underlying word order
- left-right
- linear ordering
left-to-right order of nodes together with the
index T or (prototypically) F indicates the TFA
of the sentence (of the TR)
11Topic-focus articulation
- TFA - one of the basic aspects of underlying
structures
12Complex sentence
My brother, whom you know, arrived there
yesterday.
- a subordinated (dependent) clause (i.e. its main
verb) depends on a word contained in its
governing clause
13Complex sentence
Martin came there late, since he had to accompany
his sick mother.
- function words (synsemantic) are viewed as
function morphemes, syntactically fixed to
certain lexical (autosemantic) words -
prepositions and articles to nouns, conjunctions
and auxiliaries to verbs
14Complex sentence
Martin arrived late to the session, since he had
to accompany his sick mother.schematically
(morphemes) Martin arrive.ed late to the session
since he have.ed to accompany he.s sick
mother.dot - close connection of morphemes
('semes')
15- deleted items restored
- order of items - difference between 'underlying'
and surface (morphemic) word order - transductive components - Panevová, Oliva,
Borota - coordination (multidimensional)
- Jim and Mary, who have two children, went to
Boston. - the linearized notation is adequate
- ((Jim Mary)Conj ((who)Act have (Pat (two)Rstr
children)))Act went (Dir Boston) - structures close to Boolean, i.e. no complex
'innate properties' specific for natural language
are needed.
16Prague Dependency Treebank - corpus annotation
- an intermediate level - 'analytical'
representations - dependency trees, not always projective
- nodes for all word tokens, even for punctuation
marks - tectogrammmatical tree coordinating conjunction
as the head
17Prague Dependency Treebank 1.0
- CD-ROM PRESENTATION
- Dec 18, 2000
18Prague Dependency Treebank 1.0
Morphological Layer
- CD-ROM PRESENTATION
- Dec 18, 2000
19ACKNOWLEDGEMENTS
20ANNOTATED CORPORA
PDT version 1.0, 2000 (1996 - 2000) Penn
Treebank, release 3, 1999 (1989 - 1999)
21TAG SETs
- Czech - ambiguous inflective language
- nový, nového, novému, novém, novým, nová, nové,
novou, nových, novým, novými, novejší,
novejšího, novejšímu, novejším, ., nejnovejší,
nejnovejšího, nejnovejšímu, nejnovejším..
nejnovejších, nejnovejším, - English - language with poor inflection
- work, works, worked, working
22(No Transcript)
23TEXT SOURCES
- Lidové noviny
- Mladá Fronta Dnes
- Vesmír
- Ceskomoravský Profit
- ...taken from Czech National Corpus
- 88, 89 WSJ articles
- Air Travel Information System transcripts
- Brown Corpus
- Switchboard transcripts
24ANNOTATION STRATEGY - Penn Treebank
TEXT Ken Churchs stochastic tagger, Eric
Brills transformation tagger corrections
by annotator (GNU Emacs Lisp based package)
25ANNOTATION STRATEGY - PDT
- Automatic Morphological Analyzer (AMA)
- two independent annotators Linux, Win
tools - differences resolved by third annotator
- comparison with the current AMA
manual resolution Win tools
26INTERNAL FORMAT
27SAMPLES
lts idln95040020-p1s1gt ltfgtPokusltlgtpokuslttgtNNIS1
-----A---- ltfgtoltlgtolttgtRR--4---------- ltfgtzázrakltlgt
zázraklttgtNNIS4-----A---- ltdgt.ltlgt.lttgtZ------------
-
The/DT envelope/NN arrives/VBZ in/IN the/DT
mail/NN ./.
28CONVERSION
pdt2wsj.pl
pdt2wsjFLT.pl
29DATA SIZE
30DATA SETs of MORPHOLOGICALLY ANNOTATED DATA
31TOOLS
- Automatic Morphological Analyser/Generator of
Czech - HMAnalyze.pl, HMGenerate.pl
- Dictionary CZE_a
- Remote Acces
- Czech Taggers
- HMM
- Exponential
32Prague Dependency Treebank 1.0
- CD-ROM PRESENTATION
- Dec 18, 2000
33Prague Dependency Treebank 1.0
Analytical Layer in PDT
- CD-ROM PRESENTATION
- Dec 18, 2000
34Introduction
- Input morphologically tagged sentences
- Graph Editor user-friendly software
- Output ATS structure
- surface syntax tree structure
- nodes labelled by the analytical functions
35Two stages (chronologically)
- (A) manual analytic annotation (ATS)
- training data for (B)(a)
- (B)
- (a) semiautomatic procedure (Collins parser)
- (b) manual correcting of (B)(a)
36Constraints and limitations
- any string has a node of its own
- word-form, punctuation mark, etc.
- AuxV, AuxP, AuxC, AuxX, AuxG
- reflecting the coordination and apposition
relations - so called third dimension of the graph in the
plain tree (X_Co, X_Ap, X_Pa, where X is one of
analytic functions, such as Sb, Obj, Adv, etc.)
37Constraints and limitations
- no missing nodes (on the surface) can be added
- analytic funtion Ex_D is used
- relations between semi-automatic and manual
procedure - 80 edges are established correctly automatically
38Project organization
- team consisting of 5-6 annotators
- handbook for ATS structure annotation
- 1999 100000 sentences on ATS
- tectogrammatical annotation follows
39První restitucní zákon ceského parlamentu se do
snemovních lavic muže vrátit jako bumerang.
40Prague Dependency Treebank 1.0
- CD-ROM PRESENTATION
- Dec 18, 2000
41Prague Dependency Treebank 1.0
From the Analyticaltowards the Tectogrammatical
layer
- CD-ROM PRESENTATION
- Dec 18, 2000
42Introduction
- ATS annotation
- nodes
- word forms
- punctuation
- graphical symbols
- TGTS annotation
- autosemantic words
- deletions
- edges
- surface relations
- deep layer functions
43Annotation process
Input Czech sentence
44Transition procedure
- deterministic procedure operating on trees
- macro language for Graph Editor (C like)
- automatic changes tools for annotators
- Requirements
- new attributes for tectogrammatical layer
- ATS is recoverable from TGTS
- automatized to a maximally high degree
45New attributes
- trlemma - lemma of the original node or lemma
composed of joined nodes - morphological grammatemes
- gender, number, degree of comparison, tense,
- aspect, iterativeness, verbal modality, deontic
modality, sentence modality - position of the node
- functor, topic-focus articulation, syntactic
grammateme, - type of relation (dependency, coordination,
apposition), - phraseme, deletion, quoted word, direct speech,
- coreference, antecedent
46Tree Structure Pruning
- U toho, kdo zacíná opravdu od nuly, není danový
výnos pro stát podstatný. - For those, who start actually at zero, the tax
outcome for the state is not substantial.
47Tree Structure Pruning
- U toho, kdo zacíná opravdu od nuly, není danový
výnos pro stát podstatný. - For those, who start actually at zero, the tax
outcome for the state is not substantial.
48Verbal Nodes
- podnikatelé by meli mít dane
- enterpreneurs should have (their) taxes
49Attribute Assignments
- prepositions stored as fw attribute
- quoted words
- clause in quotes -gt DSP
- one pair of quotes in the sentence -gt DSPP
- string in quotes -gt QUOT
- gender, number, tense, degcmp, aspect
- default values
50Macros for Annotators
- keyboard shortcuts (in Graph editor)
- structure changes
- hide/recover nodes
- merge nodes
- add new nodes
- functor assignments
51Manual annotation
- structure checking
- functors
- deletions of obligatory modifications
- feedback for formulating the handbook for
annotators
52Prague Dependency Treebank 1.0
- CD-ROM PRESENTATION
- Dec 18, 2000
53Prague Dependency Treebank 1.0
Tectogrammatical Layer
- CD-ROM PRESENTATION
- Dec 18, 2000
54(No Transcript)
55(No Transcript)
56 - Jirka se vcera opil do nemoty
a Honza dneska. - George himself yesterday drank to silence and
Honza today.
57Attributes of Coreferrential relations
- only in MC
- attribute valuescoref the lemma of the
antecedentcorsnt NIL - in the same
sentence PREV1 ... PREVi - position of the
sentence which includes the antecedent - grammatical coreferenceantec the functor of the
antecedent
58Example
Honza slíbil prijít vcas. Honza
promised to come in time.