Title: Prague Dependency Treebank 2.0 Zdenek
1Prague Dependency Treebank 2.0Zdenek
ŽabokrtskýDept. of Formal and Applied
LinguisticsCharles University,
Praguezabokrtsky_at_ufal.mff.cuni.cz
2Outline of the talk
- Introduction
- Layers of annotation
- Data
- Software tools
- Documentation
- Tour through the CD-ROM
- Final remarks
3Introduction
- treebank
- syntactically annotated corpus
- (bank of syntactic trees)
- Prague Dependency Treebank
- collection of linguistically annotated Czech
texts (2MW), software tools and documentation - morphological and surface- and deep-syntactic
dependency-oriented sentence analyses
4About Czech
- western group of Slavic languages
- rich inflectional morphology
- (relatively) free word order language
- Latin alphabet extended with accents
- (príliš žlutoucký kun)
- spoken in the Czech republic
- 10 million speakers
5Historical backgroundand development of PDT
- 1920s Prague Linguistic Circle founded
- 1930-50s influential dependency-oriented
works of Lucien Tesniere and Vladimír Šmilauer - mid 1960s Petr Sgalls Functional Generative
Description - 1992 Penn Treebank
- 1994 Czech National Corpus
- 1995 PDT started
- 1998 PDT 0.5 pre-release
- 2001 PDT 1.0 released by LDC
- 2006 PDT 2.0 to be released by LDC
6Outline of the talk
- Introduction
- Layers of annotation
- Data
- Software tools
- Documentation
- Tour through the CD-ROM
- Final remarks
7Layered annotation scheme
- tectogrammatical layer
- surface-syntactic dependency tree
- analytical layer
- surface-syntactic dependency tree
- morphological layer
- morphological lemma and tag associated with each
token - word layer
- original text, segmented on word boundaries
He would have gone intoforest.
8M-layer
- sentence represented as a sequence of tokens
- each token lemmatized and tagged (attributes
lemma and tag) - 15-character long positional morphological tag
- 1. (main) POS
- 2. detailed POS
- 3. gender
- 4. number
- 5. case
- ...
9A-layer (1) - nodes and edges
- sentence represented as a rooted ordered tree
with labeled nodes and edges - edges labeled with analytical functions
- dependency relations (Sb, Obj, Adv, Atr)
- non-dep. relations (Coord)
- auxiliary (functional) nodes (AuxP for
prepositions, AuxC for subordinating
conjunctions...) - special treatment of coordination constructions
10A-layer (2) - coordination
- intricate interplay between dependency and
coordination relations - PDT solution both conjuncts (members of
coordination) and shared modifiers attached below
the coordination conjunction (but distinguished
from each other by a special attribute is_member) - direct parent vs. effective parent
M
M
11T-layer (1) - nodes
- t-nodes
- complex typed feature structures
- nodes represent autosemantic words
- functional words do not have nodes of their own
- artificially added nodes (e.g. for pro-drops)
- node attributes
- tectogrammatical lemma
- dependency relation functor and subfunctor
- grammateme attributes (representing morphological
meanings) - attributes for topic-focus articulation
- attributes for coreference relations
12T-layer (2) - dependency relations
- according to FGD, two types of functors
- actants (arguments)
- ACT actor
- PAT patient
- ADDR addressee
- EFF effect
- ORIG - origin
- free modifiers (adjuncts)
- various types of temporal modifiers - TWHEN,
TTIL, TSIN... - spatial and directional modifiers LOC, DIR1,
DIR2, DIR3 - MEANS, BENeficiary, CAUSe, REGard, EXTent,
MATerial, CONDition... - additional functors for representing
non-dependency relations - coordinations CONJ, DISJ, ADVS ...
- appositions APPS
- parenthetical constructions - PAR
- expressions in foreign language - FPHR
13T-layer (3) - valency
- all occurrences of all verbs in t-trees
interlinked with the valency lexicon PDT-VALLEX - individual valency frames roughly corresponds to
individual senses of the given verb - valency frame a sequence of frame slots, for
each of which its functor, obligatority and its
possible surface realizations are specified
14T-layer (3) - coreference
- two types of coreference according to FGD
- grammatical (verbs of control, relative clauses,
reflexive pronouns...) - textual (personal pronouns, incl. elided ones)
- coreference in PDT
- binary relation between t-nodes
- depicted as a non-tree arc (arrow)
15T-layer (4) - grammatemes
- grammatemes
- t-node attributes representing morphological
meanings - motivation
- number for nouns, tense for verbs, degree for
adjectives, deontic/verb/sentence modality ...
16T-layer (5) - node typing
- presence/absence of a given attribute?
- ? the need for node typing
- two-level hierarchy of t-layer node types used in
PDT 2.0
17Interlinking the layers
- any unit at any layer has a PDT unique ID
- neighboring layers connected by top-down pointers
18Outline of the talk
- Introduction
- Layers of annotation
- Data
- Software tools
- Documentation
- Tour through the CD-ROM
- Final remarks
19Sources of text
- texts provided by the Czech National Corpus
- 7000 articles (or article fragments) from Czech
newspapers and journals - Lidové noviny (daily newspapers)
- Mladá fronta Dnes (daily newspapers)
- Ceskomoravský profit (business weekly)
- Vesmír (scientific journal)
20Amount of annotated data
- m-layer data
- 1.96 MW in 116 kS
- a-layer data (75 of m-layer)
- 1.5 MW in 88 kS
- t-layer data (59 of a-layer)
- 0.8 MW in 49 kS
21Division into files
- 1 XML file per document and annotation layer
22Train/test data
- train devtest evaltest 8 1 1
23Full vs. sample data
- sample data
- 500 sentences
- a freely available subset of the full data
- converted also to HTML (can be viewed in any WWW
browser, no tree editor needed) - the whole PDT 2.0 except for the full data (but
including sample data, all tools, docs, and
sample data) is available on the web - the full data will be available only to the
licensed users who obtain the CD from the
Linguistic Data Consortium
24Outline of the talk
- Introduction
- Layers of annotation
- Data
- Software tools
- Documentation
- Tour through the CD-ROM
- Final remarks
25Tree editor TrEd
- general customizable tree editor implemented in
Perl - the main editing and browsing tool in the PDT
project
26Batch processing of the data
- btred batch processing version of tred
- ntred networked
- (parallelized) version
- of btred
btred -TNe 'print "this-gtt_lemma\n"
if this-gtparentroot and
grep_-gtfunctor/DIR/ this-gtchildren()
data/sample/.t.gz -q
27Netgraph
- client-server application for on-line PDT search
- implemented in Java
28Tools for post-annotation consistency checking
- hundreds of btred scripts of various types
-
- technical tests
- e.g. each sentence contains at least one token
- all identifiers are unique, all referred
identifiers exist... - m-layer tests
- locative (6th case) cannot occur without a
preposition - improbable word forms (e.g. imperatives haš,
tel) - a-layer tests
- not more than one subject in a clause
- attributes (afun Atr) should not appear directly
below verbs - t-layer tests
- surface forms of verb arguments match the
specifications in the valency lexicon - relative pronouns in relative clauses should be
in agreement with their antecedent (in the sense
of grammatical coreference)
29Tools for automatic annotation
- chain of tools for automatic text processing
(from a raw text to a-layer trees) - 1. sentence segmentation and tokenization
- 2. morphological analysis
- 3. morphological disambiguation
- 4. dependency parsing (adapted Collins)
- 5. analytical function assignment
30Tools for format conversions
- conversion not only between PDT data formats,
but also from other treebanks formats - constituency trees from Negra in TrEd
31Outline of the talk
- Introduction
- Layers of annotation
- Data
- Software tools
- Documentation
- Tour through the CD-ROM
- Final remarks
32PDT 2.0 Documentation
- PDT Guide
- overview of all parts of PDT 2.0
- mirrors the directory structure of the PDT 2.0
CD-ROM - Annotation guidelines
- m-layer (100 pages)
- a-layer ( 250 pages)
- t-layer ( 800 pages)
- Publications
- conference and journal papers, technical
reports, theses ... - Technical documentation (software tools and data
formats)
33Outline of the talk
- Introduction
- Layers of annotation
- Data
- Software tools
- Documentation
- Tour through the CD-ROM
- Final remarks
34Outline of the talk
- Introduction
- Layers of annotation
- Data
- Software tools
- Documentation
- Tour through the CD-ROM
- Final remarks
35Want to experiment with...
- tagging ?
- dependency parsing ?
- semantic-role labeling ?
- frame semantics ?
- word-sense disambiguation ?
- anaphora resolution ?
- information structure ?
- ...
Use PDT 2.0,its all there !!!
36Annotation scheme not limited to Czech
T-layer in English
T-layer in German
A-layer in German
A-layer in Arabic
A-layer in Slovene
A-layer in Romanian
37Those involved (some of)
38- Thank you!
- BTW, anyone interested
- in beta-testing?