Prague Dependency Treebank 2.0 Zdenek - PowerPoint PPT Presentation

About This Presentation

Title:

Prague Dependency Treebank 2.0 Zdenek

Description:

http://ufal.mff.cuni.cz/pdt2.0. PDT 2.0. Prague Dependency ... number for nouns, tense for verbs, degree for adjectives, deontic/verb/sentence modality ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 39

Provided by: zdeneka

Category:

more less

Transcript and Presenter's Notes

Title: Prague Dependency Treebank 2.0 Zdenek

1
Prague Dependency Treebank 2.0Zdenek
ŽabokrtskýDept. of Formal and Applied
LinguisticsCharles University,
Praguezabokrtsky_at_ufal.mff.cuni.cz
2
Outline of the talk

Introduction
Layers of annotation
Data
Software tools
Documentation
Tour through the CD-ROM
Final remarks

3
Introduction

treebank
syntactically annotated corpus
(bank of syntactic trees)
Prague Dependency Treebank
collection of linguistically annotated Czech
texts (2MW), software tools and documentation
morphological and surface- and deep-syntactic
dependency-oriented sentence analyses

4
About Czech

western group of Slavic languages
rich inflectional morphology
(relatively) free word order language
Latin alphabet extended with accents
(príliš žlutoucký kun)
spoken in the Czech republic
10 million speakers

5
Historical backgroundand development of PDT

1920s Prague Linguistic Circle founded
1930-50s influential dependency-oriented
works of Lucien Tesniere and Vladimír Šmilauer
mid 1960s Petr Sgalls Functional Generative
Description
1992 Penn Treebank
1994 Czech National Corpus
1995 PDT started
1998 PDT 0.5 pre-release
2001 PDT 1.0 released by LDC
2006 PDT 2.0 to be released by LDC

6
Outline of the talk

Introduction
Layers of annotation
Data
Software tools
Documentation
Tour through the CD-ROM
Final remarks

7
Layered annotation scheme

tectogrammatical layer
surface-syntactic dependency tree
analytical layer
surface-syntactic dependency tree
morphological layer
morphological lemma and tag associated with each
token
word layer
original text, segmented on word boundaries

He would have gone intoforest.
8
M-layer

sentence represented as a sequence of tokens
each token lemmatized and tagged (attributes
lemma and tag)
15-character long positional morphological tag
1. (main) POS
2. detailed POS
3. gender
4. number
5. case
...

9
A-layer (1) - nodes and edges

sentence represented as a rooted ordered tree
with labeled nodes and edges
edges labeled with analytical functions
dependency relations (Sb, Obj, Adv, Atr)
non-dep. relations (Coord)
auxiliary (functional) nodes (AuxP for
prepositions, AuxC for subordinating
conjunctions...)
special treatment of coordination constructions

10
A-layer (2) - coordination

intricate interplay between dependency and
coordination relations
PDT solution both conjuncts (members of
coordination) and shared modifiers attached below
the coordination conjunction (but distinguished
from each other by a special attribute is_member)
direct parent vs. effective parent

M
M
11
T-layer (1) - nodes

t-nodes
complex typed feature structures
nodes represent autosemantic words
functional words do not have nodes of their own
artificially added nodes (e.g. for pro-drops)
node attributes
tectogrammatical lemma
dependency relation functor and subfunctor
grammateme attributes (representing morphological
meanings)
attributes for topic-focus articulation
attributes for coreference relations

12
T-layer (2) - dependency relations

according to FGD, two types of functors
actants (arguments)
ACT actor
PAT patient
ADDR addressee
EFF effect
ORIG - origin
free modifiers (adjuncts)
various types of temporal modifiers - TWHEN,
TTIL, TSIN...
spatial and directional modifiers LOC, DIR1,
DIR2, DIR3
MEANS, BENeficiary, CAUSe, REGard, EXTent,
MATerial, CONDition...
additional functors for representing
non-dependency relations
coordinations CONJ, DISJ, ADVS ...
appositions APPS
parenthetical constructions - PAR
expressions in foreign language - FPHR

13
T-layer (3) - valency

all occurrences of all verbs in t-trees
interlinked with the valency lexicon PDT-VALLEX
individual valency frames roughly corresponds to
individual senses of the given verb
valency frame a sequence of frame slots, for
each of which its functor, obligatority and its
possible surface realizations are specified

14
T-layer (3) - coreference

two types of coreference according to FGD
grammatical (verbs of control, relative clauses,
reflexive pronouns...)
textual (personal pronouns, incl. elided ones)

coreference in PDT
binary relation between t-nodes
depicted as a non-tree arc (arrow)

15
T-layer (4) - grammatemes

grammatemes
t-node attributes representing morphological
meanings
motivation
number for nouns, tense for verbs, degree for
adjectives, deontic/verb/sentence modality ...

16
T-layer (5) - node typing

presence/absence of a given attribute?
? the need for node typing
two-level hierarchy of t-layer node types used in
PDT 2.0

17
Interlinking the layers

any unit at any layer has a PDT unique ID
neighboring layers connected by top-down pointers

18
Outline of the talk

Introduction
Layers of annotation
Data
Software tools
Documentation
Tour through the CD-ROM
Final remarks

19
Sources of text

texts provided by the Czech National Corpus
7000 articles (or article fragments) from Czech
newspapers and journals
Lidové noviny (daily newspapers)
Mladá fronta Dnes (daily newspapers)
Ceskomoravský profit (business weekly)
Vesmír (scientific journal)

20
Amount of annotated data

m-layer data
1.96 MW in 116 kS
a-layer data (75 of m-layer)
1.5 MW in 88 kS
t-layer data (59 of a-layer)
0.8 MW in 49 kS

21
Division into files

1 XML file per document and annotation layer

22
Train/test data

train devtest evaltest 8 1 1

23
Full vs. sample data

sample data
500 sentences
a freely available subset of the full data
converted also to HTML (can be viewed in any WWW
browser, no tree editor needed)
the whole PDT 2.0 except for the full data (but
including sample data, all tools, docs, and
sample data) is available on the web
the full data will be available only to the
licensed users who obtain the CD from the
Linguistic Data Consortium

24
Outline of the talk

Introduction
Layers of annotation
Data
Software tools
Documentation
Tour through the CD-ROM
Final remarks

25
Tree editor TrEd

general customizable tree editor implemented in
Perl
the main editing and browsing tool in the PDT
project

26
Batch processing of the data

btred batch processing version of tred
ntred networked
(parallelized) version
of btred

btred -TNe 'print "this-gtt_lemma\n"
if this-gtparentroot and
grep_-gtfunctor/DIR/ this-gtchildren()
data/sample/.t.gz -q
27
Netgraph

client-server application for on-line PDT search
implemented in Java

28
Tools for post-annotation consistency checking

hundreds of btred scripts of various types
technical tests
e.g. each sentence contains at least one token
all identifiers are unique, all referred
identifiers exist...
m-layer tests
locative (6th case) cannot occur without a
preposition
improbable word forms (e.g. imperatives haš,
tel)
a-layer tests
not more than one subject in a clause
attributes (afun Atr) should not appear directly
below verbs
t-layer tests
surface forms of verb arguments match the
specifications in the valency lexicon
relative pronouns in relative clauses should be
in agreement with their antecedent (in the sense
of grammatical coreference)

29
Tools for automatic annotation

chain of tools for automatic text processing
(from a raw text to a-layer trees)
1. sentence segmentation and tokenization
2. morphological analysis
3. morphological disambiguation
4. dependency parsing (adapted Collins)
5. analytical function assignment

30
Tools for format conversions

conversion not only between PDT data formats,
but also from other treebanks formats
constituency trees from Negra in TrEd

31
Outline of the talk

Introduction
Layers of annotation
Data
Software tools
Documentation
Tour through the CD-ROM
Final remarks

32
PDT 2.0 Documentation

PDT Guide
overview of all parts of PDT 2.0
mirrors the directory structure of the PDT 2.0
CD-ROM
Annotation guidelines
m-layer (100 pages)
a-layer ( 250 pages)
t-layer ( 800 pages)
Publications
conference and journal papers, technical
reports, theses ...
Technical documentation (software tools and data
formats)

33
Outline of the talk

Introduction
Layers of annotation
Data
Software tools
Documentation
Tour through the CD-ROM
Final remarks

34
Outline of the talk

Introduction
Layers of annotation
Data
Software tools
Documentation
Tour through the CD-ROM
Final remarks

35
Want to experiment with...

tagging ?
dependency parsing ?
semantic-role labeling ?
frame semantics ?
word-sense disambiguation ?
anaphora resolution ?
information structure ?
...

Use PDT 2.0,its all there !!!
36
Annotation scheme not limited to Czech
T-layer in English
T-layer in German
A-layer in German
A-layer in Arabic
A-layer in Slovene
A-layer in Romanian
37
Those involved (some of)
38