Prague Dependency Treebank 2.0 Zdenek - PowerPoint PPT Presentation

About This Presentation
Title:

Prague Dependency Treebank 2.0 Zdenek

Description:

http://ufal.mff.cuni.cz/pdt2.0. PDT 2.0. Prague Dependency ... number for nouns, tense for verbs, degree for adjectives, deontic/verb/sentence modality ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 39
Provided by: zdeneka
Category:

less

Transcript and Presenter's Notes

Title: Prague Dependency Treebank 2.0 Zdenek


1
Prague Dependency Treebank 2.0Zdenek
ŽabokrtskýDept. of Formal and Applied
LinguisticsCharles University,
Praguezabokrtsky_at_ufal.mff.cuni.cz
2
Outline of the talk
  • Introduction
  • Layers of annotation
  • Data
  • Software tools
  • Documentation
  • Tour through the CD-ROM
  • Final remarks

3
Introduction
  • treebank
  • syntactically annotated corpus
  • (bank of syntactic trees)
  • Prague Dependency Treebank
  • collection of linguistically annotated Czech
    texts (2MW), software tools and documentation
  • morphological and surface- and deep-syntactic
    dependency-oriented sentence analyses

4
About Czech
  • western group of Slavic languages
  • rich inflectional morphology
  • (relatively) free word order language
  • Latin alphabet extended with accents
  • (príliš žlutoucký kun)
  • spoken in the Czech republic
  • 10 million speakers

5
Historical backgroundand development of PDT
  • 1920s Prague Linguistic Circle founded
  • 1930-50s influential dependency-oriented
    works of Lucien Tesniere and Vladimír Šmilauer
  • mid 1960s Petr Sgalls Functional Generative
    Description
  • 1992 Penn Treebank
  • 1994 Czech National Corpus
  • 1995 PDT started
  • 1998 PDT 0.5 pre-release
  • 2001 PDT 1.0 released by LDC
  • 2006 PDT 2.0 to be released by LDC

6
Outline of the talk
  • Introduction
  • Layers of annotation
  • Data
  • Software tools
  • Documentation
  • Tour through the CD-ROM
  • Final remarks

7
Layered annotation scheme
  • tectogrammatical layer
  • surface-syntactic dependency tree
  • analytical layer
  • surface-syntactic dependency tree
  • morphological layer
  • morphological lemma and tag associated with each
    token
  • word layer
  • original text, segmented on word boundaries

He would have gone intoforest.
8
M-layer
  • sentence represented as a sequence of tokens
  • each token lemmatized and tagged (attributes
    lemma and tag)
  • 15-character long positional morphological tag
  • 1. (main) POS
  • 2. detailed POS
  • 3. gender
  • 4. number
  • 5. case
  • ...

9
A-layer (1) - nodes and edges
  • sentence represented as a rooted ordered tree
    with labeled nodes and edges
  • edges labeled with analytical functions
  • dependency relations (Sb, Obj, Adv, Atr)
  • non-dep. relations (Coord)
  • auxiliary (functional) nodes (AuxP for
    prepositions, AuxC for subordinating
    conjunctions...)
  • special treatment of coordination constructions

10
A-layer (2) - coordination
  • intricate interplay between dependency and
    coordination relations
  • PDT solution both conjuncts (members of
    coordination) and shared modifiers attached below
    the coordination conjunction (but distinguished
    from each other by a special attribute is_member)
  • direct parent vs. effective parent

M
M
11
T-layer (1) - nodes
  • t-nodes
  • complex typed feature structures
  • nodes represent autosemantic words
  • functional words do not have nodes of their own
  • artificially added nodes (e.g. for pro-drops)
  • node attributes
  • tectogrammatical lemma
  • dependency relation functor and subfunctor
  • grammateme attributes (representing morphological
    meanings)
  • attributes for topic-focus articulation
  • attributes for coreference relations

12
T-layer (2) - dependency relations
  • according to FGD, two types of functors
  • actants (arguments)
  • ACT actor
  • PAT patient
  • ADDR addressee
  • EFF effect
  • ORIG - origin
  • free modifiers (adjuncts)
  • various types of temporal modifiers - TWHEN,
    TTIL, TSIN...
  • spatial and directional modifiers LOC, DIR1,
    DIR2, DIR3
  • MEANS, BENeficiary, CAUSe, REGard, EXTent,
    MATerial, CONDition...
  • additional functors for representing
    non-dependency relations
  • coordinations CONJ, DISJ, ADVS ...
  • appositions APPS
  • parenthetical constructions - PAR
  • expressions in foreign language - FPHR

13
T-layer (3) - valency
  • all occurrences of all verbs in t-trees
    interlinked with the valency lexicon PDT-VALLEX
  • individual valency frames roughly corresponds to
    individual senses of the given verb
  • valency frame a sequence of frame slots, for
    each of which its functor, obligatority and its
    possible surface realizations are specified

14
T-layer (3) - coreference
  • two types of coreference according to FGD
  • grammatical (verbs of control, relative clauses,
    reflexive pronouns...)
  • textual (personal pronouns, incl. elided ones)
  • coreference in PDT
  • binary relation between t-nodes
  • depicted as a non-tree arc (arrow)

15
T-layer (4) - grammatemes
  • grammatemes
  • t-node attributes representing morphological
    meanings
  • motivation
  • number for nouns, tense for verbs, degree for
    adjectives, deontic/verb/sentence modality ...

16
T-layer (5) - node typing
  • presence/absence of a given attribute?
  • ? the need for node typing
  • two-level hierarchy of t-layer node types used in
    PDT 2.0

17
Interlinking the layers
  • any unit at any layer has a PDT unique ID
  • neighboring layers connected by top-down pointers

18
Outline of the talk
  • Introduction
  • Layers of annotation
  • Data
  • Software tools
  • Documentation
  • Tour through the CD-ROM
  • Final remarks

19
Sources of text
  • texts provided by the Czech National Corpus
  • 7000 articles (or article fragments) from Czech
    newspapers and journals
  • Lidové noviny (daily newspapers)
  • Mladá fronta Dnes (daily newspapers)
  • Ceskomoravský profit (business weekly)
  • Vesmír (scientific journal)

20
Amount of annotated data
  • m-layer data
  • 1.96 MW in 116 kS
  • a-layer data (75 of m-layer)
  • 1.5 MW in 88 kS
  • t-layer data (59 of a-layer)
  • 0.8 MW in 49 kS

21
Division into files
  • 1 XML file per document and annotation layer

22
Train/test data
  • train devtest evaltest 8 1 1

23
Full vs. sample data
  • sample data
  • 500 sentences
  • a freely available subset of the full data
  • converted also to HTML (can be viewed in any WWW
    browser, no tree editor needed)
  • the whole PDT 2.0 except for the full data (but
    including sample data, all tools, docs, and
    sample data) is available on the web
  • the full data will be available only to the
    licensed users who obtain the CD from the
    Linguistic Data Consortium

24
Outline of the talk
  • Introduction
  • Layers of annotation
  • Data
  • Software tools
  • Documentation
  • Tour through the CD-ROM
  • Final remarks

25
Tree editor TrEd
  • general customizable tree editor implemented in
    Perl
  • the main editing and browsing tool in the PDT
    project

26
Batch processing of the data
  • btred batch processing version of tred
  • ntred networked
  • (parallelized) version
  • of btred

btred -TNe 'print "this-gtt_lemma\n"
if this-gtparentroot and
grep_-gtfunctor/DIR/ this-gtchildren()
data/sample/.t.gz -q
27
Netgraph
  • client-server application for on-line PDT search
  • implemented in Java

28
Tools for post-annotation consistency checking
  • hundreds of btred scripts of various types
  • technical tests
  • e.g. each sentence contains at least one token
  • all identifiers are unique, all referred
    identifiers exist...
  • m-layer tests
  • locative (6th case) cannot occur without a
    preposition
  • improbable word forms (e.g. imperatives haš,
    tel)
  • a-layer tests
  • not more than one subject in a clause
  • attributes (afun Atr) should not appear directly
    below verbs
  • t-layer tests
  • surface forms of verb arguments match the
    specifications in the valency lexicon
  • relative pronouns in relative clauses should be
    in agreement with their antecedent (in the sense
    of grammatical coreference)

29
Tools for automatic annotation
  • chain of tools for automatic text processing
    (from a raw text to a-layer trees)
  • 1. sentence segmentation and tokenization
  • 2. morphological analysis
  • 3. morphological disambiguation
  • 4. dependency parsing (adapted Collins)
  • 5. analytical function assignment

30
Tools for format conversions
  • conversion not only between PDT data formats,
    but also from other treebanks formats
  • constituency trees from Negra in TrEd

31
Outline of the talk
  • Introduction
  • Layers of annotation
  • Data
  • Software tools
  • Documentation
  • Tour through the CD-ROM
  • Final remarks

32
PDT 2.0 Documentation
  • PDT Guide
  • overview of all parts of PDT 2.0
  • mirrors the directory structure of the PDT 2.0
    CD-ROM
  • Annotation guidelines
  • m-layer (100 pages)
  • a-layer ( 250 pages)
  • t-layer ( 800 pages)
  • Publications
  • conference and journal papers, technical
    reports, theses ...
  • Technical documentation (software tools and data
    formats)

33
Outline of the talk
  • Introduction
  • Layers of annotation
  • Data
  • Software tools
  • Documentation
  • Tour through the CD-ROM
  • Final remarks

34
Outline of the talk
  • Introduction
  • Layers of annotation
  • Data
  • Software tools
  • Documentation
  • Tour through the CD-ROM
  • Final remarks

35
Want to experiment with...
  • tagging ?
  • dependency parsing ?
  • semantic-role labeling ?
  • frame semantics ?
  • word-sense disambiguation ?
  • anaphora resolution ?
  • information structure ?
  • ...

Use PDT 2.0,its all there !!!
36
Annotation scheme not limited to Czech
T-layer in English
T-layer in German
A-layer in German
A-layer in Arabic
A-layer in Slovene
A-layer in Romanian
37
Those involved (some of)
38
  • Thank you!
  • BTW, anyone interested
  • in beta-testing?
Write a Comment
User Comments (0)
About PowerShow.com