Prague%20Dependency%20Treebank%201.0 - PowerPoint PPT Presentation

About This Presentation
Title:

Prague%20Dependency%20Treebank%201.0

Description:

... framework based on the findings of European structural linguistics, esp. ... TRs (see Sgall, Hajicov and Panevov 1986, formally specified by Petkevic, also ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 59
Provided by: alena
Category:

less

Transcript and Presenter's Notes

Title: Prague%20Dependency%20Treebank%201.0


1
Prague Dependency Treebank 1.0
  • CD-ROM PRESENTATION
  • Dec 18, 2000

2
Prague Dependency Treebank 1.0
Functional Generative Description
  • CD-ROM PRESENTATION
  • Dec 18, 2000

3
Functional Generative Description
  • theoretical framework based on the findings of
    European structural linguistics, esp. of the
    classical Prague School
  • methodological requirements of a formal
    description
  • levels
  • tectogrammatical (underlying) representations
    (TRs) with dependency based syntax
  • morphemics
  • phonemics and phonetics
  • TRs (see Sgall, Hajicová and Panevová 1986,
    formally specified by Petkevic, also in a
    declarative way)

4
Dependency tree
My younger brother arrived there yesterday.
Linearized form, one-to-one relation ((I)Appurt
(younger)Rstr brother)Act arrive.Pret.Indic (Dir
there) (Temp yesterday)
5
Dependency Tree
  • labels - lexical meanings (abstract symbols) with
    indices
  • functors
  • subscripts at parentheses oriented towards head
  • grammatemes - values of morphological categories
  • Tense, Modality, Number, Definiteness, etc.
  • projectivity
  • valency
  • arguments (inner participants) and adjuncts
    (circumstantials or 'free modifications')
  • obligatory and optional with a given head,
  • deletable or not

6
Dependency Tree
  • participants (arguments) of verbs
  • Actor/Bearer (underlying subject)
  • Objective (Patient, underlying direct object)
  • Addressee (underlying indirect object)
  • Effect ('second' object to choose so. as sth.)
  • Origin (to make sth. out of sth.)
  • adjuncts
  • Locative, several Directional and Temporal
    modifications
  • Condition, Means, Manner, etc.

7
Dependency Tree
Complementations dependent mainly on nouns
  • inner participants
  • Material (Partitive) two baskets of sth.
  • Identity the river Danube the notion of
    operator
  • free modifications
  • Possession (Appurtenance) my table Jim's
    brother
  • Restrictive rich man
  • Descriptive the Swedes, who are a Scandinavian
    nation

8
Dependency Tree
  • syntactic grammatemes
  • Loc, Dir - in, on, under, between...
  • Regard - with, without
  • operational (testable) criteria
  • for distinguishing
  • arguments from adjuncts,
  • from each other
  • deletability (dialogue test)

9
Simplified valency frames
  • brother N Appurt
  • man N
  • glass N Material
  • full A Material
  • read V Act Addr Obj
  • change V Act Obj Orig Eff
  • give V Act Addr Obj

obligatory complementations in blue
10
Topic-focus articulation
  • contextual boundness
  • main verb CB/NB (T/F)
  • dependents to the left/right
  • communicative dynamism
  • left-right (mother, sisters, transitive)
  • partial ordering
  • underlying word order
  • left-right
  • linear ordering

left-to-right order of nodes together with the
index T or (prototypically) F indicates the TFA
of the sentence (of the TR)
11
Topic-focus articulation
  • TFA - one of the basic aspects of underlying
    structures

12
Complex sentence
My brother, whom you know, arrived there
yesterday.
  • a subordinated (dependent) clause (i.e. its main
    verb) depends on a word contained in its
    governing clause

13
Complex sentence
Martin came there late, since he had to accompany
his sick mother.
  • function words (synsemantic) are viewed as
    function morphemes, syntactically fixed to
    certain lexical (autosemantic) words -
    prepositions and articles to nouns, conjunctions
    and auxiliaries to verbs

14
Complex sentence
Martin arrived late to the session, since he had
to accompany his sick mother.schematically
(morphemes) Martin arrive.ed late to the session
since he have.ed to accompany he.s sick
mother.dot - close connection of morphemes
('semes')
15
  • deleted items restored
  • order of items - difference between 'underlying'
    and surface (morphemic) word order
  • transductive components - Panevová, Oliva,
    Borota
  • coordination (multidimensional)
  • Jim and Mary, who have two children, went to
    Boston.
  • the linearized notation is adequate
  • ((Jim Mary)Conj ((who)Act have (Pat (two)Rstr
    children)))Act went (Dir Boston)
  • structures close to Boolean, i.e. no complex
    'innate properties' specific for natural language
    are needed.

16
Prague Dependency Treebank - corpus annotation
  • an intermediate level - 'analytical'
    representations
  • dependency trees, not always projective
  • nodes for all word tokens, even for punctuation
    marks
  • tectogrammmatical tree coordinating conjunction
    as the head

17
Prague Dependency Treebank 1.0
  • CD-ROM PRESENTATION
  • Dec 18, 2000

18
Prague Dependency Treebank 1.0
Morphological Layer
  • CD-ROM PRESENTATION
  • Dec 18, 2000

19
ACKNOWLEDGEMENTS
20
ANNOTATED CORPORA
PDT version 1.0, 2000 (1996 - 2000) Penn
Treebank, release 3, 1999 (1989 - 1999)
21
TAG SETs
  • Czech - ambiguous inflective language
  • nový, nového, novému, novém, novým, nová, nové,
    novou, nových, novým, novými, novejší,
    novejšího, novejšímu, novejším, ., nejnovejší,
    nejnovejšího, nejnovejšímu, nejnovejším..
    nejnovejších, nejnovejším,
  • English - language with poor inflection
  • work, works, worked, working

22
(No Transcript)
23
TEXT SOURCES
  • Lidové noviny
  • Mladá Fronta Dnes
  • Vesmír
  • Ceskomoravský Profit
  • ...taken from Czech National Corpus
  • 88, 89 WSJ articles
  • Air Travel Information System transcripts
  • Brown Corpus
  • Switchboard transcripts

24
ANNOTATION STRATEGY - Penn Treebank
TEXT Ken Churchs stochastic tagger, Eric
Brills transformation tagger corrections
by annotator (GNU Emacs Lisp based package)
25
ANNOTATION STRATEGY - PDT
  • Automatic Morphological Analyzer (AMA)
  • two independent annotators Linux, Win
    tools
  • differences resolved by third annotator
  • comparison with the current AMA
    manual resolution Win tools

26
INTERNAL FORMAT
  • SGML coding, csts dtd
  • word/tag(tag)

27
SAMPLES
lts idln95040020-p1s1gt ltfgtPokusltlgtpokuslttgtNNIS1
-----A---- ltfgtoltlgtolttgtRR--4---------- ltfgtzázrakltlgt
zázraklttgtNNIS4-----A---- ltdgt.ltlgt.lttgtZ------------
-
The/DT envelope/NN arrives/VBZ in/IN the/DT
mail/NN ./.
28
CONVERSION
  • SGML coding
  • SGML coding
  • word/tag
  • word/lemma/tag

pdt2wsj.pl
pdt2wsjFLT.pl
29
DATA SIZE
30
DATA SETs of MORPHOLOGICALLY ANNOTATED DATA
31
TOOLS
  • Automatic Morphological Analyser/Generator of
    Czech
  • HMAnalyze.pl, HMGenerate.pl
  • Dictionary CZE_a
  • Remote Acces
  • Czech Taggers
  • HMM
  • Exponential

32
Prague Dependency Treebank 1.0
  • CD-ROM PRESENTATION
  • Dec 18, 2000

33
Prague Dependency Treebank 1.0
Analytical Layer in PDT
  • CD-ROM PRESENTATION
  • Dec 18, 2000

34
Introduction
  • Input morphologically tagged sentences
  • Graph Editor user-friendly software
  • Output ATS structure
  • surface syntax tree structure
  • nodes labelled by the analytical functions

35
Two stages (chronologically)
  • (A) manual analytic annotation (ATS)
  • training data for (B)(a)
  • (B)
  • (a) semiautomatic procedure (Collins parser)
  • (b) manual correcting of (B)(a)

36
Constraints and limitations
  • any string has a node of its own
  • word-form, punctuation mark, etc.
  • AuxV, AuxP, AuxC, AuxX, AuxG
  • reflecting the coordination and apposition
    relations
  • so called third dimension of the graph in the
    plain tree (X_Co, X_Ap, X_Pa, where X is one of
    analytic functions, such as Sb, Obj, Adv, etc.)

37
Constraints and limitations
  • no missing nodes (on the surface) can be added
  • analytic funtion Ex_D is used
  • relations between semi-automatic and manual
    procedure
  • 80 edges are established correctly automatically

38
Project organization
  • team consisting of 5-6 annotators
  • handbook for ATS structure annotation
  • 1999 100000 sentences on ATS
  • tectogrammatical annotation follows

39
První restitucní zákon ceského parlamentu se do
snemovních lavic muže vrátit jako bumerang.
40
Prague Dependency Treebank 1.0
  • CD-ROM PRESENTATION
  • Dec 18, 2000

41
Prague Dependency Treebank 1.0
From the Analyticaltowards the Tectogrammatical
layer
  • CD-ROM PRESENTATION
  • Dec 18, 2000

42
Introduction
  • ATS annotation
  • nodes
  • word forms
  • punctuation
  • graphical symbols
  • TGTS annotation
  • autosemantic words
  • deletions
  • edges
  • surface relations
  • deep layer functions

43
Annotation process
Input Czech sentence
44
Transition procedure
  • deterministic procedure operating on trees
  • macro language for Graph Editor (C like)
  • automatic changes tools for annotators
  • Requirements
  • new attributes for tectogrammatical layer
  • ATS is recoverable from TGTS
  • automatized to a maximally high degree

45
New attributes
  • trlemma - lemma of the original node or lemma
    composed of joined nodes
  • morphological grammatemes
  • gender, number, degree of comparison, tense,
  • aspect, iterativeness, verbal modality, deontic
    modality, sentence modality
  • position of the node
  • functor, topic-focus articulation, syntactic
    grammateme,
  • type of relation (dependency, coordination,
    apposition),
  • phraseme, deletion, quoted word, direct speech,
  • coreference, antecedent

46
Tree Structure Pruning
  • U toho, kdo zacíná opravdu od nuly, není danový
    výnos pro stát podstatný.
  • For those, who start actually at zero, the tax
    outcome for the state is not substantial.

47
Tree Structure Pruning
  • U toho, kdo zacíná opravdu od nuly, není danový
    výnos pro stát podstatný.
  • For those, who start actually at zero, the tax
    outcome for the state is not substantial.

48
Verbal Nodes
  • podnikatelé by meli mít dane
  • enterpreneurs should have (their) taxes

49
Attribute Assignments
  • prepositions stored as fw attribute
  • quoted words
  • clause in quotes -gt DSP
  • one pair of quotes in the sentence -gt DSPP
  • string in quotes -gt QUOT
  • gender, number, tense, degcmp, aspect
  • default values

50
Macros for Annotators
  • keyboard shortcuts (in Graph editor)
  • structure changes
  • hide/recover nodes
  • merge nodes
  • add new nodes
  • functor assignments

51
Manual annotation
  • structure checking
  • functors
  • deletions of obligatory modifications
  • feedback for formulating the handbook for
    annotators

52
Prague Dependency Treebank 1.0
  • CD-ROM PRESENTATION
  • Dec 18, 2000

53
Prague Dependency Treebank 1.0
Tectogrammatical Layer
  • CD-ROM PRESENTATION
  • Dec 18, 2000

54
(No Transcript)
55
(No Transcript)
56
  • Jirka se vcera opil do nemoty
    a Honza dneska.
  • George himself yesterday drank to silence and
    Honza today.

57
Attributes of Coreferrential relations
  • only in MC
  • attribute valuescoref the lemma of the
    antecedentcorsnt NIL - in the same
    sentence PREV1 ... PREVi - position of the
    sentence which includes the antecedent
  • grammatical coreferenceantec the functor of the
    antecedent

58
Example
Honza slíbil prijít vcas. Honza
promised to come in time.
Write a Comment
User Comments (0)
About PowerShow.com