Olga Pustylnikov, Alexander Mehler - PowerPoint PPT Presentation

About This Presentation
Title:

Olga Pustylnikov, Alexander Mehler

Description:

A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data Olga Pustylnikov, Alexander Mehler Bielefeld University – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 19
Provided by: Alexan93
Learn more at: http://www.lrec-conf.org
Category:

less

Transcript and Presenter's Notes

Title: Olga Pustylnikov, Alexander Mehler


1
A Unified Database of Dependency
TreebanksIntegrating, Quantifying
EvaluatingDependency Data
  • Olga Pustylnikov, Alexander Mehler
  • Bielefeld University

2
Motivation
  • Exploring similarities among languages by means
    of syntactic treebanks
  • We collected a database covering 11 languages
  • Treebanks have been developed separately by
    different research projects
  • quantitative investigations on these treebanks -gt
    the need for unification

3
Motivation
corpus
structure
annotation
(loves v ( (John n) (Mary n) )
loves
John loves Mary
1 John n 2 2 loves v 0 3 Mary n 2
Mary
John
ltS ID"1"gt ltW DOM"1" ID"2"gt John lt/Wgt ltW
DOM"_root" ID"2"gt loves lt/Wgt ltW DOM"2"
ID"3 Mary lt/Wgt lt/Sgt
4
Motivation
Demands on the unified format of treebanks
  • () generic allowing to represent as many
    treebanks as possible
  • () extensible to new treebanks
  • () complete preserving all corpus specific
    information
  • () transferable to other kinds of corpora
  • () complex exhibiting the minimal
  • complexity
  • -gt graph representations

5
Motivation
GXL (Holt et al., 2006)
  • Graph eXtensible Language is a graph model
    representig corpora in terms of graphs

XML
Multimodal Data
GXL
TOOLS
eGXL
WIKI
Treebanks
Treebanks
  • GXL can be applied to any kinds of corpora. (See
    e.g. Mehler and Gleim (2005), Ferrer i Cancho et
    al. (2007), Pustylnikov and Mehler (2008))

6
Agenda
7
eGXL
2-level data model
Types
ltgraph idTypesgt ltnode idPOS /gt ltnode
idt245 nameVERB /gt lt/graphgt
IDREF
ltgraph id"Sentences"gt ltgraph id"g8"gt ltnode
id"s8_1" form"Detta" pos"t151" /gt ltnode
id"s8_2" form"vill" pos"t245" /gt ...
ltrelgt ltrelend direction"in" target"s8_2"
/gt ltrelend direction"out" target"s8_1" /gt
lt/relgt ... lt/graphgt
Sentences
8
eGXL
2-level data model
Types
ltgraph idTypesgt ltnode idPOS /gt ltnode
idt245 nameVERB /gt lt/graphgt
IDREF
ltgraph id"Sentences"gt ltgraph id"g8"gt ltnode
id"s8_1" form"Detta" pos"t151" /gt ltnode
id"s8_2" form"vill" pos"t245" /gt ...
ltrelgt ltrelend direction"in" target"s8_2"
/gt ltrelend direction"out" target"s8_1" /gt
lt/relgt ... lt/graphgt
Sentences
9
The eGXL Types-graph
  • The Types-graph contains treebank specific
    attributes (e.g.POS, morphological attribute
    etc.) -gt nodes
  • Each instance of an attribute is given a unique
    identifier

ltgraph idTypesgt ltnode idPOS /gt ltnode
idt245 nameVERB /gt lt/graphgt
a unique identifier
a unique identifier
the value of the attribute
the value of the attribute
10
The eGXL Sentences-graph
vill
.
Detta
bestämt
jag
bemöta
each token of a treebank
each token of a treebank
an IDREF to the POS-node of the Types-graph
an IDREF to the POS-node of the Types-graph
ltgraph id"Sentences"gt ltgraph id"g8"gt ltnode
id"s8_1" form"Detta" pos"t151" /gt ltnode
id"s8_2" form"vill" pos"t245" /gt ...
ltrelgt ltrelend direction"in" target"s8_2"
/gt ltrelend direction"out" target"s8_1" /gt
lt/relgt ... lt/graphgt
word form
word form
a (syntactic) relation
a (syntactic) relation
from (e.g. a head verb)
from (e.g. a head verb)
to (e.g. a dependent argument)
to (e.g. a dependent argument)
11
The eGXL Sentences-graph
vill
.
Detta
bestämt
jag
bemöta
node each token of a treebank
id a unique identifier
form word form
pos an IDREF to the POS-node of the Types-graph
rel a (syntactic) relation
relend a relation anchor
in from (e.g. a head verb)
out to (e.g. a dependent argument)
ltgraph id"Sentences"gt ltgraph id"g8"gt ltnode
id"s8_1" form"Detta" pos"t151" /gt ltnode
id"s8_2" form"vill" pos"t245" /gt ...
ltrelgt ltrelend direction"in" target"s8_2"
/gt ltrelend direction"out" target"s8_1" /gt
lt/relgt ... lt/graphgt
12
eGXL
13
Agenda
14
11 Dependency Treebanks
7 different formats
15
Input vs. Output Formats
  • Examples from Dutch, Swedish, Italian treebanks

16
Unification is possible
  • due to the separation of the core from the
    secondary parts

ltgraph idTypesgt ltnode idPOS /gt ltnode
idt245 nameVERB /gt lt/graphgt
diversity
ltgraph id"Sentences"gt ltgraph id"g8"gt ltnode
id"s8_1" form"Detta" pos"t151" /gt ltnode
id"s8_2" form"vill" pos"t245" /gt ...
ltrelgt ltrelend direction"in" target"s8_2"
/gt ltrelend direction"out" target"s8_1" /gt
lt/relgt ... lt/graphgt
commonality
17
The TreebankWiki
  • http//ariadne.coli.uni-bielefeld.de/wikis/treeban
    kwiki/

18
Agenda
19
Complexity of eGXL
  • Logical Scalling Factor (LSF) number of logical
    elements (e.g. XML-element) required to represent
    a treebank unit (e.g. a word form, POS etc.)

node
rel
other
eGXL
other
eGXL
20
Agenda
21
DTDB
22
Agenda
23
Conclusions
  • a database covering 11 languages
  • eGXL a generic XML graph model adopted to
    syntactic treebanks
  • use of treebanks within a single application
    (Ariadne)
  • olga.pustylnikov_at_uni-bielefeld.de
  • alexander.mehler_at_uni-bielefeld.de
  • ruediger.gleim_at_uni-bielefeld.de
  • SFB 673
  • Thank you for your attention!
Write a Comment
User Comments (0)
About PowerShow.com