Title: CoNLLX Shared Task on Multilingual Dependency Parsing
1CoNLL-X Shared Task on Multilingual Dependency
Parsing
- Sabine Buchholz, Speech Technology Group,
Cambridge Research Lab, Toshiba Research Europe
Ltd, UK - Erwin Marsi, Tilburg University, The Netherlands
- Amit Dubey, University of Edinburgh, UK
- Yuval Krymolowski, University of Haifa, Israel
2Overview
- Introduction
- Dependency structures
- Data format
- Treebanks used
- Evaluation metric
- Parsing approaches
- Conclusion
- Results
- Analysis
- The Future
3Dependency structures
- No constituents (unlike phrase structure)
- Dependency relations between two lexical items
(tokens) - One possible graphical representation
4Dependency structure terminology
subj
This is
Note Other people may draw arrows from head to
child!
5Dependency structures in the shared task
punc
ROOT
comp
- Virtual root node
- Each token except BOS has exactly one head
- More than one token can link to BOS
- Crossing arcs are allowed, i.e. structures can be
non-projective
subj
det
BOS This is a test .
0 1 2 3 4 5
Do you need it for something ?
What do you need it for ?
An arc (i,j) is projective iff all nodes
occurring between i and j are dominated by i
(where dominates is the transitive closure of the
arc relation)
6Data format
punc
ROOT
comp
subj
det
BOS This is a test .
0 1 2 3 4 5
7Data format details
- ID, FORM, CPOSTAG, POSTAG, HEAD and DEPREL
guaranteed to contain a non-dummy value - except Spanish DEPREL bug ... ?
- Although CPOSTAG and POSTAG may be identical
- German and Swedish
- LEMMA and FEATS allowed to contain dummy value (
_ ) - if information not available in original treebank
- Additional columns PHEAD and PDEPREL (in training
data) not used by anybody - Unicode (UTF-8)
8Treebanks used
- Czech Prague Dependency Treebank (PDT)
- Arabic Prague Arabic Dependency Treebank (PADT)
- Slovene Slovene Dependency Treebank (SDT)
- Danish Danish Dependency Treebank (DDT)
- Swedish Talbanken05
- Turkish Metu-Sabanci treebank
- German TIGER treebank
- Japanese Japanese Verbmobil treebank
- Portuguese The Bosque part of the Floresta
sintá(c)tica - Dutch Alpino treebank
- Chinese Sinica treebank
- Spanish Cast3LB
- Bulgarian BulTreeBank
Depen- dency format
Consti- tuents and functions
Constituents and some functions
9Data format some examples
55.39031 VP(evaluationDbb?HeadV_11?range
NP(propertyNv3??HeadNab??))?(PERIODCATEGORY
)
101,AuxS,tagHEADLINE,1,ord0,commentSun Oct 3
050228 2004 \SyntaxFS.pl 1.06\,x_id_ord1_1,
x_commentALH20010911.0001_story Wed Jul 21
125109 2004 \MorphoFS.pl 1.09\\(\??????,ExD
,?????_1,N-------1R,????,ord1,x_id_ord1/1_12, x
_lookupgyAb,giyAbu,absence/disappearance
\def.nom.\\(\???????,Atr,???????_2,Z---------
,?????,ord3,x_id_ord1/3_6, x_lookupknEAn,kanoE
An,Kan'an\(\?????,Atr,?????_2,Z---------,????,o
rd2,x_id_ord1/2_11, x_lookupfAd,fuAd,Fuad/Fo
uad\)))
11ltnode id"0" rel"top" cat"top" begin"0"
end"6"gt ltnode id"1" rel"--" cat"sv1"
begin"0" end"5"gt ltnode id"2" rel"hd"
pos"verb" begin"0" end"1" root"ben"
word"Ben"/gt ltnode id"3" rel"su"
pos"noun" begin"1" end"2" root"je"
word"je"/gt ltnode id"4" rel"predc"
cat"mwu" begin"2" end"5"gt ltnode id"5"
rel"mwp" pos"adj" begin"2" end"3" root"op"
word"op"/gt ltnode id"6" rel"mwp"
pos"adj" begin"3" end"4" root"de"
word"de"/gt ltnode id"7" rel"mwp"
pos"adj" begin"4" end"5" root"hoogte"
word"hoogte"/gt lt/nodegt lt/nodegt
ltnode id"8" rel"--" pos"punct" begin"5"
end"6" root"?" word"?"/gt lt/nodegt
ltsentencegtBen je op de hoogte ?lt/sentencegt
12ltS No"2"gt ltW IX"1" LEM"" MORPH" "
IG'(1,"kurtulVerbPos")(2,"NounInfA3sgPnon
Nom")' REL"2,1,(OBJECT)"gt Kurtulmak lt/Wgt ltW
IX"2" LEM"" MORPH" " IG'(1,"isteVerbNegPro
g1A1sg")' REL"3,1,(SENTENCE)"gt istemiyorum
lt/Wgt ltW IX"3" LEM"" MORPH" "
IG'(1,".Punc")' REL",( )"gt . lt/Wgt lt/Sgt
13SOURCE CETEMPúblico n22 sececo sem92a CP22-4
Mas se falhar? A1 UTTacl COconj-c('mas') Mas A
DVLfcl SUBconj-s('se') se Pv-fin('falhar'
FUT 3S SUBJ) falhar ?
- Head table
- acl COM, PRD, P, leftmost non-punctuation
- fcl P, PAUX,
14Data format training data and test data
- Training data
- Contains all columns
- Blind test data (given to participants)
- Contains only first six columns
- Participants predict HEAD and DEPREL
- Approximately 5000 scoring tokens for each
language
15Evaluation metric
- Official metric Labelled attachment score (LAS)
- The percentage of scoring token for which the
system predicted the correct HEAD and DEPREL
value - A token is non-scoring if all characters of the
FORM value have the Unicode category property
Punctuation - E.g. . , ? ( -- _ ?_?
- Also computed, for error analysis and system
comparison - Unlabelled attachment score (UAS)
- The percentage of scoring token for which the
system predicted the correct HEAD value - Label accuracy
- The percentage of scoring token for which the
system predicted the correct DEPREL value
16Parsing approaches
- Many different approaches!
- How to deal with non-projectivity
- Different machine learners
- perceptron, Maximum Entropy, SVM, MLE, ...
- Search
- deterministic, n-best, approximate, optimal, ...
- FORM versus LEMMA, CPOSTAG versus POSTAG
- Always use one, always use both, one or the
other, ... - FEATS
- Ignore, treat as atomic, split into components,
cross-product, ... - Unlabelled parsing (HEAD) versus labelling
(DEPREL) - Interleaved or separate step
17Parsing approaches Parsing order
- Four clusters (1 long several spotlight
talks) of talks today - All pairs (cluster 4) versus stepwise
(clusters 1 3) - Chart-parsing (most of cluster 3) versus
classifier-based (12) - Child-focused (cluster 1) versus
direction-focused (cluster 2)