Title: 6th Intex Workshop
16th Intex Workshop 10 years of (Silberztein,
1993)
2Conversion between Intex and MULTEXT-East
Morphosyntactic Descriptions
- Cvetana Krstev, Duko Vitas
- University of Belgrade
- Toma Erjavec
- Joef Stefan Institute, Ljubljana
3Motivation
- general
- use of different tools
- use of multilingual resources
- comparison of results in NLP
- specific
- inclusion of Serbian language in MULTEXT-East
specification and production of Slovenian Intex
resources - production of tagged Serbian translation of
Orwell's 1984
4MULTEXT-East morphosyntactic specification
- aim
- exhaustive description of morphological and
morphosyntactic features of different languages
and establishment of unique codes for common
features - scope
- English, Romanian, Slovene, Czeck, Bulgarian,
Estonian, Hungarian, Croatian (Concede), and
Serbian
514 MULTEXT-East types or PoS - new types cannot
be introduced
- Nouns (N)
- Verbs (V)
- Adjectives (A)
- Pronouns (P)
- Determiners (D)
- Adpositions (S)
- Conjuctions (C)
- Numerals (M)
- Interjections (I)
- Abbreviations (Y)
- Particles (Q)
- Adverbs (R)
- Articles (T)
- Residuals (X)
6Type attributes
- Each type has a set of attributes that are
appropriate to it - Each type attribute has its position in MSD
description - It is not recommended to add new attributes to a
type
7Attribute values
- a set of values is added to each attribute
- each value is coded by one alphanumeric character
- the new values can be added to the attributes, if
necessary - Types
- Verb attributes
- Adjective attributes
8Adjective attribute values/1
- Adjective (A)
- 13 positions
- EN RO SL CS BG
ET HU HR SR - P ATT VAL C x x x x x
x x x x -
- 1 Type qualificative f x x x x
x x x - indefinite i
- possessive s x x
x x - ordinal o x
x - - -------------- -------------- -
- 2 Degree positive p x x x x
x x x x - comparative c x x x x
x x x x - superlative s x x x x
x x x x - elative e x
x - - -------------- -------------- -
9Adjective attribute values/2
- EN RO SL CS BG
ET HU HR SR - P ATT VAL C x x x x x
x x x x -
- 3 Gender masculine m x x x x
x x - feminine f x x x x
x x - neuter n x x x x
x x - - -------------- -------------- -
- 4 Number singular s x x x x
x x x x - plural p x x x x
x x x x - dual d x x
- paucal c
x - - -------------- -------------- -
- 5 Case nominative n x x
x x x x - genitive g x x
x x x x - dative d x x
x x x - accusative a x x
x x x - ...(various more values)..
-
10Adjective attribute values/3
- 6 Definiteness no n x x x
x x - yes y x x x
x x - short_art s x
- full_art f x
- - -------------- -------------- -
- 7 Clitic no n x
- yes y x
- - -------------- -------------- -
- 8 Animate no n x x
x x x - yes y x x
x x x - - -------------- -------------- -
- 9 Formation nominal n x
- compound c x
- - -------------- -------------- -
- ... various Hungarian specific attributes...
- EN RO SL CS BG
ET HU HR SR
11An example from the Slovenian MULTEXT-East
dictionary
- cisteji cist Afcfda
- lemma cist (Engl. clean) corresponds to the
simple word form cisteji it is qualified as
qualificative (f) adjective (A) in comparative
form (c), feminine gender (f), dual number (d),
and accusative case (a). - cisteji cist Afcmsa--n
- lemma cist (Engl. clean) corresponds to the
simple word form cisteji it is qualified as
qualificative (f) adjective (A) in comparative
form (c), masculine gender (m), singular (s),
accusative case (a), and not animate (n). -
12The first sentence of the Slovene translation of
Orwell's 1984 tagged
- ltw lemma"biti" ana"Vcps-sma"gtBillt/wgt
- ltw lemma"biti" ana"Vcip3s--n"gtjelt/wgt
- ltw lemma"jasen" ana"Afpmsnn"gtjasenlt/wgt
- ltcgt,lt/cgt
- ltw lemma"mrzel" ana"Afpmsnn"gtmrzellt/wgt
- ltw lemma"aprilski" ana"Aopmsn"gtaprilskilt/wgt
- ltw lemma"dan" ana"Ncmsn"gtdanlt/wgt
- ltw lemma"in" ana"Ccs"gtinlt/wgt
- ltw lemma"ura" ana"Ncfpn"gturelt/wgt
- ltw lemma"biti" ana"Vcip3p--n"gtsolt/wgt
- ltw lemma"biti" ana"Vmps-pfa"gtbilelt/wgt
- ltw lemma"trinajst" ana"Mcnpnl"gttrinajstlt/wgt
13Intex MSD for Serbian
- one DELAS entry cyist,A17
- one of its corresponding DELAF entries
cyistiji,cyist.A17bems1gbems4qbems5gbemp1g - bemp5g
- produced by the regular expression A17.exp
- ..............
- ijemu/bems3gbems7gbens3gbens7g
- iji/bems1gbems4qbems5gbemp1gbemp5g
o/aens1gaens4gaens5g - ..............
14Attributes and their values for Serbian
adjectives in DELAS/DELAF
15Syntactic and semantic marks in Serbian DELAS
16Problems of correspondence between MULTEXT-East
MSD and Intex/1
- The necessity to enforce the existing coding
schema to a particular language - Example How to encode present and past gerund
active? - In Serbian, for the verb ici (Engl. to go) those
gerunds are iduci and iavi - There are attributes in verb tables of
MULTEXT-east specification that describe them.
However, no Slavic language, except Bulgarian,
uses it.
17Problems/2
- the common encoding schema does not guarantee
that true standardization would be achieved - Example
- only in Bulgarian do we find the attribute value
'adjectival' for adverbs (with the examples
'umno, veselo, studeno') other Slavic
languages, at least, could make use of that value
of the attribute type.
18Problems/3
- Encoding of verb tenses
- EN RO SL CS BG
ET HU HR SR - P ATT VAL C x x x x x
x x x x -
- 2 VForm indicative i x x x x x
x x x x - subjunctive s x
- imperative m x x x x
x x x x - conditional c x x x
x x x x - infinitive n x x x x
x x x x - participle p x x x x x
x x x - gerund g x x
x - supine u x
x - transgressive t x
- quotative q
x - - -------------- -------------- -
- 3 Tense present p x x x x x
x x x x - imperfect i x x
x x x - future f x x
x x - past s x x x x x
x x x x
19Problems/3
- The second attribute specifies verb form, and the
third the tense. However, due to the composite
tenses, some verb forms are used for the
construction of different tenses. In Slovenian,
verb form imel is past participle of the verb
imeti (Engl. to have), and it is used to produce
perfect tense if used with the indicative form of
the present tense of the copula verb biti (Engl.
to be) and conditional if used with the
conditional form of the same copula verb. -
20Problems/3
- ltw lemma"Winston" ana"Npmsn"gtWinstonlt/wgt
- ltw lemma"Smith" ana"Npmsn"gtSmithlt/wgt
- ltw lemma"biti" ana"Vcip3s--n"gtjelt/wgt
- ltw lemma"imeti" ana"Vmps-sma"gtimellt/wgt
- ..........................................
- ltw lemma"da" ana"Css"gtdalt/wgt
- ltw lemma"biti" ana"Vcc"gtbilt/wgt
- ltw lemma"on" ana"Pp3msa--y-n"gtgalt/wgt
- ltw lemma"imeti" ana"Vmps-sma"gtimellt/wgt
21Problems/4
- different interpretation of various grammatical
categories across languages and lack of a clear
cross-linguistic correspondance are discussed in
Przepiórkowski (EACL 2003), for example dual
number in Slovene and paucal in Serbian. - certain morphosyntactic phenomena have not been
taken into consideration, as various problems of
agreement (Vitas, Krstev, to appear). -
22Application of MSD?Intex mapping to Serbian 1984
- SBio,biti.V77Gsm
- (je,jesam.V575ImperfItIrefAuxPzsi
je,on.PROPrssz2fisz4fi) - vedar,.A18akms1gakms4q
- (i,.CONJ i,.PAR)
- hladan,.A18akms1gakms4q
- aprilski,.A2PosQadms1gaems4qaems5gaemp1gaem
p5g - (dan,.A1PPakms1gaems4q dan,dati.V103Perf
TrIrefRefTms) -
- S
- (na,.PREPp4 na,.PREPp7)
- cyasovnicima,.?
- (je,jesam.V575ImperfItIrefAuxPzsi
je,on.PROPrssz2fisz4fi) - izbijalo,izbijati.V101PerfTrItIrefGsn
- trinaest,.?
- .
23Tool that facilitates the lemmatization and
disambiguation
24Tagged Serbian translation of 1984 after hand
disambiguation and resolving of unknown words
- SBio,biti.V77Gsm
- je,jesam.V575ImperfItIrefAuxPzsi
- vedar,.A18akms1g
- (i,.CONJ)
- hladan,.A18akms1g
- aprilski,.A2PosQadms1g
- dan,.N1ms1q
-
- S
- na,.PREPp7
- cyasovnicima,cyasovnik.N5mp7q
- je,jesam.V575ImperfItIrefAuxPzsi
- izbijalo,izbijati.V101PerfTrItIrefGsn
- trinaest,.NumCar
- .
25Simple perl script maps Serbian Intex codes to
MULTEX-East MSD
- if ((POS eq "V") (kategorije ! /XS/))
glagol je - glagol "V" . "---------------"
- if (semkat /Aux/) tip, atribut
1 - substr(glagol,1,1) "a"
- else
- substr(glagol,1,1) "m"
- if (kategorije /(WYGTIFA)/ ) forma,
atribut 2 - substr(glagol,2,1) 1
- glagol tr/WYGTIFA/nmppiii/
- if ( (lema eq "biti")
(kategorije /A/) ) - substr(glagol,2,1) "c"
- if (kategorije /(PIFAGY)/) vreme,
atribut 3 - substr(glagol,3,1) 1
- glagol tr/PIFAGY/pofasp/
- if (kategorije /(xyz)/) broj,
atribut 4 - substr(glagol,4,1) 1
- glagol tr/xyz/123/ ........
26Tagged Serbian 1984 using MULTEXT-East MSD
- ltw lemma"biti" ana"Vmps-sman-n---p"gtBiolt/wgt
- ltw lemma"jesam" ana"Va-p3s-an-y---p"gtjelt/wgt
- ltw lemma"vedar" ana"Afpms1n"gtvedarlt/wgt
- ltw lemma"i" ana"Ccs"gtilt/wgt
- ltw lemma"hladan" ana"Afpms1n"gthladanlt/wgt
- ltw lemma"aprilski" ana"Aopms1y"gtaprilskilt/wgt
- ltw lemma"dan" ana"Ncmsn--n"gtdanlt/wgt
- ltw lemma"na" ana"Sps-"gtnalt/wgt
- ltw lemma"cyasovnik" ana"Ncmpl--n"gtcyasovnicimalt/
wgt - ltw lemma"jesam" ana"Va-p3s-an-y---p"gtjelt/wgt
- ltw lemma"izbijati" ana"Vmps-snan-n---e"gtizbijalo
lt/wgt - ltw lemma"trinaest" ana"Mc---l"gttrinaestlt/wgt
27Conclusion
- It is possible to convert from Intex to
MULTEXT-East - It is possible to convert from MULTEXT-East to
Intex to certain extent. Some information can not
be recovered, such as inflectional class code
28Noun attributes
- Type
- Gender
- Number
- Case
- Definitness
- Type attributes
- Types
- Clitic
- Animate
- Owner_Number
- Owner_Person
- Owned_Number
29Verb Attributes
- Type
- VForm
- Tense
- Person
- Number
- Gender
- Voice
- Type attributes
- Types
- Negative
- Definitness
- Clitic
- Case
- Animate
- Clitic_s
- Aspect
30Adjective attributes
- Type
- Degree
- Gender
- Number
- Case
- Definitness
- Type attributes
- Types
- Clitic
- Animate
- Formation
- Owner_Number
- Owner_Person
- Owned_Number
31Adverb attributes
- Type
- Degree
- Clitic
- Number
- Person
- Wh_Type
- Type attributes
- Types
32Values of the attribute Vform of the type Verb
- indicative (m)
- subjunctive (s)
- imperative (m)
- conditional (c)
- infinitive (i)
- Verb attributes
- participle (p)
- gerund (g)
- supine (u)
- transgressive (t)
- quotative (q)
33Value of the attribute Tense of the type Verb
- present (p)
- imperfect (i)
- future (f)
- past (s)
- pluperfect (l)
- aorist (a)
- Verb attributes