6th Intex Workshop - PowerPoint PPT Presentation

About This Presentation
Title:

6th Intex Workshop

Description:

elative e x x. 6th Intex Workshop, Sofia 28-30 May 2003. 9. Adjective attribute values/2 ... their values for Serbian adjectives in DELAS/DELAF (not important) ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 34
Provided by: MF16
Category:

less

Transcript and Presenter's Notes

Title: 6th Intex Workshop


1
6th Intex Workshop 10 years of (Silberztein,
1993)
  • Sofia, 28-30 May 2003

2
Conversion between Intex and MULTEXT-East
Morphosyntactic Descriptions
  • Cvetana Krstev, Duko Vitas
  • University of Belgrade
  • Toma Erjavec
  • Joef Stefan Institute, Ljubljana

3
Motivation
  • general
  • use of different tools
  • use of multilingual resources
  • comparison of results in NLP
  • specific
  • inclusion of Serbian language in MULTEXT-East
    specification and production of Slovenian Intex
    resources
  • production of tagged Serbian translation of
    Orwell's 1984

4
MULTEXT-East morphosyntactic specification
  • aim
  • exhaustive description of morphological and
    morphosyntactic features of different languages
    and establishment of unique codes for common
    features
  • scope
  • English, Romanian, Slovene, Czeck, Bulgarian,
    Estonian, Hungarian, Croatian (Concede), and
    Serbian

5
14 MULTEXT-East types or PoS - new types cannot
be introduced
  • Nouns (N)
  • Verbs (V)
  • Adjectives (A)
  • Pronouns (P)
  • Determiners (D)
  • Adpositions (S)
  • Conjuctions (C)
  • Numerals (M)
  • Interjections (I)
  • Abbreviations (Y)
  • Particles (Q)
  • Adverbs (R)
  • Articles (T)
  • Residuals (X)

6
Type attributes
  • Each type has a set of attributes that are
    appropriate to it
  • Each type attribute has its position in MSD
    description
  • It is not recommended to add new attributes to a
    type

7
Attribute values
  • a set of values is added to each attribute
  • each value is coded by one alphanumeric character
  • the new values can be added to the attributes, if
    necessary
  • Types
  • Verb attributes
  • Adjective attributes

8
Adjective attribute values/1
  • Adjective (A)
  • 13 positions
  • EN RO SL CS BG
    ET HU HR SR
  • P ATT VAL C x x x x x
    x x x x

  • 1 Type qualificative f x x x x
    x x x
  • indefinite i
  • possessive s x x
    x x
  • ordinal o x
    x
  • - -------------- -------------- -
  • 2 Degree positive p x x x x
    x x x x
  • comparative c x x x x
    x x x x
  • superlative s x x x x
    x x x x
  • elative e x
    x
  • - -------------- -------------- -

9
Adjective attribute values/2
  • EN RO SL CS BG
    ET HU HR SR
  • P ATT VAL C x x x x x
    x x x x
  • 3 Gender masculine m x x x x
    x x
  • feminine f x x x x
    x x
  • neuter n x x x x
    x x
  • - -------------- -------------- -
  • 4 Number singular s x x x x
    x x x x
  • plural p x x x x
    x x x x
  • dual d x x
  • paucal c
    x
  • - -------------- -------------- -
  • 5 Case nominative n x x
    x x x x
  • genitive g x x
    x x x x
  • dative d x x
    x x x
  • accusative a x x
    x x x
  • ...(various more values)..

10
Adjective attribute values/3
  • 6 Definiteness no n x x x
    x x
  • yes y x x x
    x x
  • short_art s x
  • full_art f x
  • - -------------- -------------- -
  • 7 Clitic no n x
  • yes y x
  • - -------------- -------------- -
  • 8 Animate no n x x
    x x x
  • yes y x x
    x x x
  • - -------------- -------------- -
  • 9 Formation nominal n x
  • compound c x
  • - -------------- -------------- -
  • ... various Hungarian specific attributes...
  • EN RO SL CS BG
    ET HU HR SR

11
An example from the Slovenian MULTEXT-East
dictionary
  • cisteji cist Afcfda
  • lemma cist (Engl. clean) corresponds to the
    simple word form cisteji it is qualified as
    qualificative (f) adjective (A) in comparative
    form (c), feminine gender (f), dual number (d),
    and accusative case (a).
  • cisteji cist Afcmsa--n
  • lemma cist (Engl. clean) corresponds to the
    simple word form cisteji it is qualified as
    qualificative (f) adjective (A) in comparative
    form (c), masculine gender (m), singular (s),
    accusative case (a), and not animate (n).

12
The first sentence of the Slovene translation of
Orwell's 1984 tagged
  • ltw lemma"biti" ana"Vcps-sma"gtBillt/wgt
  • ltw lemma"biti" ana"Vcip3s--n"gtjelt/wgt
  • ltw lemma"jasen" ana"Afpmsnn"gtjasenlt/wgt
  • ltcgt,lt/cgt
  • ltw lemma"mrzel" ana"Afpmsnn"gtmrzellt/wgt
  • ltw lemma"aprilski" ana"Aopmsn"gtaprilskilt/wgt
  • ltw lemma"dan" ana"Ncmsn"gtdanlt/wgt
  • ltw lemma"in" ana"Ccs"gtinlt/wgt
  • ltw lemma"ura" ana"Ncfpn"gturelt/wgt
  • ltw lemma"biti" ana"Vcip3p--n"gtsolt/wgt
  • ltw lemma"biti" ana"Vmps-pfa"gtbilelt/wgt
  • ltw lemma"trinajst" ana"Mcnpnl"gttrinajstlt/wgt

13
Intex MSD for Serbian
  • one DELAS entry cyist,A17
  • one of its corresponding DELAF entries
    cyistiji,cyist.A17bems1gbems4qbems5gbemp1g
  • bemp5g
  • produced by the regular expression A17.exp
  • ..............
  • ijemu/bems3gbems7gbens3gbens7g
  • iji/bems1gbems4qbems5gbemp1gbemp5g
    o/aens1gaens4gaens5g
  • ..............

14
Attributes and their values for Serbian
adjectives in DELAS/DELAF
15
Syntactic and semantic marks in Serbian DELAS
16
Problems of correspondence between MULTEXT-East
MSD and Intex/1
  • The necessity to enforce the existing coding
    schema to a particular language
  • Example How to encode present and past gerund
    active?
  • In Serbian, for the verb ici (Engl. to go) those
    gerunds are iduci and iavi
  • There are attributes in verb tables of
    MULTEXT-east specification that describe them.
    However, no Slavic language, except Bulgarian,
    uses it.

17
Problems/2
  • the common encoding schema does not guarantee
    that true standardization would be achieved
  • Example
  • only in Bulgarian do we find the attribute value
    'adjectival' for adverbs (with the examples
    'umno, veselo, studeno') other Slavic
    languages, at least, could make use of that value
    of the attribute type.

18
Problems/3
  • Encoding of verb tenses
  • EN RO SL CS BG
    ET HU HR SR
  • P ATT VAL C x x x x x
    x x x x

  • 2 VForm indicative i x x x x x
    x x x x
  • subjunctive s x
  • imperative m x x x x
    x x x x
  • conditional c x x x
    x x x x
  • infinitive n x x x x
    x x x x
  • participle p x x x x x
    x x x
  • gerund g x x
    x
  • supine u x
    x
  • transgressive t x
  • quotative q
    x
  • - -------------- -------------- -
  • 3 Tense present p x x x x x
    x x x x
  • imperfect i x x
    x x x
  • future f x x
    x x
  • past s x x x x x
    x x x x

19
Problems/3
  • The second attribute specifies verb form, and the
    third the tense. However, due to the composite
    tenses, some verb forms are used for the
    construction of different tenses. In Slovenian,
    verb form imel is past participle of the verb
    imeti (Engl. to have), and it is used to produce
    perfect tense if used with the indicative form of
    the present tense of the copula verb biti (Engl.
    to be) and conditional if used with the
    conditional form of the same copula verb.

20
Problems/3
  • ltw lemma"Winston" ana"Npmsn"gtWinstonlt/wgt
  • ltw lemma"Smith" ana"Npmsn"gtSmithlt/wgt
  • ltw lemma"biti" ana"Vcip3s--n"gtjelt/wgt
  • ltw lemma"imeti" ana"Vmps-sma"gtimellt/wgt
  • ..........................................
  • ltw lemma"da" ana"Css"gtdalt/wgt
  • ltw lemma"biti" ana"Vcc"gtbilt/wgt
  • ltw lemma"on" ana"Pp3msa--y-n"gtgalt/wgt
  • ltw lemma"imeti" ana"Vmps-sma"gtimellt/wgt

21
Problems/4
  • different interpretation of various grammatical
    categories across languages and lack of a clear
    cross-linguistic correspondance are discussed in
    Przepiórkowski (EACL 2003), for example dual
    number in Slovene and paucal in Serbian.
  • certain morphosyntactic phenomena have not been
    taken into consideration, as various problems of
    agreement (Vitas, Krstev, to appear).

22
Application of MSD?Intex mapping to Serbian 1984
  • SBio,biti.V77Gsm
  • (je,jesam.V575ImperfItIrefAuxPzsi
    je,on.PROPrssz2fisz4fi)
  • vedar,.A18akms1gakms4q
  • (i,.CONJ i,.PAR)
  • hladan,.A18akms1gakms4q
  • aprilski,.A2PosQadms1gaems4qaems5gaemp1gaem
    p5g
  • (dan,.A1PPakms1gaems4q dan,dati.V103Perf
    TrIrefRefTms)
  • S
  • (na,.PREPp4 na,.PREPp7)
  • cyasovnicima,.?
  • (je,jesam.V575ImperfItIrefAuxPzsi
    je,on.PROPrssz2fisz4fi)
  • izbijalo,izbijati.V101PerfTrItIrefGsn
  • trinaest,.?
  • .

23
Tool that facilitates the lemmatization and
disambiguation
24
Tagged Serbian translation of 1984 after hand
disambiguation and resolving of unknown words
  • SBio,biti.V77Gsm
  • je,jesam.V575ImperfItIrefAuxPzsi
  • vedar,.A18akms1g
  • (i,.CONJ)
  • hladan,.A18akms1g
  • aprilski,.A2PosQadms1g
  • dan,.N1ms1q
  • S
  • na,.PREPp7
  • cyasovnicima,cyasovnik.N5mp7q
  • je,jesam.V575ImperfItIrefAuxPzsi
  • izbijalo,izbijati.V101PerfTrItIrefGsn
  • trinaest,.NumCar
  • .

25
Simple perl script maps Serbian Intex codes to
MULTEX-East MSD
  • if ((POS eq "V") (kategorije ! /XS/))
    glagol je
  • glagol "V" . "---------------"
  • if (semkat /Aux/) tip, atribut
    1
  • substr(glagol,1,1) "a"
  • else
  • substr(glagol,1,1) "m"
  • if (kategorije /(WYGTIFA)/ ) forma,
    atribut 2
  • substr(glagol,2,1) 1
  • glagol tr/WYGTIFA/nmppiii/
  • if ( (lema eq "biti")
    (kategorije /A/) )
  • substr(glagol,2,1) "c"
  • if (kategorije /(PIFAGY)/) vreme,
    atribut 3
  • substr(glagol,3,1) 1
  • glagol tr/PIFAGY/pofasp/
  • if (kategorije /(xyz)/) broj,
    atribut 4
  • substr(glagol,4,1) 1
  • glagol tr/xyz/123/ ........

26
Tagged Serbian 1984 using MULTEXT-East MSD
  • ltw lemma"biti" ana"Vmps-sman-n---p"gtBiolt/wgt
  • ltw lemma"jesam" ana"Va-p3s-an-y---p"gtjelt/wgt
  • ltw lemma"vedar" ana"Afpms1n"gtvedarlt/wgt
  • ltw lemma"i" ana"Ccs"gtilt/wgt
  • ltw lemma"hladan" ana"Afpms1n"gthladanlt/wgt
  • ltw lemma"aprilski" ana"Aopms1y"gtaprilskilt/wgt
  • ltw lemma"dan" ana"Ncmsn--n"gtdanlt/wgt
  • ltw lemma"na" ana"Sps-"gtnalt/wgt
  • ltw lemma"cyasovnik" ana"Ncmpl--n"gtcyasovnicimalt/
    wgt
  • ltw lemma"jesam" ana"Va-p3s-an-y---p"gtjelt/wgt
  • ltw lemma"izbijati" ana"Vmps-snan-n---e"gtizbijalo
    lt/wgt
  • ltw lemma"trinaest" ana"Mc---l"gttrinaestlt/wgt

27
Conclusion
  • It is possible to convert from Intex to
    MULTEXT-East
  • It is possible to convert from MULTEXT-East to
    Intex to certain extent. Some information can not
    be recovered, such as inflectional class code

28
Noun attributes
  • Type
  • Gender
  • Number
  • Case
  • Definitness
  • Type attributes
  • Types
  • Clitic
  • Animate
  • Owner_Number
  • Owner_Person
  • Owned_Number

29
Verb Attributes
  • Type
  • VForm
  • Tense
  • Person
  • Number
  • Gender
  • Voice
  • Type attributes
  • Types
  • Negative
  • Definitness
  • Clitic
  • Case
  • Animate
  • Clitic_s
  • Aspect

30
Adjective attributes
  • Type
  • Degree
  • Gender
  • Number
  • Case
  • Definitness
  • Type attributes
  • Types
  • Clitic
  • Animate
  • Formation
  • Owner_Number
  • Owner_Person
  • Owned_Number

31
Adverb attributes
  • Type
  • Degree
  • Clitic
  • Number
  • Person
  • Wh_Type
  • Type attributes
  • Types

32
Values of the attribute Vform of the type Verb
  • indicative (m)
  • subjunctive (s)
  • imperative (m)
  • conditional (c)
  • infinitive (i)
  • Verb attributes
  • participle (p)
  • gerund (g)
  • supine (u)
  • transgressive (t)
  • quotative (q)

33
Value of the attribute Tense of the type Verb
  • present (p)
  • imperfect (i)
  • future (f)
  • past (s)
  • pluperfect (l)
  • aorist (a)
  • Verb attributes
Write a Comment
User Comments (0)
About PowerShow.com