Writing Lexical Transducers Using xfst - PowerPoint PPT Presentation

About This Presentation
Title:

Writing Lexical Transducers Using xfst

Description:

The set of surface words (strings) to be analyzed is a regular ... Malay requires special algorithms for reduplication. Phonological/Orthographical Alternation ... – PowerPoint PPT presentation

Number of Views:165
Avg rating:3.0/5.0
Slides: 21
Provided by: anneschi
Category:

less

Transcript and Presenter's Notes

Title: Writing Lexical Transducers Using xfst


1
Writing Lexical Transducers Using xfst
  • Overview of Transduction
  • Review of xfst Rules
  • Creating Two-Level Lexicons
  • Putting it All Together

2
Theory-Neutral Morphological Analysis
Analyses
Black-Box Morphological Analyzer
Words
3
Finite-State Transducers (FSTs)
  • An FST encodes a Regular Relation, i.e. a
    relation between two regular languages.
  • FSTs can be used for morphological analysis, if
  • The set of surface words (strings) to be analyzed
    is a regular language, and
  • The analyses are also defined to be a regular
    language, i.e. just another set of strings

Analysis String Language
FST
Surface String Language
4
What Do the Two Languages Look Like?
  • In commercial natural-language processing
  • The surface language (e.g. French words written
    in the standard French orthography) is usually a
    given.
  • Periodic official spelling reforms may require
    fixes to your analyzer.
  • You may have to worry about national variations.
  • In contrast, the analysis-language strings must
    be designed by the linguist. In the most common
    Xerox convention, each analysis string consists
    of the traditional dictionary-citation baseform
    followed by multicharacter-symbol tags.
  • cantarVerbPInd1PSg
  • cantoNounMascSg
  • altoAdjFemPl

5
Non-Commercial (Lesser-Studied) Languages
  • 1. All normal human beings speak a natural
    language, but there is nothing necessary or
    natural about reading and writing.
  • 2. An orthography is a set of symbols, and
    conventions for using them, for making language
    visible.
  • 3. Orthographies are technologies, like
    agriculture or metalworking.
  • 4. Most languages have never been written, i.e.
    there is no standard orthography or linguists
    and governments may have proposed several
    competing orthographies.
  • 5. When working with lesser-studied languages,
    you may have to choose (or devise) a surface
    orthography for use in your morphological
    analyzer.

6
Two Main Tasks to Morphology
  • Morphotactics
  • Describe the structure/grammar of words
  • Classic finite-state operations required
  • Concatenation of one morpheme to the next
  • Union of morphemes within classes
  • Some languages require other finite-state
    operations
  • Arabic stems require intersection
  • Malay requires special algorithms for
    reduplication
  • Phonological/Orthographical Alternation
  • Union and concatenation by themselves tend to
    build abstract morphophonemic strings
  • Use finite-state rules to map from underlying (or
    lexical) morphophonemic strings to surface
    strings

7
Describing Morphotactics Using Regular Expressions
  • Some very simple morphotactics can be described
    using just union, concatenation and perhaps
    optionality.
  • Simple Esperanto Verbs
  • Opt. Prefix Req. Root Opt. Aspect Req.
    Verb Ending
  • ne don ad as
  • mal dir is
  • pens os
  • ir us
  • ... u
  • i

8
Esperanto Verb Morphotactics
  • xfst read regex
  • ( n e m a l )
  • d o n d i r p e n s i r
  • ( a d )
  • a s i s o s u s u i
  • Each morpheme class is a unioned list of
    morphemes.
  • Optional classes are surrounded with
    parentheses.
  • Then morpheme classes are concatenated together,
    in the right order.

9
Esperanto Verb Morphotactics, Version 2 (xfst
script)
  • xfst define Prefix n e m a l
  • xfst define Root d o n d i r p e n s
    i r
  • xfst define Aspect a d
  • xfst define VSuff a s i s o s u s
    u i
  • xfst read regex (Prefix) Root (Aspect)
    VSuff

10
Morphophonological/Orthographical Alternations
  • If simple concatenation doesnt produce valid
    words, then we need to handle alternations.
  • In todays exercises, we will use Replace Rules,
    e.g. if Spanish pluralization is done by
    concatenating s to a noun, we will need to
    fix cases like the following
  • pezs
  • .o.
  • z -gt c e _ s ..

pezs
FST
peces
11
The Simplest Xerox Replace Rules
  • Schema upper -gt lower left _ right
  • where upper, lower, left and right are regular
    expressions denoting regular languages (not
    relations!)
  • Remember to use regular-expression syntax.
    Replace Rules are
  • regular expressions! The overall Replace Rule
    denotes a relation.
  • E.g. s -gt z a e i o u _
    a e i o u
  • A context can be left empty, which is equivalent
    to a context of ?
  • E.g. s -gt z _ m
  • p -gt m m _

12
The Simplest Replace Rules II
  • Referring to the beginning or the end of a word
  • z -gt s _ ..
  • e -gt i _ (s) ..
  • e -gt i .. p _ r
  • A rule may be unconditioned, with no context at
    all
  • c h -gt
  • s s -gt s
  • Do not write ss or ch in regular expressions
    unless you want them to be treated as single
    symbols. Remember to unspecialize special
    symbols when you want a literal dollar sign, etc.

13
Rule Abbreviations
  • Instead of two rules e -gt i _ (s) ..
  • o -gt u _ (s) ..
  • You can write e -gt i , o -gt u _ (s)
    ..
  • a comma separates the left-hand sides of the
    rule
  • Instead of two rules e -gt i _ (s) ..
  • e -gt i .. p _ r
  • You can write e -gt i _ (s) ..
    , .. p _ r
  • a comma separates the right-hand sides of the
    rule

14
Simple Replace-Rule Semantics
  • upper -gt lower leftcontext _
    rightcontext
  • The overall rule denotes a finite-state relation
    (not an algorithm)
  • The upper-side language of a -gt relation is the
    universal language (?)
  • By default, all symbols on the upper side are
    mapped to the same symbol on the lower side
  • But IF a string on the upper side contains a
    designated upper string, in the designated
    context, then it is mapped to a string (or
    strings) on the lower side where the matched
    substring is replaced by the designated lower
    string.
  • The context must match on the upper side string
  • A right-arrow -gt rule has a downward
    orientation.

15
Understanding Replace Rules
  • xfstgt read regex a -gt b
  • xfstgt apply down a
  • xfstgt apply down aaa
  • xfstgt apply down dog
  • xfstgt apply up b
  • xfstgt apply up bbb
  • xfstgt apply up dog
  • xfstgt read regex ab
  • xfstgt apply down a
  • xfstgt apply down aaa
  • xfstgt apply up b
  • xfstgt apply up bbb

16
Review of Notations for Transducers
  • The cross-product operator
  • u p p e r .x. l o w e r
  • In general, for any two regular expressions A and
    B denoting languages
  • A .x. B
  • For convenience, we can also write
  • ab equivalent to a .x. b
  • Taging Tag .x. i n g
  • upperlower u p p e r .x. l o w e r

17
Esperanto Verb Morphotactics, Version 3 A
Lexicon with Two Levels
  • xfst define Prefix Negne Opmal
  • xfst define Root d o n d i r p e n s
    i r
  • xfst define Aspect Contad
  • xfst define VSuff Presas
    Pastis Futos Condus Subju
    Infi
  • xfst read regex (Prefix) Root (Aspect)
    VSuff

18
Esperanto Verb Transducer
0
Pres
a
Past
Fut
0
i
Cond
o
Neg
0
o
n
s
e
u
0
n
Cont
d
i
r
a
d
Op
0
m
Subj
l
0
i
u
0
p
s
a
Inf

e
n
i
Apply up malpensadus
19
The Usual Strategy Define a dictionary and
alternation rules
Upper OpdonContPast
Dictionary Transducer
Lower maldonadis
.o.
Final FST
As necessary, apply alternation rules via
composition
Alternation Rules
20
The Bambona Language
  • Review the Xerox regular-expression syntax.
  • Review the difference between
  • regular expression file
  • contains a single regular expression, ends with a
    semicolon and newline
  • xfst read regex lt myfile.regex
  • script file
  • contains a list of commands to xfst (including
    perhaps define and read regex commands)
  • xfst source myfile.script
  • Read the description carefully (not just the
    final test data).
  • Describe the morphotactics using union and
    concatenation.
  • Handle the variations using replace rules.
Write a Comment
User Comments (0)
About PowerShow.com