Title: Writing Lexical Transducers Using xfst
1Writing Lexical Transducers Using xfst
- Overview of Transduction
- Review of xfst Rules
- Creating Two-Level Lexicons
- Putting it All Together
2Theory-Neutral Morphological Analysis
Analyses
Black-Box Morphological Analyzer
Words
3Finite-State Transducers (FSTs)
- An FST encodes a Regular Relation, i.e. a
relation between two regular languages. - FSTs can be used for morphological analysis, if
- The set of surface words (strings) to be analyzed
is a regular language, and - The analyses are also defined to be a regular
language, i.e. just another set of strings
Analysis String Language
FST
Surface String Language
4What Do the Two Languages Look Like?
- In commercial natural-language processing
- The surface language (e.g. French words written
in the standard French orthography) is usually a
given. - Periodic official spelling reforms may require
fixes to your analyzer. - You may have to worry about national variations.
- In contrast, the analysis-language strings must
be designed by the linguist. In the most common
Xerox convention, each analysis string consists
of the traditional dictionary-citation baseform
followed by multicharacter-symbol tags. - cantarVerbPInd1PSg
- cantoNounMascSg
- altoAdjFemPl
5Non-Commercial (Lesser-Studied) Languages
- 1. All normal human beings speak a natural
language, but there is nothing necessary or
natural about reading and writing. - 2. An orthography is a set of symbols, and
conventions for using them, for making language
visible. - 3. Orthographies are technologies, like
agriculture or metalworking. - 4. Most languages have never been written, i.e.
there is no standard orthography or linguists
and governments may have proposed several
competing orthographies. - 5. When working with lesser-studied languages,
you may have to choose (or devise) a surface
orthography for use in your morphological
analyzer.
6Two Main Tasks to Morphology
- Morphotactics
- Describe the structure/grammar of words
- Classic finite-state operations required
- Concatenation of one morpheme to the next
- Union of morphemes within classes
- Some languages require other finite-state
operations - Arabic stems require intersection
- Malay requires special algorithms for
reduplication - Phonological/Orthographical Alternation
- Union and concatenation by themselves tend to
build abstract morphophonemic strings - Use finite-state rules to map from underlying (or
lexical) morphophonemic strings to surface
strings
7Describing Morphotactics Using Regular Expressions
- Some very simple morphotactics can be described
using just union, concatenation and perhaps
optionality. - Simple Esperanto Verbs
- Opt. Prefix Req. Root Opt. Aspect Req.
Verb Ending - ne don ad as
- mal dir is
- pens os
- ir us
- ... u
- i
-
8Esperanto Verb Morphotactics
- xfst read regex
- ( n e m a l )
- d o n d i r p e n s i r
- ( a d )
- a s i s o s u s u i
- Each morpheme class is a unioned list of
morphemes. - Optional classes are surrounded with
parentheses. - Then morpheme classes are concatenated together,
in the right order.
9Esperanto Verb Morphotactics, Version 2 (xfst
script)
- xfst define Prefix n e m a l
- xfst define Root d o n d i r p e n s
i r - xfst define Aspect a d
- xfst define VSuff a s i s o s u s
u i - xfst read regex (Prefix) Root (Aspect)
VSuff
10Morphophonological/Orthographical Alternations
- If simple concatenation doesnt produce valid
words, then we need to handle alternations. - In todays exercises, we will use Replace Rules,
e.g. if Spanish pluralization is done by
concatenating s to a noun, we will need to
fix cases like the following - pezs
- .o.
- z -gt c e _ s ..
-
pezs
FST
peces
11The Simplest Xerox Replace Rules
- Schema upper -gt lower left _ right
- where upper, lower, left and right are regular
expressions denoting regular languages (not
relations!) - Remember to use regular-expression syntax.
Replace Rules are - regular expressions! The overall Replace Rule
denotes a relation. - E.g. s -gt z a e i o u _
a e i o u - A context can be left empty, which is equivalent
to a context of ? - E.g. s -gt z _ m
- p -gt m m _
12The Simplest Replace Rules II
- Referring to the beginning or the end of a word
- z -gt s _ ..
- e -gt i _ (s) ..
- e -gt i .. p _ r
- A rule may be unconditioned, with no context at
all - c h -gt
- s s -gt s
- Do not write ss or ch in regular expressions
unless you want them to be treated as single
symbols. Remember to unspecialize special
symbols when you want a literal dollar sign, etc.
13Rule Abbreviations
- Instead of two rules e -gt i _ (s) ..
- o -gt u _ (s) ..
- You can write e -gt i , o -gt u _ (s)
.. - a comma separates the left-hand sides of the
rule - Instead of two rules e -gt i _ (s) ..
- e -gt i .. p _ r
- You can write e -gt i _ (s) ..
, .. p _ r - a comma separates the right-hand sides of the
rule
14Simple Replace-Rule Semantics
- upper -gt lower leftcontext _
rightcontext - The overall rule denotes a finite-state relation
(not an algorithm) - The upper-side language of a -gt relation is the
universal language (?) - By default, all symbols on the upper side are
mapped to the same symbol on the lower side - But IF a string on the upper side contains a
designated upper string, in the designated
context, then it is mapped to a string (or
strings) on the lower side where the matched
substring is replaced by the designated lower
string. - The context must match on the upper side string
- A right-arrow -gt rule has a downward
orientation.
15Understanding Replace Rules
- xfstgt read regex a -gt b
- xfstgt apply down a
- xfstgt apply down aaa
- xfstgt apply down dog
- xfstgt apply up b
- xfstgt apply up bbb
- xfstgt apply up dog
- xfstgt read regex ab
- xfstgt apply down a
- xfstgt apply down aaa
- xfstgt apply up b
- xfstgt apply up bbb
16Review of Notations for Transducers
- The cross-product operator
- u p p e r .x. l o w e r
- In general, for any two regular expressions A and
B denoting languages - A .x. B
- For convenience, we can also write
- ab equivalent to a .x. b
- Taging Tag .x. i n g
- upperlower u p p e r .x. l o w e r
17Esperanto Verb Morphotactics, Version 3 A
Lexicon with Two Levels
- xfst define Prefix Negne Opmal
- xfst define Root d o n d i r p e n s
i r - xfst define Aspect Contad
- xfst define VSuff Presas
Pastis Futos Condus Subju
Infi - xfst read regex (Prefix) Root (Aspect)
VSuff
18Esperanto Verb Transducer
0
Pres
a
Past
Fut
0
i
Cond
o
Neg
0
o
n
s
e
u
0
n
Cont
d
i
r
a
d
Op
0
m
Subj
l
0
i
u
0
p
s
a
Inf
e
n
i
Apply up malpensadus
19The Usual Strategy Define a dictionary and
alternation rules
Upper OpdonContPast
Dictionary Transducer
Lower maldonadis
.o.
Final FST
As necessary, apply alternation rules via
composition
Alternation Rules
20The Bambona Language
- Review the Xerox regular-expression syntax.
- Review the difference between
- regular expression file
- contains a single regular expression, ends with a
semicolon and newline - xfst read regex lt myfile.regex
- script file
- contains a list of commands to xfst (including
perhaps define and read regex commands) - xfst source myfile.script
- Read the description carefully (not just the
final test data). - Describe the morphotactics using union and
concatenation. - Handle the variations using replace rules.