Writing Lexical Transducers Using xfst - PowerPoint PPT Presentation

About This Presentation

Title:

Writing Lexical Transducers Using xfst

Description:

The set of surface words (strings) to be analyzed is a regular ... Malay requires special algorithms for reduplication. Phonological/Orthographical Alternation ... – PowerPoint PPT presentation

Number of Views:165

Avg rating:3.0/5.0

Slides: 21

Provided by: anneschi

Category:

more less

Transcript and Presenter's Notes

Title: Writing Lexical Transducers Using xfst

1
Writing Lexical Transducers Using xfst

Overview of Transduction
Review of xfst Rules
Creating Two-Level Lexicons
Putting it All Together

2
Theory-Neutral Morphological Analysis
Analyses
Black-Box Morphological Analyzer
Words
3
Finite-State Transducers (FSTs)

An FST encodes a Regular Relation, i.e. a
relation between two regular languages.
FSTs can be used for morphological analysis, if
The set of surface words (strings) to be analyzed
is a regular language, and
The analyses are also defined to be a regular
language, i.e. just another set of strings

Analysis String Language
FST
Surface String Language
4
What Do the Two Languages Look Like?

In commercial natural-language processing
The surface language (e.g. French words written
in the standard French orthography) is usually a
given.
Periodic official spelling reforms may require
fixes to your analyzer.
You may have to worry about national variations.
In contrast, the analysis-language strings must
be designed by the linguist. In the most common
Xerox convention, each analysis string consists
of the traditional dictionary-citation baseform
followed by multicharacter-symbol tags.
cantarVerbPInd1PSg
cantoNounMascSg
altoAdjFemPl

5
Non-Commercial (Lesser-Studied) Languages

1. All normal human beings speak a natural
language, but there is nothing necessary or
natural about reading and writing.
2. An orthography is a set of symbols, and
conventions for using them, for making language
visible.
3. Orthographies are technologies, like
agriculture or metalworking.
4. Most languages have never been written, i.e.
there is no standard orthography or linguists
and governments may have proposed several
competing orthographies.
5. When working with lesser-studied languages,
you may have to choose (or devise) a surface
orthography for use in your morphological
analyzer.

6
Two Main Tasks to Morphology

Morphotactics
Describe the structure/grammar of words
Classic finite-state operations required
Concatenation of one morpheme to the next
Union of morphemes within classes
Some languages require other finite-state
operations
Arabic stems require intersection
Malay requires special algorithms for
reduplication
Phonological/Orthographical Alternation
Union and concatenation by themselves tend to
build abstract morphophonemic strings
Use finite-state rules to map from underlying (or
lexical) morphophonemic strings to surface
strings

7
Describing Morphotactics Using Regular Expressions

Some very simple morphotactics can be described
using just union, concatenation and perhaps
optionality.
Simple Esperanto Verbs
Opt. Prefix Req. Root Opt. Aspect Req.
Verb Ending
ne don ad as
mal dir is
pens os
ir us
... u
i

8
Esperanto Verb Morphotactics

xfst read regex
( n e m a l )
d o n d i r p e n s i r
( a d )
a s i s o s u s u i

Each morpheme class is a unioned list of
morphemes.
Optional classes are surrounded with
parentheses.
Then morpheme classes are concatenated together,
in the right order.

9
Esperanto Verb Morphotactics, Version 2 (xfst
script)

xfst define Prefix n e m a l
xfst define Root d o n d i r p e n s
i r
xfst define Aspect a d
xfst define VSuff a s i s o s u s
u i
xfst read regex (Prefix) Root (Aspect)
VSuff

10
Morphophonological/Orthographical Alternations

If simple concatenation doesnt produce valid
words, then we need to handle alternations.
In todays exercises, we will use Replace Rules,
e.g. if Spanish pluralization is done by
concatenating s to a noun, we will need to
fix cases like the following
pezs
.o.
z -gt c e _ s ..

pezs
FST
peces
11
The Simplest Xerox Replace Rules

Schema upper -gt lower left _ right
where upper, lower, left and right are regular
expressions denoting regular languages (not
relations!)
Remember to use regular-expression syntax.
Replace Rules are
regular expressions! The overall Replace Rule
denotes a relation.
E.g. s -gt z a e i o u _
a e i o u
A context can be left empty, which is equivalent
to a context of ?
E.g. s -gt z _ m
p -gt m m _

12
The Simplest Replace Rules II

Referring to the beginning or the end of a word
z -gt s _ ..
e -gt i _ (s) ..
e -gt i .. p _ r
A rule may be unconditioned, with no context at
all
c h -gt
s s -gt s
Do not write ss or ch in regular expressions
unless you want them to be treated as single
symbols. Remember to unspecialize special
symbols when you want a literal dollar sign, etc.

13
Rule Abbreviations

Instead of two rules e -gt i _ (s) ..
o -gt u _ (s) ..
You can write e -gt i , o -gt u _ (s)
..
a comma separates the left-hand sides of the
rule
Instead of two rules e -gt i _ (s) ..
e -gt i .. p _ r
You can write e -gt i _ (s) ..
, .. p _ r
a comma separates the right-hand sides of the
rule

14
Simple Replace-Rule Semantics

upper -gt lower leftcontext _
rightcontext
The overall rule denotes a finite-state relation
(not an algorithm)
The upper-side language of a -gt relation is the
universal language (?)
By default, all symbols on the upper side are
mapped to the same symbol on the lower side
But IF a string on the upper side contains a
designated upper string, in the designated
context, then it is mapped to a string (or
strings) on the lower side where the matched
substring is replaced by the designated lower
string.
The context must match on the upper side string
A right-arrow -gt rule has a downward
orientation.

15
Understanding Replace Rules

xfstgt read regex a -gt b
xfstgt apply down a
xfstgt apply down aaa
xfstgt apply down dog
xfstgt apply up b
xfstgt apply up bbb
xfstgt apply up dog
xfstgt read regex ab
xfstgt apply down a
xfstgt apply down aaa
xfstgt apply up b
xfstgt apply up bbb

16
Review of Notations for Transducers

The cross-product operator
u p p e r .x. l o w e r
In general, for any two regular expressions A and
B denoting languages
A .x. B
For convenience, we can also write
ab equivalent to a .x. b
Taging Tag .x. i n g
upperlower u p p e r .x. l o w e r

17
Esperanto Verb Morphotactics, Version 3 A
Lexicon with Two Levels

xfst define Prefix Negne Opmal
xfst define Root d o n d i r p e n s
i r
xfst define Aspect Contad
xfst define VSuff Presas
Pastis Futos Condus Subju
Infi
xfst read regex (Prefix) Root (Aspect)
VSuff

18
Esperanto Verb Transducer
0
Pres
a
Past
Fut
0
i
Cond
o
Neg
0
o
n
s
e
u
0
n
Cont
d
i
r
a
d
Op
0
m
Subj
l
0
i
u
0
p
s
a
Inf

e
n
i
Apply up malpensadus
19
The Usual Strategy Define a dictionary and
alternation rules
Upper OpdonContPast
Dictionary Transducer
Lower maldonadis
.o.
Final FST
As necessary, apply alternation rules via
composition
Alternation Rules
20
The Bambona Language

Review the Xerox regular-expression syntax.
Review the difference between
regular expression file
contains a single regular expression, ends with a
semicolon and newline
xfst read regex lt myfile.regex
script file
contains a list of commands to xfst (including
perhaps define and read regex commands)
xfst source myfile.script
Read the description carefully (not just the
final test data).
Describe the morphotactics using union and
concatenation.
Handle the variations using replace rules.