Title: Chapter 3: Morphology and Finite State Transducer
1Chapter 3 Morphology and Finite State Transducer
- Heshaam Faili
- hfaili_at_ece.ut.ac.ir
- University of Tehran
2Morphology
- Morphology is the study of the internal structure
of words - morphemes (roughly) minimal meaning-bearing unit
in a language, smallest building block of words - Morphological parsing is the task of breaking a
word down into its component morphemes, i.e.,
assigning structure - going ? go ing
- running ? run ing
- spelling rules are different from morphological
rules - Parsing can also provide us with an analysis
- going ? goVERB ingGERUND
3Kinds of morphology
- Inflectional morphology grammatical morphemes
that are required for words in certain syntactic
situations - I run
- John runs
- -s is an inflectional morpheme marking the third
person singular verb - Derivational morphology morphemes that are used
to produce new words, providing new meanings
and/or new parts of speech - establish
- establishment
- -ment is a derivational morpheme that turns verbs
into nouns
4More on morphology
- We will refer to the stem of a word (main part)
and its affixes (additions), which include
prefixes, suffixes, infixes, and circumfixes - Most inflectional morphological endings (and some
derivational) are productive they apply to
every word in a given class - -ing can attach to any verb (running, hurting)
- re- can attach to any verb (rerun, rehurt)
- Morphology is highly complex in more
agglutinative languages like Persian and Turkish - Some of the work of syntax in English is in the
morphology in Turkish - Shows that we cant simply list all possible words
5Overview
- Morphological recognition with finite-state
automata (FSAs) - Morphological parsing with finite-state
transducers (FSTs) - Combining FSTs
- More applications of FSTs
6A. Morphological recognition with FSA
- Before we talk about assigning a full structure
to a word, we can talk about recognizing
legitimate words - We have the technology to do this finite-state
automata (FSAs)
7Overview of English verbal morphology
- 4 English regular verb forms base, -s, -ing, -ed
- walk/walks/walking/walked
- merge/merges/merging/merged
- try/tries/trying/tried
- map/maps/mapping/mapped
- Generally productive forms
- English irregular verbs (250)
- eat/eats/eating/ate/eaten
- catch/catches/catching/caught/caught
- cut/cuts/cutting cut/cut
- etc.
8Analyzing English verbs
- For the -s, and ing forms, both regular and
irregular verbs use their base forms - Irregulars differ in how they treat the past and
the past participle forms
9FSA for English verbal morphology (morphotactics)
- initial 0 final 1, 2, 3
- 0-gtverb-past-irreg-gt3
- 0-gtvstem-reg-gt1
- 1-gtpast-gt3
- 1-gtpastpart-gt3
- 0-gtvstem-reg-gt2
- 0-gtvstem-irreg-gt2
- 2-gtprog-gt3
- 2?sing-gt3
- N.B. covers morphotactics, but not spelling
rules (latter requires a separate FSA)
10A Fun FSA Exercise Isleta Morphology
- Consider the following data from Isleta, a
dialect of Southern Tiwa, a Native American
language spoken in New Mexico - temiban I went
- amiban you went
- temiwe I am going
- mimiay he was going
- tewanban I came
- tewanhi I will come
11Practising Isleta
- List the morphemes corresponding to the following
English translations - I
- you
- he
- go
- come
- past
- present_progressive
- past_progressive
- future
- What is the order of morphemes in Isleta?
- How would you say each of the following in
Isleta? - He went
- I will go
- You were coming
12An FSA for Isleta Verbal Inflection
- initial 0 final 3
- 0-gtmitea-gt1
- 1-gtmiwan-gt2
- 2-gtbanweayhi-gt3
13B. Morphological Parsingwith FSTs
- Using a finite-state automata (FSA) to recognize
a morphological realization of a word is useful - But what if we also want to analyze that word?
- e.g. given cats, tell us that its cat N PL
- A finite-state transducer (FST) can give us the
necessary technology to do this - Two-level morphology
- Lexical level stem plus affixes
- Surface level actual spelling/realization of the
word - Roughly, well have the following for cats
- cc aa tt eN sPL
14Finite-State Transducers
- While an FSA recognizes (accept/reject) an input
expression, it doesnt produce any other output - An FST, on the other hand, in addition produces
an output expression ? we define this in terms of
relations - So, an FSA is a recognizer, whereas an FST
translates from one expression to another - So, it reads from one tape, and writes to another
tape (see Figure 3.8, p. 71) - Actually, it can also read from the output tape
and write to the input tape - So, FSTs can be used for both analysis and
generation (they are bidirectional)
15Transducers and Relations
- Lets pretend we want to translate from the
Cyrillic alphabet to the Roman alphabet - We can use a mapping table, such as
- A A
- ? B
- ? G
- ? D
- etc.
- We define R ltA, Agt, lt?, Bgt, lt?, Ggt, lt?, Dgt,
.. - We can thing of this as a relation R ? Cyrillic
X Roman - To understand FSTs, we need to understand
relations
16The Cyrillic Transducer
- initial 0 final 0
- 0gtAA-gt 0
- 0-gt?B-gt 0
- 0-gt?G-gt 0
- 0-gt?D-gt 0
- .
- Transducers implement a mapping defined by a
relation - R ltA, Agt, lt?, Bgt, lt?, Ggt, lt?, Dgt, ..
- These relations are called regular relations
(since each side expresses a regular expression)
17FSAs and FSTs
- FSTs, then, are almost identical to FSAs Both
have - Q a finite set of states
- q0 a designated start state
- F a set of final states
- ? a transition function
- The difference the alphabet (?) for an FST is
now comprised of complex symbols (e.g., XY) - FSA ? a finite alphabet of symbols
- FST ? a finite alphabet of complex symbols, or
pairs - As a shorthand, if we have XX, we can write this
as X
18FSTs for morphology
- For morphology, using FSTs means that we can
- set up pairs between the surface level (actual
realization) and the lexical level (stem/affixes) - cc aa tt eN sPL
- set up pairs to go from one form to another,
i.e., the underlying base form maps to the
plural - gg oe oe ss ee
- Can combine both kinds of information into the
same FST - gg oo oo ss ee eN eSG
- gg oe oe ss ee eN ePL
19Isleta Verbal Inflection
- te ? ? mi hi ?
- te PRO 1P mi hi FUT
- I will go
- Surface temihi
- Lexical tePRO1PmihiFUTURE
- Note that the cells have to line up across tapes.
- So, if an input symbol gives rise to more/less
output symbols, epsilons have to be added to the
input/output tape in the appropriate positions.
20An FST for Isleta Verbal Inflection
- initial 0 final 3
- 0-gt mi??miPRO3P te??tePRO1P a??aPRO2P
-gt1 - 1-gt miwan -gt2
- 2-gt ban?banPAST we??wePRESPROG
ay??ayPASTPROG hi?hiFUT -gt3 - Interpret te??tePRO1P as shorthand for 3
separate arcs
21A Lexical Transducer (Xerox)
- Remember that FSTs can be used in either
direction - l e a v e VBZ l e a v e s l e a v e VB l e
a v el e a v e VBG l e a v i n g l e a v e
VBD l e f t l e a v e NN l e a v e l e a
v e NNS l e a v e s l e a f NNS l e a v e
s  l e f t JJ l e f t - Left-to-Right Input leaveVBD (upper
language) Output left
     (lower language) - Right-to-Left Input leaves (lower
language)Â Â Â Â Â Output leaveNNS
(upper language)Â Â Â Â Â
leaveVBZ Â Â Â Â Â
leafNNS
22Transducer Example (Xerox)
- L1 a-z.
- Consider language L2 that results from replacing
any instances of "ab" in L1 by "x". - So, to define the mapping, we define a relation
R ? L1 X L2 - e.g., lt"abacab", "xacx"gt is in R.
- Note xacx" in lower language is paired with 4
strings in upper language, "abacab", "abacx",
"xacab", "xacx"
NB ? a-z\a,b,x
23C. Combining FSTs Spelling Rules
- So far, we have gone from a lexical level (e.g.,
catNPL) to a surface level (e.g., cats) - But this surface level is actually an
intermediate level ? it doesnt take spelling
into account - So, the lexical level of foxNPL corresponds
to foxs - We will use to refer to a morpheme boundary
- We need another level to account for spelling
rules
24Lexicon FST
- The lexicon FST will convert a lexical level to
an intermediate form - dogNPL ? dogs
- foxNPL ? foxs
- mouseNPL ? mouses
- dogVSG ? dogs
- This will be of the form
- 0-gt f -gt1 3-gt N -gt4
- 1-gt o -gt2 4-gt PLs -gt5
- 2-gt x -gt3 4-gt SGe -gt6
- And so on
25English noun lexicon as a FST
JM Fig 3.9
Expanding the aliases LEX-FST
JM Fig 3.11
26LEX-FST
- Lets allow?? to pad the tape
- Then, we wont force both tapes to have same
length - Also, lets pretend were generating
Morpheme Boundary
Word-Final Boundary
Lexical Tape
Intermediate Tape
27Rule FST
- The rule FST will convert the intermediate form
into the surface form - dogs ? dogs (covers both N and V forms)
- foxs ? foxes
- mouses ? mice
- Assuming we include other arcs for every other
character, this will be of the form - 0-gt f -gt0 1-gt e -gt2
- 0 -gt o -gt0 2-gt ee -gt3
- 0 -gt x -gt 1 3-gt s -gt4
- This FST is too impoverished
28Some English Spelling Rules
29E-insertion FST JM Fig 3.14, p. 78
30E-insertion FST
Intermediate Tape
Surface Tape
- Trace
- generating foxes from foxs
- q0-f-gtq0-o-gtq0-x-gtq1-?-gtq2-?e-gtq3-s-gtq4--gtq0
- generating foxs from foxs
- q0-f-gtq0-o-gtq0-x-gtq1-?-gtq2-s-gtq5--gtFAIL
- generating salt from salt
- q0-s-gtq1-a-gtq0-l-gtq0-tgtq0--gtq0
- parsing assess
- q0-a-gtq0-s-gtq1-s-gtq1-?-gtq2-?e-gtq3-s-gtq4-s-gtFAIL
- q0-a-gtq0-s-gtq1-s-gtq1-e-gtq0-s-gtq1-s-gtq1--gtq0
31Combining Lexicon and Rule FSTs
- We would like to combine these two FSTs, so that
we can go from the lexical level to the surface
level. - How do we integrate the intermediate level?
- Cascade the FSTs one after the other
- Compose the FSTs combine the rules at each state
32Cascading FSTs
- The idea of cascading FSTs is simple
- Input1 ? FST1 ? Output1
- Output1 ? FST2 ? Output2
- The output of the first FST is run as the input
of the second - Since both FSTs are reversible, the cascaded FSTs
are still reversible/bi-directional.
33Composing FSTs
- We can compose each transition in one FST with a
transition in another - FST1 p0-gt ab -gt p1 p0-gt de -gtp1
- FST2 q0-gt bc -gt q1 q0-gt ef -gt q0
- Composed FST
- (p0,q0)-gt ac -gt(p1,q1)
- (p0,q0)-gt df -gt(p1,q0)
- The new state names (e.g., (p0,q0)) seem somewhat
arbitrary, but this ensures that two FSTs with
different structures can still be composed - e.g., ab and de originally went to the same
state, but now we have to distinguish those
states - Why doesnt ef loop anymore?
34Composing FSTs for morphology
- With our lexical, intermediate, and surface
levels, this means that well compose - p2-gt x -gtp3 p4-gt PLs -gtp5
- p3-gt N -gtp4 p4-gt ee -gtp4
- q0-gt x -gtq1 q2-gt ee -gtq3
- q1-gt e -gtq2 q3-gt s -gtq4
- into
- (p2,q0)-gt x -gt(p3,q1)
- (p3,q1)-gt Ne -gt(p4,q2)
- (p4,q2)-gt ee -gt(p4,q3)
- (p4,q3)-gt PLs -gt(p4,q4)
35Generating or Parsing with FST lexicon and rules
36Lexicon-Free FST Porter Stemmer
- Used for IR and Search Engine
- e.g. by search Foxes should relates Fox
- Stemming
- Lexicon-Free Porter Algorithm
- ATIONAL -gt ATE (relational -gt relate)
- ING -gt e if stem contain vowel (motoring -gt
motor)
37D. More applications of FSTs
- Syntactic parsing using FSTs
- approximate the actual structure
- (it wont work in general for syntax)
- Noun Phrase (NP) parsing using FSTs
- also referred to as NP chunking, or partial
parsing - often used for prepositional phrases (PPs), too
38Syntactic parsing using FSTs
- Parsing more than recognition returns a
structure - For syntactic recognition, FSA could be used
- How does syntax work?
- S ? NP VP D ? the
- NP ? (D) N N ? girl N ? zebras
- VP ? V NP V ? saw
- How do we go about encoding this?
39Syntactic Parsing using FSTs
S
FST 3 Ss
VP
FST 2 VPs
NP
NP
FST 1 NPs
D
N
V
N
The
girl
saw
zebras
Input
0 1 2 3 4
FST1 initial0 final 2 0-gtNNP-gt2 0-gtD?-gt1 1-
gtN-NP-gt2
D N V N ? NP V NP ? NP ? VP ?
? ? S
FST1 FST2 FST3
40Syntactic structure with FSTs
- Note that the previous FSTs only output labels
after the phrase has been completed. - Where did the phrase start?
- To fully capture the structure of a sentence, we
need an FST which delineates the beginning and
the end of a phrase - 0-gt DetNP-Start -gt1
- 1-gt NNP-Finish -gt2
- Another FST can group the pieces into complete
phrases
41Why FSTs cant always be used for syntax
- Syntax is infinite, but we have set up a finite
number of levels (depth) to a tree with a finite
number of FSTs - Can still use FSTs, but (arguably) not as elegant
- The girl saw that zebras saw the giraffes.
- We have a VP over a VP and will have to run FST2
twice at different times. - Furthermore, we begin to get very complex FST
abbreviationse.g., /Det? Adj N PP/which dont
match linguistic structure - Center-embedding constructions
- Allowed in languages like English
- Mathematically impossible to capture with
finite-state methods
42Center embedding
- Example
- The man (that) the woman saw laughed.
- The man Harry said the woman saw laughed.
- S in the middle of another S
- Problem for FSA/FST technology
- Theres no way for finite-state grammars to make
sure that the number of NPs matches the number of
verbs - These are ancbn constructions ? not regular
- We have to use context-free grammars a topic
well return to later in the course
43Noun Phrase (NP) parsing using FSTs
- If we make the task more narrow, we can have more
success e.g., only parse NPs - The man on the floor likes the woman who is a
trapeze artist - The manNP on the floorNP likes the womanNP
who is a trapeze artistNP - Taking the NP chunker output at input, a PP
chunker - The manNP on the floorNPPP likes the
womanNP who is a trapeze artistNP
44Exercises
- 3.1, 3.4, 3.6 (3.3 , 3.5 new edition 2005)
- Write the Persian morphology (including name and
verb) analyzer by perl - See The document for morphological Analysis from
Shiraz Project