Title: CPSC 503 Computational Linguistics
1CPSC 503Computational Linguistics
- Lecture 4
- Giuseppe Carenini
2Today 1/23
- Finite State Transducers (FSTs) and Morphological
Parsing - Stemming (Porter Stemmer)
3Computational problems in Morphology
- Recognition recognize whether a string is an
English word (FSA) - Parsing/Generation
stem, class, lexical features
.
word
.
lie N PL
e.g.,
lies
lie V 3SG
stem
word
.
4Finite State Transducers (FSTs)
- FSA cannot help
- Need to extend FSA
- Add another tape
- Add extra symbols to the transitions
- On one tape we read cats, on the other we write
cat N PL (or vice versa)
5FSTs as translators
generation
parsing
6Example
PLs
ll
ii
ee
Ne
q1
q0
q2
q3
q4
q6
q5
q7
Ve
3SGs
- Transitions (as a translator)
- ll means read a l on one tape and write a l on
the other (or vice versa) - Ne means read a N symbol on one tape and write
nothing on the other (or vice versa) - PLs means read PL and write an s (or vice
versa) -
7Examples (as a translator)
lexical
l
i
e
s
surface
lexical
V
3SG
l
i
e
surface
8Examples (as a recognizer and a generator)
V
3SG
l
i
e
lexical
l
i
e
s
surface
lexical
surface
9FST definition
- Q a finite set of states
- I,O input and an output alphabets (which may
include e) - S a finite alphabet of complex symbols io, i?I
and o?O - Q0 the start state
- F a set of accept/final states (F?Q)
- A transition relation d that maps QxS to Q
10FST can be used as
- Translators input one string from I, output
another from O (or vice versa) - Recognizers input a strings from IxO
- Generator output a string from IxO
Terminology warning!
11A step back FSA can represent morphological
knowledge
- Lexicon list of stem and affixes, together with
basic information about them - Morphotactics the rules governing the ordering
of morphemes - Orthographics rules model changes in morphemes
when they combine
12FSA for inflectional morphology of plural
Some regular-nouns
i
Some irregular-nouns
13FST for inflectional morphology of plural
Some regular-nouns
Some irregular-nouns
oi
14Examples
lexical
m
i
c
surface
e
lexical
N
PL
c
a
t
surface
15Problems/Challenges
- Ambiguity one word can correspond to multiple
structures - Spelling changes may occur when two morphemes
are combined (inflectionally) - e.g. butterfly -s - butterflies
16Ambiguity
- ND recognition multiple paths through a machine
may lead to an accept state (Didnt matter which
path was actually traversed) - In ND parsing the path to an accept state does
matter differ paths represent different parses
and different outputs will result
PLs
ll
ii
ee
Ne
q1
q0
q2
q3
q4
q6
q5
q7
Ve
PLs
17Ambiguity more complex example
- Whats the right parse for Unionizable?
- Union-ize-able
- Un-ion-ize-able
- Each would represent a valid path through an FST
for derivational morphology.
18Deal with Morphological Ambiguity
- There are a number of ways to deal with this
problem - Simply take the first output found
- Find all the possible outputs (all paths) and
return them all (without choosing) - Bias the search so that only one or a few likely
paths are explored
Then Part-of-speech tagging to choose
19Spelling Changes
- When morphemes are combined inflectionally the
spelling at the boundaries may change
- Examples
- E-insertion when s is added to a word, -e is
inserted if word ends in s, -z, -sh, -ch, -x
(e.g., kiss, miss, waltz, bush, watch, rich, box) - Y-replacement when s or -ed are added to a word
ending with a y, -y changes to ie or i
respectively (e.g., try, butterfly)
20Solution Multi-Tape Machines
- Add intermediate tape
- Use the output of one tape machine as the input
to the next - Add intermediate symbols
- morpheme boundary
- word boundary
21Multi-Level Tape Machines
FST-1
FST-2
- FST-1 translates between the lexical and the
intermediate level - FTS-2 handles the spelling changes (due to one
rule) to the surface tape
22FST-1 for inflectional morphology of plural
Some regular-nouns
PLs
Some irregular-nouns
oi
es
e
PL
23Example
lexical
f
o
x
PL
N
intemediate
s
e
m
o
u
lexical
N
PL
intemediate
24FST-2 for E-insertion(Intermediate to Surface)
- E-insertion when s is added to a word, -e is
inserted if word ends in s, -z, -sh, -ch, -x - as in foxs foxes
e
25Examples
intemediate
s
f
o
x
surface
intemediate
i
b
o
x
n
g
surface
26Where are we?
27Final Scheme Part 1
28Final Scheme Part 2
29Intersection (T1,T2)
- States of T1 and T2 Q1 and Q2
- States of intersection Q1 x Q2
- Transitions of T1 and T2 d1, d2
- Transitions of intersection d3
- d3((xa,ya), ic) (xb,yb) iff
- d1(xa, ic) xb AND
- d2(ya, ic) yb
30Composition(T1,T2)
- States of T1 and T2 Q1 and Q2
- States of composition Q1 x Q2
- Transitions of T1 and T2 d1, d2
- Transitions of composition d3
- d3((xa,ya), io) (xb,yb) iff
- There exists c such that
- d1(xa, ic) xb AND
- d2(ya, co) yb
31Other important applications of FTS in NLP
- Segmentation finding word boundaries in text
(?!) - Shallow syntactic parsing e.g., find only noun
phrases - Dialogue Act Disambiguation right (IUI-04)
- Phonological Rules.
32FSTs in Practice
- Install an FST package (pointers)
- Describe your formal language (e.g, lexicon,
morphotactic and rules) in a RegExp like notation
(pointer) - Your specification is compiled in an FST
- NOTE FSTs for the morphology of a natural
language may have 105 107 states and arcs
33Computational problems in Morphology
- Recognition recognize whether a string is an
English word (FSA) - Parsing/Generation (FST)
stem, class, lexical features
word
.
.
lie N PL
e.g.,
lies
lie V 3SG
stem
word
.
34Stemmer
- E.g. the Porter algorithm (Appendix B), which is
based on a series of sets of simple cascaded
rewrite rules - ATIONAL ? ATE (relational ? relate)
- ING ? ? if stem contains vowel (motoring ? motor)
- Cascade of rules applied to computerization
- ization - -ize computerize
- ize - e computer
- Errors occur
- organization ? organ, doing ? doe university ?
universe
35Stemming mainly used in Information Retrieval
- Run a stemmer on the documents to be indexed
- Run a stemmer on users queries
- Compute similarity between queries and documents
(based on stems they contain)
36Porter as an FST
- The original exposition of the Porter stemmer did
not describe it as a transducer but - Each stage is a separate transducer
- The stages can be composed to get one big
transducer
37Formalisms and associated Algorithms
Linguistic Knowledge
- State Machines (no prob.)
- Finite State Automata (and Regular Expressions)
- Finite State Transducers
(English) Morphology
Syntax
Rule systems (and prob. version) (e.g., (Prob.)
Context-Free Grammars)
Semantics
Logical formalisms (First-Order Logics)
Pragmatics Discourse and Dialogue
AI planners
38Next Time
- Intro to probability and information theory
- On your preferred source read about
- Conditional probability
- Bayes rule
- Independence
- Entropy
- Conditional Entropy and Mutual Information
39Lexical to Intermediate Level
40FST for inflectional morphology of plural
Some regular-nouns
Some irregular-nouns
41Foxes
42FST Review
- FSTs allow us to take an input and deliver a
structure based on it - Or take a structure and create a surface form
- Or take a structure and create another structure
43Formalisms and associated Algorithms
Linguistic Knowledge
- State Machines (no prob.)
- Finite State Automata (and Regular Expressions)
- Finite State Transducers
(English) Morphology
Syntax
Rule systems (and prob. version) (e.g., (Prob.)
Context-Free Grammars)
Semantics
Logical formalisms (First-Order Logics)
Pragmatics Discourse and Dialogue
AI planners
44Review
- In many applications its convenient to decompose
the problem into a set of cascaded transducers
where - The output of one feeds into the input of the
next.
45English Spelling Changes
- We use one machine to transduce between the
lexical and the intermediate level, and another
to handle the spelling changes to the surface
tape
46FST can be used as
- Translators input one string (a sequence from
I), output another one (a sequence from O)or
viceversa - Recognizers input both strings (a sequence from
IxO) - Generator output both strings (a sequence from
IxO)