Title: CPSC 503 Computational Linguistics
1CPSC 503Computational Linguistics
- Lecture 3
- Giuseppe Carenini
2- Subscribe to mailing list
- Some more Intros
- NLP_at_UBC
3Introductions
- Your Name
- Previous experience in NLP?
- Why are you interested in NLP?
- Are you thinking of NLP as your main research
area? If not, what else do you want to specialize
in. - Anything else
4NLP research at UBC
- TOPICS
- Generation and Summarization of Evaluative Text
(e.g., customer reviews) - Summarization of conversations (emails, blogs,
meetings) - PEOPLE G. Carenini R. Ng (Profs), G. Murray
(Postdoc) Students - SUPPORT NSERC, Google, BObjects(now SAP),
MSResearch
5Formalisms and associated Algorithms
Linguistic Knowledge
- State Machines (no prob.)
- Finite State Automata (and Regular Expressions)
- Finite State Transducers
(English) Morphology
Syntax
Rule systems (and prob. version) (e.g., (Prob.)
Context-Free Grammars)
Semantics
Logical formalisms (First-Order Logics)
Pragmatics Discourse and Dialogue
AI planners
6Computational tasks in Morphology
- Recognition recognize whether a string is an
English/ word (FSA) - Parsing/Generation
stem, class, lexical features
.
word
.
buy V PAST-PART
bought
e.g.,
buy V PAST
stem
word
.
7Today Sept 15
- Finite State Transducers (FSTs) and Morphological
Parsing - Stemming (Porter Stemmer)
8FST definition
- Q a finite set of states
- I,O input and an output alphabets (which may
include e) - S a finite alphabet of complex symbols io, i?I
and o?O - Q0 the start state
- F a set of accept/final states (F?Q)
- A transition relation d that maps QxS to 2Q
E.g., Q 3 Ia,b,c, e Oa,b S? 0
lt d lt ?
9FST can be used as
- Translators input one string from I, output
another from O (or vice versa) - Recognizers input a string from IxO
- Generator output a string from IxO
Terminology warning!
E.g., if Ia,b,c, e Oa,b
10FST inflectional morphology of plural
Some regular-nouns
Notes
X -gt XX
lexicalsurface
Some irregular-nouns
oi
11Examples
lexical
m
i
c
surface
e
lexical
N
PL
c
a
t
surface
12Computational Morphology Problems/Challenges
- Ambiguity one word can correspond to multiple
structures (more critical in morphologically
richer languages) - Spelling changes may occur when two morphemes
are combined - e.g. butterfly -s -gt butterflies
13Ambiguity more complex example
- Whats the right parse for Unionizable?
- Union-ize-able
- Un-ion-ize-able
- Each would represent a valid path through an FST
for derivational morphology. - Both Adj
14Deal with Morphological Ambiguity
- Find all the possible outputs (all paths) and
return them all (without choosing)
Then Part-of-speech tagging to choose look at
the neighboring words
15(2) Spelling Changes
- When morphemes are combined inflectionally the
spelling at the boundaries may change
- Examples
- E-insertion when s is added to a word, -e is
inserted if word ends in s, -z, -sh, -ch, -x
(e.g., kiss, miss, waltz, bush, watch, rich, box) - Y-replacement when s or -ed are added to a word
ending with a y, -y changes to ie or i
respectively (e.g., try, butterfly)
16Solution Multi-Tape Machines
- Add intermediate tape
- Use the output of one tape machine as the input
to the next - Add intermediate symbols
- morpheme boundary
- word boundary
17Multi-Level Tape Machines
FST-1
FST-2
- FST-1 translates between the lexical and the
intermediate level - FTS-2 handles the spelling changes (due to one
rule) to the surface tape
18FST-1 for inflectional morphology of plural
(Lexical lt-gt Intermediate )
Some regular-nouns
PLs
Some irregular-nouns
oi
es
e
PL
19Example
lexical
f
o
x
PL
N
intemediate
s
e
m
o
u
lexical
N
PL
intemediate
20FST-2 for E-insertion(Intermediate lt-gt Surface)
- E-insertion when s is added to a word, -e is
inserted if word ends in s, -z, -sh, -ch, -x - as in foxs lt-gt foxes
e
21Examples
intermediate
s
f
o
x
surface
intermediate
i
b
o
x
n
g
surface
22Where are we?
23Final Scheme Part 1
24Final Scheme Part 2
25Intersection (FST1, FST2)
- States of FST1 and FST2 Q1 and Q2
- States of intersection (Q1 x Q2)
- Transitions of FST1 and FST2 d1, d2
- Transitions of intersection d3
- For all i,j,n,m,a,b d3((q1i,q2j), ab)
(q1n,q2m) iff - d1(q1i, ab) q1n AND
- d2(q2j, ab) q2m
ab
q1i
q1n
ab
ab
ab
(q1i,q2j)
(q1n,q2m)
?
q2j
q2m
26Composition(FST1, FST2)
- States of FST1 and FST2 Q1 and Q2
- States of composition Q1 x Q2
- Transitions of FST1 and FST2 d1, d2
- Transitions of composition d3
- For all i,j,n,m,a,b d3((q1i,q2j), ab)
(q1n,q2m) iff - There exists c such that
- d1(q1i, ac) q1n AND
- d2(q2j, cb) q2m
ac
q1i
q1n
cb
ab
ab
?
q2j
q2m
(q1i,q2j)
(q1n,q2m)
27FSTs in Practice
- Install an FST package (pointers)
- Describe your formal language (e.g, lexicon,
morphotactic and rules) in a RegExp-like notation
(pointer) - Your specification is compiled in a single FST
- Ref Finite State Morphology (Beesley and
Karttunen, 2003, CSLI Publications)
- Complexity/Coverage
- FSTs for the morphology of a natural language may
have 105 107 states and arcs - Spanish (1996) 46x103 stems 3.4 x 106 word
forms - Arabic (2002?) 131x103 stems 7.7 x 106 word
forms
28Other important applications of FST in NLP
- From segmenting words into morphemes to
- Tokenization
- finding word boundaries in text (?!) maxmatch
- Finding sentence boundaries punctuation but .
is ambiguous look at example in Fig. 3.22 - Shallow syntactic parsing e.g., find only noun
phrases - Phonological Rules (Chpt. ?11?)
29Computational tasks in Morphology
- Recognition recognize whether a string is an
English word (FSA) - Parsing/Generation
stem, class, lexical features
.
word
.
buy V PAST-PART
bought
e.g.,
buy V PAST
stem
word
.
30Stemmer
- E.g. the Porter algorithm, which is based on a
series of sets of simple cascaded rewrite rules - (condition) S1-gtS2
- ATIONAL ? ATE (relational ? relate)
- (v) ING ? ? if stem contains vowel (motoring ?
motor) - Cascade of rules applied to computerization
- ization -gt -ize computerize
- ize -gt e computer
- Errors occur
- organization ? organ, doing ? doe university ?
universe
Code freely available in most languages Python,
Java,
31Stemming mainly used in Information Retrieval
- Run a stemmer on the documents to be indexed
- Run a stemmer on users queries
- Compute similarity between queries and documents
(based on stems they contain)
Seems to work especially well with smaller
documents
32Porter as an FST
- The original exposition of the Porter stemmer did
not describe it as a transducer but - Each stage is a separate transducer
- The stages can be composed to get one big
transducer
33Next Time
- Read handout
- Probability
- Stats
- Information theory
- Next Lecture
- finish Chpt 3, 3.10-11
- Start Probabilistic Models for NLP (Chpt. 4, 4.1
4.2 and 5.9!)