CPSC 503 Computational Linguistics - PowerPoint PPT Presentation

About This Presentation
Title:

CPSC 503 Computational Linguistics

Description:

none – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 34
Provided by: JimMa47
Category:

less

Transcript and Presenter's Notes

Title: CPSC 503 Computational Linguistics


1
CPSC 503Computational Linguistics
  • Lecture 3
  • Giuseppe Carenini

2
  • Subscribe to mailing list
  • Some more Intros
  • NLP_at_UBC

3
Introductions
  • Your Name
  • Previous experience in NLP?
  • Why are you interested in NLP?
  • Are you thinking of NLP as your main research
    area? If not, what else do you want to specialize
    in.
  • Anything else

4
NLP research at UBC
  • TOPICS
  • Generation and Summarization of Evaluative Text
    (e.g., customer reviews)
  • Summarization of conversations (emails, blogs,
    meetings)
  • PEOPLE G. Carenini R. Ng (Profs), G. Murray
    (Postdoc) Students
  • SUPPORT NSERC, Google, BObjects(now SAP),
    MSResearch

5
Formalisms and associated Algorithms
Linguistic Knowledge
  • State Machines (no prob.)
  • Finite State Automata (and Regular Expressions)
  • Finite State Transducers

(English) Morphology
Syntax
Rule systems (and prob. version) (e.g., (Prob.)
Context-Free Grammars)
Semantics
Logical formalisms (First-Order Logics)
Pragmatics Discourse and Dialogue
AI planners
6
Computational tasks in Morphology
  • Recognition recognize whether a string is an
    English/ word (FSA)
  • Parsing/Generation

stem, class, lexical features
.
word
.
buy V PAST-PART
bought
e.g.,
buy V PAST
  • Stemming

stem
word
.
7
Today Sept 15
  • Finite State Transducers (FSTs) and Morphological
    Parsing
  • Stemming (Porter Stemmer)

8
FST definition
  • Q a finite set of states
  • I,O input and an output alphabets (which may
    include e)
  • S a finite alphabet of complex symbols io, i?I
    and o?O
  • Q0 the start state
  • F a set of accept/final states (F?Q)
  • A transition relation d that maps QxS to 2Q

E.g., Q 3 Ia,b,c, e Oa,b S? 0
lt d lt ?
9
FST can be used as
  • Translators input one string from I, output
    another from O (or vice versa)
  • Recognizers input a string from IxO
  • Generator output a string from IxO

Terminology warning!
E.g., if Ia,b,c, e Oa,b
10
FST inflectional morphology of plural
Some regular-nouns
Notes
X -gt XX
lexicalsurface
Some irregular-nouns
oi
11
Examples
lexical
m
i
c
surface
e
lexical
N
PL
c
a
t
surface
12
Computational Morphology Problems/Challenges
  • Ambiguity one word can correspond to multiple
    structures (more critical in morphologically
    richer languages)
  • Spelling changes may occur when two morphemes
    are combined
  • e.g. butterfly -s -gt butterflies

13
Ambiguity more complex example
  • Whats the right parse for Unionizable?
  • Union-ize-able
  • Un-ion-ize-able
  • Each would represent a valid path through an FST
    for derivational morphology.
  • Both Adj

14
Deal with Morphological Ambiguity
  • Find all the possible outputs (all paths) and
    return them all (without choosing)

Then Part-of-speech tagging to choose look at
the neighboring words
15
(2) Spelling Changes
  • When morphemes are combined inflectionally the
    spelling at the boundaries may change
  • Examples
  • E-insertion when s is added to a word, -e is
    inserted if word ends in s, -z, -sh, -ch, -x
    (e.g., kiss, miss, waltz, bush, watch, rich, box)
  • Y-replacement when s or -ed are added to a word
    ending with a y, -y changes to ie or i
    respectively (e.g., try, butterfly)

16
Solution Multi-Tape Machines
  • Add intermediate tape
  • Use the output of one tape machine as the input
    to the next
  • Add intermediate symbols
  • morpheme boundary
  • word boundary

17
Multi-Level Tape Machines
FST-1
FST-2
  • FST-1 translates between the lexical and the
    intermediate level
  • FTS-2 handles the spelling changes (due to one
    rule) to the surface tape

18
FST-1 for inflectional morphology of plural
(Lexical lt-gt Intermediate )
Some regular-nouns
PLs



Some irregular-nouns
oi
es
e
PL
19
Example
lexical
f
o
x
PL
N
intemediate
s
e
m
o
u
lexical
N
PL
intemediate
20
FST-2 for E-insertion(Intermediate lt-gt Surface)
  • E-insertion when s is added to a word, -e is
    inserted if word ends in s, -z, -sh, -ch, -x
  • as in foxs lt-gt foxes

e
21
Examples
intermediate

s
f
o
x

surface
intermediate

i
b
o
x
n
g

surface
22
Where are we?

23
Final Scheme Part 1
24
Final Scheme Part 2
25
Intersection (FST1, FST2)
  • States of FST1 and FST2 Q1 and Q2
  • States of intersection (Q1 x Q2)
  • Transitions of FST1 and FST2 d1, d2
  • Transitions of intersection d3
  • For all i,j,n,m,a,b d3((q1i,q2j), ab)
    (q1n,q2m) iff
  • d1(q1i, ab) q1n AND
  • d2(q2j, ab) q2m

ab
q1i
q1n
ab
ab
ab
(q1i,q2j)
(q1n,q2m)
?
q2j
q2m
26
Composition(FST1, FST2)
  • States of FST1 and FST2 Q1 and Q2
  • States of composition Q1 x Q2
  • Transitions of FST1 and FST2 d1, d2
  • Transitions of composition d3
  • For all i,j,n,m,a,b d3((q1i,q2j), ab)
    (q1n,q2m) iff
  • There exists c such that
  • d1(q1i, ac) q1n AND
  • d2(q2j, cb) q2m

ac
q1i
q1n
cb
ab
ab
?
q2j
q2m
(q1i,q2j)
(q1n,q2m)
27
FSTs in Practice
  • Install an FST package (pointers)
  • Describe your formal language (e.g, lexicon,
    morphotactic and rules) in a RegExp-like notation
    (pointer)
  • Your specification is compiled in a single FST
  • Ref Finite State Morphology (Beesley and
    Karttunen, 2003, CSLI Publications)
  • Complexity/Coverage
  • FSTs for the morphology of a natural language may
    have 105 107 states and arcs
  • Spanish (1996) 46x103 stems 3.4 x 106 word
    forms
  • Arabic (2002?) 131x103 stems 7.7 x 106 word
    forms

28
Other important applications of FST in NLP
  • From segmenting words into morphemes to
  • Tokenization
  • finding word boundaries in text (?!) maxmatch
  • Finding sentence boundaries punctuation but .
    is ambiguous look at example in Fig. 3.22
  • Shallow syntactic parsing e.g., find only noun
    phrases
  • Phonological Rules (Chpt. ?11?)

29
Computational tasks in Morphology
  • Recognition recognize whether a string is an
    English word (FSA)
  • Parsing/Generation

stem, class, lexical features
.
word
.
buy V PAST-PART
bought
e.g.,
buy V PAST
  • Stemming

stem
word
.
30
Stemmer
  • E.g. the Porter algorithm, which is based on a
    series of sets of simple cascaded rewrite rules
  • (condition) S1-gtS2
  • ATIONAL ? ATE (relational ? relate)
  • (v) ING ? ? if stem contains vowel (motoring ?
    motor)
  • Cascade of rules applied to computerization
  • ization -gt -ize computerize
  • ize -gt e computer
  • Errors occur
  • organization ? organ, doing ? doe university ?
    universe

Code freely available in most languages Python,
Java,
31
Stemming mainly used in Information Retrieval
  1. Run a stemmer on the documents to be indexed
  2. Run a stemmer on users queries
  3. Compute similarity between queries and documents
    (based on stems they contain)

Seems to work especially well with smaller
documents
32
Porter as an FST
  • The original exposition of the Porter stemmer did
    not describe it as a transducer but
  • Each stage is a separate transducer
  • The stages can be composed to get one big
    transducer

33
Next Time
  • Read handout
  • Probability
  • Stats
  • Information theory
  • Next Lecture
  • finish Chpt 3, 3.10-11
  • Start Probabilistic Models for NLP (Chpt. 4, 4.1
    4.2 and 5.9!)
Write a Comment
User Comments (0)
About PowerShow.com