Title: Fall 2005
1EECS 595 / LING 541 / SI 661
Natural Language Processing
- Fall 2005
- Lecture Notes 4
2Features and unification
3Introduction
- Grammatical categories have properties
- Constraint-based formalisms
- Example this flights agreement is difficult to
handle at the level of grammatical categories - Example many water count/mass nouns
- Sample rule that takes into account features S ?
NP VP (but only if the number of the NP is equal
to the number of the VP)
4Feature structures
CAT NP NUMBER SINGULAR PERSON 3
CAT NP AGREEMENT
NUMBER SG
PERSON 3
Feature paths x agreement number
5Unification
- NUMBER SG NUMBER SG
- NUMBER SG NUMBER PL -
- NUMBER SG NUMBER NUMBER
SG - NUMBER SG PERSON 3 ?
6Agreement
- S ? NP VPNP AGREEMENT VP AGREEMENT
- Does this flight serve breakfast?
- Do these flights serve breakfast?
- S ? Aux NP VPAux AGREEMENT NP AGREEMENT
7Agreement
- These flights
- This flight
- NP ? Det NominalDet AGREEMENT Nominal
AGREEMENT - Verb ? serveVerb AGREEMENT NUMBER PL
- Verb ? servesVerb AGREEMENT NUMBER SG
8Subcategorization
- VP ? VerbVP HEAD Verb HEADVP HEAD
SUBCAT INTRANS - VP ? Verb NPVP HEAD Verb HEADVP HEAD
SUBCAT TRANS - VP ? Verb NP NPVP HEAD Verb HEADVP HEAD
SUBCAT DITRANS
9Regular Expressions andAutomata
10Regular expressions
- Searching for woodchuck
- Searching for woodchucks with an optional final
s - Regular expressions
- Finite-state automata (singular automaton)
11Regular expressions
- Basic regular expression patterns
- Perl-based syntax (slightly different from other
notations for regular expressions) - Disjunctions abc
- Ranges A-Z
- Negations Ss
- Optional characters ? and
- Wild cards .
- Anchors and , also \b and \B
- Disjunction, grouping, and precedence
12Writing correct expressions
- Exercise write a Perl regular expression to
match the English article the
/the//tThe//\btThe\b//a-zA-ZtThea-zA
-Z//(a-zA-Z)tThea-zA-Z/
13A more complex example
- Exercise Write a regular expression that will
match any PC with more than 500MHz and 32 Gb of
disk space for less than 1000
/0-9//0-9\.0-90-9//\b0-9(\.0-9
0-9)?\b//\b0-9 (MHzMmegahertzGhz
Ggigahertz)\b//\b0-9 (MbMmegabytes?)\b/
/\b0-9(\.0-9) (GbGgigabytes?)\b/
14Advanced operators
15Substitutions and memory
s/colour/color/
- Memory (\1, \2, etc. refer back to matches)
s/(0-9)/lt\1gt/
16Eliza Weizenbaum, 1966
- User Men are all alike
- ELIZA IN WHAT WAY
- User Theyre always bugging us about something
or other - ELIZA CAN YOU THINK OF A SPECIFIC EXAMPLE?
- User Well, my boyfriend made me come here
- ELIZA YOUR BOYFRIEND MADE YOU COME HERE
- User He says Im depressed much of the time
- ELIZA I AM SORRY TO HEAR THAT YOU ARE DEPRESSED
17Eliza-style regular expressions
Step 1 replace first person references with
second person referencesStep 2 use additional
regular expressions to generate replies Step 3
use scores to rank possible transformations
- s/. YOU ARE (depressedsad) ./I AM SORRY TO
HEAR YOU ARE \1/ - s/. YOU ARE (depressedsad) ./WHY DO YOU THINK
YOU ARE \1/ - s/. all ./IN WHAT WAY/
- s/. always ./CAN YOU THINK OF A SPECIFIC
EXAMPLE/
18Finite-state automata
- Finite-state automata (FSA)
- Regular languages
- Regular expressions
19Finite-state automata (machines)
baa! baaa! baaaa! baaaaa! ...
baa!
a
b
a
a
!
q0
q1
q2
q3
q4
finalstate
state
transition
20Input tape
q0
a
b
a
!
b
21Finite-state automata
- Q a finite set of N states q0, q1, qN
- ? a finite input alphabet of symbols
- q0 the start state
- F the set of final states
- ?(q,i) transition function
22State-transition tables
Input Input Input
State b a !
0 1 0 0
1 0 2 0
2 0 3 0
3 0 3 4
4 0 0 0
23The FSM toolkit and friends
- Developed at ATT Research (Riley, Pereira,
Mohri, Sproat) - Download http//www.research.att.com/sw/tools/fs
m/tech.htmlhttp//www.research.att.com/sw/tools/l
extools/ - Tutorial available
- 4 useful parts FSM, Lextools, GRM, Dot
(separate) - /data2/tools/fsm-3.6/bin
- /data2/tools/lextools/bin
- /data2/tools/dot/bin
24D-RECOGNIZE
function D-RECOGNIZE (tape, machine) returns
accept or reject index ? Beginning of tape
current-state ? Initial state of machine loop
if End of input has been reached then
if current-state is an accept state then
return accept else return
reject elsif transition-table
current-state, tapeindex is empty then
return reject else current-state ?
transition-table current-state, tapeindex
index ? index 1end
25Adding a failing state
a
b
a
a
!
q0
q1
q2
q3
q4
!
!
b
!
b
!
b
b
a
qF
a
26Languages and automata
- Formal languages regular languages, non-regular
languages - deterministic vs. non-deterministic FSAs
- Epsilon (?) transitions
27Using NFSAs to accept strings
- Backup add markers at choice points, then
possibly revisit underexplored markers - Look-ahead look ahead in input
- Parallelism look at alternatives in parallel
28Using NFSAs
Input Input Input Input
State b a ! e
0 1 0 0 0
1 0 2 0 0
2 0 2,3 0 0
3 0 0 4 0
4 0 0 0 0
29More about FSAs
- Transducers
- Equivalence of DFSAs and NFSAs
- Recognition as search depth-first,
breadth-search
30Recognition using NFSAs
31Regular languages
- Operations on regular languages and FSAs
concatenation, closure, union - Properties of regular languages (closed under
concatenation, union, disjunction, intersection,
difference, complementation, reversal, Kleene
closure)
32An exercise
- JM 2.8. Write a regular expression for the
language accepted by the NFSA in the Figure.
33Morphology and Finite-State Transducers
34Morphemes
- Stems, affixes
- Affixes prefixes, suffixes, infixes hingi
(borrow) humingi (agent) in Tagalog,
circumfixes sagen gesagt in German - Concatenative morphology
- Templatic morphology (Semitic languages)
- lmd (learn), lamad (he studied), limed (he
taught), lumad (he was taught)
35Morphological analysis
36Inflectional morphology
- Tense, number, person, mood, aspect
- Five verb forms in English
- 40 forms in French
- Six cases in Russianhttp//www.departments.buckn
ell.edu/russian/language/case.html - Up to 40,000 forms in Turkish (you cause X to
cause Y to do Z)
37Derivational morphology
- Nominalization computerization, appointee,
killer, fuzziness - Formation of adjectives computational,
embraceable, clueless
38Finite-state morphological parsing
- Cats cat N PL
- Cat cat N SG
- Cities city N PL
- Geese goose N PL
- Ducks (duck N PL) or (duck V 3SG)
- Merging V PRES-PART
- Caught (catch V PAST-PART) or (catch V PAST)
39Principles of morphological parsing
- Lexicon
- Morphotactics (e.g., plural follows noun)
- Orthography (easy ? easier)
- Irregular nouns e.g., geese, sheep, mice
- Irregular verbs e.g., caught, ate, eaten
40FSA for adjectives
- Big, bigger, biggest
- Cool, cooler, coolest, coolly
- Red, redder, reddest
- Clear, clearer, clearest, clearly, unclear,
unclearly - Happy, happier, happiest, happily
- Unhappy, unhappier, unhappiest, unhappily
- What about unbig, redly, and realest?
41Using FSA for recognition
- Is a string a legitimate word or not?
- Two-level morphology lexical level surface
level (Koskenniemi 83) - Finite-state transducers (FST) used for regular
relations - Inversion and composition of FST
42Orthographic rules
- Beg/begging
- Make/making
- Watch/watches
- Try/tries
- Panic/panicked
43Combining FST lexicon and rules
- Cascades of transducersthe output of one
becomes the input of another
44Weighted Automata
45Phonetic symbols
46Using WFST for language modeling
- Phonetic representation
- Part-of-speech tagging
47Word Classes andPart Of Speech Tagging
48Some POS statistics
- Preposition list from COBUILD
- Single-word particles
- Conjunctions
- Pronouns
- Modal verbs
49Tagsets for English
- Penn Treebank
- Other tagsets (see Week 1 slides)
50POS ambiguity
- Degrees of ambiguity (DeRose 1988)
- Rule-based POS tagging
- ENGTWOL (Voutilainen et al. )
- Sample rule
- Adverbial-That rule (it isnt that
odd) (Given input thatif (1
A/ADV/QUANT) (2 SENT-LIM) (NOT 1
SVOC/A) (not a verb like consider)then
eliminate non-ADV tagselse eliminate ADV tag
51Evaluating POS taggers
- Percent correct
- What is the lower bound on a systems
performance? - What about the upper bound?
52 Kappa
- N number of items (index i)
- n number of categories (index j)
- k number of annotators
- when k gt .8 agreement is considered high
53Midterm reading list
- Chapter 1 Introduction
- Chapter 2 Regular expressions and automata
- Chapter 3 Morphology and finite-state
transducers FSM tutorial - Chapter 8 Word classes and POS tagging
- Chapter 9 Context-free grammars for English
- Chapter 10 Parsing with context-free grammars
- Chapter 11 - Features and unification
54Syntaxscape
- Written by Juno Suk of Lucent
- http//www.cs.columbia.edu/radev/syntaxscape/
55(No Transcript)
56Read by yourselves
- 9.9. Spoken language syntax
- 9.10. Grammar equivalence
- 9.11. Finite-state and context-free grammars