Title: SIMS 290-2: Applied Natural Language Processing
1SIMS 290-2 Applied Natural Language Processing
Marti Hearst Sept 8, 2004
2Today
- Tokenizing using Regular Expressions
- Elementary Morphology
- Frequency Distributions in NLTK
3Tokenizing in NLTK
- The Whitespace Tokenizer doesnt work very well
- What are some of the problems?
- NLTK provides an easy way to incorporate regexs
into your tokenizer - Uses pythons regex package (re)
- http//docs.python.org/lib/re-syntax.html
4Regexs for Tokenizing
- Build up your recognizer piece by piece
- Make a string of regexs combined with ORs
- Put each one in a group (surrounded by parens)
- Things to recognize
- urls
- words with hyphens in them
- words in which hyphens should be removed (end of
line hyphens) - Numerical terms
- Words with apostrophes
5Regexs for Tokenizing
- Here are some I put together
- url r'((http\/\/)?A-Za-z(\.A-Za-z
)1,3(\/)?(\d)?) - Allows port number but no argument variables.
- hyphen r'(\w\-\s?\w)
- Allows for a space after the hyphen
- apostro r'(\w\'\w)
- numbers r'((\)?\d(\.)?\d?)
- Needs to handle large numbers with commas
- punct r'(\w\s)
- wordr r'(\w)
- A nice python trick
- regexp string.join(url, hyphen, apostro,
numbers, wordr, punct,"") - Makes one string in which a goes in between
each substring
6Regexs for Tokenizing
- More code
- import string
- from nltk.token import
- from nltk.tokenizer import
- t Token(TEXT'This is the girl\'s depart-
ment.') - regexp
- string.join(url, hyphen, apostrophe, numbers,
wordr, punct,"") - RegexpTokenizer(regexp,SUBTOKENS'WORDS').tokenize
(t) - print t'WORDS'
- ltThisgt, ltisgt, ltthegt, ltgirl'sgt, ltdepart- mentgt,
ltstoregt, lt.gt
7Tokenization Issues
- Sentence Boundaries
- Include parens around sentences?
- What about quotation marks around sentences?
- Periods end of line or not?
- Well study this in detail in a couple of weeks.
- Proper Names
- What to do about
- New York-New Jersey train?
- California Governor Arnold Schwarzenegger?
- Clitics and Contractions
8Morphology
- Morphology
- The study of the way words are built up from
smaller meaning units. - Morphemes
- The smallest meaningful unit in the grammar of a
language. - Contrasts
- Derivational vs. Inflectional
- Regular vs. Irregular
- Concatinative vs. Templatic (root-and-pattern)
- A useful resource
- Glossary of linguistic terms by Eugene Loos
- http//www.sil.org/linguistics/GlossaryOfLinguisti
cTerms/contents.htm
9Examples (English)
- unladylike
- 3 morphemes, 4 syllables
- un- not
- lady (well behaved) female adult human
- -like having the characteristics of
- Cant break any of these down further without
distorting the meaning of the units - technique
- 1 morpheme, 2 syllables
- dogs
- 2 morphemes, 1 syllable
- -s, a plural marker on nouns
10Morpheme Definitions
- Root
- The portion of the word that
- is common to a set of derived or inflected forms,
if any, when all affixes are removed - is not further analyzable into meaningful
elements - carries the principle portion of meaning of the
words - Stem
- The root or roots of a word, together with any
derivational affixes, to which inflectional
affixes are added. - Affix
- A bound morpheme that is joined before, after, or
within a root or stem. - Clitic
- a morpheme that functions syntactically like a
word, but does not appear as an independent
phonological word - Spanish un beso, las aguas
- English Hals (genetive marker)
11Inflectional vs. Derivational
- Word Classes
- Parts of speech noun, verb, adjectives, etc.
- Word class dictates how a word combines with
morphemes to form new words - Inflection
- Variation in the form of a word, typically by
means of an affix, that expresses a grammatical
contrast. - Doesnt change the word class
- Usually produces a predictable, nonidiosyncratic
change of meaning. - Derivation
- The formation of a new word or inflectable stem
from another word or stem.
12Inflectional Morphology
- Adds
- tense, number, person, mood, aspect
- Word class doesnt change
- Word serves new grammatical role
- Examples
- come is inflected for person and number
- The pizza guy comes at noon.
- las and rojas are inflected for agreement with
manzanas in grammatical gender by -a and in
number by s - las manzanas rojas (the red apples)
13Derivational Morphology
- Nominalization (formation of nouns from other
parts of speech, primarily verbs in English) - computerization
- appointee
- killer
- fuzziness
- Formation of adjectives (primarily from nouns)
- computational
- clueless
- Embraceable
- Diffulcult cases
- building ? from which sense of build?
- A resource
- CatVar Categorial Variation Database
- http//clipdemos.umiacs.umd.edu/catvar
14Concatinative Morphology
- MorphemeMorphemeMorpheme
- Stems also called lemma, base form, root, lexeme
- hopeing ? hoping hop ? hopping
- Affixes
- Prefixes Antidisestablishmentarianism
- Suffixes Antidisestablishmentarianism
- Infixes hingi (borrow) humingi (borrower) in
Tagalog - Circumfixes sagen (say) gesagt (said) in
German - Agglutinative Languages
- uygarlastiramadiklarimizdanmissinizcasina
- uygarlastiramadiklarimizdanmissinizcasin
a - Behaving as if you are among those whom we could
not cause to become civilized
15Templatic Morphology
- Roots and Patterns
- Example Hebrew verbs
- Root
- Consists of 3 consonants CCC
- Carries basic meaning
- Template
- Gives the ordering of consonants and vowels
- Specifies semantic information about the verb
- Active, passive, middle voice
- Example
- lmd (to learn or study)
- CaCaC -gt lamad (he studied)
- CiCeC -gt limed (he taught)
- CuCaC -gt lumad (he was taught)
16Nouns and Verbs (in English)
- Nouns have simple inflectional morphology
- cat
- cats, cats
- Verbs have more complex morphology
17Nouns and Verbs (in English)
- Nouns
- Have simple inflectional morphology
- Cat/Cats
- Mouse/Mice, Ox, Oxen, Goose, Geese
- Verbs
- More complex morphology
- Walk/Walked
- Go/Went, Fly/Flew
18Regular (English) Verbs
Morphological Form Classes Regularly Inflected Verbs Regularly Inflected Verbs Regularly Inflected Verbs Regularly Inflected Verbs
Stem walk merge try map
-s form walks merges tries maps
-ing form walking merging trying mapping
Past form or ed participle walked merged tried mapped
19Irregular (English) Verbs
Morphological Form Classes Irregularly Inflected Verbs Irregularly Inflected Verbs Irregularly Inflected Verbs
Stem eat catch cut
-s form eats catches cuts
-ing form eating catching cutting
Past form ate caught cut
-ed participle eaten caught cut
20To love in Spanish
21Syntax and Morphology
- Phrase-level agreement
- Subject-Verb
- John studies hard (STUDY3SG)
- Noun-Adjective
- Las vacas hermosas
- Sub-word phrasal structures
- ????????
- ?????????
- ThatinbookPLPoss1PL
- Which are in our books
22Phonology and Morphology
- Script Limitations
- Spoken English has 14 vowels
- heed hid hayed head had hoed hood whod hide
howd taught Tut toy enough - English Alphabet has 5
- Use vowel combinatios far fair fare
- Consonantal doubling (hopping vs. hoping)
23Computational Morphology
- Approaches
- Lexicon only
- Rules only
- Lexicon and Rules
- Finite-state Automata
- Finite-state Transducers
- Systems
- WordNets morphy
- PCKimmo
- Named after Kimmo Koskenniemi, much work done by
Lauri Karttunen, Ron Kaplan, and Martin Kay - Accurate but complex
- http//www.sil.org/pckimmo/
- Two-level morphology
- Commercial version available from InXight Corp.
- Background
- Chapter 3 of Jurafsky and Martin
- A short history of Two-Level Morphology
- http//www.ling.helsinki.fi/koskenni/esslli-2001-
karttunen/
24Porter Stemmer
- Discount morphology
- So not all that accurate
- Uses a series of cascaded rewrite rules
- ATIONAL -gt ATE
- (relational -gt relate)
- ING -gt ? if stem contains vowel
- (motoring -gt motor)
25Porter Stemmer
- Step 4 Derivational Morphology I Multiple
Suffixes - (mgt0) ATIONAL -gt ATE relational
-gt relate - (mgt0) TIONAL -gt TION conditional
-gt condition - rational
-gt rational - (mgt0) ENCI -gt ENCE valenci
-gt valence - (mgt0) ANCI -gt ANCE hesitanci
-gt hesitance - (mgt0) IZER -gt IZE digitizer
-gt digitize - (mgt0) ABLI -gt ABLE conformabli
-gt conformable - (mgt0) ALLI -gt AL radicalli
-gt radical - (mgt0) ENTLI -gt ENT differentli
-gt different - (mgt0) ELI -gt E vileli
- gt vile - (mgt0) OUSLI -gt OUS analogousli
-gt analogous - (mgt0) IZATION -gt IZE
vietnamization -gt vietnamize - (mgt0) ATION -gt ATE predication
-gt predicate - (mgt0) ATOR -gt ATE operator
-gt operate - (mgt0) ALISM -gt AL feudalism
-gt feudal - (mgt0) IVENESS -gt IVE decisiveness
-gt decisive - (mgt0) FULNESS -gt FUL hopefulness
-gt hopeful - (mgt0) OUSNESS -gt OUS callousness
-gt callous
26Porter Stemmer
- Errors of Omission
- European Europe
- analysis analyzes
- matrices matrix
- noise noisy
- explain explanation
- Errors of Commission
- organization organ
- doing doe
- generalization generic
- numerical numerous
- university universe
27Computational Morphology
- WORD STEM (FEATURES)
- cats cat N PL
- cat cat N SG
- cities city N PL
- geese goose N PL
- ducks (duck N PL) or
- (duck V 3SG)
- merging merge V PRES-PART
- caught (catch V PAST-PART) or
- (catch V PAST)
28Lexicon-only Morphology
- The lexicon lists all surface level and lexical
level pairs - No rules
- Analysis/Generation is easy
- Very large for English
- What about
- Arabic or
- Turkish or
- Chinese?
acclaim acclaim N acclaim
acclaim V0 acclaimed acclaim
Ved acclaimed acclaim Ven acclaiming
acclaim Ving acclaims acclaim
Ns acclaims acclaim Vs acclamation
acclamation N acclamations acclamation
Ns acclimate acclimate
V0 acclimated acclimate
Ved acclimated acclimate
Ven acclimates acclimate
Vs acclimating acclimate Ving
29For Next Week
- Software status
- Software on 3 lab machines, more coming
- Lecture on Monday Sept 13
- Part of speech tagging
- For Wed Sept 15
- Do exercises 1-3 in Tutorial 2 (Tokenizing)
- Do the following exercises from Tutorial 3
(Tagging) - 1a-h
- 2, 3, 4, 5a-b
- Turn them in online
- (Ill have something available for this by then)