SIMS 290-2: Applied Natural Language Processing

About This Presentation
Title:

SIMS 290-2: Applied Natural Language Processing

Description:

NLTK provides an easy way to incorporate regex's into ... Mouse/Mice, Ox, Oxen, Goose, Geese. Verbs. More complex morphology. Walk/Walked. Go/Went, Fly/Flew ... – PowerPoint PPT presentation

Number of Views:524
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: SIMS 290-2: Applied Natural Language Processing


1
SIMS 290-2 Applied Natural Language Processing
Marti Hearst Sept 8, 2004    
2
Today
  • Tokenizing using Regular Expressions
  • Elementary Morphology
  • Frequency Distributions in NLTK

3
Tokenizing in NLTK
  • The Whitespace Tokenizer doesnt work very well
  • What are some of the problems?
  • NLTK provides an easy way to incorporate regexs
    into your tokenizer
  • Uses pythons regex package (re)
  • http//docs.python.org/lib/re-syntax.html

4
Regexs for Tokenizing
  • Build up your recognizer piece by piece
  • Make a string of regexs combined with ORs
  • Put each one in a group (surrounded by parens)
  • Things to recognize
  • urls
  • words with hyphens in them
  • words in which hyphens should be removed (end of
    line hyphens)
  • Numerical terms
  • Words with apostrophes

5
Regexs for Tokenizing
  • Here are some I put together
  • url r'((http\/\/)?A-Za-z(\.A-Za-z
    )1,3(\/)?(\d)?)
  • Allows port number but no argument variables.
  • hyphen r'(\w\-\s?\w)
  • Allows for a space after the hyphen
  • apostro r'(\w\'\w)
  • numbers r'((\)?\d(\.)?\d?)
  • Needs to handle large numbers with commas
  • punct r'(\w\s)
  • wordr r'(\w)
  • A nice python trick
  • regexp string.join(url, hyphen, apostro,
    numbers, wordr, punct,"")
  • Makes one string in which a goes in between
    each substring

6
Regexs for Tokenizing
  • More code
  • import string
  • from nltk.token import
  • from nltk.tokenizer import
  • t Token(TEXT'This is the girl\'s depart-
    ment.')
  • regexp
  • string.join(url, hyphen, apostrophe, numbers,
    wordr, punct,"")
  • RegexpTokenizer(regexp,SUBTOKENS'WORDS').tokenize
    (t)
  • print t'WORDS'
  • ltThisgt, ltisgt, ltthegt, ltgirl'sgt, ltdepart- mentgt,
    ltstoregt, lt.gt

7
Tokenization Issues
  • Sentence Boundaries
  • Include parens around sentences?
  • What about quotation marks around sentences?
  • Periods end of line or not?
  • Well study this in detail in a couple of weeks.
  • Proper Names
  • What to do about
  • New York-New Jersey train?
  • California Governor Arnold Schwarzenegger?
  • Clitics and Contractions

8
Morphology
  • Morphology
  • The study of the way words are built up from
    smaller meaning units.
  • Morphemes
  • The smallest meaningful unit in the grammar of a
    language.
  • Contrasts
  • Derivational vs. Inflectional
  • Regular vs. Irregular
  • Concatinative vs. Templatic (root-and-pattern)
  • A useful resource
  • Glossary of linguistic terms by Eugene Loos
  • http//www.sil.org/linguistics/GlossaryOfLinguisti
    cTerms/contents.htm

9
Examples (English)
  • unladylike
  • 3 morphemes, 4 syllables
  • un- not
  • lady (well behaved) female adult human
  • -like having the characteristics of
  • Cant break any of these down further without
    distorting the meaning of the units
  • technique
  • 1 morpheme, 2 syllables
  • dogs
  • 2 morphemes, 1 syllable
  • -s, a plural marker on nouns

10
Morpheme Definitions
  • Root
  • The portion of the word that
  • is common to a set of derived or inflected forms,
    if any, when all affixes are removed
  • is not further analyzable into meaningful
    elements
  • carries the principle portion of meaning of the
    words
  • Stem
  • The root or roots of a word, together with any
    derivational affixes, to which inflectional
    affixes are added.
  • Affix
  • A bound morpheme that is joined before, after, or
    within a root or stem.
  • Clitic
  • a morpheme that functions syntactically like a
    word, but does not appear as an independent
    phonological word
  • Spanish un beso, las aguas
  • English Hals (genetive marker)

11
Inflectional vs. Derivational
  • Word Classes
  • Parts of speech noun, verb, adjectives, etc.
  • Word class dictates how a word combines with
    morphemes to form new words
  • Inflection
  • Variation in the form of a word, typically by
    means of an affix, that expresses a grammatical
    contrast.
  • Doesnt change the word class
  • Usually produces a predictable, nonidiosyncratic
    change of meaning.
  • Derivation
  • The formation of a new word or inflectable stem
    from another word or stem.

12
Inflectional Morphology
  • Adds
  • tense, number, person, mood, aspect
  • Word class doesnt change
  • Word serves new grammatical role
  • Examples
  • come is inflected for person and number
  • The pizza guy comes at noon.
  • las and rojas are inflected for agreement with
    manzanas in grammatical gender by -a and in
    number by s
  • las manzanas rojas (the red apples)

13
Derivational Morphology
  • Nominalization (formation of nouns from other
    parts of speech, primarily verbs in English)
  • computerization
  • appointee
  • killer
  • fuzziness
  • Formation of adjectives (primarily from nouns)
  • computational
  • clueless
  • Embraceable
  • Diffulcult cases
  • building ? from which sense of build?
  • A resource
  • CatVar Categorial Variation Database
  • http//clipdemos.umiacs.umd.edu/catvar

14
Concatinative Morphology
  • MorphemeMorphemeMorpheme
  • Stems also called lemma, base form, root, lexeme
  • hopeing ? hoping hop ? hopping
  • Affixes
  • Prefixes Antidisestablishmentarianism
  • Suffixes Antidisestablishmentarianism
  • Infixes hingi (borrow) humingi (borrower) in
    Tagalog
  • Circumfixes sagen (say) gesagt (said) in
    German
  • Agglutinative Languages
  • uygarlastiramadiklarimizdanmissinizcasina
  • uygarlastiramadiklarimizdanmissinizcasin
    a
  • Behaving as if you are among those whom we could
    not cause to become civilized

15
Templatic Morphology
  • Roots and Patterns
  • Example Hebrew verbs
  • Root
  • Consists of 3 consonants CCC
  • Carries basic meaning
  • Template
  • Gives the ordering of consonants and vowels
  • Specifies semantic information about the verb
  • Active, passive, middle voice
  • Example
  • lmd (to learn or study)
  • CaCaC -gt lamad (he studied)
  • CiCeC -gt limed (he taught)
  • CuCaC -gt lumad (he was taught)

16
Nouns and Verbs (in English)
  • Nouns have simple inflectional morphology
  • cat
  • cats, cats
  • Verbs have more complex morphology

17
Nouns and Verbs (in English)
  • Nouns
  • Have simple inflectional morphology
  • Cat/Cats
  • Mouse/Mice, Ox, Oxen, Goose, Geese
  • Verbs
  • More complex morphology
  • Walk/Walked
  • Go/Went, Fly/Flew

18
Regular (English) Verbs
Morphological Form Classes Regularly Inflected Verbs Regularly Inflected Verbs Regularly Inflected Verbs Regularly Inflected Verbs
Stem walk merge try map
-s form walks merges tries maps
-ing form walking merging trying mapping
Past form or ed participle walked merged tried mapped
19
Irregular (English) Verbs
Morphological Form Classes Irregularly Inflected Verbs Irregularly Inflected Verbs Irregularly Inflected Verbs
Stem eat catch cut
-s form eats catches cuts
-ing form eating catching cutting
Past form ate caught cut
-ed participle eaten caught cut
20
To love in Spanish
21
Syntax and Morphology
  • Phrase-level agreement
  • Subject-Verb
  • John studies hard (STUDY3SG)
  • Noun-Adjective
  • Las vacas hermosas
  • Sub-word phrasal structures
  • ????????
  • ?????????
  • ThatinbookPLPoss1PL
  • Which are in our books

22
Phonology and Morphology
  • Script Limitations
  • Spoken English has 14 vowels
  • heed hid hayed head had hoed hood whod hide
    howd taught Tut toy enough
  • English Alphabet has 5
  • Use vowel combinatios far fair fare
  • Consonantal doubling (hopping vs. hoping)

23
Computational Morphology
  • Approaches
  • Lexicon only
  • Rules only
  • Lexicon and Rules
  • Finite-state Automata
  • Finite-state Transducers
  • Systems
  • WordNets morphy
  • PCKimmo
  • Named after Kimmo Koskenniemi, much work done by
    Lauri Karttunen, Ron Kaplan, and Martin Kay
  • Accurate but complex
  • http//www.sil.org/pckimmo/
  • Two-level morphology
  • Commercial version available from InXight Corp.
  • Background
  • Chapter 3 of Jurafsky and Martin
  • A short history of Two-Level Morphology
  • http//www.ling.helsinki.fi/koskenni/esslli-2001-
    karttunen/

24
Porter Stemmer
  • Discount morphology
  • So not all that accurate
  • Uses a series of cascaded rewrite rules
  • ATIONAL -gt ATE
  • (relational -gt relate)
  • ING -gt ? if stem contains vowel
  • (motoring -gt motor)

25
Porter Stemmer
  • Step 4 Derivational Morphology I Multiple
    Suffixes
  • (mgt0) ATIONAL -gt ATE relational
    -gt relate
  • (mgt0) TIONAL -gt TION conditional
    -gt condition
  • rational
    -gt rational
  • (mgt0) ENCI -gt ENCE valenci
    -gt valence
  • (mgt0) ANCI -gt ANCE hesitanci
    -gt hesitance
  • (mgt0) IZER -gt IZE digitizer
    -gt digitize
  • (mgt0) ABLI -gt ABLE conformabli
    -gt conformable
  • (mgt0) ALLI -gt AL radicalli
    -gt radical
  • (mgt0) ENTLI -gt ENT differentli
    -gt different
  • (mgt0) ELI -gt E vileli
    - gt vile
  • (mgt0) OUSLI -gt OUS analogousli
    -gt analogous
  • (mgt0) IZATION -gt IZE
    vietnamization -gt vietnamize
  • (mgt0) ATION -gt ATE predication
    -gt predicate
  • (mgt0) ATOR -gt ATE operator
    -gt operate
  • (mgt0) ALISM -gt AL feudalism
    -gt feudal
  • (mgt0) IVENESS -gt IVE decisiveness
    -gt decisive
  • (mgt0) FULNESS -gt FUL hopefulness
    -gt hopeful
  • (mgt0) OUSNESS -gt OUS callousness
    -gt callous

26
Porter Stemmer
  • Errors of Omission
  • European Europe
  • analysis analyzes
  • matrices matrix
  • noise noisy
  • explain explanation
  • Errors of Commission
  • organization organ
  • doing doe
  • generalization generic
  • numerical numerous
  • university universe

27
Computational Morphology
  • WORD STEM (FEATURES)
  • cats cat N PL
  • cat cat N SG
  • cities city N PL
  • geese goose N PL
  • ducks (duck N PL) or
  • (duck V 3SG)
  • merging merge V PRES-PART
  • caught (catch V PAST-PART) or
  • (catch V PAST)

28
Lexicon-only Morphology
  • The lexicon lists all surface level and lexical
    level pairs
  • No rules
  • Analysis/Generation is easy
  • Very large for English
  • What about
  • Arabic or
  • Turkish or
  • Chinese?

acclaim acclaim N acclaim
acclaim V0 acclaimed acclaim
Ved acclaimed acclaim Ven acclaiming
acclaim Ving acclaims acclaim
Ns acclaims acclaim Vs acclamation
acclamation N acclamations acclamation
Ns acclimate acclimate
V0 acclimated acclimate
Ved acclimated acclimate
Ven acclimates acclimate
Vs acclimating acclimate Ving
29
For Next Week
  • Software status
  • Software on 3 lab machines, more coming
  • Lecture on Monday Sept 13
  • Part of speech tagging
  • For Wed Sept 15
  • Do exercises 1-3 in Tutorial 2 (Tokenizing)
  • Do the following exercises from Tutorial 3
    (Tagging)
  • 1a-h
  • 2, 3, 4, 5a-b
  • Turn them in online
  • (Ill have something available for this by then)
Write a Comment
User Comments (0)