POS Tagging: Introduction

1 / 39
About This Presentation
Title:

POS Tagging: Introduction

Description:

Acknowledgement: some s from Ralph Grishman, Nicolas ... Proper nouns (Boulder, Granby, Eli Manning) English capitalizes these. Common nouns (the rest) ... – PowerPoint PPT presentation

Number of Views:145
Avg rating:3.0/5.0
Slides: 40
Provided by: hen4

less

Transcript and Presenter's Notes

Title: POS Tagging: Introduction


1
POS Tagging Introduction
  • Heng Ji
  • hengji_at_cs.qc.cuny.edu
  • Feb 2, 2008

Acknowledgement some slides from Ralph Grishman,
Nicolas Nicolov, JM
2
Some Administrative Stuff
  • Assignment 1 due on Feb 17
  • Textbook required for assignments and final exam

3
Outline
  • Parts of speech (POS)
  • Tagsets
  • POS Tagging
  • Rule-based tagging
  • Markup Format
  • Open source Toolkits

4
What is Part-of-Speech (POS)
  • Generally speaking, Word Classes (POS)
  • Verb, Noun, Adjective, Adverb, Article,
  • We can also include inflection
  • Verbs Tense, number,
  • Nouns Number, proper/common,
  • Adjectives comparative, superlative,

5
Parts of Speech
  • 8 (ish) traditional parts of speech
  • Noun, verb, adjective, preposition, adverb,
    article, interjection, pronoun, conjunction, etc
  • Called parts-of-speech, lexical categories, word
    classes, morphological classes, lexical tags...
  • Lots of debate within linguistics about the
    number, nature, and universality of these
  • Well completely ignore this debate.

6
7 Traditional POS Categories
  • N noun chair, bandwidth, pacing
  • V verb study, debate, munch
  • ADJ adj purple, tall, ridiculous
  • ADV adverb unfortunately, slowly,
  • P preposition of, by, to
  • PRO pronoun I, me, mine
  • DET determiner the, a, that, those

7
POS Tagging
  • The process of assigning a part-of-speech or
    lexical class marker to each word in a collection.

WORD tag the DET koala N put
V the DET keys N on P the
DET table N
8
Penn TreeBank POS Tag Set
  • Penn Treebank hand-annotated corpus of Wall
    Street Journal, 1M words
  • 46 tags
  • Some particularities
  • to /TO not disambiguated
  • Auxiliaries and verbs not distinguished

9
Penn Treebank Tagset
10
Why POS tagging is useful?
  • Speech synthesis
  • How to pronounce lead?
  • INsult inSULT
  • OBject obJECT
  • OVERflow overFLOW
  • DIScount disCOUNT
  • CONtent conTENT
  • Stemming for information retrieval
  • Can search for aardvarks get aardvark
  • Parsing and speech recognition and etc
  • Possessive pronouns (my, your, her) followed by
    nouns
  • Personal pronouns (I, you, he) likely to be
    followed by verbs
  • Need to know if a word is an N or V before you
    can parse
  • Information extraction
  • Finding names, relations, etc.
  • Machine Translation

11
Equivalent Problem in Bioinformatics
  • Durbin et al. Biological Sequence Analysis,
    Cambridge University Press.
  • Several applications, e.g. proteins
  • From primary structure ATCPLELLLD
  • Infer secondary structure HHHBBBBBC..

12
Why is POS Tagging Useful?
  • First step of a vast number of practical tasks
  • Speech synthesis
  • How to pronounce lead?
  • INsult inSULT
  • OBject obJECT
  • OVERflow overFLOW
  • DIScount disCOUNT
  • CONtent conTENT
  • Parsing
  • Need to know if a word is an N or V before you
    can parse
  • Information extraction
  • Finding names, relations, etc.
  • Machine Translation

13
Open and Closed Classes
  • Closed class a small fixed membership
  • Prepositions of, in, by,
  • Auxiliaries may, can, will had, been,
  • Pronouns I, you, she, mine, his, them,
  • Usually function words (short common words which
    play a role in grammar)
  • Open class new ones can be created all the time
  • English has 4 Nouns, Verbs, Adjectives, Adverbs
  • Many languages have these 4, but not all!

14
Open Class Words
  • Nouns
  • Proper nouns (Boulder, Granby, Eli Manning)
  • English capitalizes these.
  • Common nouns (the rest).
  • Count nouns and mass nouns
  • Count have plurals, get counted goat/goats, one
    goat, two goats
  • Mass dont get counted (snow, salt, communism)
    (two snows)
  • Adverbs tend to modify things
  • Unfortunately, John walked home extremely slowly
    yesterday
  • Directional/locative adverbs (here,home,
    downhill)
  • Degree adverbs (extremely, very, somewhat)
  • Manner adverbs (slowly, slinkily, delicately)
  • Verbs
  • In English, have morphological affixes
    (eat/eats/eaten)

15
Closed Class Words
  • Examples
  • prepositions on, under, over,
  • particles up, down, on, off,
  • determiners a, an, the,
  • pronouns she, who, I, ..
  • conjunctions and, but, or,
  • auxiliary verbs can, may should,
  • numerals one, two, three, third,

16
Prepositions from CELEX
17
English Particles
18
Conjunctions
19
POS TaggingChoosing a Tagset
  • There are so many parts of speech, potential
    distinctions we can draw
  • To do POS tagging, we need to choose a standard
    set of tags to work with
  • Could pick very coarse tagsets
  • N, V, Adj, Adv.
  • More commonly used set is finer grained, the
    Penn TreeBank tagset, 45 tags
  • PRP, WRB, WP, VBG
  • Even more fine-grained tagsets exist

20
Using the Penn Tagset
  • The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT
    number/NN of/IN other/JJ topics/NNS ./.
  • Prepositions and subordinating conjunctions
    marked IN (although/IN I/PRP..)
  • Except the preposition/complementizer to is
    just marked TO.

21
POS Tagging
  • Words often have more than one POS back
  • The back door JJ
  • On my back NN
  • Win the voters back RB
  • Promised to back the bill VB
  • The POS tagging problem is to determine the POS
    tag for a particular instance of a word.

These examples from Dekang Lin
22
How Hard is POS Tagging? Measuring Ambiguity
23
Current Performance
  • How many tags are correct?
  • About 97 currently
  • But baseline is already 90
  • Baseline algorithm
  • Tag every word with its most frequent tag
  • Tag unknown words as nouns
  • How well do people do?

24
Quick Test Agreement?
  • the students went to class
  • plays well with others
  • fruit flies like a banana

DT the, this, that NN noun VB verb P
prepostion ADV adverb
25
Quick Test
  • the students went to class
  • DT NN VB P NN
  • plays well with others
  • VB ADV P NN
  • NN NN P DT
  • fruit flies like a banana
  • NN NN VB DT NN
  • NN VB P DT NN
  • NN NN P DT NN
  • NN VB VB DT NN

26
How to do it? History
Combined Methods 98
Trigram Tagger (Kempe) 96
DeRose/Church Efficient HMM Sparse Data 95
Tree-Based Statistics (Helmut Shmid) Rule Based
96
Transformation Based Tagging (Eric Brill) Rule
Based 95
Greene and Rubin Rule Based - 70
HMM Tagging (CLAWS) 93-95
Neural Network 96
LOB Corpus Tagged
Brown Corpus Created (EN-US) 1 Million Words
Brown Corpus Tagged
British National Corpus (tagged by CLAWS)
POS Tagging separated from other NLP
LOB Corpus Created (EN-UK) 1 Million Words
Penn Treebank Corpus (WSJ, 4.5M)
27
Two Methods for POS Tagging
  • Rule-based tagging
  • (ENGTWOL)
  • Stochastic
  • Probabilistic sequence models
  • HMM (Hidden Markov Model) tagging
  • MEMMs (Maximum Entropy Markov Models)

28
Rule-Based Tagging
  • Start with a dictionary
  • Assign all possible tags to words from the
    dictionary
  • Write rules by hand to selectively remove tags
  • Leaving the correct tag for each word.

29
Rule-based taggers
  • Early POS taggers all hand-coded
  • Most of these (Harris, 1962 Greene and Rubin,
    1971) and the best of the recent ones, ENGTWOL
    (Voutilainen, 1995) based on a two-stage
    architecture
  • Stage 1 look up word in lexicon to give list of
    potential POSs
  • Stage 2 Apply rules which certify or disallow
    tag sequences
  • Rules originally handwritten more recently
    Machine Learning methods can be used

30
Start With a Dictionary
  • she PRP
  • promised VBN,VBD
  • to TO
  • back VB, JJ, RB, NN
  • the DT
  • bill NN, VB
  • Etc for the 100,000 words of English with more
    than 1 tag

31
Assign Every Possible Tag
  • NN
  • RB
  • VBN JJ VB
  • PRP VBD TO VB DT NN
  • She promised to back the bill

32
Write Rules to Eliminate Tags
  • Eliminate VBN if VBD is an option when VBNVBD
    follows ltstartgt PRP
  • NN
  • RB
  • JJ VB
  • PRP VBD TO VB DT NN
  • She promised to back the bill

VBN
33
Stage 1 of ENGTWOL Tagging
  • First Stage Run words through FST morphological
    analyzer to get all parts of speech.
  • Example Pavlov had shown that salivation
  • Pavlov PAVLOV N NOM SG PROPERhad HAVE V
    PAST VFIN SVO HAVE PCP2 SVOshown SHOW PCP2
    SVOO SVO SVthat ADV PRON DEM SG DET CENTRAL
    DEM SG CSsalivation N NOM SG

34
Stage 2 of ENGTWOL Tagging
  • Second Stage Apply NEGATIVE constraints.
  • Example Adverbial that rule
  • Eliminates all readings of that except the one
    in
  • It isnt that odd
  • Given input thatIf(1 A/ADV/QUANT) if next
    word is adj/adv/quantifier
  • (2 SENT-LIM) following which is E-O-S
  • (NOT -1 SVOC/A) and the previous word is
    not a
  • verb like consider which
  • allows adjective
    complements
  • in I consider that odd
  • Then eliminate non-ADV tagsElse eliminate ADV

35
Inline Mark-up
  • POS Tagging
  • http//nlp.cs.qc.cuny.edu/wsj_pos.zip
  • Input Format
  • Pierre Vinken, 61/CD years/NNS old , will join
    the board as a nonexecutive director Nov. 29.
  • Output Format
  • Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ
    ,/, will/MD join/VB the/DT board/NN as/IN a/DT
    nonexecutive/JJ director/NN Nov./NNP 29/CD ./.

36
POS Tagging Tools
  • NYU Prof. Ralph Grishmans HMM POS tagger (in
    Java)
  • http//nlp.cs.qc.cuny.edu/jet.zip
  • http//nlp.cs.qc.cuny.edu/jet_src.zip
  • http//www.cs.nyu.edu/cs/faculty/grishman/jet/lice
    nse.html
  • Demo
  • How it works
  • Learned HMM data/pos_hmm.txt
  • Source code src/jet/HMM/HMMTagger.java

37
POS Tagging Tools
  • Stanford tagger (Loglinear tagger )
    http//nlp.stanford.edu/software/tagger.shtml
  • Brill tagger
  • http//www.tech.plym.ac.uk/soc/staff/guidbugm/soft
    ware/RULE_BASED_TAGGER_V.1.14.tar.Z
  • tagger LEXICON test BIGRAMS LEXICALRULEFULE
    CONTEXTUALRULEFILE
  • YamCha (SVM)
  • http//chasen.org/taku/software/yamcha/
  • MXPOST (Maximum Entropy)
  • ftp//ftp.cis.upenn.edu/pub/adwait/
    jmx/
  • More complete list at
  • http//www-nlp.stanford.edu/links/statnlp.htm
    lTaggers

38
NLP Toolkits
  • Uniform CL Annotation Platform
  • UIMA (IBM NLP platform) http//incubator.apache.o
    rg/uima/svn.html
  • Mallet (UMASS) http//mallet.cs.umass.edu/index.p
    hp/Main_Page
  • MinorThird (CMU) http//minorthird.sourceforge.ne
    t/
  • NLTK http//nltk.sourceforge.net/ Natural
    langauge toolkit, with data sets ? Demo
  • Information Extraction
  • Jet (NYU IE toolkit) http//www.cs.nyu.edu/cs/facu
    lty/grishman/jet/license.html
  • Gate http//gate.ac.uk/download/index.htmlUniver
    sity of Sheffield IE toolkit
  • Information Retrieval
  • INDRI http//www.lemurproject.org/indri/
    Information Retrieval toolkit
  • Machine Translation
  • Compara http//adamastor.linguateca.pt/COMPARA/We
    lcome.html
  • ISI decoder http//www.isi.edu/licensed-sw/rewrit
    e-decoder/
  • MOSES http//www.statmt.org/moses/

39
Looking Ahead Next Class
  • Machine Learning for POS Tagging Hidden Markov
    Model
Write a Comment
User Comments (0)