Title: POS Tagging: Introduction
1POS Tagging Introduction
- Heng Ji
- hengji_at_cs.qc.cuny.edu
- Feb 2, 2008
Acknowledgement some slides from Ralph Grishman,
Nicolas Nicolov, JM
2Some Administrative Stuff
- Assignment 1 due on Feb 17
- Textbook required for assignments and final exam
3Outline
- Parts of speech (POS)
- Tagsets
- POS Tagging
- Rule-based tagging
- Markup Format
- Open source Toolkits
4What is Part-of-Speech (POS)
- Generally speaking, Word Classes (POS)
- Verb, Noun, Adjective, Adverb, Article,
- We can also include inflection
- Verbs Tense, number,
- Nouns Number, proper/common,
- Adjectives comparative, superlative,
5Parts of Speech
- 8 (ish) traditional parts of speech
- Noun, verb, adjective, preposition, adverb,
article, interjection, pronoun, conjunction, etc - Called parts-of-speech, lexical categories, word
classes, morphological classes, lexical tags... - Lots of debate within linguistics about the
number, nature, and universality of these - Well completely ignore this debate.
67 Traditional POS Categories
- N noun chair, bandwidth, pacing
- V verb study, debate, munch
- ADJ adj purple, tall, ridiculous
- ADV adverb unfortunately, slowly,
- P preposition of, by, to
- PRO pronoun I, me, mine
- DET determiner the, a, that, those
7POS Tagging
- The process of assigning a part-of-speech or
lexical class marker to each word in a collection.
WORD tag the DET koala N put
V the DET keys N on P the
DET table N
8Penn TreeBank POS Tag Set
- Penn Treebank hand-annotated corpus of Wall
Street Journal, 1M words - 46 tags
- Some particularities
- to /TO not disambiguated
- Auxiliaries and verbs not distinguished
9Penn Treebank Tagset
10Why POS tagging is useful?
- Speech synthesis
- How to pronounce lead?
- INsult inSULT
- OBject obJECT
- OVERflow overFLOW
- DIScount disCOUNT
- CONtent conTENT
- Stemming for information retrieval
- Can search for aardvarks get aardvark
- Parsing and speech recognition and etc
- Possessive pronouns (my, your, her) followed by
nouns - Personal pronouns (I, you, he) likely to be
followed by verbs - Need to know if a word is an N or V before you
can parse - Information extraction
- Finding names, relations, etc.
- Machine Translation
11Equivalent Problem in Bioinformatics
- Durbin et al. Biological Sequence Analysis,
Cambridge University Press. - Several applications, e.g. proteins
- From primary structure ATCPLELLLD
- Infer secondary structure HHHBBBBBC..
12Why is POS Tagging Useful?
- First step of a vast number of practical tasks
- Speech synthesis
- How to pronounce lead?
- INsult inSULT
- OBject obJECT
- OVERflow overFLOW
- DIScount disCOUNT
- CONtent conTENT
- Parsing
- Need to know if a word is an N or V before you
can parse - Information extraction
- Finding names, relations, etc.
- Machine Translation
13Open and Closed Classes
- Closed class a small fixed membership
- Prepositions of, in, by,
- Auxiliaries may, can, will had, been,
- Pronouns I, you, she, mine, his, them,
- Usually function words (short common words which
play a role in grammar) - Open class new ones can be created all the time
- English has 4 Nouns, Verbs, Adjectives, Adverbs
- Many languages have these 4, but not all!
14Open Class Words
- Nouns
- Proper nouns (Boulder, Granby, Eli Manning)
- English capitalizes these.
- Common nouns (the rest).
- Count nouns and mass nouns
- Count have plurals, get counted goat/goats, one
goat, two goats - Mass dont get counted (snow, salt, communism)
(two snows) - Adverbs tend to modify things
- Unfortunately, John walked home extremely slowly
yesterday - Directional/locative adverbs (here,home,
downhill) - Degree adverbs (extremely, very, somewhat)
- Manner adverbs (slowly, slinkily, delicately)
- Verbs
- In English, have morphological affixes
(eat/eats/eaten)
15Closed Class Words
- Examples
- prepositions on, under, over,
- particles up, down, on, off,
- determiners a, an, the,
- pronouns she, who, I, ..
- conjunctions and, but, or,
- auxiliary verbs can, may should,
- numerals one, two, three, third,
16Prepositions from CELEX
17English Particles
18Conjunctions
19POS TaggingChoosing a Tagset
- There are so many parts of speech, potential
distinctions we can draw - To do POS tagging, we need to choose a standard
set of tags to work with - Could pick very coarse tagsets
- N, V, Adj, Adv.
- More commonly used set is finer grained, the
Penn TreeBank tagset, 45 tags - PRP, WRB, WP, VBG
- Even more fine-grained tagsets exist
20Using the Penn Tagset
- The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT
number/NN of/IN other/JJ topics/NNS ./. - Prepositions and subordinating conjunctions
marked IN (although/IN I/PRP..) - Except the preposition/complementizer to is
just marked TO.
21POS Tagging
- Words often have more than one POS back
- The back door JJ
- On my back NN
- Win the voters back RB
- Promised to back the bill VB
- The POS tagging problem is to determine the POS
tag for a particular instance of a word.
These examples from Dekang Lin
22How Hard is POS Tagging? Measuring Ambiguity
23Current Performance
- How many tags are correct?
- About 97 currently
- But baseline is already 90
- Baseline algorithm
- Tag every word with its most frequent tag
- Tag unknown words as nouns
- How well do people do?
24Quick Test Agreement?
- the students went to class
- plays well with others
- fruit flies like a banana
-
DT the, this, that NN noun VB verb P
prepostion ADV adverb
25Quick Test
- the students went to class
- DT NN VB P NN
- plays well with others
- VB ADV P NN
- NN NN P DT
- fruit flies like a banana
- NN NN VB DT NN
- NN VB P DT NN
- NN NN P DT NN
- NN VB VB DT NN
26How to do it? History
Combined Methods 98
Trigram Tagger (Kempe) 96
DeRose/Church Efficient HMM Sparse Data 95
Tree-Based Statistics (Helmut Shmid) Rule Based
96
Transformation Based Tagging (Eric Brill) Rule
Based 95
Greene and Rubin Rule Based - 70
HMM Tagging (CLAWS) 93-95
Neural Network 96
LOB Corpus Tagged
Brown Corpus Created (EN-US) 1 Million Words
Brown Corpus Tagged
British National Corpus (tagged by CLAWS)
POS Tagging separated from other NLP
LOB Corpus Created (EN-UK) 1 Million Words
Penn Treebank Corpus (WSJ, 4.5M)
27Two Methods for POS Tagging
- Rule-based tagging
- (ENGTWOL)
- Stochastic
- Probabilistic sequence models
- HMM (Hidden Markov Model) tagging
- MEMMs (Maximum Entropy Markov Models)
28Rule-Based Tagging
- Start with a dictionary
- Assign all possible tags to words from the
dictionary - Write rules by hand to selectively remove tags
- Leaving the correct tag for each word.
29Rule-based taggers
- Early POS taggers all hand-coded
- Most of these (Harris, 1962 Greene and Rubin,
1971) and the best of the recent ones, ENGTWOL
(Voutilainen, 1995) based on a two-stage
architecture - Stage 1 look up word in lexicon to give list of
potential POSs - Stage 2 Apply rules which certify or disallow
tag sequences - Rules originally handwritten more recently
Machine Learning methods can be used
30Start With a Dictionary
- she PRP
- promised VBN,VBD
- to TO
- back VB, JJ, RB, NN
- the DT
- bill NN, VB
- Etc for the 100,000 words of English with more
than 1 tag -
31Assign Every Possible Tag
- NN
- RB
- VBN JJ VB
- PRP VBD TO VB DT NN
- She promised to back the bill
32Write Rules to Eliminate Tags
- Eliminate VBN if VBD is an option when VBNVBD
follows ltstartgt PRP - NN
- RB
- JJ VB
- PRP VBD TO VB DT NN
- She promised to back the bill
VBN
33Stage 1 of ENGTWOL Tagging
- First Stage Run words through FST morphological
analyzer to get all parts of speech. - Example Pavlov had shown that salivation
- Pavlov PAVLOV N NOM SG PROPERhad HAVE V
PAST VFIN SVO HAVE PCP2 SVOshown SHOW PCP2
SVOO SVO SVthat ADV PRON DEM SG DET CENTRAL
DEM SG CSsalivation N NOM SG
34Stage 2 of ENGTWOL Tagging
- Second Stage Apply NEGATIVE constraints.
- Example Adverbial that rule
- Eliminates all readings of that except the one
in - It isnt that odd
- Given input thatIf(1 A/ADV/QUANT) if next
word is adj/adv/quantifier - (2 SENT-LIM) following which is E-O-S
- (NOT -1 SVOC/A) and the previous word is
not a - verb like consider which
- allows adjective
complements - in I consider that odd
- Then eliminate non-ADV tagsElse eliminate ADV
35Inline Mark-up
- POS Tagging
- http//nlp.cs.qc.cuny.edu/wsj_pos.zip
- Input Format
- Pierre Vinken, 61/CD years/NNS old , will join
the board as a nonexecutive director Nov. 29. - Output Format
- Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ
,/, will/MD join/VB the/DT board/NN as/IN a/DT
nonexecutive/JJ director/NN Nov./NNP 29/CD ./.
36POS Tagging Tools
- NYU Prof. Ralph Grishmans HMM POS tagger (in
Java) - http//nlp.cs.qc.cuny.edu/jet.zip
- http//nlp.cs.qc.cuny.edu/jet_src.zip
- http//www.cs.nyu.edu/cs/faculty/grishman/jet/lice
nse.html - Demo
- How it works
- Learned HMM data/pos_hmm.txt
- Source code src/jet/HMM/HMMTagger.java
37POS Tagging Tools
- Stanford tagger (Loglinear tagger )
http//nlp.stanford.edu/software/tagger.shtml - Brill tagger
- http//www.tech.plym.ac.uk/soc/staff/guidbugm/soft
ware/RULE_BASED_TAGGER_V.1.14.tar.Z - tagger LEXICON test BIGRAMS LEXICALRULEFULE
CONTEXTUALRULEFILE - YamCha (SVM)
- http//chasen.org/taku/software/yamcha/
- MXPOST (Maximum Entropy)
- ftp//ftp.cis.upenn.edu/pub/adwait/
jmx/ - More complete list at
- http//www-nlp.stanford.edu/links/statnlp.htm
lTaggers
38NLP Toolkits
- Uniform CL Annotation Platform
- UIMA (IBM NLP platform) http//incubator.apache.o
rg/uima/svn.html - Mallet (UMASS) http//mallet.cs.umass.edu/index.p
hp/Main_Page - MinorThird (CMU) http//minorthird.sourceforge.ne
t/ - NLTK http//nltk.sourceforge.net/ Natural
langauge toolkit, with data sets ? Demo - Information Extraction
- Jet (NYU IE toolkit) http//www.cs.nyu.edu/cs/facu
lty/grishman/jet/license.html - Gate http//gate.ac.uk/download/index.htmlUniver
sity of Sheffield IE toolkit - Information Retrieval
- INDRI http//www.lemurproject.org/indri/
Information Retrieval toolkit - Machine Translation
- Compara http//adamastor.linguateca.pt/COMPARA/We
lcome.html - ISI decoder http//www.isi.edu/licensed-sw/rewrit
e-decoder/ - MOSES http//www.statmt.org/moses/
39Looking Ahead Next Class
- Machine Learning for POS Tagging Hidden Markov
Model