Title: NLTK Tagging
1NLTK Tagging
- CS1573 AI Application Development, Spring 2003
- (modified from Steven Birds notes)
2Todays Outline
- Administration
- Final Words on Regular Expressions
- Regular Expressions in NLTK
- New Topic Tagging
- Motivation and Linguistic Background
- NLTK Tutorial Tagging
- Part-of-Speech Tagging
- The nltk.tagger Module
- A Few Tagging Algorithms
- Some Gory Details
3Regular Expressions, again
- Python
- Regular expression syntax
- NLTK uses
- The regular expression tokenizer
- A simple regular expression tagging algorithm
4Regular Expression Tokenizers
- Mimicing the WSTokenizer
- gtgtgt tokenizerRETokenizer(r'\s')
- gtgtgt tokenizer.tokenize(example_text)
- 'Hello.'_at_0w, "Isn't"_at_1w, 'this'_at_2w,
'fun?'_at_3w
5RE Tokenization, continued
- gt regexpr'\w\w\s
- '\w\w\s'
- gt tokenizer RETokenizer(regexp)
- gt tokenizer.tokenize(example_text)
'Hello'_at_0w, '.'_at_1w, 'Isn'_at_2w, "'"_at_3w,
't'_at_4w, 'this'_at_5w, 'fun'_at_6w, '?'_at_7w - Why is this version better?
6RE Tokenization, continued
- gt regexpr'\w\w\s'
- Why is this version better?
- -includes punctuation as separate tokens
- -matches either a sequence of alphanumeric
characters (letters and numbers) or a sequence
of punctuation characters. - But, still has problems, for example ?
7Improved Example
- gt example_text 'That poster costs 22.40.'
- gt regexp r'(\w)(\\d\.\d)(\w\s)'
'(\w)(\\d\.\d)(\w\s)' - gt tokenizer RETokenizer(regexp)
- gt tokenizer.tokenize(example_text)
'That'_at_0w, 'poster'_at_1w, 'costs'_at_2w,
'22.40'_at_3w, '.'_at_4w
8Regular Expression Limitations
- While Regular Languages can model many things,
there are still limitations (no advice when
rejection, all or one solution when accept
condition is ambiguous).
9New Topic
- Now were going to start looking at tagging, and
especially approaches that depend on looking at
words in context. - Well start with what looks like an artificial
task predicting the next word in a sequence. - Well then move to tagging, the process of
associating auxiliary information with each
token, often for use in later stages of text
processing
10Word Prediction Example
- From NY Times
- Stocks plunged this
11Word Prediction Example
- From NY Times
- Stocks plunged this morning, despite a cut in
interest
12Word Prediction Example
- From NY Times
- Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall
13Word Prediction Example
- From NY Times
- Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall
Street began
14Word Prediction Example
- From NY Times
- Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall
Street began trading for the first time since
last
15Word Prediction Example
- From NY Times
- Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall
Street began trading for the first time since
last Tuesdays terrorist attacks.
16Format Change
- Move to pdf slides (highlights of Jurafsky and
Martin Chapters 6 and 8)
17Tagging Overview /Review
- Motivation
- What is tagging? What does tagging do? Kinds of
tagging? - Significance of part of speech
- Basics
- Features and context
- Brown and Penn Treebank, tagsets
- Tagging in NLTK (nltk.tagger module)
- Tagging
- Algorithms, statistical and rule-based tagging
- Evaluation
18Terminology
- Tagging
- The process of associating labels with each token
in a text - Tags
- The labels
- Tag Set
- The collection of tags used for a particular task
19Example
- Typically a tagged text is a sequence of
white-space separated base/tag tokens - The/at Pantheons/np interior/nn ,/,still/rb
in/in its/pp original/jj form/nn ,/, is/bez
truly/ql majestic/jj and/cc an/at
architectural/jj triumph/nn ./. Its/pp rotunda/nn
forms/vbz a/at perfect/jj circle/nn whose/wp
diameter/nn is/bez equal/jj to/in the/at
height/nn from/in the/at floor/nn to/in the/at
ceiling/nn ./. - .
20What does Tagging do?
- Collapses Distinctions
- Lexical identity may be discarded
- e.g. all personal pronouns tagged with PRP
- Introduces Distinctions
- Ambiguities may be removed
- e.g. deal tagged with NN or VB
- e.g. deal tagged with DEAL1 or DEAL2
- Helps classification and prediction
21Kinds of Tagging
- Part-of-Speech tagging
- Grammatical tagging
- Divides words into categories based on how they
can be combined to form sentences (e.g., articles
can combine with nouns but not verbs) - Semantic Sense tagging
- Sense disambiguation
- Homonym disambiguation
- Discourse tagging
- Speech acts (request, inform, greet, etc.)
22Significance of Parts of Speech
- A words POS tells us a lot about the word and
its neighbors - Limits the range of meanings (deal),
pronunciation (object vs object) or both (wind) - Helps in stemming
- Limits the range of following words for ASR
- Helps select nouns from a document for IR
- Basis for partial parsing
- Basis for searching for linguistic constructions
- Parsers can build trees directly on the POS tags
instead of maintaining a lexicon
23Features and Contexts
wn-2 wn-1 wn wn1
CONTEXT FEATURE
tn-1
tn
tn1
tn-2
24Why there are many tag sets
- Definition of POS tag
- Semantic, syntactic, morphological
- Tagsets differ in both how they define the tags,
and at what level of granularity - Balancing classification and prediction
- Introducing more distinctions
- Better information about context
- Harder to classify current token
- Introducing few distinctions
- Less information about context
- Less work to do for classifying current token
25The Brown Corpus
- The first digital corpus (1961)
- Francis and Kucera, Brown University
- Contents 500 texts, each 2000 words long
- From American books, newspapers, magazines
- Representing genres
- Science fiction, romance fiction, press reportage
scientific writing, popular lore
26Penn Treebank
- First syntactically annotated corpus
- 1 million words from Wall Street Journal
- Part of speech tags and syntax trees
27Representing Tags in NLTK
- TaggedType class
- gtgtgt ttype1 TaggedType('dog', 'NN')
- 'dog'/'NN
- gtgtgt ttype1.base()
- dog'
- gtgtgt ttype1.tag()
- NN'
- Tagged tokens
- gtgtgt ttoken Token(ttype, Location(5))
- 'dog'/'NN'_at_5
28Reading Tagged Corpora
- gtgt tagged_text_str open('corpus.txt').read()
- 'John/NN saw/VB the/AT book/NN on/IN the/AT
table/NN ./END He/NN sighed/VB ./END' - gtgt tokensTaggedTokenizer().tokenize(tagged_text_s
tr) - 'John'/'NN'_at_0w, 'saw'/'VB'_at_1w,
'the'/'AT'_at_2w, 'book'/'NN'_at_3w,
'on'/'IN'_at_4w, 'the'/'AT'_at_5w,
'table'/'NN'_at_6w, '.'/'END'_at_7w,
'He'/'NN'_at_8w, 'sighed'/'VB'_at_9w,
'.'/'END'_at_10w - If TaggedTokenizer encouters a word without a
tag, it will assign it the default tag None.
29The TaggerI Interface
- gt tokens WSTokenizer().tokenize(untagged_text_st
r) 'John'_at_0w, 'saw'_at_1w, 'the'_at_2w,
'book'_at_3w, 'on'_at_4w, 'the'_at_5w, 'table'_at_6w,
'.'_at_7w, 'He'_at_8w, 'sighed'_at_9w, '.'_at_10w - gt my_tagger.tag(tokens)
- 'John'/'NN'_at_0w, 'saw'/'VB'_at_1w,
'the'/'AT'_at_2w, 'book'/'NN'_at_3w,
'on'/'IN'_at_4w, 'the'/'AT'_at_5w,
'table'/'NN'_at_6w, '.'/'END'_at_7w,
'He'/'NN'_at_8w, 'sighed'/'VB'_at_9w,
'.'/'END'_at_10w - The interface defines a single method, tag, which
assigns a tag to each token in a list, and
returns the resulting list of tagged tokens.
30Tagging Algorithms
- Default tagger
- Inspect the word and guess a tag
- Unigram tagger
- Assign the tag which is the most probable for the
word in question, based on raw frequency - Uses training data
- Bigram tagger, n-gram tagger
- Rule-based taggers, HMM taggers (outside scope of
this class)
31Default Tagger
- We need something to use for unseen words
- E.g., guess NNP for a word with an initial
capital - Do regular-expression processing of the words
- Sequence of regular expression tests
- Assigment of the wor to a suitable tag
- If there are no matches
- Assign to the most frequent tag, NN
32Finding the most frequent tag
- nltk.probability module
- for ttoken in ttext
freq_dist.inc(ttoken.tag())
def_tag
freq_dist.max()
33A Default Tagger
- gt tokensWSTokenizer().tokenize(untag_text_str)
'John'_at_0w, 'saw'_at_1w, '3'_at_2w, 'polar'_at_3w,
'bears'_at_4w, '.'_at_5w - gt my_tagger.tag(tokens)
- 'John'/'NN'_at_0w, 'saw'/'NN'_at_1w,
'3'/'CD'_at_2w, 'polar'/'NN'_at_3w,
'bears'/'NN'_at_4w, '.'/'NN'_at_5w - NN_CD_Tagger assigns CD to numbers, otherwise NN
- Poor performance (20-30) in isolation, but when
used with other taggers can significantly improve
performance
34Unigram Tagger
- Unigram table of frequencies
- E.g. in tagged WSJ sample, deal is tagged with
NN 11 times, with VB 1 time, and with VBP 1 time - 90 accuracy
- Counting events
- freq_dist CFFreqDist()
for tttoken
in ttext
context
ttoken.type().base() - feature ttoken.type().tag()
freq_dist.inc(CFSample(context,feature)) - context_event ContextEvent(token.type())
samplefreq_dist.cond_max(context_event)
tagsample.feature()
35Unigram Tagger (continued)
- Before being used, UnigramTaggers are trained
using the train method, which uses a tagged
corpus to determine which tags are most common
for each word - 'train.txt' is a tagged training corpus
- gtgtgt tagged_text_str open('train.txt').read()
- gtgtgt train_toks TaggedTokenizer().tokenize(tagged
_text_str) - gtgtgt tagger UnigramTagger()
- gtgtgt tagger.train(train_toks)
36Unigram Tagger (continued)
- Once a UnigramTagger has been trained, the tag
can be used to tag untagged corpora - gt tokens WSTokenizer().tokenize(untagged_text_st
r) - gt tagger.tag(tokens)
- 'John'/'NN'_at_0w, 'saw'/'VB'_at_1w,
'the'/'AT'_at_2w, 'book'/'NN'_at_3w,
'on'/'IN'_at_4w, 'the'/'AT'_at_5w, ...
37Unigram Tagger (continued)
- Performance is highly dependent on the quality of
its training set - Cant be too small
- Cant be too different from texts we actually
want to tag - How is this related to the homework that we just
did?
38Nth Order Tagging
- Bigram table frequencies of pairs
- Not necessarily adjacent or of same category
- What is the most likely tag for w_n, given w_n-1
and t_n-1? - What is the context for NLTK?
- N-gram tagger
- Consider n-1 previous tags
- Sparse data problem
- Accuracy versus coverage tradeoff
- Backoff
- Throwing away order
- Put context into a set
39Nth-Order Tagging (continued)
- In addition to considering the tokens type, the
context also considers the tags of the n
preceding tokens - The tagger then picks the tag which is most
likely for that context - Different values of n are possible
- Oth order unigram tagger
- 1st order bigrams
- 2nd order trigrams
40Nth-Order Tagging (continued)
- Tagged training corpus determines most likely tag
for each context - gt train_toks TaggedTokenizer().tokenize(tagged_t
ext_str) - gt tagger NthOrderTagger(3) 3rd order
tagger - gttagger.train(train_toks)
41Nth-Order Tagging (continued)
- Once trained, it can tag untagged corpora
- gt tokensWSTokenizer().tokenize(untag_text_str)
- gt tagger.tag(tokens)
- 'John'/'NN'_at_0w, 'saw'/'VB'_at_1w,
'the'/'AT'_at_2w, 'book'/'NN'_at_3w,
'on'/'IN'_at_4w, 'the'/'AT'_at_5w, ...
42Combining Taggers
- Use more accurate algorithms when we can, backoff
to wider coverage when needed. - Try tagging the token with the 1st order tagger.
- If the 1st order tagger is unable to find a tag
for the token, try finding a tag with the 0th
order tagger. - If the 0th order tagger is also unable to find a
tag, use the NN_CD_Tagger to find a tag.
43BackoffTagger class
- gtgtgt train_toks TaggedTokenizer().tokenize(tagged
_text_str) - Construct the taggers
- gtgtgt tagger1 NthOrderTagger(1) 1st order
- gtgtgt tagger2 UnigramTagger() 0th order
- gtgtgt tagger3 NN_CD_Tagger()
- Train the taggers
- gtgtgt tagger1.train(train_toks)
- gtgtgt tagger2.train(train_toks)
44Backoff (continued)
- Combine the taggers (in order, by specificity)
- gtgt tagger BackoffTagger(tagger1, tagger2,
tagger3) - Use the combined tagger
- gtgttokensTaggedTokenizer().tokenize(untagged_text_
str) - gtgt tagger.tag(tokens)
- 'John'/'NN'_at_0w, 'saw'/'VB'_at_1w,
'the'/'AT'_at_2w, 'book'/'NN'_at_3w,
'on'/'IN'_at_4w, 'the'/'AT'_at_5w, ...
45Rule-Based Tagger
- The Linguistic Complaint
- Where is the linguistic knowledge of a tagger?
- Just a massive table of numbers
- Arent there any linguistic insights that could
emerge from the data? - Could thus use handcrafted sets of rules to tag
input sentences, for example, if input follows a
determiner tag it as a noun.
46Evaluating a Tagger
- Tagged tokens the original data
- Untag the data
- Tag the data with your own tagger
- Compare the original and new tags
- Iterate over the two lists checking for identity
and counting - Accuracy fraction correct
47A Look at Tagging Implementations
- It demonstrates how to write classes implementing
the interfaces defined by NLTK. - It provides you with a better understanding of
the algorithms and data structures underlying
each approach to tagging. - It gives you a chance to see some of the code
used to implement NLTK. The developers have tried
hard to ensure that the implementation of every
class in NLTK is easy to understand.
48A Sequential Tagger
- The taggers in this tutorial are implemented as
sequential taggers - Assigns tags to one token at a time, starting
with the first token of the text, and proceeding
in sequential order. - Decides which tag to assign a token on the basis
of that token, the tokens that preceed it, and
the predicted tags for the tokens that preceed
it. - To capture this commonality, we define a common
base class, SequentialTagger (class
SequentialTagger(TaggerI)) - The next.tag method (note typo in tutorial)
returns the appropriate tag for the next token
each tagger subclass provides its own
implementation
49SequentialTagger.next_tag
- -decides which tag to assign a token, given the
list of tagged tokens that preceeds it. - two arguments a list of tagged tokens preceeding
the token to be tagged, and the token to be
tagged and it returns the appropriate tag for
that token. - def next_tag(self, tagged_tokens, next_token)
assert 0, "next_tag not defined by
SequentialTagger subclass"
50SequentialTagger.tag
- def tag(self, text)
- tagged_text
- Tag each token, in sequential order.
- for token in text
- Get the tag for the next token.
- tag self.next_tag(tagged_text, token)
- Use tag to build tagged token, add to
tagged_text. tagged_token Token(TaggedType(toke
n.type(), tag), token.loc()) - tagged_text.append(tagged_token)
- return tagged_text
51Example Subclass NN_CD_Tagger
- class NN_CD_Tagger(SequentialTagger)
- def __init__(self) pass empty constructor
- def next_tag(self, tagged_tokens, next_token)
- Assign 'CD' for numbers, 'NN' for anything
else. - if re.match(r'0-9(.0-9)?',
next_token.type()) - return 'CD'
- else
- return 'NN
- just define this method when the tag method is
called, the definition given by SequentialTagger
will be used.
52Another Example UnigramTagger
- class UnigramTagger(TaggerI)
- class UnigramTagger(SequentialTagger)
53Unigram Tagger Training
- def train(self, tagged_tokens)
- for token in tagged_tokens
- outcome token.type().tag()
- context token.type().base()
self._freqdistcontext.inc(outcome
54Unigram Tagger Tagging
- def next_tag(self, tagged_tokens, next_token)
context next_token.type() return
self._freqdistcontext.max() -
eg access context and find most likely
outcome gtgtgt freqdist'bank'.max() 'NN'
55Unigram Tagger Initialization
- The constructor for UnigramTagger simply
initializes self._freqdist with a new conditional
frequency distribution. - def __init__(self)
self._freqdist probability.Conditional
FreqDist()
56For Self-Study
- NthOrder Tagger Implementation
- BackoffTagger Implementation
57For Next Time