NLTK Tagging

About This Presentation

Title:

NLTK Tagging

Description:

Brown and Penn Treebank, tagsets. Tagging in NLTK (nltk.tagger module) Tagging ... Francis and Kucera, Brown University. Contents: 500 texts, each 2000 words long ... – PowerPoint PPT presentation

Number of Views:527

Avg rating:3.0/5.0

Slides: 58

Provided by: lab256

Learn more at: https://people.cs.pitt.edu

Category:

more less

Transcript and Presenter's Notes

Title: NLTK Tagging

1
NLTK Tagging

CS1573 AI Application Development, Spring 2003
(modified from Steven Birds notes)

2
Todays Outline

Administration
Final Words on Regular Expressions
Regular Expressions in NLTK
New Topic Tagging
Motivation and Linguistic Background
NLTK Tutorial Tagging
Part-of-Speech Tagging
The nltk.tagger Module
A Few Tagging Algorithms
Some Gory Details

3
Regular Expressions, again

Python
Regular expression syntax
NLTK uses
The regular expression tokenizer
A simple regular expression tagging algorithm

4
Regular Expression Tokenizers

Mimicing the WSTokenizer
gtgtgt tokenizerRETokenizer(r'\s')
gtgtgt tokenizer.tokenize(example_text)
'Hello.'_at_0w, "Isn't"_at_1w, 'this'_at_2w,
'fun?'_at_3w

5
RE Tokenization, continued

gt regexpr'\w\w\s
'\w\w\s'
gt tokenizer RETokenizer(regexp)
gt tokenizer.tokenize(example_text)
'Hello'_at_0w, '.'_at_1w, 'Isn'_at_2w, "'"_at_3w,
't'_at_4w, 'this'_at_5w, 'fun'_at_6w, '?'_at_7w
Why is this version better?

6
RE Tokenization, continued

gt regexpr'\w\w\s'
Why is this version better?
-includes punctuation as separate tokens
-matches either a sequence of alphanumeric
characters (letters and numbers) or a sequence
of punctuation characters.
But, still has problems, for example ?

7
Improved Example

gt example_text 'That poster costs 22.40.'
gt regexp r'(\w)(\\d\.\d)(\w\s)'
'(\w)(\\d\.\d)(\w\s)'
gt tokenizer RETokenizer(regexp)
gt tokenizer.tokenize(example_text)
'That'_at_0w, 'poster'_at_1w, 'costs'_at_2w,
'22.40'_at_3w, '.'_at_4w

8
Regular Expression Limitations

While Regular Languages can model many things,
there are still limitations (no advice when
rejection, all or one solution when accept
condition is ambiguous).

9
New Topic

Now were going to start looking at tagging, and
especially approaches that depend on looking at
words in context.
Well start with what looks like an artificial
task predicting the next word in a sequence.
Well then move to tagging, the process of
associating auxiliary information with each
token, often for use in later stages of text
processing

10
Word Prediction Example

From NY Times
Stocks plunged this

11
Word Prediction Example

From NY Times
Stocks plunged this morning, despite a cut in
interest

12
Word Prediction Example

From NY Times
Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall

13
Word Prediction Example

From NY Times
Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall
Street began

14
Word Prediction Example

From NY Times
Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall
Street began trading for the first time since
last

15
Word Prediction Example

From NY Times
Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall
Street began trading for the first time since
last Tuesdays terrorist attacks.

16
Format Change

Move to pdf slides (highlights of Jurafsky and
Martin Chapters 6 and 8)

17
Tagging Overview /Review

Motivation
What is tagging? What does tagging do? Kinds of
tagging?
Significance of part of speech
Basics
Features and context
Brown and Penn Treebank, tagsets
Tagging in NLTK (nltk.tagger module)
Tagging
Algorithms, statistical and rule-based tagging
Evaluation

18
Terminology

Tagging
The process of associating labels with each token
in a text
Tags
The labels
Tag Set
The collection of tags used for a particular task

19
Example

Typically a tagged text is a sequence of
white-space separated base/tag tokens
The/at Pantheons/np interior/nn ,/,still/rb
in/in its/pp original/jj form/nn ,/, is/bez
truly/ql majestic/jj and/cc an/at
architectural/jj triumph/nn ./. Its/pp rotunda/nn
forms/vbz a/at perfect/jj circle/nn whose/wp
diameter/nn is/bez equal/jj to/in the/at
height/nn from/in the/at floor/nn to/in the/at
ceiling/nn ./.
.

20
What does Tagging do?

Collapses Distinctions
Lexical identity may be discarded
e.g. all personal pronouns tagged with PRP
Introduces Distinctions
Ambiguities may be removed
e.g. deal tagged with NN or VB
e.g. deal tagged with DEAL1 or DEAL2
Helps classification and prediction

21
Kinds of Tagging

Part-of-Speech tagging
Grammatical tagging
Divides words into categories based on how they
can be combined to form sentences (e.g., articles
can combine with nouns but not verbs)
Semantic Sense tagging
Sense disambiguation
Homonym disambiguation
Discourse tagging
Speech acts (request, inform, greet, etc.)

22
Significance of Parts of Speech

A words POS tells us a lot about the word and
its neighbors
Limits the range of meanings (deal),
pronunciation (object vs object) or both (wind)
Helps in stemming
Limits the range of following words for ASR
Helps select nouns from a document for IR
Basis for partial parsing
Basis for searching for linguistic constructions
Parsers can build trees directly on the POS tags
instead of maintaining a lexicon

23
Features and Contexts
wn-2 wn-1 wn wn1
CONTEXT FEATURE
tn-1
tn
tn1
tn-2
24
Why there are many tag sets

Definition of POS tag
Semantic, syntactic, morphological
Tagsets differ in both how they define the tags,
and at what level of granularity
Balancing classification and prediction
Introducing more distinctions
Better information about context
Harder to classify current token
Introducing few distinctions
Less information about context
Less work to do for classifying current token

25
The Brown Corpus

The first digital corpus (1961)
Francis and Kucera, Brown University
Contents 500 texts, each 2000 words long
From American books, newspapers, magazines
Representing genres
Science fiction, romance fiction, press reportage
scientific writing, popular lore

26
Penn Treebank

First syntactically annotated corpus
1 million words from Wall Street Journal
Part of speech tags and syntax trees

27
Representing Tags in NLTK

TaggedType class
gtgtgt ttype1 TaggedType('dog', 'NN')
'dog'/'NN
gtgtgt ttype1.base()
dog'
gtgtgt ttype1.tag()
NN'
Tagged tokens
gtgtgt ttoken Token(ttype, Location(5))
'dog'/'NN'_at_5

28
Reading Tagged Corpora

gtgt tagged_text_str open('corpus.txt').read()
'John/NN saw/VB the/AT book/NN on/IN the/AT
table/NN ./END He/NN sighed/VB ./END'
gtgt tokensTaggedTokenizer().tokenize(tagged_text_s
tr)
'John'/'NN'_at_0w, 'saw'/'VB'_at_1w,
'the'/'AT'_at_2w, 'book'/'NN'_at_3w,
'on'/'IN'_at_4w, 'the'/'AT'_at_5w,
'table'/'NN'_at_6w, '.'/'END'_at_7w,
'He'/'NN'_at_8w, 'sighed'/'VB'_at_9w,
'.'/'END'_at_10w
If TaggedTokenizer encouters a word without a
tag, it will assign it the default tag None.

29
The TaggerI Interface

gt tokens WSTokenizer().tokenize(untagged_text_st
r) 'John'_at_0w, 'saw'_at_1w, 'the'_at_2w,
'book'_at_3w, 'on'_at_4w, 'the'_at_5w, 'table'_at_6w,
'.'_at_7w, 'He'_at_8w, 'sighed'_at_9w, '.'_at_10w
gt my_tagger.tag(tokens)
'John'/'NN'_at_0w, 'saw'/'VB'_at_1w,
'the'/'AT'_at_2w, 'book'/'NN'_at_3w,
'on'/'IN'_at_4w, 'the'/'AT'_at_5w,
'table'/'NN'_at_6w, '.'/'END'_at_7w,
'He'/'NN'_at_8w, 'sighed'/'VB'_at_9w,
'.'/'END'_at_10w
The interface defines a single method, tag, which
assigns a tag to each token in a list, and
returns the resulting list of tagged tokens.

30
Tagging Algorithms

Default tagger
Inspect the word and guess a tag
Unigram tagger
Assign the tag which is the most probable for the
word in question, based on raw frequency
Uses training data
Bigram tagger, n-gram tagger
Rule-based taggers, HMM taggers (outside scope of
this class)

31
Default Tagger

We need something to use for unseen words
E.g., guess NNP for a word with an initial
capital
Do regular-expression processing of the words
Sequence of regular expression tests
Assigment of the wor to a suitable tag
If there are no matches
Assign to the most frequent tag, NN

32
Finding the most frequent tag

nltk.probability module
for ttoken in ttext

freq_dist.inc(ttoken.tag())
def_tag
freq_dist.max()

33
A Default Tagger

gt tokensWSTokenizer().tokenize(untag_text_str)
'John'_at_0w, 'saw'_at_1w, '3'_at_2w, 'polar'_at_3w,
'bears'_at_4w, '.'_at_5w
gt my_tagger.tag(tokens)
'John'/'NN'_at_0w, 'saw'/'NN'_at_1w,
'3'/'CD'_at_2w, 'polar'/'NN'_at_3w,
'bears'/'NN'_at_4w, '.'/'NN'_at_5w
NN_CD_Tagger assigns CD to numbers, otherwise NN
Poor performance (20-30) in isolation, but when
used with other taggers can significantly improve
performance

34
Unigram Tagger

Unigram table of frequencies
E.g. in tagged WSJ sample, deal is tagged with
NN 11 times, with VB 1 time, and with VBP 1 time
90 accuracy
Counting events
freq_dist CFFreqDist()
for tttoken
in ttext
context
ttoken.type().base()
feature ttoken.type().tag()
freq_dist.inc(CFSample(context,feature))
context_event ContextEvent(token.type())
samplefreq_dist.cond_max(context_event)
tagsample.feature()

35
Unigram Tagger (continued)

Before being used, UnigramTaggers are trained
using the train method, which uses a tagged
corpus to determine which tags are most common
for each word
'train.txt' is a tagged training corpus
gtgtgt tagged_text_str open('train.txt').read()
gtgtgt train_toks TaggedTokenizer().tokenize(tagged
_text_str)
gtgtgt tagger UnigramTagger()
gtgtgt tagger.train(train_toks)

36
Unigram Tagger (continued)

Once a UnigramTagger has been trained, the tag
can be used to tag untagged corpora
gt tokens WSTokenizer().tokenize(untagged_text_st
r)
gt tagger.tag(tokens)
'John'/'NN'_at_0w, 'saw'/'VB'_at_1w,
'the'/'AT'_at_2w, 'book'/'NN'_at_3w,
'on'/'IN'_at_4w, 'the'/'AT'_at_5w, ...

37
Unigram Tagger (continued)

Performance is highly dependent on the quality of
its training set
Cant be too small
Cant be too different from texts we actually
want to tag
How is this related to the homework that we just
did?

38
Nth Order Tagging

Bigram table frequencies of pairs
Not necessarily adjacent or of same category
What is the most likely tag for w_n, given w_n-1
and t_n-1?
What is the context for NLTK?
N-gram tagger
Consider n-1 previous tags
Sparse data problem
Accuracy versus coverage tradeoff
Backoff
Throwing away order
Put context into a set

39
Nth-Order Tagging (continued)

In addition to considering the tokens type, the
context also considers the tags of the n
preceding tokens
The tagger then picks the tag which is most
likely for that context
Different values of n are possible
Oth order unigram tagger
1st order bigrams
2nd order trigrams

40
Nth-Order Tagging (continued)

Tagged training corpus determines most likely tag
for each context
gt train_toks TaggedTokenizer().tokenize(tagged_t
ext_str)
gt tagger NthOrderTagger(3) 3rd order
tagger
gttagger.train(train_toks)

41
Nth-Order Tagging (continued)

Once trained, it can tag untagged corpora
gt tokensWSTokenizer().tokenize(untag_text_str)
gt tagger.tag(tokens)
'John'/'NN'_at_0w, 'saw'/'VB'_at_1w,
'the'/'AT'_at_2w, 'book'/'NN'_at_3w,
'on'/'IN'_at_4w, 'the'/'AT'_at_5w, ...

42
Combining Taggers

Use more accurate algorithms when we can, backoff
to wider coverage when needed.
Try tagging the token with the 1st order tagger.
If the 1st order tagger is unable to find a tag
for the token, try finding a tag with the 0th
order tagger.
If the 0th order tagger is also unable to find a
tag, use the NN_CD_Tagger to find a tag.

43
BackoffTagger class

gtgtgt train_toks TaggedTokenizer().tokenize(tagged
_text_str)
Construct the taggers
gtgtgt tagger1 NthOrderTagger(1) 1st order
gtgtgt tagger2 UnigramTagger() 0th order
gtgtgt tagger3 NN_CD_Tagger()
Train the taggers
gtgtgt tagger1.train(train_toks)
gtgtgt tagger2.train(train_toks)

44
Backoff (continued)

Combine the taggers (in order, by specificity)
gtgt tagger BackoffTagger(tagger1, tagger2,
tagger3)
Use the combined tagger
gtgttokensTaggedTokenizer().tokenize(untagged_text_
str)
gtgt tagger.tag(tokens)
'John'/'NN'_at_0w, 'saw'/'VB'_at_1w,
'the'/'AT'_at_2w, 'book'/'NN'_at_3w,
'on'/'IN'_at_4w, 'the'/'AT'_at_5w, ...

45
Rule-Based Tagger

The Linguistic Complaint
Where is the linguistic knowledge of a tagger?
Just a massive table of numbers
Arent there any linguistic insights that could
emerge from the data?
Could thus use handcrafted sets of rules to tag
input sentences, for example, if input follows a
determiner tag it as a noun.

46
Evaluating a Tagger

Tagged tokens the original data
Untag the data
Tag the data with your own tagger
Compare the original and new tags
Iterate over the two lists checking for identity
and counting
Accuracy fraction correct

47
A Look at Tagging Implementations

It demonstrates how to write classes implementing
the interfaces defined by NLTK.
It provides you with a better understanding of
the algorithms and data structures underlying
each approach to tagging.
It gives you a chance to see some of the code
used to implement NLTK. The developers have tried
hard to ensure that the implementation of every
class in NLTK is easy to understand.

48
A Sequential Tagger

The taggers in this tutorial are implemented as
sequential taggers
Assigns tags to one token at a time, starting
with the first token of the text, and proceeding
in sequential order.
Decides which tag to assign a token on the basis
of that token, the tokens that preceed it, and
the predicted tags for the tokens that preceed
it.
To capture this commonality, we define a common
base class, SequentialTagger (class
SequentialTagger(TaggerI))
The next.tag method (note typo in tutorial)
returns the appropriate tag for the next token
each tagger subclass provides its own
implementation

49
SequentialTagger.next_tag

-decides which tag to assign a token, given the
list of tagged tokens that preceeds it.
two arguments a list of tagged tokens preceeding
the token to be tagged, and the token to be
tagged and it returns the appropriate tag for
that token.
def next_tag(self, tagged_tokens, next_token)
assert 0, "next_tag not defined by
SequentialTagger subclass"

50
SequentialTagger.tag

def tag(self, text)
tagged_text
Tag each token, in sequential order.
for token in text
Get the tag for the next token.
tag self.next_tag(tagged_text, token)
Use tag to build tagged token, add to
tagged_text. tagged_token Token(TaggedType(toke
n.type(), tag), token.loc())
tagged_text.append(tagged_token)
return tagged_text

51
Example Subclass NN_CD_Tagger

class NN_CD_Tagger(SequentialTagger)
def __init__(self) pass empty constructor
def next_tag(self, tagged_tokens, next_token)
Assign 'CD' for numbers, 'NN' for anything
else.
if re.match(r'0-9(.0-9)?',
next_token.type())
return 'CD'
else
return 'NN
just define this method when the tag method is
called, the definition given by SequentialTagger
will be used.

52
Another Example UnigramTagger

class UnigramTagger(TaggerI)
class UnigramTagger(SequentialTagger)

53
Unigram Tagger Training

def train(self, tagged_tokens)
for token in tagged_tokens
outcome token.type().tag()
context token.type().base()
self._freqdistcontext.inc(outcome

54
Unigram Tagger Tagging

def next_tag(self, tagged_tokens, next_token)
context next_token.type() return
self._freqdistcontext.max()

eg access context and find most likely
outcome gtgtgt freqdist'bank'.max() 'NN'
55
Unigram Tagger Initialization

The constructor for UnigramTagger simply
initializes self._freqdist with a new conditional
frequency distribution.
def __init__(self)
self._freqdist probability.Conditional
FreqDist()

56
For Self-Study