NLTK Tagging - PowerPoint PPT Presentation

About This Presentation
Title:

NLTK Tagging

Description:

Brown and Penn Treebank, tagsets. Tagging in NLTK (nltk.tagger module) Tagging ... Francis and Kucera, Brown University. Contents: 500 texts, each 2000 words long ... – PowerPoint PPT presentation

Number of Views:527
Avg rating:3.0/5.0
Slides: 58
Provided by: lab256
Category:
Tags: nltk | tagging

less

Transcript and Presenter's Notes

Title: NLTK Tagging


1
NLTK Tagging
  • CS1573 AI Application Development, Spring 2003
  • (modified from Steven Birds notes)

2
Todays Outline
  • Administration
  • Final Words on Regular Expressions
  • Regular Expressions in NLTK
  • New Topic Tagging
  • Motivation and Linguistic Background
  • NLTK Tutorial Tagging
  • Part-of-Speech Tagging
  • The nltk.tagger Module
  • A Few Tagging Algorithms
  • Some Gory Details

3
Regular Expressions, again
  • Python
  • Regular expression syntax
  • NLTK uses
  • The regular expression tokenizer
  • A simple regular expression tagging algorithm

4
Regular Expression Tokenizers
  • Mimicing the WSTokenizer
  • gtgtgt tokenizerRETokenizer(r'\s')
  • gtgtgt tokenizer.tokenize(example_text)
  • 'Hello.'_at_0w, "Isn't"_at_1w, 'this'_at_2w,
    'fun?'_at_3w

5
RE Tokenization, continued
  • gt regexpr'\w\w\s
  • '\w\w\s'
  • gt tokenizer RETokenizer(regexp)
  • gt tokenizer.tokenize(example_text)
    'Hello'_at_0w, '.'_at_1w, 'Isn'_at_2w, "'"_at_3w,
    't'_at_4w, 'this'_at_5w, 'fun'_at_6w, '?'_at_7w
  • Why is this version better?

6
RE Tokenization, continued
  • gt regexpr'\w\w\s'
  • Why is this version better?
  • -includes punctuation as separate tokens
  • -matches either a sequence of alphanumeric
    characters (letters and numbers) or a sequence
    of punctuation characters.
  • But, still has problems, for example ?

7
Improved Example
  • gt example_text 'That poster costs 22.40.'
  • gt regexp r'(\w)(\\d\.\d)(\w\s)'
    '(\w)(\\d\.\d)(\w\s)'
  • gt tokenizer RETokenizer(regexp)
  • gt tokenizer.tokenize(example_text)
    'That'_at_0w, 'poster'_at_1w, 'costs'_at_2w,
    '22.40'_at_3w, '.'_at_4w

8
Regular Expression Limitations
  • While Regular Languages can model many things,
    there are still limitations (no advice when
    rejection, all or one solution when accept
    condition is ambiguous).

9
New Topic
  • Now were going to start looking at tagging, and
    especially approaches that depend on looking at
    words in context.
  • Well start with what looks like an artificial
    task predicting the next word in a sequence.
  • Well then move to tagging, the process of
    associating auxiliary information with each
    token, often for use in later stages of text
    processing

10
Word Prediction Example
  • From NY Times
  • Stocks plunged this

11
Word Prediction Example
  • From NY Times
  • Stocks plunged this morning, despite a cut in
    interest

12
Word Prediction Example
  • From NY Times
  • Stocks plunged this morning, despite a cut in
    interest rates by the Federal Reserve, as Wall

13
Word Prediction Example
  • From NY Times
  • Stocks plunged this morning, despite a cut in
    interest rates by the Federal Reserve, as Wall
    Street began

14
Word Prediction Example
  • From NY Times
  • Stocks plunged this morning, despite a cut in
    interest rates by the Federal Reserve, as Wall
    Street began trading for the first time since
    last

15
Word Prediction Example
  • From NY Times
  • Stocks plunged this morning, despite a cut in
    interest rates by the Federal Reserve, as Wall
    Street began trading for the first time since
    last Tuesdays terrorist attacks.

16
Format Change
  • Move to pdf slides (highlights of Jurafsky and
    Martin Chapters 6 and 8)

17
Tagging Overview /Review
  • Motivation
  • What is tagging? What does tagging do? Kinds of
    tagging?
  • Significance of part of speech
  • Basics
  • Features and context
  • Brown and Penn Treebank, tagsets
  • Tagging in NLTK (nltk.tagger module)
  • Tagging
  • Algorithms, statistical and rule-based tagging
  • Evaluation

18
Terminology
  • Tagging
  • The process of associating labels with each token
    in a text
  • Tags
  • The labels
  • Tag Set
  • The collection of tags used for a particular task

19
Example
  • Typically a tagged text is a sequence of
    white-space separated base/tag tokens
  • The/at Pantheons/np interior/nn ,/,still/rb
    in/in its/pp original/jj form/nn ,/, is/bez
    truly/ql majestic/jj and/cc an/at
    architectural/jj triumph/nn ./. Its/pp rotunda/nn
    forms/vbz a/at perfect/jj circle/nn whose/wp
    diameter/nn is/bez equal/jj to/in the/at
    height/nn from/in the/at floor/nn to/in the/at
    ceiling/nn ./.
  • .

20
What does Tagging do?
  • Collapses Distinctions
  • Lexical identity may be discarded
  • e.g. all personal pronouns tagged with PRP
  • Introduces Distinctions
  • Ambiguities may be removed
  • e.g. deal tagged with NN or VB
  • e.g. deal tagged with DEAL1 or DEAL2
  • Helps classification and prediction

21
Kinds of Tagging
  • Part-of-Speech tagging
  • Grammatical tagging
  • Divides words into categories based on how they
    can be combined to form sentences (e.g., articles
    can combine with nouns but not verbs)
  • Semantic Sense tagging
  • Sense disambiguation
  • Homonym disambiguation
  • Discourse tagging
  • Speech acts (request, inform, greet, etc.)

22
Significance of Parts of Speech
  • A words POS tells us a lot about the word and
    its neighbors
  • Limits the range of meanings (deal),
    pronunciation (object vs object) or both (wind)
  • Helps in stemming
  • Limits the range of following words for ASR
  • Helps select nouns from a document for IR
  • Basis for partial parsing
  • Basis for searching for linguistic constructions
  • Parsers can build trees directly on the POS tags
    instead of maintaining a lexicon

23
Features and Contexts
wn-2 wn-1 wn wn1
CONTEXT FEATURE
tn-1
tn
tn1
tn-2
24
Why there are many tag sets
  • Definition of POS tag
  • Semantic, syntactic, morphological
  • Tagsets differ in both how they define the tags,
    and at what level of granularity
  • Balancing classification and prediction
  • Introducing more distinctions
  • Better information about context
  • Harder to classify current token
  • Introducing few distinctions
  • Less information about context
  • Less work to do for classifying current token

25
The Brown Corpus
  • The first digital corpus (1961)
  • Francis and Kucera, Brown University
  • Contents 500 texts, each 2000 words long
  • From American books, newspapers, magazines
  • Representing genres
  • Science fiction, romance fiction, press reportage
    scientific writing, popular lore

26
Penn Treebank
  • First syntactically annotated corpus
  • 1 million words from Wall Street Journal
  • Part of speech tags and syntax trees

27
Representing Tags in NLTK
  • TaggedType class
  • gtgtgt ttype1 TaggedType('dog', 'NN')
  • 'dog'/'NN
  • gtgtgt ttype1.base()
  • dog'
  • gtgtgt ttype1.tag()
  • NN'
  • Tagged tokens
  • gtgtgt ttoken Token(ttype, Location(5))
  • 'dog'/'NN'_at_5

28
Reading Tagged Corpora
  • gtgt tagged_text_str open('corpus.txt').read()
  • 'John/NN saw/VB the/AT book/NN on/IN the/AT
    table/NN ./END He/NN sighed/VB ./END'
  • gtgt tokensTaggedTokenizer().tokenize(tagged_text_s
    tr)
  • 'John'/'NN'_at_0w, 'saw'/'VB'_at_1w,
    'the'/'AT'_at_2w, 'book'/'NN'_at_3w,
    'on'/'IN'_at_4w, 'the'/'AT'_at_5w,
    'table'/'NN'_at_6w, '.'/'END'_at_7w,
    'He'/'NN'_at_8w, 'sighed'/'VB'_at_9w,
    '.'/'END'_at_10w
  • If TaggedTokenizer encouters a word without a
    tag, it will assign it the default tag None.

29
The TaggerI Interface
  • gt tokens WSTokenizer().tokenize(untagged_text_st
    r) 'John'_at_0w, 'saw'_at_1w, 'the'_at_2w,
    'book'_at_3w, 'on'_at_4w, 'the'_at_5w, 'table'_at_6w,
    '.'_at_7w, 'He'_at_8w, 'sighed'_at_9w, '.'_at_10w
  • gt my_tagger.tag(tokens)
  • 'John'/'NN'_at_0w, 'saw'/'VB'_at_1w,
    'the'/'AT'_at_2w, 'book'/'NN'_at_3w,
    'on'/'IN'_at_4w, 'the'/'AT'_at_5w,
    'table'/'NN'_at_6w, '.'/'END'_at_7w,
    'He'/'NN'_at_8w, 'sighed'/'VB'_at_9w,
    '.'/'END'_at_10w
  • The interface defines a single method, tag, which
    assigns a tag to each token in a list, and
    returns the resulting list of tagged tokens.

30
Tagging Algorithms
  • Default tagger
  • Inspect the word and guess a tag
  • Unigram tagger
  • Assign the tag which is the most probable for the
    word in question, based on raw frequency
  • Uses training data
  • Bigram tagger, n-gram tagger
  • Rule-based taggers, HMM taggers (outside scope of
    this class)

31
Default Tagger
  • We need something to use for unseen words
  • E.g., guess NNP for a word with an initial
    capital
  • Do regular-expression processing of the words
  • Sequence of regular expression tests
  • Assigment of the wor to a suitable tag
  • If there are no matches
  • Assign to the most frequent tag, NN

32
Finding the most frequent tag
  • nltk.probability module
  • for ttoken in ttext

    freq_dist.inc(ttoken.tag())
    def_tag
    freq_dist.max()

33
A Default Tagger
  • gt tokensWSTokenizer().tokenize(untag_text_str)
    'John'_at_0w, 'saw'_at_1w, '3'_at_2w, 'polar'_at_3w,
    'bears'_at_4w, '.'_at_5w
  • gt my_tagger.tag(tokens)
  • 'John'/'NN'_at_0w, 'saw'/'NN'_at_1w,
    '3'/'CD'_at_2w, 'polar'/'NN'_at_3w,
    'bears'/'NN'_at_4w, '.'/'NN'_at_5w
  • NN_CD_Tagger assigns CD to numbers, otherwise NN
  • Poor performance (20-30) in isolation, but when
    used with other taggers can significantly improve
    performance

34
Unigram Tagger
  • Unigram table of frequencies
  • E.g. in tagged WSJ sample, deal is tagged with
    NN 11 times, with VB 1 time, and with VBP 1 time
  • 90 accuracy
  • Counting events
  • freq_dist CFFreqDist()
    for tttoken
    in ttext
    context
    ttoken.type().base()
  • feature ttoken.type().tag()
    freq_dist.inc(CFSample(context,feature))
  • context_event ContextEvent(token.type())
    samplefreq_dist.cond_max(context_event)
    tagsample.feature()

35
Unigram Tagger (continued)
  • Before being used, UnigramTaggers are trained
    using the train method, which uses a tagged
    corpus to determine which tags are most common
    for each word
  • 'train.txt' is a tagged training corpus
  • gtgtgt tagged_text_str open('train.txt').read()
  • gtgtgt train_toks TaggedTokenizer().tokenize(tagged
    _text_str)
  • gtgtgt tagger UnigramTagger()
  • gtgtgt tagger.train(train_toks)

36
Unigram Tagger (continued)
  • Once a UnigramTagger has been trained, the tag
    can be used to tag untagged corpora
  • gt tokens WSTokenizer().tokenize(untagged_text_st
    r)
  • gt tagger.tag(tokens)
  • 'John'/'NN'_at_0w, 'saw'/'VB'_at_1w,
    'the'/'AT'_at_2w, 'book'/'NN'_at_3w,
    'on'/'IN'_at_4w, 'the'/'AT'_at_5w, ...

37
Unigram Tagger (continued)
  • Performance is highly dependent on the quality of
    its training set
  • Cant be too small
  • Cant be too different from texts we actually
    want to tag
  • How is this related to the homework that we just
    did?

38
Nth Order Tagging
  • Bigram table frequencies of pairs
  • Not necessarily adjacent or of same category
  • What is the most likely tag for w_n, given w_n-1
    and t_n-1?
  • What is the context for NLTK?
  • N-gram tagger
  • Consider n-1 previous tags
  • Sparse data problem
  • Accuracy versus coverage tradeoff
  • Backoff
  • Throwing away order
  • Put context into a set

39
Nth-Order Tagging (continued)
  • In addition to considering the tokens type, the
    context also considers the tags of the n
    preceding tokens
  • The tagger then picks the tag which is most
    likely for that context
  • Different values of n are possible
  • Oth order unigram tagger
  • 1st order bigrams
  • 2nd order trigrams

40
Nth-Order Tagging (continued)
  • Tagged training corpus determines most likely tag
    for each context
  • gt train_toks TaggedTokenizer().tokenize(tagged_t
    ext_str)
  • gt tagger NthOrderTagger(3) 3rd order
    tagger
  • gttagger.train(train_toks)

41
Nth-Order Tagging (continued)
  • Once trained, it can tag untagged corpora
  • gt tokensWSTokenizer().tokenize(untag_text_str)
  • gt tagger.tag(tokens)
  • 'John'/'NN'_at_0w, 'saw'/'VB'_at_1w,
    'the'/'AT'_at_2w, 'book'/'NN'_at_3w,
    'on'/'IN'_at_4w, 'the'/'AT'_at_5w, ...

42
Combining Taggers
  • Use more accurate algorithms when we can, backoff
    to wider coverage when needed.
  • Try tagging the token with the 1st order tagger.
  • If the 1st order tagger is unable to find a tag
    for the token, try finding a tag with the 0th
    order tagger.
  • If the 0th order tagger is also unable to find a
    tag, use the NN_CD_Tagger to find a tag.

43
BackoffTagger class
  • gtgtgt train_toks TaggedTokenizer().tokenize(tagged
    _text_str)
  • Construct the taggers
  • gtgtgt tagger1 NthOrderTagger(1) 1st order
  • gtgtgt tagger2 UnigramTagger() 0th order
  • gtgtgt tagger3 NN_CD_Tagger()
  • Train the taggers
  • gtgtgt tagger1.train(train_toks)
  • gtgtgt tagger2.train(train_toks)

44
Backoff (continued)
  • Combine the taggers (in order, by specificity)
  • gtgt tagger BackoffTagger(tagger1, tagger2,
    tagger3)
  • Use the combined tagger
  • gtgttokensTaggedTokenizer().tokenize(untagged_text_
    str)
  • gtgt tagger.tag(tokens)
  • 'John'/'NN'_at_0w, 'saw'/'VB'_at_1w,
    'the'/'AT'_at_2w, 'book'/'NN'_at_3w,
    'on'/'IN'_at_4w, 'the'/'AT'_at_5w, ...

45
Rule-Based Tagger
  • The Linguistic Complaint
  • Where is the linguistic knowledge of a tagger?
  • Just a massive table of numbers
  • Arent there any linguistic insights that could
    emerge from the data?
  • Could thus use handcrafted sets of rules to tag
    input sentences, for example, if input follows a
    determiner tag it as a noun.

46
Evaluating a Tagger
  • Tagged tokens the original data
  • Untag the data
  • Tag the data with your own tagger
  • Compare the original and new tags
  • Iterate over the two lists checking for identity
    and counting
  • Accuracy fraction correct

47
A Look at Tagging Implementations
  • It demonstrates how to write classes implementing
    the interfaces defined by NLTK.
  • It provides you with a better understanding of
    the algorithms and data structures underlying
    each approach to tagging.
  • It gives you a chance to see some of the code
    used to implement NLTK. The developers have tried
    hard to ensure that the implementation of every
    class in NLTK is easy to understand.

48
A Sequential Tagger
  • The taggers in this tutorial are implemented as
    sequential taggers
  • Assigns tags to one token at a time, starting
    with the first token of the text, and proceeding
    in sequential order.
  • Decides which tag to assign a token on the basis
    of that token, the tokens that preceed it, and
    the predicted tags for the tokens that preceed
    it.
  • To capture this commonality, we define a common
    base class, SequentialTagger (class
    SequentialTagger(TaggerI))
  • The next.tag method (note typo in tutorial)
    returns the appropriate tag for the next token
    each tagger subclass provides its own
    implementation

49
SequentialTagger.next_tag
  • -decides which tag to assign a token, given the
    list of tagged tokens that preceeds it.
  • two arguments a list of tagged tokens preceeding
    the token to be tagged, and the token to be
    tagged and it returns the appropriate tag for
    that token.
  • def next_tag(self, tagged_tokens, next_token)
    assert 0, "next_tag not defined by
    SequentialTagger subclass"

50
SequentialTagger.tag
  • def tag(self, text)
  • tagged_text
  • Tag each token, in sequential order.
  • for token in text
  • Get the tag for the next token.
  • tag self.next_tag(tagged_text, token)
  • Use tag to build tagged token, add to
    tagged_text. tagged_token Token(TaggedType(toke
    n.type(), tag), token.loc())
  • tagged_text.append(tagged_token)
  • return tagged_text

51
Example Subclass NN_CD_Tagger
  • class NN_CD_Tagger(SequentialTagger)
  • def __init__(self) pass empty constructor
  • def next_tag(self, tagged_tokens, next_token)
  • Assign 'CD' for numbers, 'NN' for anything
    else.
  • if re.match(r'0-9(.0-9)?',
    next_token.type())
  • return 'CD'
  • else
  • return 'NN
  • just define this method when the tag method is
    called, the definition given by SequentialTagger
    will be used.

52
Another Example UnigramTagger
  • class UnigramTagger(TaggerI)
  • class UnigramTagger(SequentialTagger)

53
Unigram Tagger Training
  • def train(self, tagged_tokens)
  • for token in tagged_tokens
  • outcome token.type().tag()
  • context token.type().base()
    self._freqdistcontext.inc(outcome

54
Unigram Tagger Tagging
  • def next_tag(self, tagged_tokens, next_token)
    context next_token.type() return
    self._freqdistcontext.max()

eg access context and find most likely
outcome gtgtgt freqdist'bank'.max() 'NN'
55
Unigram Tagger Initialization
  • The constructor for UnigramTagger simply
    initializes self._freqdist with a new conditional
    frequency distribution.
  • def __init__(self)
    self._freqdist probability.Conditional
    FreqDist()

56
For Self-Study
  • NthOrder Tagger Implementation
  • BackoffTagger Implementation

57
For Next Time
  • Chunk Parsing
Write a Comment
User Comments (0)
About PowerShow.com