Lecture 17 Chunking - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Lecture 17 Chunking

Description:

Chunking is the indentifying and classifying non-overlapping portions of a ... Accuracy of chunking systems can be evaluated using techniques from information ... – PowerPoint PPT presentation

Number of Views:291
Avg rating:3.0/5.0
Slides: 29
Provided by: mantonm5
Category:
Tags: chunking | lecture

less

Transcript and Presenter's Notes

Title: Lecture 17 Chunking


1
Lecture 17 Chunking
CSCE 771 Natural Language Processing
  • Topics
  • Chunking
  • Overview of Meaning
  • Readings
  • Text 13.5
  • NLTK book 7.2

July 7, 2009
2
Overview
  • Last Time (Programming)
  • Probabilistic parsing
  • Features and Unification
  • Complexity of Language
  • Meaning representations
  • Projects???
  • Today
  • Chunking
  • Meaning representations
  • Description Logics
  • Projects???
  • Readings
  • Chapter 17.1
  • 771 Website Resources
  • Next Time Semantics

3
Harvesting the Web Homework
  • You should write a python program to open a URL,
    read the page, and save it to a file.
  • You should figure out how to find (recognize
    regexpr for) and build a list of links.
  • Combine the previous two to have a program that
    will do a breadth first search to a certain level
    starting with a URL.
  • You should find a tool or routine that will strip
    the HTML for the web page read leaving plain
    text.
  • You should find a tool or routine to convert a
    pdf file into plain text.

4
Find a Topic
  • Suggested projects include
  • Reading a collection of research papers and
    identifying references and extracting a reference
    graph.
  • Constructing a similarity measure between
    documents by constructing bigram vectors.
  • Read a collection of documents and build a parse
    tree distribution.
  • Knowledge extraction
  • Plagiarism detection.
  • Page Rank procedures
  • Authorship identification

5
Partial Parsing / Chunking
  • Information extraction typically does not require
    a complete parse.
  • Consider ungrammatical fragments
  • Chunking is the indentifying and classifying
    non-overlapping portions of a sentence that
    constitute the basic non-recursive phrases
  • E.g. noun phrases, verb phrases, adj. phrases,
    prepositional phrases
  • NP the morning flight PP from NP Denver
    VP has arrived

6
Focus on NP
  • NP the morning flight from NP Denver has
    arrived

7
Finite State Rule Based Chunking
  • Rules developed based on regular expressions for
    the particular application
  • No recursion
  • NP ? (DT) NN NN
  • NP ? NNP
  • VP ? VB
  • VP ? Aux VB

8
Cascaded Finite State Machines
  • .

9
IOB Tagging
  • B beginning
  • I internal parts of the chunk
  • O outside the chunk
  • Example
  • the morning flight from Denver has
    arrived
  • B_NP I_NP I_NP B_PP B_NP B_VP
  • Focussing on NPs
  • the morning flight from Denver has
    arrived
  • B_NP I_NP I_NP O B_NP O O

10
Chunking Systems Evaluation
  • Accuracy of chunking systems can be evaluated
    using techniques from information retrieval
  • Precision
  • Recall
  • F- measure (Rijsbergen) combines these into one
    measure
  • where ß gt 1 favors recall and where ß lt
    1 favors precision

11
Information Extraction Chap 7 NLTK Book
  • OrgName LocationName
  • Omnicom New York
  • DDB Needham New York
  • Kaplan Thaler New York
  • BBDO South Atlanta
  • Georgia-Pacific Atlanta
  • gtgtgt print org for (e1, rel, e2) if rel'IN' and
    e2'Atlanta'
  • 'BBDO South', 'Georgia-Pacific'

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
12
Information Extraction from Text
  • The fourth Wells account moving to another agency
    is the packaged paper-products division of
    Georgia-Pacific Corp., which arrived at Wells
    only last fall. Like Hertz and the History
    Channel, it is also leaving for an Omnicom-owned
    agency, the BBDO South unit of BBDO Worldwide.
    BBDO South in Atlanta, which handles corporate
    advertising for Georgia-Pacific, will assume
    additional duties for brands like Angel Soft
    toilet tissue and Sparkle paper towels, said Ken
    Haldin, a spokesman for Georgia-Pacific in
    Atlanta.

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
13
Information Extraction Architecture
  • .

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
14
  • gtgtgt def ie_preprocess(document)
  • ... sentences nltk.sent_tokenize(document)
  • ... sentences nltk.word_tokenize(sent) for
    sent in sentences
  • ... sentences nltk.pos_tag(sent) for sent in
    sentences

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
15
Noun Phrase Chunking using NLTK
  • gtgtgt sentence ("the", "DT"), ("little", "JJ"),
    ("yellow", "JJ"),
  • ... ("dog", "NN"), ("barked", "VBD"), ("at",
    "IN"), ("the", "DT"), ("cat", "NN")
  • gtgtgt grammar "NP ltDTgt?ltJJgtltNNgt"
  • gtgtgt cp nltk.RegexpParser(grammar)
  • gtgtgt result cp.parse(sentence)
  • gtgtgt print result (S (NP the/DT little/JJ
    yellow/JJ dog/NN) barked/VBD at/IN (NP the/DT
    cat/NN))
  • gtgtgt result.draw()

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
16
Tag Patterns used in Chunk Grammars
  • Tag pattern - a sequence of part-of-speech tags
    delimited using angle brackets, e.g.
    ltDTgt?ltJJgtltNNgt

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
17
nltk.app.chunkparser()
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
18
Chunking with Regular Expressions
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
19
  • gtgtgt nouns ("money", "NN"), ("market", "NN"),
    ("fund", "NN")
  • gtgtgt grammar "NP ltNNgtltNNgt Chunk two
    consecutive nouns"
  • gtgtgt cp nltk.RegexpParser(grammar)
  • gtgtgt print cp.parse(nouns) (S (NP money/NN
    market/NN) fund/NN)

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
20
  • gtgtgt cp nltk.RegexpParser('CHUNK ltV.gt ltTOgt
    ltV.gt')
  • gtgtgt brown nltk.corpus.brown
  • gtgtgt for sent in brown.tagged_sents()
  • ... tree cp.parse(sent)
  • ... for subtree in tree.subtrees()
  • ... if subtree.node 'CHUNK' print subtree
  • ...
  • (CHUNK combined/VBN to/TO achieve/VB)
  • (CHUNK continue/VB to/TO place/VB)
  • (CHUNK serve/VB to/TO protect/VB)
  • (CHUNK wanted/VBD to/TO wait/VB)
  • .

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
21
Chinker.py Figure 7.5
  • Natural Language Toolkit code_chinker
  • grammar r"""
  • NP
  • lt.gt Chunk everything
  • ltVBDINgt Chink sequences of VBD and
    IN
  • """
  • sentence ("the", "DT"), ("little", "JJ"),
    ("yellow", "JJ"),
  • ("dog", "NN"), ("barked", "VBD"), ("at",
    "IN"), ("the", "DT"), ("cat", "NN")
  • cp nltk.RegexpParser(grammar)

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
22
  • gtgtgt print cp.parse(sentence)
  • (S
  • (NP the/DT little/JJ yellow/JJ dog/NN)
  • barked/VBD
  • at/IN
  • (NP the/DT cat/NN))

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
23
Developing and Evaluating Chunkers
  • chunk.conllstr2tree() builds a tree
    representation from one of these multi-line
    strings

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
24
Reading IOB Format and the CoNLL 2000 Corpus
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
25
  • gtgtgt text '''
  • ... he PRP B-NP
  • ... accepted VBD B-VP
  • ... the DT B-NP
  • ... position NN I-NP
  • ... of IN B-PP
  • ... vice NN B-NP ... chairman NN I-NP ... of IN
    B-PP ... Carlyle NNP B-NP ... Group NNP I-NP
  • ... , , O
  • . . O
  • ... '''
  • gtgtgt nltk.chunk.conllstr2tree(text,
    chunk_types'NP').draw()

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
26
  • gtgtgt from nltk.corpus import conll2000
  • gtgtgt print conll2000.chunked_sents('train.txt')99
  • (S
  • (PP Over/IN)
  • (NP a/DT cup/NN)
  • (PP of/IN)
  • (NP coffee/NN)
  • ,/,
  • (NP Mr./NNP Stone/NNP)
  • (VP told/VBD)
  • (NP his/PRP story/NN)
  • ./.)

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
27
  • gtgtgt from nltk.corpus import conll2000
  • gtgtgt cp nltk.RegexpParser("")
  • gtgtgt test_sents conll2000.chunked_sents('test.txt
    ', chunk_types'NP')
  • gtgtgt print cp.evaluate(test_sents)
  • ChunkParse score
  • IOB Accuracy 43.4
  • Precision 0.0
  • Recall 0.0
  • F-Measure 0.0

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
28
  • gtgtgt grammar r"NP ltCDJNP.gt"
  • gtgtgt cp nltk.RegexpParser(grammar)
  • gtgtgt print cp.evaluate(test_sents)
  • ChunkParse score
  • IOB Accuracy 87.7
  • Precision 70.6
  • Recall 67.8
  • F-Measure 69.2

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
Write a Comment
User Comments (0)
About PowerShow.com