Lecture 17 Chunking

About This Presentation

Title:

Lecture 17 Chunking

Description:

Chunking is the indentifying and classifying non-overlapping portions of a ... Accuracy of chunking systems can be evaluated using techniques from information ... – PowerPoint PPT presentation

Number of Views:291

Avg rating:3.0/5.0

Slides: 29

Provided by: mantonm5

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 17 Chunking

1
Lecture 17 Chunking
CSCE 771 Natural Language Processing

Topics
Chunking
Overview of Meaning
Readings
Text 13.5
NLTK book 7.2

July 7, 2009
2
Overview

Last Time (Programming)
Probabilistic parsing
Features and Unification
Complexity of Language
Meaning representations
Projects???
Today
Chunking
Meaning representations
Description Logics
Projects???
Readings
Chapter 17.1
771 Website Resources
Next Time Semantics

3
Harvesting the Web Homework

You should write a python program to open a URL,
read the page, and save it to a file.
You should figure out how to find (recognize
regexpr for) and build a list of links.
Combine the previous two to have a program that
will do a breadth first search to a certain level
starting with a URL.
You should find a tool or routine that will strip
the HTML for the web page read leaving plain
text.
You should find a tool or routine to convert a
pdf file into plain text.

4
Find a Topic

Suggested projects include
Reading a collection of research papers and
identifying references and extracting a reference
graph.
Constructing a similarity measure between
documents by constructing bigram vectors.
Read a collection of documents and build a parse
tree distribution.
Knowledge extraction
Plagiarism detection.
Page Rank procedures
Authorship identification

5
Partial Parsing / Chunking

Information extraction typically does not require
a complete parse.
Consider ungrammatical fragments
Chunking is the indentifying and classifying
non-overlapping portions of a sentence that
constitute the basic non-recursive phrases
E.g. noun phrases, verb phrases, adj. phrases,
prepositional phrases
NP the morning flight PP from NP Denver
VP has arrived

6
Focus on NP

NP the morning flight from NP Denver has
arrived

7
Finite State Rule Based Chunking

Rules developed based on regular expressions for
the particular application
No recursion
NP ? (DT) NN NN
NP ? NNP
VP ? VB
VP ? Aux VB

8
Cascaded Finite State Machines

9
IOB Tagging

B beginning
I internal parts of the chunk
O outside the chunk
Example
the morning flight from Denver has
arrived
B_NP I_NP I_NP B_PP B_NP B_VP
Focussing on NPs
the morning flight from Denver has
arrived
B_NP I_NP I_NP O B_NP O O

10
Chunking Systems Evaluation

Accuracy of chunking systems can be evaluated
using techniques from information retrieval
Precision
Recall
F- measure (Rijsbergen) combines these into one
measure
where ß gt 1 favors recall and where ß lt
1 favors precision

11
Information Extraction Chap 7 NLTK Book

OrgName LocationName
Omnicom New York
DDB Needham New York
Kaplan Thaler New York
BBDO South Atlanta
Georgia-Pacific Atlanta
gtgtgt print org for (e1, rel, e2) if rel'IN' and
e2'Atlanta'
'BBDO South', 'Georgia-Pacific'

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
12
Information Extraction from Text

The fourth Wells account moving to another agency
is the packaged paper-products division of
Georgia-Pacific Corp., which arrived at Wells
only last fall. Like Hertz and the History
Channel, it is also leaving for an Omnicom-owned
agency, the BBDO South unit of BBDO Worldwide.
BBDO South in Atlanta, which handles corporate
advertising for Georgia-Pacific, will assume
additional duties for brands like Angel Soft
toilet tissue and Sparkle paper towels, said Ken
Haldin, a spokesman for Georgia-Pacific in
Atlanta.

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
13
Information Extraction Architecture

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
14

gtgtgt def ie_preprocess(document)
... sentences nltk.sent_tokenize(document)
... sentences nltk.word_tokenize(sent) for
sent in sentences
... sentences nltk.pos_tag(sent) for sent in
sentences

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
15
Noun Phrase Chunking using NLTK

gtgtgt sentence ("the", "DT"), ("little", "JJ"),
("yellow", "JJ"),
... ("dog", "NN"), ("barked", "VBD"), ("at",
"IN"), ("the", "DT"), ("cat", "NN")
gtgtgt grammar "NP ltDTgt?ltJJgtltNNgt"
gtgtgt cp nltk.RegexpParser(grammar)
gtgtgt result cp.parse(sentence)
gtgtgt print result (S (NP the/DT little/JJ
yellow/JJ dog/NN) barked/VBD at/IN (NP the/DT
cat/NN))
gtgtgt result.draw()

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
16
Tag Patterns used in Chunk Grammars

Tag pattern - a sequence of part-of-speech tags
delimited using angle brackets, e.g.
ltDTgt?ltJJgtltNNgt

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
17
nltk.app.chunkparser()
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
18
Chunking with Regular Expressions
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
19

gtgtgt nouns ("money", "NN"), ("market", "NN"),
("fund", "NN")
gtgtgt grammar "NP ltNNgtltNNgt Chunk two
consecutive nouns"
gtgtgt cp nltk.RegexpParser(grammar)
gtgtgt print cp.parse(nouns) (S (NP money/NN
market/NN) fund/NN)

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
20

gtgtgt cp nltk.RegexpParser('CHUNK ltV.gt ltTOgt
ltV.gt')
gtgtgt brown nltk.corpus.brown
gtgtgt for sent in brown.tagged_sents()
... tree cp.parse(sent)
... for subtree in tree.subtrees()
... if subtree.node 'CHUNK' print subtree
...
(CHUNK combined/VBN to/TO achieve/VB)
(CHUNK continue/VB to/TO place/VB)
(CHUNK serve/VB to/TO protect/VB)
(CHUNK wanted/VBD to/TO wait/VB)
.

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
21
Chinker.py Figure 7.5

Natural Language Toolkit code_chinker
grammar r"""
NP
lt.gt Chunk everything
ltVBDINgt Chink sequences of VBD and
IN
"""
sentence ("the", "DT"), ("little", "JJ"),
("yellow", "JJ"),
("dog", "NN"), ("barked", "VBD"), ("at",
"IN"), ("the", "DT"), ("cat", "NN")
cp nltk.RegexpParser(grammar)

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
22

gtgtgt print cp.parse(sentence)
(S
(NP the/DT little/JJ yellow/JJ dog/NN)
barked/VBD
at/IN
(NP the/DT cat/NN))

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
23
Developing and Evaluating Chunkers

chunk.conllstr2tree() builds a tree
representation from one of these multi-line
strings

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
24
Reading IOB Format and the CoNLL 2000 Corpus
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
25

gtgtgt text '''
... he PRP B-NP
... accepted VBD B-VP
... the DT B-NP
... position NN I-NP
... of IN B-PP
... vice NN B-NP ... chairman NN I-NP ... of IN
B-PP ... Carlyle NNP B-NP ... Group NNP I-NP
... , , O
. . O
... '''
gtgtgt nltk.chunk.conllstr2tree(text,
chunk_types'NP').draw()

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
26

gtgtgt from nltk.corpus import conll2000
gtgtgt print conll2000.chunked_sents('train.txt')99
(S
(PP Over/IN)
(NP a/DT cup/NN)
(PP of/IN)
(NP coffee/NN)
,/,
(NP Mr./NNP Stone/NNP)
(VP told/VBD)
(NP his/PRP story/NN)
./.)

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
27

gtgtgt from nltk.corpus import conll2000
gtgtgt cp nltk.RegexpParser("")
gtgtgt test_sents conll2000.chunked_sents('test.txt
', chunk_types'NP')
gtgtgt print cp.evaluate(test_sents)
ChunkParse score
IOB Accuracy 43.4
Precision 0.0
Recall 0.0
F-Measure 0.0

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
28

gtgtgt grammar r"NP ltCDJNP.gt"
gtgtgt cp nltk.RegexpParser(grammar)
gtgtgt print cp.evaluate(test_sents)
ChunkParse score
IOB Accuracy 87.7
Precision 70.6
Recall 67.8
F-Measure 69.2

http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html

Write a Comment

User Comments (0)

About PowerShow.com

Lecture 17 Chunking - PowerPoint PPT Presentation

Lecture 17 Chunking

Chunking is the indentifying and classifying non-overlapping portions of a ... Accuracy of chunking systems can be evaluated using techniques from information ... – PowerPoint PPT presentation