Title: Lecture 17 Chunking
1Lecture 17 Chunking
CSCE 771 Natural Language Processing
- Topics
- Chunking
- Overview of Meaning
- Readings
- Text 13.5
- NLTK book 7.2
July 7, 2009
2Overview
- Last Time (Programming)
- Probabilistic parsing
- Features and Unification
- Complexity of Language
- Meaning representations
- Projects???
- Today
- Chunking
- Meaning representations
- Description Logics
- Projects???
- Readings
- Chapter 17.1
- 771 Website Resources
- Next Time Semantics
3Harvesting the Web Homework
- You should write a python program to open a URL,
read the page, and save it to a file. - You should figure out how to find (recognize
regexpr for) and build a list of links. - Combine the previous two to have a program that
will do a breadth first search to a certain level
starting with a URL. - You should find a tool or routine that will strip
the HTML for the web page read leaving plain
text. - You should find a tool or routine to convert a
pdf file into plain text.
4Find a Topic
- Suggested projects include
- Reading a collection of research papers and
identifying references and extracting a reference
graph. - Constructing a similarity measure between
documents by constructing bigram vectors. - Read a collection of documents and build a parse
tree distribution. - Knowledge extraction
- Plagiarism detection.
- Page Rank procedures
- Authorship identification
-
5Partial Parsing / Chunking
- Information extraction typically does not require
a complete parse. - Consider ungrammatical fragments
- Chunking is the indentifying and classifying
non-overlapping portions of a sentence that
constitute the basic non-recursive phrases - E.g. noun phrases, verb phrases, adj. phrases,
prepositional phrases - NP the morning flight PP from NP Denver
VP has arrived
6Focus on NP
- NP the morning flight from NP Denver has
arrived
7Finite State Rule Based Chunking
- Rules developed based on regular expressions for
the particular application - No recursion
- NP ? (DT) NN NN
- NP ? NNP
- VP ? VB
- VP ? Aux VB
8Cascaded Finite State Machines
9IOB Tagging
- B beginning
- I internal parts of the chunk
- O outside the chunk
- Example
- the morning flight from Denver has
arrived - B_NP I_NP I_NP B_PP B_NP B_VP
- Focussing on NPs
- the morning flight from Denver has
arrived - B_NP I_NP I_NP O B_NP O O
10Chunking Systems Evaluation
- Accuracy of chunking systems can be evaluated
using techniques from information retrieval - Precision
- Recall
- F- measure (Rijsbergen) combines these into one
measure - where ß gt 1 favors recall and where ß lt
1 favors precision
11Information Extraction Chap 7 NLTK Book
- OrgName LocationName
- Omnicom New York
- DDB Needham New York
- Kaplan Thaler New York
- BBDO South Atlanta
- Georgia-Pacific Atlanta
- gtgtgt print org for (e1, rel, e2) if rel'IN' and
e2'Atlanta' - 'BBDO South', 'Georgia-Pacific'
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
12Information Extraction from Text
- The fourth Wells account moving to another agency
is the packaged paper-products division of
Georgia-Pacific Corp., which arrived at Wells
only last fall. Like Hertz and the History
Channel, it is also leaving for an Omnicom-owned
agency, the BBDO South unit of BBDO Worldwide.
BBDO South in Atlanta, which handles corporate
advertising for Georgia-Pacific, will assume
additional duties for brands like Angel Soft
toilet tissue and Sparkle paper towels, said Ken
Haldin, a spokesman for Georgia-Pacific in
Atlanta.
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
13Information Extraction Architecture
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
14- gtgtgt def ie_preprocess(document)
- ... sentences nltk.sent_tokenize(document)
- ... sentences nltk.word_tokenize(sent) for
sent in sentences - ... sentences nltk.pos_tag(sent) for sent in
sentences
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
15Noun Phrase Chunking using NLTK
- gtgtgt sentence ("the", "DT"), ("little", "JJ"),
("yellow", "JJ"), - ... ("dog", "NN"), ("barked", "VBD"), ("at",
"IN"), ("the", "DT"), ("cat", "NN") - gtgtgt grammar "NP ltDTgt?ltJJgtltNNgt"
- gtgtgt cp nltk.RegexpParser(grammar)
- gtgtgt result cp.parse(sentence)
- gtgtgt print result (S (NP the/DT little/JJ
yellow/JJ dog/NN) barked/VBD at/IN (NP the/DT
cat/NN)) - gtgtgt result.draw()
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
16Tag Patterns used in Chunk Grammars
- Tag pattern - a sequence of part-of-speech tags
delimited using angle brackets, e.g.
ltDTgt?ltJJgtltNNgt
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
17nltk.app.chunkparser()
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
18Chunking with Regular Expressions
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
19- gtgtgt nouns ("money", "NN"), ("market", "NN"),
("fund", "NN") - gtgtgt grammar "NP ltNNgtltNNgt Chunk two
consecutive nouns" - gtgtgt cp nltk.RegexpParser(grammar)
- gtgtgt print cp.parse(nouns) (S (NP money/NN
market/NN) fund/NN)
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
20- gtgtgt cp nltk.RegexpParser('CHUNK ltV.gt ltTOgt
ltV.gt') - gtgtgt brown nltk.corpus.brown
- gtgtgt for sent in brown.tagged_sents()
- ... tree cp.parse(sent)
- ... for subtree in tree.subtrees()
- ... if subtree.node 'CHUNK' print subtree
- ...
- (CHUNK combined/VBN to/TO achieve/VB)
- (CHUNK continue/VB to/TO place/VB)
- (CHUNK serve/VB to/TO protect/VB)
- (CHUNK wanted/VBD to/TO wait/VB)
- .
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
21Chinker.py Figure 7.5
- Natural Language Toolkit code_chinker
- grammar r"""
- NP
- lt.gt Chunk everything
- ltVBDINgt Chink sequences of VBD and
IN - """
- sentence ("the", "DT"), ("little", "JJ"),
("yellow", "JJ"), - ("dog", "NN"), ("barked", "VBD"), ("at",
"IN"), ("the", "DT"), ("cat", "NN") - cp nltk.RegexpParser(grammar)
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
22- gtgtgt print cp.parse(sentence)
- (S
- (NP the/DT little/JJ yellow/JJ dog/NN)
- barked/VBD
- at/IN
- (NP the/DT cat/NN))
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
23Developing and Evaluating Chunkers
- chunk.conllstr2tree() builds a tree
representation from one of these multi-line
strings
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
24Reading IOB Format and the CoNLL 2000 Corpus
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
25- gtgtgt text '''
- ... he PRP B-NP
- ... accepted VBD B-VP
- ... the DT B-NP
- ... position NN I-NP
- ... of IN B-PP
- ... vice NN B-NP ... chairman NN I-NP ... of IN
B-PP ... Carlyle NNP B-NP ... Group NNP I-NP - ... , , O
- . . O
- ... '''
- gtgtgt nltk.chunk.conllstr2tree(text,
chunk_types'NP').draw()
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
26- gtgtgt from nltk.corpus import conll2000
- gtgtgt print conll2000.chunked_sents('train.txt')99
- (S
- (PP Over/IN)
- (NP a/DT cup/NN)
- (PP of/IN)
- (NP coffee/NN)
- ,/,
- (NP Mr./NNP Stone/NNP)
- (VP told/VBD)
- (NP his/PRP story/NN)
- ./.)
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
27- gtgtgt from nltk.corpus import conll2000
- gtgtgt cp nltk.RegexpParser("")
- gtgtgt test_sents conll2000.chunked_sents('test.txt
', chunk_types'NP') - gtgtgt print cp.evaluate(test_sents)
- ChunkParse score
- IOB Accuracy 43.4
- Precision 0.0
- Recall 0.0
- F-Measure 0.0
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html
28- gtgtgt grammar r"NP ltCDJNP.gt"
- gtgtgt cp nltk.RegexpParser(grammar)
- gtgtgt print cp.evaluate(test_sents)
- ChunkParse score
- IOB Accuracy 87.7
- Precision 70.6
- Recall 67.8
- F-Measure 69.2
http//nltk.googlecode.com/svn/trunk/doc/book/ch07
.html