Title: I256: Applied Natural Language Processing
1I256 Applied Natural Language Processing
Marti Hearst Sept 25, 2006
2Shallow Parsing
- Break text up into non-overlapping contiguous
subsets of tokens. - Also called chunking, partial parsing, light
parsing. - What is it useful for?
- Entity recognition
- people, locations, organizations
- Studying linguistic patterns
- gave NP
- gave up NP in NP
- gave NP NP
- gave NP to NP
- Can ignore complex structure when not relevant
3A Relationship between Segmenting and Labeling
- Tokenization segments the text
- Tagging labels the text
- Shallow parsing does both simultaneously.
4Chunking vs. Full Syntactic Parsing
- G.K. Chesterton, author of The Man who was
Thursday
5Representations for Chunks
- IOB tags
- Inside, outside, and begin
- Why do we need a begin tag?
6Representations for Chunks
- Trees
- Chunk structure is a two-level tree that spans
the entire text, containing both chunks and
non-chunks
7CONLL Collection
- From the Conference on Natural Language Learning
Competition from 2000 - Goal create machine learning methods to improve
on the chunking task
8CONLL Collection
- Data in IOB format from WSJ
- Word POS-tag IOB-tag
- Training set 8936 sentences
- Test set 2012 sentences
- Tags from the Brill tagger
- Penn Treebank Tags
- Evaluation measure F-score
- 2precisionrecall / (recallprecision)
- Baseline was select the chunk tag that is most
frequently associated with the POS tag, F 77.07 - Best score in the contest was F94.13
9nltk_lite and CONLL2000
Note that raw hides the IOB format
10nltk_lite and CONLL2000
11nltk_lite and CONLL2000
pp() stands for pretty print and applies to the
Tree data structure.
12nltk_lite chunks in treebank
13nltk_lite parses in treebank
14Chunking with Regular Expressions
- This time we write regexs over TAGS rather than
words - ltDTgtltJJgt?ltNNgt
- ltNN.gt
- ltJJNNgt
- Compile them with parse.ChunkRule()
- rule parse.ChunkRule(ltDTNNgt)
- chunkparser parse.RegexpChunk(rule,
chunk_node NP) - Resulting object is of type Tree
- Top-level node called S
- Can change this label if you want, in third
argument to RegexpChunk
15Chunking with Regular Expressions
16Chunking with Regular Expressions
- Rule application is sensitive to order
17Chinking
- Specify what does not go into a chunk.
- Kind of like specifying punctuation as being not
alphanumeric and spaces. - Can be more difficult to think about.
18Practice Regexp Chunking
- Write rules to be able to produce this kind of
chunking
19Next Time
- Evaluating Shallow Parsing
- Begin Text Summarization