IE by Candidate Classification: Jansche - PowerPoint PPT Presentation

About This Presentation

Title:

IE by Candidate Classification: Jansche

Description:

IE by Candidate Classification: Jansche & Abney, Cohen et al William Cohen 1/19/03 – PowerPoint PPT presentation

Number of Views:105

Avg rating:3.0/5.0

Slides: 50

Provided by: William1501

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: IE by Candidate Classification: Jansche

1
IE by Candidate ClassificationJansche Abney,
Cohen et al

William Cohen
1/19/03

2
SCAN Search Summarization for Audio
Collections (ATT Labs)
3
(No Transcript)
4
Why IE from personal voicemail

Unified interface for email, voicemail, fax,
requires uniform headers
Sender, Time, Subject,
Headers are key for uniform interface
Independently, voicemail access is slow
useful to have fast access to important parts of
message (contact number, caller)

5
Why else to read this paper

Robust information extraction
Generalizing from manual transcripts (i.e.,
human-produced written version of voicemail) to
automatic (ASR) transcripts
Place of hand-coding vs learning in information
extraction
How to break up task
Where and how to use engineering

Candidate Generator
Candidate phrase
Learned filter
Extracted phrase
6
Voicemail corpus

About 10,000 manually transcribed and annotated
voice messages.
1869 used for evaluation

7
Observation caller phrases are short and near
the beginning of the message.
8
Caller-phrase extraction

Propose start positions i1,,iN
Use a learned decision tree to pick the best i
Propose end positions ij1,ij2,,ijM
Use a learned decision tree to pick the best j

9
Baseline (HZP, Col log-linear)

IE as tagging
Pr(tag iword i,word i-1,,word i1,,tag i-1,)
estimated via MAXENT model
Beam search to find best tag sequence given word
sequence
Features of model are words, word pairs, word
pairtag trigrams, .

Hi there its Bill and
OUT OUT IN IN OUT
10
Performance
11
Observation caller names are really short and
near the beginning of the message.
12
What about ASR transcripts?
13
Extracting phone numbers

Phase 1 hand-coded grammer proposes candidate
phone numbers
Not too hard, due to limited vocabulary
Optimize recall (96) not precision (30)
Phase 2 a learned decision tree filters
candidates
Use length, position, context,

14
Results
15
Their Conclusions
16
Cohen, Wang, Murphy

Another paper with a similar flavor
IE for a particular task
IE using similar propose-and-filter approach
When and how to you engineer, and when and how to
you use learning?

17
Background subcellular localization
The most important tool for studying protein
localizations is fluorescence microscopy.
New image processing techniques can automatically
produce a quantitative description of subcellular
localization.
18
Background subcellular localization
19
Background subcellular localization
Entrez a new 376kD Golgi complex outher
membrane protein SWISSProt INTEGRAL MEMBRANE
PROTEIN. GOLGI MEMBRANE
Entrez GPP130 type II Golgi membrane
protein SWISSProt nothing
20
Background subcellular localization

Some other interesting facts
Primary structure is poor indicator of
localization
Many possible localizations with image analysis
Tens of thousands of images in open literature

21
Overview of SLIF image analysis of existing
images from online publications
Image
Panel Splitter
On-line paper
Figure finder
Panel Classifier
Scale Finder
Fl. Micr. Panel
Figure
Micr. Scale
22
Overview of SLIF image analysis of existing
images from online publications
End result collection of on-line fluorescence
microscope images, with quantitative description
of localization.
E.g. we know this figure section shows a
tubulin-like protein
but not which one!
23
Background overview of SLIF1
24
Background overview of SLIF1
Figure 1. (A) Single confocal 0-GFP fusion.
Bars, 5 m m.Movement of Coiled Bodies Vol. 10,
July 1999
Find scalebar and scale measurement
Rescale image of each cell, adjust contrast, and
compute subcellular localization features as if
it were an ordinary microscope image. Of course,
you still dont know what its an image of
25
Background overview of SLIF2.0
Caption
Image
Image Pointer Finder
Panel Splitter
Panel Label Matcher
Scope Finder
Panel Classifier
Name Finder
Scale Finder
Fl. Micr. Panel
Micr. Scale
Cell Type
Protein Name
26
Background overview of SLIF2.0
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
A new issue caption understanding - where are
the entities in the image?
27
Why caption understanding? - Location
proteomics. - Remove extraneous junk from caption
text for ordinary IE, NLP, indexing, -
Better text- or content-based image retrieval for
scientific images.
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
28
Identify image pointers Substrings that refer to
parts of the image
Will focus on text issues, not matching
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
29
Identify image pointers Substrings that refer to
parts of the image
Classify image pointers as citation-style or
bullet-style.
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
30
Classify image pointers as citation-style or
bullet-style.
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
31
Compute scopes - The scope of a bullet-style
image pointer is all words after it, but before
next bullet - The scope of a citation-style
image pointer is some set of words nearby it
(heuristically determined by separating words and
punctuation)
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
32
Image pointers share all entities in their
scope. Entities are assigned to panels based
on matches of image-pointers to annotations in
panels.
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
33
Outline

Details on caption understanding
Baseline hand-coded methods
Learning methods
Experimental results

34
Task

Identify image pointers in captions.
Classify image pointers
bullet-style, citation-style, or NP-style
E.g., Panels A and C show the
Wont talk about scoping
Will focus first on extracting image
pointersi.e., binary classification of
substrings is this an image pointer
Data 100 captions from 100 papersabout 600
positive examples.

35
Baseline methods

Labeled 100 sample figure captions.
HANDCODE-1 patterns like (A), (B-E), (c and d),
etc.
HANDCODE-2 all short parenthesized expressions
patterns like panel A or in B-C

HC-1 HC-2f HC-2
Precis. 98.5 89.0 74.5
Recall 45.6 54.8 98.0
F1 62.3 67.8 84.6
HC-1 HC-2
Precision 98.5 74.5
Recall 45.6 98.0
F1 62.3 84.6
Some plausible tricks (like filtering HC-2) dont
help much
36
How hard is the problem?
Some citation-style image pointers
37
How hard is the problem?
NP-style
non-image pointers
The difficulty of the task suggests using a
learning approach
38
Another use of propose-and-filter
Note that Hand-Code2 (recall 98) is a natural
candidate generator. Well start with off the
shelf features
Candidate Generator
Candidate phrase
Learned filter
Extracted phrase
39
Learning methods features

Start with named sets of labeled substrings
Image pointers and tokens (not marked)

Fig. 1. Kinase inactive Plk inhibits Golgi
fragmentation by mitotic cytosol. (A) NRK cells
were grown on coverslips and treated with
2mMthymidine for 8 to 14 h. Cells were
subsequently permeabilized with digitonin, washed
with 1M KCl-containing buer, and incubated with
either 7 mgyml interphase cytosol (IE), 7mgyml
mitotic extract (ME), or mitotic extract to
which 20 mgyml kinase inactive Plk (ME Plk-KD)
was added. After a 60-min incubation at 32C,
cells were fixed and stained with anti-mannosidase
II antibody to visualize the Golgi apparatus
by fluorescence microscopy. (B) Percentage of
cells with fragmented Golgi after incubation with
mitotic extract (ME) in the absence or the
presence of kinase inactive Plk (ME Plk-KD).
The histogram represents the average of four
independent experiments.
40
Learning methods features

Start with named sets of labeled substrings
Image pointers (labely/n) and tokens
(labeltoken)
Substrings act as examples and features
To create features use a little language
emit( token, before, -1, label ),
emit( token, before, -2, label ),

41
Learning methods features

emit( token, before, -1, label ),
emit( token, before, -2, label ),

kind of substring to look for
what to emit (substring label, distance in chars
to substring, )
direction to go
distance to go
emit inactive
42
Learning methods boosting
Generalized version of AdaBoost (SingerSchapire,
99) Allows real-valued predictions for each
base hypothesisincluding value of zero.
43
Learning methods boosting rules

Weak learner to find weak hypothesis t
Split Data into Growing and Pruning sets
Let Rt be an empty conjunction
Greedily add conditions to Rt guided by Growing
set
Greedily remove conditions from Rt guided by
Pruning set
Convert to weak hypothesis

44
Learning methods boosting rules
SLIPPER also produces fairly compact rule sets.
45
Learning methods BWI

Boosted wrapper induction (BWI) learns to extract
substrings from a document.
Learns three concepts firstToken(x),
lastToken(x), substringLength(k)
Conditions are tests on tokens before/after x
E.g., toki-2from, isNumber(toki1)
SLIPPER weak learner, no pruning.
Greedy search extends window size by at most L in
each iteration, uses lookahead L, no fixed limit
on window size.
Good results in (Kushmeric and Frietag, 2000)

46
Learning methods ABWI

Almost boosted wrapper induction (ABWI) learns
to extract substrings
Learns to filter candidate substrings (HandCode2)
Conditions are the same tests on tokens near x
E.g., toki-2from, isNumber(toki1)
SLIPPER weak learner, no pruning.
Greedy search extends window size any amount,
uses no lookahead, has fixed limit on window
size.
Optimal window sizes for this problem seem to be
small

47
Learning methods
HC-1 HC-2f HC-2 ABWI (W2)
Precis. 98.5 89.0 74.5 89.7
Recall 45.6 54.8 98.0 91.0
F1 62.3 67.8 84.6 90.3

Features W tokens before/after, all tokens
inside.
Learner 100 rounds boosting conjunctions of
feature tests
Inspired by BWI (Frietag Kushmeric)
Implemented with SLIPPER learner

48
Other learning methods
HC-1 HC-2f HC-2 ABWI (W2) ABWI Slipper ABWI Ripper ABWI SVM1 ABWI SVM2
Precis. 98.5 89.0 74.5 89.7 96.1 88.1 69.0 100.0
Recall 45.6 54.8 98.0 91.0 85.2 87.1 78.0 75.2
F1 62.3 67.8 84.6 90.3 90.3 87.6 73.2 85.6
All learning methods are competitive with
hand-coded methods
49
Additional features

Check if candidate contains certain special
substrings
Matches color name labeled color
Matches HANDCODE-1 pattern handcode1
Matches mm, mg, etc measure
Matches 1980,,2003, et al citation
Matches top, left, etc place
Added sentence boundary substrings
Feature is distance to boundary.

50
Learning with expanded feature set
HC-1 HC-2f HC-2 ABWI (W2) ABWI NA
Precis. 98.5 89.0 74.5 89.7 85.9
Recall 45.6 54.8 98.0 91.0 92.2
F1 62.3 67.8 84.6 90.3 89.0
Many new features are inversely correlated with
class (e.g. citation), but ABWI looks only for
positively-correlated patterns.
51
Learning with expanded feature set
HC-1 HC-2f HC-2 ABWI (W2) ABWI NA SABWI NA
Precis. 98.5 89.0 74.5 89.7 85.9 88.6
Recall 45.6 54.8 98.0 91.0 92.2 93.8
F1 62.3 67.8 84.6 90.3 89.0 91.1
SABWI is a symmetric version of ABWI can use
rules and/or conditions negatively or positively
correlated with the class
52
(No Transcript)
53
Task

Identify image pointers in captions.
Classify image pointers
bullet-style, citation-style, or NP-style
Combine these to get a four-class problem
bullet-style, citation-style, or NP-style, other
no hand-coded baseline methods

54
Four-class extraction results
Method Error rate Error rate Error rate
W2 W3 W5
ABWI 24.6 27.5 26.7
ABWINA 26.7 22.2 26.7
SABWINA 24.2 18.2 22.6
55
Further improvement is probable with additional
labeled data

Write a Comment

User Comments (0)