Title: IE by Candidate Classification: Jansche
1IE by Candidate ClassificationJansche Abney,
Cohen et al
2SCAN Search Summarization for Audio
Collections (ATT Labs)
3(No Transcript)
4Why IE from personal voicemail
- Unified interface for email, voicemail, fax,
requires uniform headers - Sender, Time, Subject,
- Headers are key for uniform interface
- Independently, voicemail access is slow
- useful to have fast access to important parts of
message (contact number, caller)
5Why else to read this paper
- Robust information extraction
- Generalizing from manual transcripts (i.e.,
human-produced written version of voicemail) to
automatic (ASR) transcripts - Place of hand-coding vs learning in information
extraction - How to break up task
- Where and how to use engineering
Candidate Generator
Candidate phrase
Learned filter
Extracted phrase
6Voicemail corpus
- About 10,000 manually transcribed and annotated
voice messages. - 1869 used for evaluation
7Observation caller phrases are short and near
the beginning of the message.
8Caller-phrase extraction
- Propose start positions i1,,iN
- Use a learned decision tree to pick the best i
- Propose end positions ij1,ij2,,ijM
- Use a learned decision tree to pick the best j
9Baseline (HZP, Col log-linear)
- IE as tagging
- Pr(tag iword i,word i-1,,word i1,,tag i-1,)
estimated via MAXENT model - Beam search to find best tag sequence given word
sequence - Features of model are words, word pairs, word
pairtag trigrams, .
Hi there its Bill and
OUT OUT IN IN OUT
10Performance
11Observation caller names are really short and
near the beginning of the message.
12What about ASR transcripts?
13Extracting phone numbers
- Phase 1 hand-coded grammer proposes candidate
phone numbers - Not too hard, due to limited vocabulary
- Optimize recall (96) not precision (30)
- Phase 2 a learned decision tree filters
candidates - Use length, position, context,
14Results
15Their Conclusions
16Cohen, Wang, Murphy
- Another paper with a similar flavor
- IE for a particular task
- IE using similar propose-and-filter approach
- When and how to you engineer, and when and how to
you use learning?
17Background subcellular localization
The most important tool for studying protein
localizations is fluorescence microscopy.
New image processing techniques can automatically
produce a quantitative description of subcellular
localization.
18Background subcellular localization
19Background subcellular localization
Entrez a new 376kD Golgi complex outher
membrane protein SWISSProt INTEGRAL MEMBRANE
PROTEIN. GOLGI MEMBRANE
Entrez GPP130 type II Golgi membrane
protein SWISSProt nothing
20Background subcellular localization
- Some other interesting facts
- Primary structure is poor indicator of
localization - Many possible localizations with image analysis
- Tens of thousands of images in open literature
21Overview of SLIF image analysis of existing
images from online publications
Image
Panel Splitter
On-line paper
Figure finder
Panel Classifier
Scale Finder
Fl. Micr. Panel
Figure
Micr. Scale
22Overview of SLIF image analysis of existing
images from online publications
End result collection of on-line fluorescence
microscope images, with quantitative description
of localization.
E.g. we know this figure section shows a
tubulin-like protein
but not which one!
23Background overview of SLIF1
24Background overview of SLIF1
Figure 1. (A) Single confocal 0-GFP fusion.
Bars, 5 m m.Movement of Coiled Bodies Vol. 10,
July 1999
Find scalebar and scale measurement
Rescale image of each cell, adjust contrast, and
compute subcellular localization features as if
it were an ordinary microscope image. Of course,
you still dont know what its an image of
25Background overview of SLIF2.0
Caption
Image
Image Pointer Finder
Panel Splitter
Panel Label Matcher
Scope Finder
Panel Classifier
Name Finder
Scale Finder
Fl. Micr. Panel
Micr. Scale
Cell Type
Protein Name
26Background overview of SLIF2.0
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
A new issue caption understanding - where are
the entities in the image?
27Why caption understanding? - Location
proteomics. - Remove extraneous junk from caption
text for ordinary IE, NLP, indexing, -
Better text- or content-based image retrieval for
scientific images.
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
28Identify image pointers Substrings that refer to
parts of the image
Will focus on text issues, not matching
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
29Identify image pointers Substrings that refer to
parts of the image
Classify image pointers as citation-style or
bullet-style.
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
30 Classify image pointers as citation-style or
bullet-style.
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
31Compute scopes - The scope of a bullet-style
image pointer is all words after it, but before
next bullet - The scope of a citation-style
image pointer is some set of words nearby it
(heuristically determined by separating words and
punctuation)
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
32Image pointers share all entities in their
scope. Entities are assigned to panels based
on matches of image-pointers to annotations in
panels.
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
33Outline
- Details on caption understanding
- Baseline hand-coded methods
- Learning methods
- Experimental results
34Task
- Identify image pointers in captions.
- Classify image pointers
- bullet-style, citation-style, or NP-style
- E.g., Panels A and C show the
- Wont talk about scoping
- Will focus first on extracting image
pointersi.e., binary classification of
substrings is this an image pointer - Data 100 captions from 100 papersabout 600
positive examples.
35Baseline methods
- Labeled 100 sample figure captions.
- HANDCODE-1 patterns like (A), (B-E), (c and d),
etc. - HANDCODE-2 all short parenthesized expressions
patterns like panel A or in B-C
HC-1 HC-2f HC-2
Precis. 98.5 89.0 74.5
Recall 45.6 54.8 98.0
F1 62.3 67.8 84.6
HC-1 HC-2
Precision 98.5 74.5
Recall 45.6 98.0
F1 62.3 84.6
Some plausible tricks (like filtering HC-2) dont
help much
36How hard is the problem?
Some citation-style image pointers
37How hard is the problem?
NP-style
non-image pointers
The difficulty of the task suggests using a
learning approach
38Another use of propose-and-filter
Note that Hand-Code2 (recall 98) is a natural
candidate generator. Well start with off the
shelf features
Candidate Generator
Candidate phrase
Learned filter
Extracted phrase
39Learning methods features
- Start with named sets of labeled substrings
- Image pointers and tokens (not marked)
Fig. 1. Kinase inactive Plk inhibits Golgi
fragmentation by mitotic cytosol. (A) NRK cells
were grown on coverslips and treated with
2mMthymidine for 8 to 14 h. Cells were
subsequently permeabilized with digitonin, washed
with 1M KCl-containing buer, and incubated with
either 7 mgyml interphase cytosol (IE), 7mgyml
mitotic extract (ME), or mitotic extract to
which 20 mgyml kinase inactive Plk (ME Plk-KD)
was added. After a 60-min incubation at 32C,
cells were fixed and stained with anti-mannosidase
II antibody to visualize the Golgi apparatus
by fluorescence microscopy. (B) Percentage of
cells with fragmented Golgi after incubation with
mitotic extract (ME) in the absence or the
presence of kinase inactive Plk (ME Plk-KD).
The histogram represents the average of four
independent experiments.
40Learning methods features
- Start with named sets of labeled substrings
- Image pointers (labely/n) and tokens
(labeltoken) - Substrings act as examples and features
- To create features use a little language
- emit( token, before, -1, label ),
- emit( token, before, -2, label ),
41Learning methods features
- emit( token, before, -1, label ),
- emit( token, before, -2, label ),
kind of substring to look for
what to emit (substring label, distance in chars
to substring, )
direction to go
distance to go
emit inactive
42Learning methods boosting
Generalized version of AdaBoost (SingerSchapire,
99) Allows real-valued predictions for each
base hypothesisincluding value of zero.
43Learning methods boosting rules
- Weak learner to find weak hypothesis t
- Split Data into Growing and Pruning sets
- Let Rt be an empty conjunction
- Greedily add conditions to Rt guided by Growing
set - Greedily remove conditions from Rt guided by
Pruning set - Convert to weak hypothesis
44Learning methods boosting rules
SLIPPER also produces fairly compact rule sets.
45Learning methods BWI
- Boosted wrapper induction (BWI) learns to extract
substrings from a document. - Learns three concepts firstToken(x),
lastToken(x), substringLength(k) - Conditions are tests on tokens before/after x
- E.g., toki-2from, isNumber(toki1)
- SLIPPER weak learner, no pruning.
- Greedy search extends window size by at most L in
each iteration, uses lookahead L, no fixed limit
on window size. - Good results in (Kushmeric and Frietag, 2000)
46Learning methods ABWI
- Almost boosted wrapper induction (ABWI) learns
to extract substrings - Learns to filter candidate substrings (HandCode2)
- Conditions are the same tests on tokens near x
- E.g., toki-2from, isNumber(toki1)
- SLIPPER weak learner, no pruning.
- Greedy search extends window size any amount,
uses no lookahead, has fixed limit on window
size. - Optimal window sizes for this problem seem to be
small
47Learning methods
HC-1 HC-2f HC-2 ABWI (W2)
Precis. 98.5 89.0 74.5 89.7
Recall 45.6 54.8 98.0 91.0
F1 62.3 67.8 84.6 90.3
- Features W tokens before/after, all tokens
inside. - Learner 100 rounds boosting conjunctions of
feature tests - Inspired by BWI (Frietag Kushmeric)
- Implemented with SLIPPER learner
48Other learning methods
HC-1 HC-2f HC-2 ABWI (W2) ABWI Slipper ABWI Ripper ABWI SVM1 ABWI SVM2
Precis. 98.5 89.0 74.5 89.7 96.1 88.1 69.0 100.0
Recall 45.6 54.8 98.0 91.0 85.2 87.1 78.0 75.2
F1 62.3 67.8 84.6 90.3 90.3 87.6 73.2 85.6
All learning methods are competitive with
hand-coded methods
49Additional features
- Check if candidate contains certain special
substrings - Matches color name labeled color
- Matches HANDCODE-1 pattern handcode1
- Matches mm, mg, etc measure
- Matches 1980,,2003, et al citation
- Matches top, left, etc place
- Added sentence boundary substrings
- Feature is distance to boundary.
50Learning with expanded feature set
HC-1 HC-2f HC-2 ABWI (W2) ABWI NA
Precis. 98.5 89.0 74.5 89.7 85.9
Recall 45.6 54.8 98.0 91.0 92.2
F1 62.3 67.8 84.6 90.3 89.0
Many new features are inversely correlated with
class (e.g. citation), but ABWI looks only for
positively-correlated patterns.
51Learning with expanded feature set
HC-1 HC-2f HC-2 ABWI (W2) ABWI NA SABWI NA
Precis. 98.5 89.0 74.5 89.7 85.9 88.6
Recall 45.6 54.8 98.0 91.0 92.2 93.8
F1 62.3 67.8 84.6 90.3 89.0 91.1
SABWI is a symmetric version of ABWI can use
rules and/or conditions negatively or positively
correlated with the class
52(No Transcript)
53Task
- Identify image pointers in captions.
- Classify image pointers
- bullet-style, citation-style, or NP-style
- Combine these to get a four-class problem
- bullet-style, citation-style, or NP-style, other
- no hand-coded baseline methods
54Four-class extraction results
Method Error rate Error rate Error rate
W2 W3 W5
ABWI 24.6 27.5 26.7
ABWINA 26.7 22.2 26.7
SABWINA 24.2 18.2 22.6
55Further improvement is probable with additional
labeled data