Title: Text Categorization and Images
1Text Categorization and Images
Thesis Defense for Carl Sable Committee Kathleen
McKeown, Vasileios Hatzivassiloglou, Shree Nayar,
Kenneth W. Church, Shih-Fu Chang
2Text Categorization
- Text categorization (TC) refers to the automatic
labeling of documents, using natural language
text contained in or associated with each
document, into one or more pre-defined
categories. - Idea TC techniques can be applied to image
captions or articles to label the corresponding
images.
3Clues for Indoor versus OutdoorText (as opposed
to visual image features)
Denver Summit of Eight leaders begin their first
official meeting in the Denver Public Library,
June 21.
The two engines of an Amtrak passenger train lie
in the mud at the edge a marsh after the train,
bound for Boston from Washington, derailed on the
bank of the Hackensack River, just after crossing
a bridge.
4Contributions
- General
- An in-depth exploration of the categorization of
images based on associated text - Incorporating research into Newsblaster
- Novel machine learning (ML) techniques
- The creation of two novel TC approaches
- The combination of high-precision/low-recall
rules with other systems - Novel representation
- The integration of NLP and IR
- The use of low-level image features
5Framework
- Collection of Experiments
- Various tasks
- Multiple techniques
- No clear winner for all tasks
- Characteristics of tasks often dictate which
techniques work best - No Free Lunch
6Overview
- The Main Idea
- Description of Corpus
- Novel ML Systems
- NLP Based System
- High-Precision/Low-Recall Rules
- Image Features
- Newsblaster
- Conclusions and Future Work
7Corpus
- Raw data
- Postings from news related Usenet newsgroups
- Over 2000 include embedded captioned images
- Data sets
- Multiple sets of categories representing various
levels of abstraction - Mutually exclusive and exhaustive categories
8Indoor
Outdoor
9Events Categories
Politics
Struggle
Disaster
Crime
Other
10Subcategories for Disaster Images
Category F1
Politics 89
Struggle 88
Disaster 97
Crime 90
Other 59
Politics
Struggle
Disaster
Crime
Other
11Collect Labels to Train Systems
12Overview
- The Main Idea
- Description of Corpus
- Novel ML Systems
- NLP Based System
- High-Precision/Low-Recall Rules
- Image Features
- Newsblaster
- Conclusions and Future Work
13Two Novel ML Approaches
- Density estimation
- Applied to the results of some other system
- Often improves performance
- Always provides probabilistic confidence measures
for predictions - BINS
- Uses binning to estimate accurate term weights
for words with scarce evidence - Extremely competitive for two data sets in my
corpus
14BINS SystemNaïve Bayes Smoothing
- Binning based on smoothing in the speech
recognition literature - Not enough training data to estimate term weights
for words with scarce evidence - Words with similar statistical features are
grouped into a common bin - Estimate a single weight for each bin
- This weight is assigned to all words in the bin
- Credible estimates even for small (or zero) counts
15Binning Uses Statistical Features of Words
Intuition Word Indoor Category Count Outdoor Category Count Quantized IDF
Clearly Indoor conference 14 1 4
Clearly Indoor bed 1 0 8
Clearly Outdoor plane 0 9 5
Clearly Outdoor earthquake 0 4 6
Unclear speech 2 2 6
Unclear ceremony 3 8 5
16Lambdas Weights
- First half of training set Assign words to bins
- Second half of training set Estimate term
weights
17Binning ? Credible Log Likelihood Ratios
Intuition Word ?Indoor minus ?Outdoor Indoor Category Count Outdoor Category Count Quantized IDF
Clearly Indoor conference 4.84 14 1 4
Clearly Indoor bed 1.35 1 0 8
Clearly Outdoor plane -2.01 0 9 5
Clearly Outdoor earthquake -1.00 0 4 6
Unclear speech 0.84 2 2 6
Unclear ceremony -0.50 3 8 5
18BINS Robust Version of Naïve Bayes
Indoor versus Outdoor
Events Politics, Struggle, Disaster, Crime, Other
19Combining Bin Weights and Naïve Bayes Weights
- Idea
- It might be better to use the Naïve Bayes weight
when there is enough evidence for a word - Back off to the bin weight otherwise
- BINS allows combinations of weights to be used
based on the level of evidence - How can we automatically determine when to use
which weights??? - Entropy
- Minimum Squared Error (MSE)
20Can Provide File to BINS that Specifies How to
Combine Weights
Based on Entropy
Based on MSE
0 0.25 0.5 0.75 1
Use only bin weight for evidence of 0
0 0.5 1
Average bin weight and NB weight for evidence of 1
Use only NB weight for evidence of 2 or more
21Appropriately Combining the Bin Weight and the
Naïve Bayes Weight Leads to the Best Performance
Yet
Indoor versus Outdoor
Events Politics, Struggle, Disaster, Crime, Other
22BINS Performs the Best of All Systems Tested
Indoor versus Outdoor
Events Politics, Struggle, Disaster, Crime, Other
23Overview
- The Main Idea
- Description of Corpus
- Novel ML Systems
- NLP Based System
- High-Precision/Low-Recall Rules
- Image Features
- Newsblaster
- Conclusions and Future Work
24Disaster Image Categories
Affected People
Workers Responding
Other
Wreckage
25Performance of Standard SystemsNot Very
Satisfying
26Ambiguity for Disaster ImagesWorkers Responding
vs. Affected People
Philippine rescuers carry a fire victim March 19
who perished in a blaze at a Manila disco.
Hypothetical alternative caption A fire victim
who perished in a blaze at a Manila disco is
carried by Philippine rescuers March 19.
27Summary of Observations About Task
Philippine rescuers carry a fire victim March 19
who perished in a blaze at a Manila disco.
- Need to distinguish foreground from background,
determine focus of image - Not all words are important some are misleading
- Hypothesis the main subject and verb are
particularly useful for this task - Problematic for bag of words approaches
- Need linguistic analysis to determine predicate
argument relationships
28Experiments with Humans Subjects 4
ConditionsTest Hypothesis Subject and Verb are
Useful Clues
SENT First sentence of caption Philippine rescuers carry a fire victim March 19 who perished in a blaze at a Manila disco.
RAND All words from first sentence in random order At perished disco who Manila a a in 19 carry Philippine blaze victim a rescuers March fire
IDF Top two TFIDF words disco rescuers
S-V Subject and verb subject rescuers, verb carry
29Experiments with Humans Subjects
ResultsHypothesis Subject and Verb are Useful
Clues
- More words are better than fewer words
- SENT, RAND gt S-V, IDF
- Syntax is important
- SENT gt RAND S-V gt IDF
Condition Average Time (in seconds)
RAND 68
SENT 34
IDF 22
S-V 20
30Using Just Two Words (S-V)Almost as Good as All
the Words (Bag of Words)
31Operational NLP Based System
- Extract subjects and verbs from all documents in
training set
Subjects 83.9
Verbs 80.6
- For each test document
- Extract subject and verb
- Compare to those from training set using some
method of word-to-word similarity - Based on similarities, generate a score for every
category
32The NLP Based System Beats All Others by a
Considerable Margin
33Politics Image Categories
Meeting
Civilians
Announcement
Other
Military
Politician Photographed
34The NLP Based System is in the Middle of the Pack
for the Politics Image Data Set
35Overview
- The Main Idea
- Description of Corpus
- Novel ML Systems
- NLP Based System
- High-Precision/Low-Recall Rules
- Image Features
- Newsblaster
- Conclusions and Future Work
36The Original Premise
- For the Disaster image data set, the performance
of the NLP based system still leaves room for
improvement - NLP based system achieves 65 overall accuracy
for the Disaster image data set - Humans viewing all words in random order achieve
about 75 - Humans viewing full first sentence achieve over
90 - Main subject and verb are particularly important,
but sometimes other words might offer good clues
37Higinio Guereca carries family photos he
retrieved from his mobile home which was
destroyed as a tornado moved through the Central
Florida community, early December 27.
38Selected Indicative Words for the Disaster Image
Data Set
Word Indicated Category Total Count (x) Proportion (p)
her Affected People 7 1.0
his Affected People 7 0.86
family Affected People 6 0.83
relatives Affected People 6 1.0
rescue Workers Responding 15 1.0
search Workers Responding 9 1.0
similar Other 2 1.0
soldiers Workers Responding 6 1.0
workers Workers Responding 12 1.0
39High-Precision/Low-Recall Rules
- If a word w that indicates category c occurs in a
document d, then assign d to c - Every selected indicative word has an associated
rule of the above form - Each rule is very accurate but rarely applicable
- If only rules are used
- most predictions will be correct (hence, high
precision) - most instances of most categories will remain
unlabeled (hence, low recall)
40Combining the High-Precision/Low-Recall Rules
with Other Systems
- Two-pass approach
- Conduct a first-pass using the indicative words
and the high-precision/low-recall rules - For documents that are still unlabeled, fall back
to some other system - Compared to the fall back system
- If the rules are more accurate for the documents
to which they apply, overall accuracy will
improve! - Intended to improve the NLP based system, but
easy to test with other systems as well
41The Rules Improve Every Fall Back System for the
Disaster Image Data Set
42The Rules Improve 7 of 8 Fall Back Systems for
the Politics Image Data Set
43Overview
- The Main Idea
- Description of Corpus
- Novel ML Systems
- NLP Based System
- High-Precision/Low-Recall Rules
- Image Features
- Newsblaster
- Conclusions and Future Work
44Low-Level Image Features
- Collaboration with Paek and Benitez
- They have provided me with information, pointers
to resources, and code - I have reimplemented some of their code
- Color histograms
- Based on entire images or image regions
- Can be used as input to machine learning
approaches (e.g. kNN, SVMs)
45Combining Text and Image Features
- Combining systems has had mixed results in the TC
literature, but - Most attempts have involved systems that use the
same features (bag of words) - There is little reason to believe that indicative
text is correlated with indicative low-level
image features - Most text based systems are beating the image
based systems, but - Distance from optimal hyperplane can be used as a
confidence measure for support vector machine - Predictions with high confidence may be more
accurate than text systems
46The Combination of Text and Image Beats Text
AloneMost systems show small gains, one has
major improvement
47Overview
- The Main Idea
- Description of Corpus
- Novel ML Systems
- NLP Based System
- High-Precision/Low-Recall Rules
- Image Features
- Newsblaster
- Conclusions and Future Work
48(No Transcript)
49(No Transcript)
50(No Transcript)
51Newsblaster
- A pragmatic showcase for NLP
- My contributions
- Extraction of images and captions from web pages
- Image browsing interface
- Categorization of stories (clusters) and images
- Scripts that allow users to suggest labels for
articles with incorrect predictions
52Overview
- The Main Idea
- Description of Corpus
- Novel ML Systems
- NLP Based System
- High-Precision/Low-Recall Rules
- Image Features
- Newsblaster
- Conclusions and Future Work
53Conclusions
- TC techniques can be used to categorize images
- Many methods exist
- No clear winner for all tasks
- BINS is very competitive
- NLP can lead to substantial improvement, at least
for certain tasks - High-precision/low-recall rules are likely to
improve performance for tough tasks - Image features show promise
- Newsblaster demonstrates pragmatic benefits of my
work
54Future Work
- BINS
- Explore additional binning features
- Explore use of unlabeled data
- NLP and TC
- Improve current system
- Explore additional categories
- Image features
- Explore additional low-level image features
- Explore better methods of combining text and image
55And Now the Questions