Text Categorization and Images - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

Text Categorization and Images

Description:

Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu Chang – PowerPoint PPT presentation

Number of Views:177

Avg rating:3.0/5.0

Slides: 56

Provided by: carls

Category:

more less

Transcript and Presenter's Notes

Title: Text Categorization and Images

1
Text Categorization and Images
Thesis Defense for Carl Sable Committee Kathleen
McKeown, Vasileios Hatzivassiloglou, Shree Nayar,
Kenneth W. Church, Shih-Fu Chang
2
Text Categorization

Text categorization (TC) refers to the automatic
labeling of documents, using natural language
text contained in or associated with each
document, into one or more pre-defined
categories.
Idea TC techniques can be applied to image
captions or articles to label the corresponding
images.

3
Clues for Indoor versus OutdoorText (as opposed
to visual image features)
Denver Summit of Eight leaders begin their first
official meeting in the Denver Public Library,
June 21.
The two engines of an Amtrak passenger train lie
in the mud at the edge a marsh after the train,
bound for Boston from Washington, derailed on the
bank of the Hackensack River, just after crossing
a bridge.
4
Contributions

General
An in-depth exploration of the categorization of
images based on associated text
Incorporating research into Newsblaster
Novel machine learning (ML) techniques
The creation of two novel TC approaches
The combination of high-precision/low-recall
rules with other systems
Novel representation
The integration of NLP and IR
The use of low-level image features

5
Framework

Collection of Experiments
Various tasks
Multiple techniques
No clear winner for all tasks
Characteristics of tasks often dictate which
techniques work best
No Free Lunch

6
Overview

The Main Idea
Description of Corpus
Novel ML Systems
NLP Based System
High-Precision/Low-Recall Rules
Image Features
Newsblaster
Conclusions and Future Work

7
Corpus

Raw data
Postings from news related Usenet newsgroups
Over 2000 include embedded captioned images
Data sets
Multiple sets of categories representing various
levels of abstraction
Mutually exclusive and exhaustive categories

8
Indoor
Outdoor
9
Events Categories
Politics
Struggle
Disaster
Crime
Other
10
Subcategories for Disaster Images
Category F1
Politics 89
Struggle 88
Disaster 97
Crime 90
Other 59
Politics
Struggle
Disaster
Crime
Other
11
Collect Labels to Train Systems
12
Overview

The Main Idea
Description of Corpus
Novel ML Systems
NLP Based System
High-Precision/Low-Recall Rules
Image Features
Newsblaster
Conclusions and Future Work

13
Two Novel ML Approaches

Density estimation
Applied to the results of some other system
Often improves performance
Always provides probabilistic confidence measures
for predictions
BINS
Uses binning to estimate accurate term weights
for words with scarce evidence
Extremely competitive for two data sets in my
corpus

14
BINS SystemNaïve Bayes Smoothing

Binning based on smoothing in the speech
recognition literature
Not enough training data to estimate term weights
for words with scarce evidence
Words with similar statistical features are
grouped into a common bin
Estimate a single weight for each bin
This weight is assigned to all words in the bin
Credible estimates even for small (or zero) counts

15
Binning Uses Statistical Features of Words
Intuition Word Indoor Category Count Outdoor Category Count Quantized IDF
Clearly Indoor conference 14 1 4
Clearly Indoor bed 1 0 8
Clearly Outdoor plane 0 9 5
Clearly Outdoor earthquake 0 4 6
Unclear speech 2 2 6
Unclear ceremony 3 8 5
16
Lambdas Weights

First half of training set Assign words to bins
Second half of training set Estimate term
weights

17
Binning ? Credible Log Likelihood Ratios
Intuition Word ?Indoor minus ?Outdoor Indoor Category Count Outdoor Category Count Quantized IDF
Clearly Indoor conference 4.84 14 1 4
Clearly Indoor bed 1.35 1 0 8
Clearly Outdoor plane -2.01 0 9 5
Clearly Outdoor earthquake -1.00 0 4 6
Unclear speech 0.84 2 2 6
Unclear ceremony -0.50 3 8 5
18
BINS Robust Version of Naïve Bayes
Indoor versus Outdoor
Events Politics, Struggle, Disaster, Crime, Other
19
Combining Bin Weights and Naïve Bayes Weights

Idea
It might be better to use the Naïve Bayes weight
when there is enough evidence for a word
Back off to the bin weight otherwise
BINS allows combinations of weights to be used
based on the level of evidence
How can we automatically determine when to use
which weights???
Entropy
Minimum Squared Error (MSE)

20
Can Provide File to BINS that Specifies How to
Combine Weights
Based on Entropy
Based on MSE
0 0.25 0.5 0.75 1
Use only bin weight for evidence of 0
0 0.5 1
Average bin weight and NB weight for evidence of 1
Use only NB weight for evidence of 2 or more
21
Appropriately Combining the Bin Weight and the
Naïve Bayes Weight Leads to the Best Performance
Yet
Indoor versus Outdoor
Events Politics, Struggle, Disaster, Crime, Other
22
BINS Performs the Best of All Systems Tested
Indoor versus Outdoor
Events Politics, Struggle, Disaster, Crime, Other
23
Overview

The Main Idea
Description of Corpus
Novel ML Systems
NLP Based System
High-Precision/Low-Recall Rules
Image Features
Newsblaster
Conclusions and Future Work

24
Disaster Image Categories
Affected People
Workers Responding
Other
Wreckage
25
Performance of Standard SystemsNot Very
Satisfying
26
Ambiguity for Disaster ImagesWorkers Responding
vs. Affected People
Philippine rescuers carry a fire victim March 19
who perished in a blaze at a Manila disco.
Hypothetical alternative caption A fire victim
who perished in a blaze at a Manila disco is
carried by Philippine rescuers March 19.
27
Summary of Observations About Task
Philippine rescuers carry a fire victim March 19
who perished in a blaze at a Manila disco.

Need to distinguish foreground from background,
determine focus of image
Not all words are important some are misleading
Hypothesis the main subject and verb are
particularly useful for this task
Problematic for bag of words approaches
Need linguistic analysis to determine predicate
argument relationships

28
Experiments with Humans Subjects 4
ConditionsTest Hypothesis Subject and Verb are
Useful Clues
SENT First sentence of caption Philippine rescuers carry a fire victim March 19 who perished in a blaze at a Manila disco.
RAND All words from first sentence in random order At perished disco who Manila a a in 19 carry Philippine blaze victim a rescuers March fire
IDF Top two TFIDF words disco rescuers
S-V Subject and verb subject rescuers, verb carry
29
Experiments with Humans Subjects
ResultsHypothesis Subject and Verb are Useful
Clues

More words are better than fewer words
SENT, RAND gt S-V, IDF
Syntax is important
SENT gt RAND S-V gt IDF

Condition Average Time (in seconds)
RAND 68
SENT 34
IDF 22
S-V 20
30
Using Just Two Words (S-V)Almost as Good as All
the Words (Bag of Words)
31
Operational NLP Based System

Extract subjects and verbs from all documents in
training set

Subjects 83.9
Verbs 80.6

For each test document
Extract subject and verb
Compare to those from training set using some
method of word-to-word similarity
Based on similarities, generate a score for every
category

32
The NLP Based System Beats All Others by a
Considerable Margin
33
Politics Image Categories
Meeting
Civilians
Announcement
Other
Military
Politician Photographed
34
The NLP Based System is in the Middle of the Pack
for the Politics Image Data Set
35
Overview

The Main Idea
Description of Corpus
Novel ML Systems
NLP Based System
High-Precision/Low-Recall Rules
Image Features
Newsblaster
Conclusions and Future Work

36
The Original Premise

For the Disaster image data set, the performance
of the NLP based system still leaves room for
improvement
NLP based system achieves 65 overall accuracy
for the Disaster image data set
Humans viewing all words in random order achieve
about 75
Humans viewing full first sentence achieve over
90
Main subject and verb are particularly important,
but sometimes other words might offer good clues

37
Higinio Guereca carries family photos he
retrieved from his mobile home which was
destroyed as a tornado moved through the Central
Florida community, early December 27.
38
Selected Indicative Words for the Disaster Image
Data Set
Word Indicated Category Total Count (x) Proportion (p)
her Affected People 7 1.0
his Affected People 7 0.86
family Affected People 6 0.83
relatives Affected People 6 1.0
rescue Workers Responding 15 1.0
search Workers Responding 9 1.0
similar Other 2 1.0
soldiers Workers Responding 6 1.0
workers Workers Responding 12 1.0
39
High-Precision/Low-Recall Rules

If a word w that indicates category c occurs in a
document d, then assign d to c
Every selected indicative word has an associated
rule of the above form
Each rule is very accurate but rarely applicable
If only rules are used
most predictions will be correct (hence, high
precision)
most instances of most categories will remain
unlabeled (hence, low recall)

40
Combining the High-Precision/Low-Recall Rules
with Other Systems

Two-pass approach
Conduct a first-pass using the indicative words
and the high-precision/low-recall rules
For documents that are still unlabeled, fall back
to some other system
Compared to the fall back system
If the rules are more accurate for the documents
to which they apply, overall accuracy will
improve!
Intended to improve the NLP based system, but
easy to test with other systems as well

41
The Rules Improve Every Fall Back System for the
Disaster Image Data Set
42
The Rules Improve 7 of 8 Fall Back Systems for
the Politics Image Data Set
43
Overview

The Main Idea
Description of Corpus
Novel ML Systems
NLP Based System
High-Precision/Low-Recall Rules
Image Features
Newsblaster
Conclusions and Future Work

44
Low-Level Image Features

Collaboration with Paek and Benitez
They have provided me with information, pointers
to resources, and code
I have reimplemented some of their code
Color histograms
Based on entire images or image regions
Can be used as input to machine learning
approaches (e.g. kNN, SVMs)

45
Combining Text and Image Features

Combining systems has had mixed results in the TC
literature, but
Most attempts have involved systems that use the
same features (bag of words)
There is little reason to believe that indicative
text is correlated with indicative low-level
image features
Most text based systems are beating the image
based systems, but
Distance from optimal hyperplane can be used as a
confidence measure for support vector machine
Predictions with high confidence may be more
accurate than text systems

46
The Combination of Text and Image Beats Text
AloneMost systems show small gains, one has
major improvement
47
Overview

The Main Idea
Description of Corpus
Novel ML Systems
NLP Based System
High-Precision/Low-Recall Rules
Image Features
Newsblaster
Conclusions and Future Work

48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
Newsblaster

A pragmatic showcase for NLP
My contributions
Extraction of images and captions from web pages
Image browsing interface
Categorization of stories (clusters) and images
Scripts that allow users to suggest labels for
articles with incorrect predictions

52
Overview

The Main Idea
Description of Corpus
Novel ML Systems
NLP Based System
High-Precision/Low-Recall Rules
Image Features
Newsblaster
Conclusions and Future Work

53
Conclusions

TC techniques can be used to categorize images
Many methods exist
No clear winner for all tasks
BINS is very competitive
NLP can lead to substantial improvement, at least
for certain tasks
High-precision/low-recall rules are likely to
improve performance for tough tasks
Image features show promise
Newsblaster demonstrates pragmatic benefits of my
work

54
Future Work