Text Categorization and Images - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Text Categorization and Images

Description:

Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu Chang – PowerPoint PPT presentation

Number of Views:172
Avg rating:3.0/5.0
Slides: 56
Provided by: carls
Category:

less

Transcript and Presenter's Notes

Title: Text Categorization and Images


1
Text Categorization and Images
Thesis Defense for Carl Sable Committee Kathleen
McKeown, Vasileios Hatzivassiloglou, Shree Nayar,
Kenneth W. Church, Shih-Fu Chang
2
Text Categorization
  • Text categorization (TC) refers to the automatic
    labeling of documents, using natural language
    text contained in or associated with each
    document, into one or more pre-defined
    categories.
  • Idea TC techniques can be applied to image
    captions or articles to label the corresponding
    images.

3
Clues for Indoor versus OutdoorText (as opposed
to visual image features)
Denver Summit of Eight leaders begin their first
official meeting in the Denver Public Library,
June 21.
The two engines of an Amtrak passenger train lie
in the mud at the edge a marsh after the train,
bound for Boston from Washington, derailed on the
bank of the Hackensack River, just after crossing
a bridge.
4
Contributions
  • General
  • An in-depth exploration of the categorization of
    images based on associated text
  • Incorporating research into Newsblaster
  • Novel machine learning (ML) techniques
  • The creation of two novel TC approaches
  • The combination of high-precision/low-recall
    rules with other systems
  • Novel representation
  • The integration of NLP and IR
  • The use of low-level image features

5
Framework
  • Collection of Experiments
  • Various tasks
  • Multiple techniques
  • No clear winner for all tasks
  • Characteristics of tasks often dictate which
    techniques work best
  • No Free Lunch

6
Overview
  1. The Main Idea
  2. Description of Corpus
  3. Novel ML Systems
  4. NLP Based System
  5. High-Precision/Low-Recall Rules
  6. Image Features
  7. Newsblaster
  8. Conclusions and Future Work

7
Corpus
  • Raw data
  • Postings from news related Usenet newsgroups
  • Over 2000 include embedded captioned images
  • Data sets
  • Multiple sets of categories representing various
    levels of abstraction
  • Mutually exclusive and exhaustive categories

8
Indoor
Outdoor
9
Events Categories
Politics
Struggle
Disaster
Crime
Other
10
Subcategories for Disaster Images
Category F1
Politics 89
Struggle 88
Disaster 97
Crime 90
Other 59
Politics
Struggle
Disaster
Crime
Other
11
Collect Labels to Train Systems
12
Overview
  1. The Main Idea
  2. Description of Corpus
  3. Novel ML Systems
  4. NLP Based System
  5. High-Precision/Low-Recall Rules
  6. Image Features
  7. Newsblaster
  8. Conclusions and Future Work

13
Two Novel ML Approaches
  • Density estimation
  • Applied to the results of some other system
  • Often improves performance
  • Always provides probabilistic confidence measures
    for predictions
  • BINS
  • Uses binning to estimate accurate term weights
    for words with scarce evidence
  • Extremely competitive for two data sets in my
    corpus

14
BINS SystemNaïve Bayes Smoothing
  • Binning based on smoothing in the speech
    recognition literature
  • Not enough training data to estimate term weights
    for words with scarce evidence
  • Words with similar statistical features are
    grouped into a common bin
  • Estimate a single weight for each bin
  • This weight is assigned to all words in the bin
  • Credible estimates even for small (or zero) counts

15
Binning Uses Statistical Features of Words
Intuition Word Indoor Category Count Outdoor Category Count Quantized IDF
Clearly Indoor conference 14 1 4
Clearly Indoor bed 1 0 8
Clearly Outdoor plane 0 9 5
Clearly Outdoor earthquake 0 4 6
Unclear speech 2 2 6
Unclear ceremony 3 8 5
16
Lambdas Weights
  • First half of training set Assign words to bins
  • Second half of training set Estimate term
    weights

17
Binning ? Credible Log Likelihood Ratios
Intuition Word ?Indoor minus ?Outdoor Indoor Category Count Outdoor Category Count Quantized IDF
Clearly Indoor conference 4.84 14 1 4
Clearly Indoor bed 1.35 1 0 8
Clearly Outdoor plane -2.01 0 9 5
Clearly Outdoor earthquake -1.00 0 4 6
Unclear speech 0.84 2 2 6
Unclear ceremony -0.50 3 8 5
18
BINS Robust Version of Naïve Bayes
Indoor versus Outdoor
Events Politics, Struggle, Disaster, Crime, Other
19
Combining Bin Weights and Naïve Bayes Weights
  • Idea
  • It might be better to use the Naïve Bayes weight
    when there is enough evidence for a word
  • Back off to the bin weight otherwise
  • BINS allows combinations of weights to be used
    based on the level of evidence
  • How can we automatically determine when to use
    which weights???
  • Entropy
  • Minimum Squared Error (MSE)

20
Can Provide File to BINS that Specifies How to
Combine Weights
Based on Entropy
Based on MSE
0 0.25 0.5 0.75 1
Use only bin weight for evidence of 0
0 0.5 1
Average bin weight and NB weight for evidence of 1
Use only NB weight for evidence of 2 or more
21
Appropriately Combining the Bin Weight and the
Naïve Bayes Weight Leads to the Best Performance
Yet
Indoor versus Outdoor
Events Politics, Struggle, Disaster, Crime, Other
22
BINS Performs the Best of All Systems Tested
Indoor versus Outdoor
Events Politics, Struggle, Disaster, Crime, Other
23
Overview
  1. The Main Idea
  2. Description of Corpus
  3. Novel ML Systems
  4. NLP Based System
  5. High-Precision/Low-Recall Rules
  6. Image Features
  7. Newsblaster
  8. Conclusions and Future Work

24
Disaster Image Categories
Affected People
Workers Responding
Other
Wreckage
25
Performance of Standard SystemsNot Very
Satisfying
26
Ambiguity for Disaster ImagesWorkers Responding
vs. Affected People
Philippine rescuers carry a fire victim March 19
who perished in a blaze at a Manila disco.
Hypothetical alternative caption A fire victim
who perished in a blaze at a Manila disco is
carried by Philippine rescuers March 19.
27
Summary of Observations About Task
Philippine rescuers carry a fire victim March 19
who perished in a blaze at a Manila disco.
  • Need to distinguish foreground from background,
    determine focus of image
  • Not all words are important some are misleading
  • Hypothesis the main subject and verb are
    particularly useful for this task
  • Problematic for bag of words approaches
  • Need linguistic analysis to determine predicate
    argument relationships

28
Experiments with Humans Subjects 4
ConditionsTest Hypothesis Subject and Verb are
Useful Clues
SENT First sentence of caption Philippine rescuers carry a fire victim March 19 who perished in a blaze at a Manila disco.
RAND All words from first sentence in random order At perished disco who Manila a a in 19 carry Philippine blaze victim a rescuers March fire
IDF Top two TFIDF words disco rescuers
S-V Subject and verb subject rescuers, verb carry
29
Experiments with Humans Subjects
ResultsHypothesis Subject and Verb are Useful
Clues
  • More words are better than fewer words
  • SENT, RAND gt S-V, IDF
  • Syntax is important
  • SENT gt RAND S-V gt IDF

Condition Average Time (in seconds)
RAND 68
SENT 34
IDF 22
S-V 20
30
Using Just Two Words (S-V)Almost as Good as All
the Words (Bag of Words)
31
Operational NLP Based System
  • Extract subjects and verbs from all documents in
    training set

Subjects 83.9
Verbs 80.6
  • For each test document
  • Extract subject and verb
  • Compare to those from training set using some
    method of word-to-word similarity
  • Based on similarities, generate a score for every
    category

32
The NLP Based System Beats All Others by a
Considerable Margin
33
Politics Image Categories
Meeting
Civilians
Announcement
Other
Military
Politician Photographed
34
The NLP Based System is in the Middle of the Pack
for the Politics Image Data Set
35
Overview
  1. The Main Idea
  2. Description of Corpus
  3. Novel ML Systems
  4. NLP Based System
  5. High-Precision/Low-Recall Rules
  6. Image Features
  7. Newsblaster
  8. Conclusions and Future Work

36
The Original Premise
  • For the Disaster image data set, the performance
    of the NLP based system still leaves room for
    improvement
  • NLP based system achieves 65 overall accuracy
    for the Disaster image data set
  • Humans viewing all words in random order achieve
    about 75
  • Humans viewing full first sentence achieve over
    90
  • Main subject and verb are particularly important,
    but sometimes other words might offer good clues

37
Higinio Guereca carries family photos he
retrieved from his mobile home which was
destroyed as a tornado moved through the Central
Florida community, early December 27.
38
Selected Indicative Words for the Disaster Image
Data Set
Word Indicated Category Total Count (x) Proportion (p)
her Affected People 7 1.0
his Affected People 7 0.86
family Affected People 6 0.83
relatives Affected People 6 1.0
rescue Workers Responding 15 1.0
search Workers Responding 9 1.0
similar Other 2 1.0
soldiers Workers Responding 6 1.0
workers Workers Responding 12 1.0
39
High-Precision/Low-Recall Rules
  • If a word w that indicates category c occurs in a
    document d, then assign d to c
  • Every selected indicative word has an associated
    rule of the above form
  • Each rule is very accurate but rarely applicable
  • If only rules are used
  • most predictions will be correct (hence, high
    precision)
  • most instances of most categories will remain
    unlabeled (hence, low recall)

40
Combining the High-Precision/Low-Recall Rules
with Other Systems
  • Two-pass approach
  • Conduct a first-pass using the indicative words
    and the high-precision/low-recall rules
  • For documents that are still unlabeled, fall back
    to some other system
  • Compared to the fall back system
  • If the rules are more accurate for the documents
    to which they apply, overall accuracy will
    improve!
  • Intended to improve the NLP based system, but
    easy to test with other systems as well

41
The Rules Improve Every Fall Back System for the
Disaster Image Data Set
42
The Rules Improve 7 of 8 Fall Back Systems for
the Politics Image Data Set
43
Overview
  1. The Main Idea
  2. Description of Corpus
  3. Novel ML Systems
  4. NLP Based System
  5. High-Precision/Low-Recall Rules
  6. Image Features
  7. Newsblaster
  8. Conclusions and Future Work

44
Low-Level Image Features
  • Collaboration with Paek and Benitez
  • They have provided me with information, pointers
    to resources, and code
  • I have reimplemented some of their code
  • Color histograms
  • Based on entire images or image regions
  • Can be used as input to machine learning
    approaches (e.g. kNN, SVMs)

45
Combining Text and Image Features
  • Combining systems has had mixed results in the TC
    literature, but
  • Most attempts have involved systems that use the
    same features (bag of words)
  • There is little reason to believe that indicative
    text is correlated with indicative low-level
    image features
  • Most text based systems are beating the image
    based systems, but
  • Distance from optimal hyperplane can be used as a
    confidence measure for support vector machine
  • Predictions with high confidence may be more
    accurate than text systems

46
The Combination of Text and Image Beats Text
AloneMost systems show small gains, one has
major improvement
47
Overview
  1. The Main Idea
  2. Description of Corpus
  3. Novel ML Systems
  4. NLP Based System
  5. High-Precision/Low-Recall Rules
  6. Image Features
  7. Newsblaster
  8. Conclusions and Future Work

48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
Newsblaster
  • A pragmatic showcase for NLP
  • My contributions
  • Extraction of images and captions from web pages
  • Image browsing interface
  • Categorization of stories (clusters) and images
  • Scripts that allow users to suggest labels for
    articles with incorrect predictions

52
Overview
  1. The Main Idea
  2. Description of Corpus
  3. Novel ML Systems
  4. NLP Based System
  5. High-Precision/Low-Recall Rules
  6. Image Features
  7. Newsblaster
  8. Conclusions and Future Work

53
Conclusions
  • TC techniques can be used to categorize images
  • Many methods exist
  • No clear winner for all tasks
  • BINS is very competitive
  • NLP can lead to substantial improvement, at least
    for certain tasks
  • High-precision/low-recall rules are likely to
    improve performance for tough tasks
  • Image features show promise
  • Newsblaster demonstrates pragmatic benefits of my
    work

54
Future Work
  • BINS
  • Explore additional binning features
  • Explore use of unlabeled data
  • NLP and TC
  • Improve current system
  • Explore additional categories
  • Image features
  • Explore additional low-level image features
  • Explore better methods of combining text and image

55
And Now the Questions
Write a Comment
User Comments (0)
About PowerShow.com