Title: Learning shared representations for object recognition
1Learning shared representations for object
recognition
Antonio Torralba CSAIL Massachusetts Institute of
Technology
In collaboration with Erik Sudderth, Kevin
Murphy William
Freeman, Aude Oliva
2Collaborators
Erik Sudderth, Berkeley
Kevin Murphy, UBC
Aude Oliva, MIT
William Freeman, MIT
3Standard approach for object detection
Object detection and recognition is formulated as
a classification problem.
The image is partitioned into a set of
overlapping windows
and a decision is taken at each window about if
it contains a target object or not.
Decision boundary
Where are the screens?
Bag of image patches
4Face detection
- Human Face Detection in Visual Scenes - Rowley,
Baluja, Kanade (1995) - Graded Learning for Object Detection - Fleuret,
Geman (1999) - Robust Real-time Object Detection - Viola, Jones
(2001) - Feature Reduction and Hierarchy of Classifiers
for Fast Object Detection in Video Images -
Heisele, Serre, Mukherjee, Poggio (2001)
5The single class age
6Multiclass object detection
Using a set of independent binary classifiers is
the dominant strategy
- Viola-Jones extension for dealing with rotations
- two cascades for each view
- Schneiderman-Kanade multiclass object detection
a) One detector for each class
7Object detection and theHead in the coffee
beans problem
8Head in the coffee beans problem
Can you find the head in this image?
9Head in the coffee beans problem
Can you find the head in this image?
10Symptoms of local detectors
False alarms occur in image regions in which is
impossible for the target to be present.
11Failure modes for object presence detection
Low probability of keyboard presence
High probability of keyboard presence
12The system does not care about the scene, but we
do
We know there is a keyboard present in this scene
even if we cannot see it clearly.
13Some symptoms of one-vs-all multiclass approaches
What is the best representation to detect a
traffic sign?
Very regular object template matching will do
the job
Some of these parts cannot be used for anything
else than this object.
14Some symptoms of one-vs-all multiclass approaches
Part-based object representation (looking for
meaningful parts)
- M. Weber, M. Welling and P. Perona
These studies try to recover parts that are
meaningful. But is this the right thing to do?
The derived parts may be too specific, and they
are not likely to be useful in a general system.
15Some symptoms of one-vs-all multiclass approaches
Computational cost grows linearly with Nclasses
Nviews Nstyles
16Green pastures for research in multiclass object
detection
- Transfer knowledge between objects
- More efficient representations
- Better generalization
- Discovering commonalities
- Context among objects
- More efficient search
- Robust systems
- Scene understanding
17Links to datasets
The next tables summarize some of the available
datasets for training and testing object
detection and recognition algorithms. These lists
are far from exhaustive.
Databases for object localization
CMU/MIT frontal faces vasc.ri.cmu.edu/idb/html/face/frontal_images cbcl.mit.edu/software-datasets/FaceData2.html Patches Frontal faces
Graz-02 Database www.emt.tugraz.at/pinz/data/GRAZ_02/ Segmentation masks Bikes, cars, people
UIUC Image Database l2r.cs.uiuc.edu/cogcomp/Data/Car/ Bounding boxes Cars
TU Darmstadt Database www.vision.ethz.ch/leibe/data/ Segmentation masks Motorbikes, cars, cows
LabelMe dataset people.csail.mit.edu/brussell/research/LabelMe/intro.html Polygonal boundary gt500 Categories
Databases for object recognition
Caltech 101 www.vision.caltech.edu/Image_Datasets/Caltech101/Caltech101.html Segmentation masks 101 categories
COIL-100 www1.cs.columbia.edu/CAVE/research/softlib/coil-100.html Patches 100 instances
NORB www.cs.nyu.edu/ylclab/data/norb-v1.0/ Bounding box 50 toys
On-line annotation tools
ESP game www.espgame.org Global image descriptions Web images
LabelMe people.csail.mit.edu/brussell/research/LabelMe/intro.html Polygonal boundary High resolution images
Collections
PASCAL http//www.pascal-network.org/challenges/VOC/ Segmentation, boxes various
18Bryan Russell, Antonio Torralba, Bill Freeman
Google search LabelMe
19LabelMe Screen Shot
20Some stats
21Online resources
http//people.csail.mit.edu/torralba/iccv2005/
22What do we do with many classes?
Styles, lighting conditions, etc, etc, etc
Need to detect Nclasses Nviews Nstyles, in
clutter. Lots of variability within classes, and
across viewpoints.
23Shared features
- Is learning the object class 1000 easier than
learning the first? - Can we transfer knowledge from one object to
another? - Are the shared properties interesting by
themselves?
24Multitask learning
R. Caruana. Multitask Learning. ML 1997
MTL improves generalization by leveraging the
domain-specific information contained in the
training signals of related tasks. It does this
by training tasks in parallel while using a
shared representation.
vs.
Sejnowski Rosenberg 1986 Hinton 1986 Le Cun
et al. 1989 Suddarth Kergosien 1990 Pratt et
al. 1991 Sharkey Sharkey 1992
25Multitask learning
R. Caruana. Multitask Learning. ML 1997
Primary task detect door knobs
Tasks used
- horizontal location of right door jamb
- width of left door jamb
- width of right door jamb
- horizontal location of left edge of door
- horizontal location of right edge of door
- horizontal location of doorknob
- single or double door
- horizontal location of doorway center
- width of doorway
- horizontal location of left door jamb
26Sharing invariances
S. Thrun. Is Learning the n-th Thing Any Easier
Than Learning The First? NIPS 1996 Knowledge is
transferred between tasks via a learned model of
the invariances of the domain object recognition
is invariant to rotation, translation, scaling,
lighting, These invariances are common to all
object recognition tasks.
Toy world
With sharing
Without sharing
27Sharing transformations
- Miller, E., Matsakis, N., and Viola, P. (2000).
Learning from one example through shared
densities on transforms. In IEEE Computer Vision
and Pattern Recognition.
Transformations are shared and can be learnt from
other tasks.
28Models of object recognition
I. Biederman, Recognition-by-components A
theory of human image understanding,
Psychological Review, 1987. M. Riesenhuber and
T. Poggio, Hierarchical models of object
recognition in cortex, Nature Neuroscience 1999.
T. Serre, L. Wolf and T. Poggio. Object
recognition with features inspired by visual
cortex. CVPR 2005
29Sharing in constellation models
Pictorial StructuresFischler Elschlager, IEEE
Trans. Comp. 1973
SVM DetectorsHeisele, Poggio, et. al., NIPS 2001
Constellation ModelFergus, Perona, Zisserman,
CVPR 2003
Model-Guided SegmentationMori, Ren, Efros,
Malik, CVPR 2004
30Variational EM
Random initialization
Fei-Fei, Fergus, Perona, ICCV 2003
(Attias, Hinton, Beal, etc.)
Slide from Fei Fei Li
31Grand piano
Slide from Fei Fei Li
32Reusable Parts
Krempp, Geman, Amit Sequential Learning of
Reusable Parts for Object Detection. TR 2002
Goal Look for a vocabulary of edges that reduces
the number of features.
Examples of reused parts
Number of features
Number of classes
33Sharing patches
For a new class, use only features similar to
features that where good for other classes
Proposed Dog features
34Additive models and boosting
- Independent binary classifiers
Class 1
Class 2
Class 3
- Binary classifiers that share features
Class 1
Class 2
Class 3
35Boosting
- Boosting fits the additive model
by minimizing the exponential loss
Training samples
The exponential loss is a differentiable upper
bound to the misclassification error.
36Why boosting?
- A simple algorithm for learning robust
classifiers - Freund Shapire, 1995
- Friedman, Hastie, Tibshhirani, 1998
- Provides efficient algorithm for sparse visual
feature selection - Tieu Viola, 2000
- Viola Jones, 2003
- Easy to implement, not requires external
optimization tools.
37Weak detectors
- Part based similar to part-based generative
models. We create weak detectors by using parts
and voting for the object center location
Screen model
Car model
These features are used for the detector on the
course web site.
38Weak detectors
- Tieu and Viola, CVPR 2000
- Viola and Jones, ICCV 2001
- Carmichael, Hebert 2004
- Yuille, Snow, Nitzbert, 1998
- Amit, Geman 1998
- Papageorgiou, Poggio, 2000
- Heisele, Serre, Poggio, 2001
- Agarwal, Awan, Roth, 2004
- Schneiderman, Kanade 2004
-
39Weak detectors
First we collect a set of part templates from a
set of training objects. Vidal-Naquet, Ullman
(2003)
40Weak detectors
We now define a family of weak detectors as
Better than chance
41Example screen detection
Thresholded output
Feature output
Strong classifier
Adding features
Final classification
Strong classifier at iteration 200
42Multi-class Boosting
We use the exponential multi-class cost function
classes
classifier output for class c
membership in class c, 1/-1
cost function
Freund Shapire, 1995 Friedman, Hastie,
Tibshhirani, 1998
43Weak learners are shared
At each boosting round, we add a perturbation or
weak learner which is shared across some
classes
We add the weak classifier that provides the best
reduction of the exponential cost
Freund Shapire, 1995 Friedman, Hastie,
Tibshhirani, 1998
44Summary of our algorithm for finding shared
features
- It is an iterative algorithm that adds one
feature at each iteration - At each iteration, the algorithm selects from a
dictionary of features, the best feature and the
set of object classes to which the feature has to
be applied. - All the training samples are reweighted to
increase the weight of samples for which the
previously selected features provided wrong
labels.
45Specific feature
pedestrian
chair
Traffic light
sign
face
Background class
Non-shared feature this feature is too specific
to faces.
46Shared feature
shared feature
47Shared vs. specific features
48Shared vs. specific features
49How the features are shared across objects
(features sorted left-to-right from generic to
specific)
Torralba, Murphy, Freeman. CVPR 2004.
50Red shared features Blue independent
features
Sharing features shows sub-linear scaling of
features with objects (for area under ROC 0.9).
Results averaged over 8 training sets, and
different combinations of objects. Error bars
show variability.
51Red shared features Blue independent
features
52(No Transcript)
53An application of feature sharing Object
clustering
Count number of common features between objects
54Multi-view object detectiontrain for object and
orientation
Sharing features is a natural approach to
view-invariant object detection.
View invariant features
View specific features
55Multi-view object detection
Sharing is not a tree. Depends also on 3D
symmetries.
56Multi-view object detection
Strong learner H response for car as function of
assumed view angle
57Generalization as a function of object
similarities
Number of training samples per class
Number of training samples per class
Each point in the graphs is the average over the
12 classes.
58PASCAL dataset
59From shared to specific features
Face detection and recognition
60Hierarchical Topic Models
Pr(topic doc)
- Topic models typically use a bag of words
approx. - Learning topics allows transfer of information
within a corpus of related documents - Mixing proportions capture the distinctive
features of particular documents
a
p
z
q
K
x
N
J
Pr(word topic)
Latent Dirichlet Allocation (LDA)Blei, Ng,
Jordan, JMLR 2003
61Hierarchical Topic Models
S
Pr(xword ztopic) Pr(ztopic doc)
Pr(xword doc)
Pr(topic doc)
topic
a
p
z
q
K
x
N
J
Pr(word topic)
Latent Dirichlet Allocation (LDA)Blei, Ng,
Jordan, JMLR 2003
62Hierarchical Topic Models
Pr(topic doc)
a
p
z
q
K
Some previous work on bag of features models
x
N
J
Object Recognition (Sivic et. al., ICCV
2005) Scene Recognition (Fei-Fei et. al., CVPR
2005)
Pr(word topic)
Latent Dirichlet Allocation (LDA)Blei, Ng,
Jordan, JMLR 2003
63Hierarchical Sharing and Context
E. Sudderth, A. Torralba, W. T. Freeman, and A.
Wilsky. ICCV 2005.
- Scenes share objects
- Objects share parts
- Parts share features
64From images to visual words
Maximally StableExtremal Regions
Linked Sequencesof Canny Edges
Affinely AdaptedHarris Corners
- Some invariance to lighting pose variations
- Dense, multiscale, over-segmentation of image
65From images to visual words
SIFT Descriptors
- Normalized histograms of orientation energy
- Compute 1,000 word dictionary via K-means
- Map each feature to nearest visual word
Lowe, IJCV 2004
appearance offeature i in image j
2D position offeature i in image j
66Object models
Constellation model
Bag of words
Structured clusters
E. Sudderth, A. Torralba, W. T. Freeman, and A.
Wilsky. ICCV 2005.
67Counting Objects Parts
How many parts?
68Generative Model for Objects
69Graphical Model for Objects
p
z
Y
For each of J images, sample a reference position
r
z
m
h
L
w
y
K
N
K
J
70Parametric Object Model
- For a fixed reference position, the generative
model is equivalent to a finite mixture model
Distribution of appearances for each part
Mixture of K parts
Feature appearance
Distribution of feature locations for each part
Feature location
Weights
- How many parts should we choose?
- Too few reduces model accuracy
- Too many causes overfitting poor generalization
71Dirichlet Process Object Model
- Dirichlet process allows using an infinite mixture
Dirichlet Processes define priors over the
mixture weights pok
- Some weights are effectively zero which
corresponds to having a finite number of parts
(automatically selected from the data).
72Dirichlet Process Object Model
p
z
a
Y
r
z
m
h
L
w
y
N
J
73Decomposing Faces into Parts
Number of parts
Number training images
4 Images
16 Images
64 Images
74Multiclass object model
- We want to model N object classes jointly
- We want an efficient representation
- We want to transfer between categories
- Furthermore,
- We do not know how many parts to share
- We do not know how many parts each object should
use (each object needs different number of parts).
75Learning Shared Parts
- Objects are often locally similar in appearance
- Discover parts shared across categories
- Need unsupervised methods for part discovery
Sharing features in a discriminative framework
(Torralba, Murphy, Freeman, CVPR 2004)
76HDP Object Model
- We learn the number of parts.
- Each object uses a different number of parts.
- The model assumes a known number of object
categories.
77HDP Object Model
There is no context, so the model is happy in
creating impossible part combinations.
78HDP Object Model
Global Dirichlet process learns number of shared
parts
g
b
p
z
a
Y
Reference position allows consistent spatial
model
Objects reuse global parts in different
proportions
r
z
m
Parts location model
h
L
w
y
N
Parts appearance model
J
Joint model of O objects
O
79Learning HDPs Gibbs Sampling
g
b
R
p
z
a
Y
Integrate
r
Sample
Sample(via implicit table assignments)
z
m
H
y
h
L
w
y
H
w
N
J
Integrate
Integrate
O
80Sharing Parts 16 Categories
- Caltech 101 Dataset (Li Perona)
- Horses (Borenstein Ullman)
- Cat dog faces (Vidal-Naquet Ullman)
- Bikes from Graz-02 (Opelt Pinz)
- Google
81Visualization of Shared Parts
Pr(position part)
Pr(appearance part)
82Visualization of Shared Parts
Pr(position part)
Pr(appearance part)
83Visualization of Part Densities
MDS Embedding of Pr(part object)
84Detection Task
versus
85Detection Results
Detection vs. Training Set Size
6 Training Images per Category
86Recognition Task
versus
87Recognition Results
6 Training Images per Category
Recognition vs. Training Set Size
88Context
What do you think are the hidden objects?
1
2
89Context
What do you think are the hidden objects?
Even without local object models, we can make
reasonable detections!
90The multiple personalities of a blob
91The multiple personalities of a blob
Human vision Biederman, Bar Ullman, Palmer,
92Context relationships between objects
Detect first simple objects (reliable detectors)
that provide strong contextual constraints to the
target (screen -gt keyboard -gt mouse)
93Global context location priming
How far can we go without object detectors?
- Context features that represent the scene instead
of other objects. - The global features can provide
- Object presence
- Location priming
- Scale priming
94Object global features
First we create a dictionary of scene features
and object locations
Associated screen location
Feature map
.
.
.
Only the vertical position of the object is well
constrained by the global features
95Object global features
How to compute the global features
96Car detection with global features
Features selected by boosting
Car
Boosting round
97Combining global and local
ROC for same total number of features (100
boosting rounds)
car
building
road
screen
keyboard
mouse
desk
Global and local
Only local
98Clustering of objects with local and global
feature sharing
Clustering with local features
Clustering with global and local features
Objects are similar if they share local features
and they appear in the same contexts.
99Conclusions
- Sharing information at multiple levels leads to
reduced computation better generalization - What are the object representations that allow
transfer between classes?