The Visual Recognition Machine

About This Presentation

Title:

The Visual Recognition Machine

Description:

V: image pixels. E: connections between pairs of nearby pixels ... but it is sparse with O(N) nonzero entries, where N is the number of pixels. ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 43

Provided by: sche59

Learn more at: http://iram.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: The Visual Recognition Machine

1
The Visual Recognition Machine

Jitendra Malik
University of California at Berkeley

2
From images to objects
Labeled sets tiger, grass etc
3
Recognition
4
Three stages

Segmentation Images Regions
Association Regions Super-regions
Matching Super-regions Prototype views

5
(No Transcript)
6
Three stages

Segmentation Images Regions
Association Regions Super-regions
Matching Super-regions Prototype views

7
Boundaries of image regions defined by a number
of attributes

Brightness/color
Texture
Motion
Stereoscopic depth
Familiar configuration

8
Image Segmentation as Graph Partitioning
Build a weighted graph G(V,E) from image
V image pixels E connections between pairs of
nearby pixels
Partition graph so that similarity within group
is large and similarity between groups is small
-- Normalized Cuts ShiMalik 97
9
Some Terminology for Graph Partitioning

How do we bipartition a graph

10
Normalized Cut, A measure of dissimilarity

Minimum cut is not appropriate since it favors
cutting small pieces.
Normalized Cut, Ncut

11
Solving the Normalized Cut problem

Exact discrete solution to Ncut is NP-complete
even on regular grid,
Papadimitriou97
Drawing on spectral graph theory, good
approximation can be obtained by solving a
generalized eigenvalue problem.

12
Normalized Cut As Generalized Eigenvalue problem

after simplification, we get

13
Computational Aspects

Solving for the generalized eigensystem
(D-W) is of size , but it is sparse
with O(N) nonzero entries, where N is the number
of pixels.
Using Lanczos algorithm.

14
Three stages

Segmentation Images Regions
Association Regions Super-regions
Matching Super-regions Prototype views

15
Association

Number of super-regions of size k in image with
n regions is approximately (4k)n/k
For typical images, this ranges between 1000 and
10000
Plausibility ordering could reduce effective
number substantially
Computing time for this stage negligible

16
Three stages

Segmentation Images Regions
Association Regions Super-regions
Matching Super-regions Prototype views

17
Matching

Objects are represented by a set of prototypical
views (10 per object)
For each super-region S, calculate probability
that it is an instance of view V
Determine most probable labeling of image into
objects

18
(No Transcript)
19
Matching super-regions to views

Based on color, texture and shape similarity
Color, texture matching is relatively well
understood and fast
Shape matching is difficult because the algorithm
should tolerate pose, illumination and
intra-category variation
GOAL small misclassification error with few
views.

20
Core idea

Find corresponding points on the two shapes and
use those to deform prototype into alignment
Allowing this flexibility reduces number of
prototype views needed

21
(No Transcript)
22
(No Transcript)
23
MNIST Handwritten Digits
24
Digit Prototypes
25
Matching with original and deformed prototypes
Prototype
Test
Error
26
Deforming prototypes using thin plate splines
27
Only 25 deformable templates needed (instead of
60 K) to get 5 error
28
COIL Object Database
29
(No Transcript)
30
Computing cost on a Pentium PC

Segmentation 5 minutes /image
Matching 0.5 sec / match

31
Cost on 104 node machine

Segmentation 0.03 sec /image, which is 30 Hz
(video rate)
Matching 20K matches/sec at full resolution
(100 points/shape)

32
How many prototype views can one match at 1 Hz?

1K candidate super-regions
Consider only 1 of matches at full resolution
(10 pass color/texture filter, 10 of those
pass low resolution shape filter)
If half time spent in pruning and half in full
resolution matching, 1000 prototype views can be
matched at 1 Hz.

33
What can one do with matching 1000 views a second?

Worst case 100 object categories
Best case depends on how well one can exploit
context, hierarchy and hashing.
Cf. humans can recognize 10-100K objects

34
Memory requirements

10 K object categories 10 views/category 100
100 pixels/view 1 byte/pixel gives us 1
Gigabyte.

35
Concluding remarks

Speech in 1985 was in the same state as vision in
2000. Hidden Markov Models adoption led to a
decade of research which refined the paradigm for
continuous speech recognition.
The proposed 3 stage framework for recognition
segmentation, association and matching, could
provide the same focus and coherence to vision
research leading to general purpose object
recognition in 10 years.