Title: The Visual Recognition Machine
1The Visual Recognition Machine
- Jitendra Malik
- University of California at Berkeley
2From images to objects
Labeled sets tiger, grass etc
3Recognition
4Three stages
- Segmentation Images Regions
- Association Regions Super-regions
- Matching Super-regions Prototype views
5(No Transcript)
6Three stages
- Segmentation Images Regions
- Association Regions Super-regions
- Matching Super-regions Prototype views
7Boundaries of image regions defined by a number
of attributes
- Brightness/color
- Texture
- Motion
- Stereoscopic depth
- Familiar configuration
8Image Segmentation as Graph Partitioning
Build a weighted graph G(V,E) from image
V image pixels E connections between pairs of
nearby pixels
Partition graph so that similarity within group
is large and similarity between groups is small
-- Normalized Cuts ShiMalik 97
9Some Terminology for Graph Partitioning
- How do we bipartition a graph
10Normalized Cut, A measure of dissimilarity
- Minimum cut is not appropriate since it favors
cutting small pieces. - Normalized Cut, Ncut
11Solving the Normalized Cut problem
- Exact discrete solution to Ncut is NP-complete
even on regular grid, - Papadimitriou97
- Drawing on spectral graph theory, good
approximation can be obtained by solving a
generalized eigenvalue problem.
12Normalized Cut As Generalized Eigenvalue problem
- after simplification, we get
13Computational Aspects
- Solving for the generalized eigensystem
- (D-W) is of size , but it is sparse
with O(N) nonzero entries, where N is the number
of pixels. - Using Lanczos algorithm.
14Three stages
- Segmentation Images Regions
- Association Regions Super-regions
- Matching Super-regions Prototype views
15Association
- Number of super-regions of size k in image with
n regions is approximately (4k)n/k - For typical images, this ranges between 1000 and
10000 - Plausibility ordering could reduce effective
number substantially - Computing time for this stage negligible
16Three stages
- Segmentation Images Regions
- Association Regions Super-regions
- Matching Super-regions Prototype views
17Matching
- Objects are represented by a set of prototypical
views (10 per object) - For each super-region S, calculate probability
that it is an instance of view V - Determine most probable labeling of image into
objects
18(No Transcript)
19Matching super-regions to views
- Based on color, texture and shape similarity
- Color, texture matching is relatively well
understood and fast - Shape matching is difficult because the algorithm
should tolerate pose, illumination and
intra-category variation - GOAL small misclassification error with few
views.
20Core idea
- Find corresponding points on the two shapes and
use those to deform prototype into alignment - Allowing this flexibility reduces number of
prototype views needed
21(No Transcript)
22(No Transcript)
23MNIST Handwritten Digits
24Digit Prototypes
25Matching with original and deformed prototypes
Prototype
Test
Error
26Deforming prototypes using thin plate splines
27Only 25 deformable templates needed (instead of
60 K) to get 5 error
28COIL Object Database
29(No Transcript)
30Computing cost on a Pentium PC
- Segmentation 5 minutes /image
- Matching 0.5 sec / match
31Cost on 104 node machine
- Segmentation 0.03 sec /image, which is 30 Hz
(video rate) - Matching 20K matches/sec at full resolution
(100 points/shape)
32How many prototype views can one match at 1 Hz?
- 1K candidate super-regions
- Consider only 1 of matches at full resolution
(10 pass color/texture filter, 10 of those
pass low resolution shape filter) - If half time spent in pruning and half in full
resolution matching, 1000 prototype views can be
matched at 1 Hz.
33What can one do with matching 1000 views a second?
- Worst case 100 object categories
- Best case depends on how well one can exploit
context, hierarchy and hashing. - Cf. humans can recognize 10-100K objects
34Memory requirements
- 10 K object categories 10 views/category 100
100 pixels/view 1 byte/pixel gives us 1
Gigabyte.
35 Concluding remarks
- Speech in 1985 was in the same state as vision in
2000. Hidden Markov Models adoption led to a
decade of research which refined the paradigm for
continuous speech recognition. - The proposed 3 stage framework for recognition
segmentation, association and matching, could
provide the same focus and coherence to vision
research leading to general purpose object
recognition in 10 years.
36(No Transcript)
37(No Transcript)
38(No Transcript)
39(No Transcript)
40(No Transcript)
41(No Transcript)
42(No Transcript)