Title: Object Recognition from Local ScaleInvariant Features
1Object Recognition from Local Scale-Invariant
Features
- a piece of scientific work by David G. Lowe
University of
Britisch Columbia Vancouver, B.C., Canada
2Overview
- SIFT - Scale Invariant Feature Transform
- Object Recognition using SIFT KEYS
3Introduction
- Definition of object recognition
- The visual perception of familiar objects
- Given an image containing unknown objects, the
problem of object recognition is to find a match
between these objects and a set of known objects,
that are available in an appropriate
representation. The problem includes the question
for the objects poses in the image. - Representation of objects can exist as
- 3D-models - model - based recognition
- images - appearence - based recognition
4Introduction
- Applications of object recognition for HMI
- Content based image retrieval (web search)
- Interactions with robots
- Vision substitution for blind people
- Personal assistance systems
5Introduction
- Template matching - an early approach
- Given a template matrix T of the object we are
looking for, we can use the following approach to
detect the presence of the object in a search
image I - We move T pixel by pixel over I.
- We create a new image matrix R, in which every
pixel (u,v) is the result of the
cross-correlation between the matrix T and the
matrix I, where T is centered at pixel (u,v) of
I. - Maxima in R indicate the presence of the searched
object.
6Introduction
- Template matching - an early approach
7Introduction
- Template Matching - problems
- Templates are images of the whole object gt no
possibility to deal with occlusion / background
clutter - Only invariant to translation
- Computational expensive
- You must know the objects you are searching for,
otherwise you have to do template matching for
every known object in your database gt
computation gets really expensive!
8Overview
- SIFT - Scale Invariant Feature Transform
- Object Recognition using SIFT KEYS
9SIFT KEYS
- What is a SIFT KEY?
- A SIFT KEY is an image feature vector, that is
fully invariant to image translation, rotation
and scaling. In addition it is partially
invariant to change in illumination and camera
viewpoint.
- Image regions from which SIFT KEYS are created
10SIFT KEYS
- Properties of SIFT KEYS
- Locality features are local, so robust to
occlusion and clutter. - Distinctiveness individual features can be
matched to a large database of features. - Quantity many features can be generated for even
small objects. - Efficiency generation close to real-time
performance. - Extensibility can easily be extended to wide
range of differing feature types, with each
adding robustness.
- SIFT KEYS provide a good basis for object
recognition
11Overview
- SIFT - Scale Invariant Feature Transform
- Object Recognition using SIFT KEYS
12Scale - Space
- Motivation
- Objects are perceived as meaningful entities at a
certain range of scales by humans.
- A discussion of bees and ponds doesn t make
sense at other scales
- Images taken from A question of scale Quarks
to Quasars the whole image
series is located at http//www.wordwizz.com/pwr
sof10.htm
13Scale - Space
- Motivation
- The fact, that objects have different appearance
depending on the observation scale, has important
implications if one tries to describe them. - In order to describe objects, information must be
gathered about them. Humans and computers do
this by analysis of signals resulting from real
world measurements. - Analysing signals is done by using certain
operators on them. The relationship between these
operators size (resolution) and the size of
actual structures in data has a great influence
on the information that can be derived.
- If the size type of the operator isnt
appropriatly chosen, then it can be hard to
interpret the information derived from signal
analysis. Unfortunatly there is no obvious way
to determine the scales, at which the desired
structures are hidden in the signal.
14Scale - Space
- Solution
- We represent the signal at all sensible scales!
Types of multi - scale representations are - wavelets
- image pyramids
- scale-space representation
- The last one is a framework that has been
developed by computer vision community to
represent image data and its multi-scale nature
at the earliest stages in the chain of visual
processing performed by vision systems. - Scale space theory states, that the natural
operations that should be performed in visual
front-end are convolutions with gaussian kernels
and their derivatives.
15Scale - Space
- What is scale-space representation?
- Scale-space representation of a signal is an
embedding of the original signal into an one
parameter family of derived signals constructed
by convolution of an one - parameter family of
Gaussian kernels of increasing width
- Scale - space representation is created by
convolution of the original signal and the
members of the GKF
- The scale - Paramter indexes the members of the
GKF and the resulting derived signals
- Scale-space representation
16Scale - Space
- Example of scale-space representation
- Increasing level of blur gt
17Overview
- SIFT - Scale Invariant Feature Transform
- SIFT - Scale Invariant Feature Transform
- Object Recognition using SIFT KEYS
18SIFT - Scale Invariant Feature Transform
- It is searched over all scales and image
locations to identify potential interest points,
that are invariant to scale and orientation.
- 1) Detection of scale - space
extrema
- At each candidate location, a detailed model is
fit to determine location and scale. Keypoints
are selected based on measures of their stability.
- One or more orientations are assigned to the
keypoint based on the local image gradient
directions.
- 3) Orientation Assignment
- Based on the selected scale the local image
gradients are transformed into a representation
that allows for signicant levels of shape
distortion illumination change.
19SIFT - Scale Invariant Feature Transform
- Detection of scale-space extrema
- In order to detect locations that are invariant
to scale change of the image, an image pyramid is
computed containing the scale-space
representation L of the Image I. - Afterwards the DoG (difference of Gaussian)
pyramid is computed from the scale-space pyramid.
This is done for two reasons - DoG is an approximation of LoG (Laplacian of
Gaussian). Mikolajczyk showed, that the maxima
and minima of the LoG function produce the most
stable image features. - DoG pyramid very efficient to compute.
- can be computed by subtracting two nearby scales
separated by a constant
20SIFT - Scale Invariant Feature Transform
- Detection of scale-space extrema
21SIFT - Scale Invariant Feature Transform
- Detection of scale-space extrema
- When building the image pyramid sampling
frequency in scale and in the image domain must
be properly determined. - Parameter regulates the number of
scales per octave - s regulates the amount of initial smoothing
before creating the first octave and therefore
the resolution in image domain.
22SIFT - Scale Invariant Feature Transform
- Detection of scale-space extrema
- The detection of the extrema in the DoG pyramid
is achieved by comparing each sample point with
all its neighbours in the current scale and the
scale above and below. It is selected if it is
smaller or greater than these.
23SIFT - Scale Invariant Feature Transform
- Accurate key point localization
- The exact location of the maxima is determined by
fitting a 3D-quadratic function to the local
sample points, that were detected in step 1. This
is done using Taylor Expansion of the DoG
scale-space function - The new extrema is obtained by taking the first
derivative of the quadratic function with respect
to x and setting it to zero.
- D(x,y,s) and its derivatives are evaluated at the
sample point, x(x,y,s)T is the offset
- D (x) is the quadratic function
24SIFT - Scale Invariant Feature Transform
- Accurate key point localization
- If there is a deviation between the sampled
keypoint and the interpolated key point larger
than 0.5 in any dimension, then the sample point
is changed and the computation is repeated at
this point. Otherwise the deviation is added to
the location of its sample point to get the
interpolated estimate of the extrema. - After the extrema have been accurate localized,
the following 2 operations are performed at this
stage - Unstable Extrema, that have low contrast are
discarded. - The DoG function finds edges. Key points that
belong to edges arent well localized along the
edge. This makes them very unstable to small
amounts of noise and therefore this type of key
points will be discarded too.
25SIFT - Scale Invariant Feature Transform
- Accurate key point localization
26SIFT - Scale Invariant Feature Transform
- Orientation assignment
- Select the image of the Gaussian pyramid L thats
closest to the selected keypoint. - For each image sample in that scale precompute
the gradient magnitude and orientation. - Build orientation histogram with 36 bins from the
region around the keypoint in the following
manner Every gradient that is added to the
corresponding bin is weighted by its magnitude
and a Gaussian circular window with s1.5 x the s
of the corresponding scale.
27SIFT - Scale Invariant Feature Transform
- Orientation assignment
- The keypoints orientation is chosen to be the
peak in the histogram. If there are other local
peaks gt 80 of the highest peak gt create
additional keypoints. - Like in accurate keypoint localization a parabola
is fit to the histogram peaks to get a more
precise estimate for the dominant gradient
direction.
28SIFT - Scale Invariant Feature Transform
- At this stage every keypoint has been assigned an
image location, scale and orientation. This
parameters impose a repeatable local 2D
coordinate system in which the local image
regions are described. The generated
descriptors are invariant to this transformations.
29SIFT - Scale Invariant Feature Transform
- Generation of the image descriptor
- We want to compute a descriptor that is invariant
to the remaining variations (illumination,
viewpoint)! - In primary visual cortex one can find complex
neurons that respond to a gradient at a
particular direction and spatial frequency, but
the location of the gradient on the retina is
allowed to shift over a small receptive field. - Edelman et. Al hypothesis The function of these
neurons allows to match recognize 3D objects
from a range of viewpoints. Experiments showed,
that matching gradients while allowing for shifts
in their position indeed improves classification
under 3D rotation. - This is the key idea, that is used in descriptor
generation
30SIFT - Scale Invariant Feature Transform
- Generation of the image descriptor
- Image gradient magnitudes orientations at all
levels of the pyramid have been precomputed
(orientation step). - The gradients are sampled in a small window
around every keypoint with respect to the scale
the keypoint belongs to. gt scale invariance - The gradients are rotated relative to the
keypoints orientation. gt rotation invariance - Gradient magnitudes are weighted with a gaussian
weighting function located at the center of the
window, with a variance of s windowsize / 2.
gt Avoidance
of sudden changes in the descriptor, when window
position is shifted ( now samples at the bounds
have a smaller influence).
31SIFT - Scale Invariant Feature Transform
- Generation of the image descriptor
- Now angle discretised gradient orientation
histograms are builded. The value of an entry in
a histogram is calculated as the sum of all
gradient magnitudes from the corresponding
subwindow, whose orientations accord to the
direction of the entry.
32SIFT - Scale Invariant Feature Transform
- Generation of the image descriptor - affine
invariance - The histograms are invariant to positional shifts
of the gradients, as far as the gradients dont
cross the bounds of the window subregions. - To minimize the effects of crossing between
subregions and discretised angles, the assignment
of a particular gradient magnitude is done by
trilinear interpolation, so affine invariance is
improved. - The image descriptor is a vector, that contains
the values of the gradient orientation histogram
entries. - In the paper they sample a 16x16 region, that is
divided into 4x4 subregions. The gradient
orientation is discretised to angles of 45, so
each histogram has 8 entries.
gt the resulting image descriptor 128 element
vector.
33SIFT - Scale Invariant Feature Transform
- Generation of the image descriptor - illumimation
invariance - What remains is the question of illumination
invariance - Change in image contrast means multiplication of
gradients with a constant gt is canceled by
vector normalization. - Brightness change means addition of a constant
value to each pixel gt gradient computation not
affected. - Non linear illumination change can stronlgy
influence the magnitude of certain gradients, but
has almost no influence on their orientations.
Therefore D. Lowe puts a threshold of 0.2 on the
feature vector and then renormalizes the stored
values afterwards. The threshold of 0.2 was
experimentally evaluated.
34SIFT - Scale Invariant Feature Transform
- Image descriptor - sensitivity to affine change
35SIFT - Scale Invariant Feature Transform
- Image descriptor - distinctiveness
36Overview
- SIFT - Scale invariant feature transform
- Object Recognition using SIFT KEYS
- Object Recognition using SIFT KEYS
37Object Recognition using SIFT KEYS
- Keypoints of an image are matched to the database
of keypoints retrieved from training images using
a nearest neighbour algorithm.
- Clusters of at least 3 matched features are
identified, that agree on an object and its pose.
These are interpreted as the occurrance of object.
- 2) Clustering of matched keys
- Each cluster is checked by performing a detailed
geometric fit to the model. The quality of the
fit is used to accept or reject the
interpretation.
- 3) Fitting a geometric model
- OBJECT RECOGNIZED IN IMAGE
38Object Recognition using SIFT KEYS
- Keypoint matching - quality of matching
- Keypoint match is defined as the nearest
neighbour found in database. The nearest
neighbour is the keypoint, whose descriptor has
minimum euclidian distance. - However an image may contain features that wont
have any correct match in the training database - feature may result from background clutter
- feature wasnt detected in training phase
- The second closest neighbour is defined as the
closest feature in database, that is known to
belong to a different object.
- We must find a possibility to discard this
features
39Object Recognition using SIFT KEYS
- Keypoint matching - quality of matching
- The quality of a keypoint match is defined by the
ratio between NN SCN. This measure performs
well, because correct matches must have the NN
significantly closer than the SCN in order to
achieve reliable matching. - All matches with a ratio gt 0.8 are discarded
40Object Recognition using SIFT KEYS
- Keypoint matching - NN search
- Finding the nearest neighbours in a database is
done by searching. If databases are large, linear
search is not applicable. - A better approach for searching in high
dimensional spaces are k-d trees. But also k-d
trees loose their advantage at dimension gt 10 - Therefore an approximate algorithm from Beis
Lowe is used, called BBF (Best-Bin-First), which
returns the NN with high probability. BBF is
similar to k-d NN tree search. - BBF forces an upper limit, on how many bins are
inspected. - Standard NN search parses the tree according to
the structure, thats immanent to the tree after
it has been build. BBF parses the tree in an
order, that inspects at first the leaf nodes with
the least distance to the query point.
41Object Recognition using SIFT KEYS
- K-d trees - the datastructure for NN search
- The following recursive procedure creates a k-d
tree from a set of k-dimensional points
Pp1,...pn, P ? IRk , that are bounded by a
hypercuboid H - find the dimension i, where P exhibits the
greatest variance - find the point pm ? P, whose ith entry mi is the
median in the dimension i - create a new tree element with i, mi
- devide P into Piltmi and Pigtmi
- repeat procedure with Piltmi and Pigtmi
- This way H is devided recursively into smaller
hypercuboids. The hypercuboids represented by the
leaves of the k-d tree contain the points that
are included in their volume. Therefore they are
now called bins.
42Object Recognition using SIFT KEYS
- K-d trees - Example of a 2-d tree
43Object Recognition using SIFT KEYS
- K-d trees - Example of a 2-d tree
44Object Recognition using SIFT KEYS
- K-d trees - Example of a 2-d tree
45Object Recognition using SIFT KEYS
- K-d trees - Example of a 2-d tree
46Object Recognition using SIFT KEYS
- K-d trees - Example of a 3-d tree
47Object Recognition using SIFT KEYS
- Clustering with the Hough transform
- Test images may contain multiple objects, that
the system has learned (can be different ones or
the same in different poses). - The ratio between NN and SCN is a good criteria
for discarding false matches arising from
background clutter, but doesnt solve the problem
of matched keypoints, that belong to other valid
objects. - Therefore we need to identify clusters of
features with a consistent interpretation in
terms of an object and its pose. - The probablity of the interpretation represented
by such a cluster is higher, the more features
belong to this cluster. - Clustering is done with Hough transform.
48Object Recognition using SIFT KEYS
- Clustering with the Hough transform
- Imagine that we trained the system with the
images of this strange creatures. The SIFT KEYS
were created and stored in the database. Like
mentioned earlier, SIFT KEYS contain the local
coordinate system, that has underlied the
creation of the image descriptor. Furthermore for
every SIFT KEY it is known to which object(s) it
belongs to.
49Object Recognition using SIFT KEYS
- Clustering with the Hough transform
- Now we want to recognize the creatures in the
following test image.
- Suppose the NN matching has the following
results. We have one false match from the fish!
50Object Recognition using SIFT KEYS
- Clustering with the Hough transform
- We can do Hough transform using a hash, as we
know the coordinate system of SIFT KEYS in the DB
as well in the image.
- Every key votes for the interpretation of an
image region as a known object at a certain
location, scale and orientation (transformation
?).
51Object Recognition using SIFT KEYS
- Clustering with the Hough transform
- All clusters that collected more than 3 votes
will advance to the geometric fitting step.
52Object Recognition using SIFT KEYS
- Fitting a geometric model - least squares
solution - Each cluster of SIFT KEYS with more than 3
entries is subject to a verification procedure.
With least-squares solution we try to find out
the the best affine parameters, that relate the
model image in the DB to the test image. - The affine transformation of a model point (x,y)T
to an image point (u,v)T can be written as
- Affine transformation accounts correctly for 3D
rotation of planar surfaces under orthographic
projection. For general 3D-objects this is not
the case.
53Object Recognition using SIFT KEYS
- Fitting a geometric model
- The equation can be reformulated to
- This is a linear system of the type Axb. The
least-squares solution of such a linear system
can be computed by
54Object Recognition using SIFT KEYS
- Fitting a geometric model - iterative process
- Outliers can be removed by checking for agreement
between image features and the model image. - If fewer than 3 features remain after discarding
outliers, the match is rejected (The
interpretation associated with the cluster is
considered to be false). - As outliers are discarded, the least-squares
solution is resolved. This process (step 1-3) is
repeated in iterative manner.
55Overview
- SIFT - Scale invariant feature transform
- Object Recognition using SIFT KEYS
56Object Recognition using SIFT KEYS
- Results
- The training images are shown on the left. The
keypoints used for recognition are shown as
squares with an extra line indicating
orientation. The size of the squares indicate the
image region, that was used for the construction
of the descriptor.
57Object Recognition using SIFT KEYS
- Example image, where background is strongly
cluttered. This one may be difficult to recognize
for humans too! - The viewpoint was rotated by an angle of 30
compared to the image, from which the training
samples were taken.
58Object Recognition using SIFT KEYS
- Results
- The original size of the image from the first
recognition example is (600x480), the size of the
later one (640x315). - In both cases the time required for recognition
of all object is less than 0.3 s on a 2 GHz
Pentium 4 processor. - In general textured planar surfaces can reliably
detected over a rotation depth about 50 in any
direction and under almost any illumination
condition (sufficient light must be provided, no
glare). - For general 3D-objects, the range of rotation in
depth diminishes to 30 and illumination change
is more disruptive.
59Object Recognition using SIFT KEYS
60Object Recognition using SIFT KEYS
61Object Recognition using SIFT KEYS
- Further research
- Systematic tests with databases, that contain
images representing multi-views/
multi-illuminations. - Extension to color descriptors (Brown Lowe
2002) - Incorporation of other feature types than
gradients e.g. texture meassurements - Learning features, that are suited to recognize
whole object categories
62Overview
- SIFT - Scale invariant feature transform
- Object Recognition using SIFT KEYS
63Literature list
- Lowe, D.G. (2004). Distinctive Image Features
from Scale - Invariant Keypoints. International
Journal of Computer Vision, 60, 2 (2004), pp.
91-110. http//www.cs.ubc.ca/lowe/papers/ijcv04-a
bs.html - Lowe, D.G. (1999). Object from local
scale-invariant features. In International
Conference on Computer Vision, Corfu, Greece, pp.
1150-1157. http//www.cs.ubc.ca/spider/lowe/papers
/iccv99-abs.html - Lindeberg, T. (1994). Scale-space theory A basic
tool for analysing structures at different
scales. Journal of Applied Structures, 21(2)
224-270. http//www.nada.kth.se/tony/abstracts/Li
n94-SI-abstract.html
64Literature list
- Beis J. and Lowe, D.G. (1997). Shape indexing
using approximate nearest-neighbour search in
high-dimensional spaces. In Conference on
Computer Vision and Pattern Recognition, Puerto
Rico, pp. 1000-1006. http//www.cs.ubc.ca/spider/l
owe/papers/cvpr97-abs.html - Sample, N., Haines M., Arnold M. and Purcell, T.
(2001). Optimizing Search Strategies in k-d
Trees. http//graphics.stan
ford.edu/tpurcell/pubs/search.pdf
65The End - Thank you for paying attention