Title: Fisher kernels for image representation
1Fisher kernels for image representation
generative classification models
- Jakob Verbeek
- December 11, 2009
2Plan for this course
- Introduction to machine learning
- Clustering techniques
- k-means, Gaussian mixture density
- Gaussian mixture density continued
- Parameter estimation with EM
- Classification techniques 1
- Introduction, generative methods, semi-supervised
- Fisher kernels
- Classification techniques 2
- Discriminative methods, kernels
- Decomposition of images
- Topic models,
3Classification
- Training data consists of inputs, denoted x,
and corresponding output class labels, denoted
as y. - Goal is to correctly predict for a test data
input the corresponding class label. - Learn a classifier f(x) from the input data
that outputs the class label or a probability
over the class labels. - Example
- Input image
- Output category label, eg cat vs. no cat
- Classification can be binary (two classes), or
over a larger number of classes (multi-class). - In binary classification we often refer to one
class as positive, and the other as negative - Binary classifier creates a boundaries in the
input space between areas assigned to each class
4Example of classification
Given training images and their categories
What are the categories of these test images?
5Discriminative vs generative methods
- Generative probabilistic methods
- Model the density of inputs x from each class
p(xy) - Estimate class prior probability p(y)
- Use Bayes rule to infer distribution over class
given input - Discriminative (probabilistic) methods
- Directly estimate class probability given input
p(yx) - Some methods do not have probabilistic
interpretation, - eg. they fit a function f(x), and assign to class
1 if f(x)gt0, - and to class 2 if f(x)lt0
6Generative classification methods
- Generative probabilistic methods
- Model the density of inputs x from each class
p(xy) - Estimate class prior probability p(y)
- Use Bayes rule to infer distribution over class
given input - Modeling class-conditional densities over the
inputs x - Selection of model class
- Parametric models such as Gaussian (for
continuous), Bernoulli (for binary), - Semi-parametric models mixtures of Gaussian,
Bernoulli, - Non-parametric models Histograms over
one-dimensional, or multi-dimensional data,
nearest-neighbor method, kernel density estimator - Given class conditional model, classification is
trivial just apply Bayes rule - Adding new classes can be done by adding a new
class conditional model - Existing class conditional models stay as they
are
7Histogram methods
- Suppose we
- have N data points
- use a histogram with C cells
- How to set the density level in each cell ?
- Maximum (log)-likelihood estimator.
- Proportional to nr of points n in cell
- Inversely proportional to volume V of cell
- Problems with histogram method
- cells scales exponentially with the dimension
of the data - Discontinuous density estimate
- How to choose cell size?
8The curse of dimensionality
- Number of bins increases exponentially with the
dimensionality of the data. - Fine division of each dimension many empty bins
- Rough division of each dimension poor density
model - Probability distribution of D discrete variables
takes at least 2D values - At least 2 values for each variable
- The number of cells may be reduced assuming
independency between the components of x the
naïve Bayes model - Model is naïve since it assumes that all
variables are independent - Unrealistic for high dimensional data, where
variables tend to be dependent - Poor density estimator
- Classification performance can still be good
using derived p(yx)
9Example of generative classification
- Hand-written digit classification
- Input binary 28x28 scanned digit images, collect
in 784 long vector - Desired output class label of image
- Generative model
- Independent Bernoulli model for each class
- Probability per pixel per class
- Maximum likelihood estimator is average value
- per pixel per class
10k-nearest-neighbor estimation method
- Idea fix number of samples in the cell, find the
right cell size. - Probability to find a point in a sphere A
centered on x with volume v is - Smooth density approximately constant in small
region, and thus - Alternatively estimate P from the fraction of
training data in a sphere on x - Combine the above to obtain estimate
11k-nearest-neighbor estimation method
- Method in practice
- Choose k
- For given x, compute the volume v which contain k
samples. - Estimate density with
- Volume of a sphere with radius r in d dimensions
is - What effect does k have?
- Data sampled from mixture
- of Gaussians plotted in green
- Larger k, larger region,
- smoother estimate
- Selection of k
- Leave-one-out cross validation
- Select k that maximizes data
- log-likelihood
12k-nearest-neighbor classification rule
- Use k-nearest neighbor density estimation to find
p(xcategory) - Apply Bayes rule for classification k-nearest
neighbor classification - Find sphere volume v to capture k data points for
estimate - Use the same sphere for each class for estimates
- Estimate global class priors
- Calculate class posterior distribution
13k-nearest-neighbor classification rule
- Effect of k on classification boundary
- Larger number of neighbors
- Larger regions
- Smoother class boundaries
14Kernel density estimation methods
- Consider a simple estimator of the cumulative
distribution function - Derivative gives an estimator of the density
function, but this is just a set of delta peaks. - Derivative is defined as
- Consider a non-limiting value of h
- Each data point adds 1/(2hN) in region of size h
around it, sum of blocks gives estimate
15Kernel density estimation methods
- Can use other than block function to obtain
smooth estimator. - Widely used kernel function is the (multivariate)
Gaussian - Contribution decreases smoothly as a function of
the distance to data point. - Choice of smoothing parameter
- Larger size of kernel function gives
- smoother desnity estimator
- Use the average distance between samples.
- Use cross-validation.
- Method can be used for multivariate data
- Or in naïve bayes model
16Summary generative classification methods
- (Semi-) Parametric models (eg p(data category)
gaussian or mixture) - No need to store data, but possibly too strong
assumptions on data density - Can lead to poor fit on data, and poor
classification result - Non-parametric models
- Histograms
- Only practical in low dimensional space (lt5 or
so) - High dimensional space will lead to many cells,
many of which will be empty - Naïve Bayes modeling in higher dimensional cases
- K-nearest neighbor kernel density estimation
- Need to store all training data
- Need to find nearest neighbors or points with
non-zero kernel evaluation (costly)
histogram
k-nn
k.d.e.
17Discriminative vs generative methods
- Generative probabilistic methods
- Model the density of inputs x from each class
p(xy) - Estimate class prior probability p(y)
- Use Bayes rule to infer distribution over class
given input - Discriminative (probabilistic) methods (next
week) - Directly estimate class probability given input
p(yx) - Some methods do not have probabilistic
interpretation, - eg. they fit a function f(x), and assign to class
1 if f(x)gt0, - and to class 2 if f(x)lt0
- Hybrid generative-discriminative models
- Fit density model to data
- Use properties of this model as input for
classifier - Example Fisher-vectors for image respresentation
18Clustering for visual vocabulary construction
- Clustering of local image descriptors
- using k-means or mixture of Gaussians
- Recap of the image representation pipe-line
- Extract image regions at various locations and
scales Compute descriptor for each region (eg
SIFT) - (Soft) assignment each descriptors to clusters
- Make histogram for complete image
- Summing of vector representations of each
descriptor - Input to image classification method
Cluster indexes
Image regions
19Fisher Vector motivation
- Feature vector quantization is computationally
expensive in practice - Run-time linear in
- N nr. of feature vectors 103 per image
- D nr. of dimensions 102 (SIFT)
- K nr. of clusters 103 for recognition
- So in total in the order of 108 multiplications
per image to assign SIFT descriptors to visual
words - We use histogram of visual word counts
- Can we do this more efficiently ?!
- Reading material Fisher Kernels on Visual
- Vocabularies for Image Categorization
- F. Perronnin and C. Dance, in CVPR'07
- Xerox Research Centre Europe, Meylan
20Fisher vector image representation
- MoG / k-means stores nr of points per cell
- Need many clusters to represent distribution of
descriptors in image - But increases computational cost
- Fischer vector adds 1st 2nd order moments
- More precise description regions assigned to
cluster - Fewer clusters needed for same accuracy
- Representation (2D1) times larger, at same
computational cost - Terms already calculated when computing
soft-assignment
qnk soft-assignment of image region to cluster
(Gaussian mixture component)
21Image representation using Fisher kernels
- General idea of Fischer vector representation
- Fit probabilistic model to data
- Use derivative of data log-likelihood as data
representation, eg.for classification - Jaakkola Haussler. Exploiting generative
models in discriminative classifiers, in
Advances in Neural Information Processing Systems
11, 1999. - Here, we use Mixture of Gaussians to cluster the
region descriptors - Concatenate derivatives to obtain data
representation
22Image representation using Fisher kernels
- Extended representation of image descriptors
using MoG - Displacement of descriptor from center
- Squares of displacement from center
- From 1 number per descriptor per cluster, to
1DD2 (D data dimension) - Simplified version obtained when
- Using this representation for a linear classifier
- Diagonal covariance matrices, variance in
dimensions given by vector vk - For a single image region descriptor
- Summed over all descriptors this gives us
- 1 Soft count of regions assigned to cluster
- D Weighted average of assigned descriptors
- D Weighted variance of descriptors in all
dimensions
23Fisher vector image representation
- MoG / k-means stores nr of points per cell
- Need many clusters to represent distribution of
descriptors in image - Fischer vector adds 1st 2nd order moments
- More precise description regions assigned to
cluster - Fewer clusters needed for same accuracy
- Representation (2D1) times larger, at same
computational cost - Terms already calculated when computing
soft-assignment - Comp. cost is O(NKD), need difference between all
clusters and data
24Images from categorization task PASCAL VOC
- Yearly competition for image classification
(also object localization, segmentation, and
body-part localization)
25Fisher Vector results
- BOV-supervised learns separate mixture model for
each image class, makes that some of the visual
words are class-specific - MAP assign image to class for which the
corresponding MoG assigns maximum likelihood to
the region descriptors - Other results based on linear classifier of the
image descriptions - Similar performance, using 16x fewer Gaussians
- Unsupervised/universal representation good
26Plan for this course
- Introduction to machine learning
- Clustering techniques
- k-means, Gaussian mixture density
- Gaussian mixture density continued
- Parameter estimation with EM
- Classification techniques 1
- Introduction, generative methods, semi-supervised
- Reading for next week
- Previous papers (!), nothing new
- Available on course website http//lear.inrialpes.
fr/verbeek/teaching - Classification techniques 2
- Discriminative methods, kernels
- Decomposition of images