Title: Machine Learning
1Machine Learning Category Representation
- Non-linear classification through the kernel
trick
2Linear discriminant
w
3Motivation of generalized linear functions
- Linear classifiers are nice because of their
simplicity - Computationally efficient
- Some methods, as SVM and logistic discriminant,
do not suffer from an objective function with
multiple local optima. - The fact that only linear functions can be
implemented limits their use to discriminating
(approximately) linearly separable classes. - Easy way to extend their applicability extend or
replace the original set of features with several
non-linear functions of the original ones.
4The main idea
- Data vector x replaced by vector ? of m function
responses.
5Example of alternative feature space
- Suppose that one class encapsulates the other
class - A linear classifier does not work very well ...
- Lets map our features to a new space spanned by
- A circle in the original space is now described
by a plane in the new space
6The kernel function of our example
- Lets compute the inner-product in the new
feature space we defined - Thus, we simply square the standard inner
product, and do not need to explicitly map the
points to the new feature space ! - This becomes useful if the new feature space has
very many features. - The kernel function is a shortcut to compute
inner products in feature space, without
explicitly mapping the data to that space.
7Example non-linear support vector machine
- Example where classes are separable, but not
linearly. - Gaussian kernel used
Decision surface
8Nonlinear support vector machines
9Classification
10The kernel function
- A kernel function k(x,y) computes inner-product
between x and y after mapping x and y to some
alternative representation. - Starting with a new representation in some
feature space we can find the corresponding
kernel function as the program - Map x and y to the new feature space.
- Compute the inner product between the mapped
points. - A function is positive definite if evaluating
the function k(x,y) on all N2 pairs of N
arbitrary points, always yields an NxN kernel
matrix K that is positive definite - If the kernel is computing inner products then
kernel matrix K is positive definite - Mercers Theorem if k is positive definite, then
there exists a feature space in which k computes
the inner-products.
11The kernel trick has many applications
- We used kernels to compute inner-products to find
non-linear SVMs. - The main advantage of the kernel trick is that
we can use feature spaces with a vast number of
derived non-linear features without being
confronted with the computational burden of
explicitly mapping the data points to this space. - Some kernels even have an associated feature
space with infinitely many dimensions, such the
Gaussian kernel - The same trick can also be used for many other
inner-product based methods for classification,
regression, clustering, and dimension reduction.
12Example Kernel Logistic Discriminant
- Recall logistic disrciminant
- Weight vector may be written as linear
combination of data points - The derivative of the log-likelihood is now given
by
As before, the empirical expectation should equal
the model expectations.
13Image categorization with Bags-of-Words
- A range of kernel functions are used in
categorization tasks - If an interesting image similarity measure can be
show to be a positive definite kernel, then it
can be plugged into SVMs or logistic discriminant - Kernels kan be combined by
- Summation
- Products
- Exponentiation
- Many other ways
14Chi-square kernel
- One of the most popular and effective kernels for
image categorization -
- Similarity of bag-of-word histograms x and y
- See Zhang, Marszalek, Lazebnik Schmid, Int.
Journal of Computer Vision 2007 - Performance further improved by combining
Chi-square kernel over different types of
features - Descriptors SIFT, HOG, Colour,
- Different spatial decompositions of image
- Different ways to detect patches interest
points, dense extraction
K(x,y) exp-? ?² (x,y) ?² (x,y) ?i
(xi-yi) ² / xiyi
15Pyramid match kernel (Grauman Darrell 2005)
optimal partial matching between sets of features
number of new matches at level i
difficulty of a match at level i
Slide credit Kristen Grauman
16Spatial Pyramid Kernel
- Histograms of visual words in each cell
- Lazebnik, Schmid Ponce, CVPR 2006
17Spatial Pyramid Kernel
- Histograms of gradient orientations
- Bosch, Zisserman, Munoz, Int. Conf. on Image and
Video Retrieval (2007)