Title: Part 3: Supervised Learning
1Machine Learning Techniques for Computer Vision
- Part 3 Supervised Learning
Christopher M. Bishop
Microsoft Research Cambridge
ECCV 2004, Prague
2Overview of Part 3
- Linear models for regression and classification
- Decision theory
- Discriminative versus generative methods
- The curse of dimensionality
- Sparse kernel machines, boosting
- Neural networks
3Linear Basis Function Models
- Prediction given by linear combination of basis
functions - Example polynomialso that the basis
functions are given by
4Least Squares
- Minimize sum-of-squares error function
5Least Squares Solution
- Exact closed-form minimizerwhereand
is the design matrix given by
6Model Complexity
7Generalization Error
8Regularization
- Discourage large values by adding penalty term to
error - Also called ridge regression or shrinkage or
weight decay - The regularization coefficient now controls
the effective model complexity
9Regularized M 9 Polynomial
10Regularized Parameters
11Generalization
12Probability Theory
- Target values are corrupted with noise which is
intrinsically unpredictable from the observed
inputs - Inputs themselves may be noisy
- The most complete description of the data is the
joint distribution - The parameters of any model are also uncertain
Bayesian probabilities
13Decision Theory
- Loss function
- loss incurred in choosing when truth is
- Minimize the average, or expected, loss
- Two phases
- Inference model the probability distribution
(hard) - Decision choose the optimal output (easy)
14Squared Loss
- Common choice for regression is the squared
loss - Minimum expected loss given by the conditional
average
15Squared Loss
16Classification
- Assign input vector to one of two or more
classes - Joint distribution of data and classes
- Any decision rule divides input space into
decision regions separated by decision boundaries
17Minimum Misclassification Rate
- Simplest loss minimize number of
misclassifications - For two classes
- Since this says assign to class
for which is largest
18Minimum Misclassification Rate
19General Loss Matrix for Classification
- Loss in choosing when true class is
denoted - Expected loss given by
- Minimized by choosing class which
minimizes - again, this is trivial once we know
20Generative vs. Discriminative Models
- Generative approach separately model
class-conditional densities and priorsthen
evaluate posterior probabilities using Bayes
theorem - Discriminative approaches
- model posterior probabilities directly
- just predict class label (no inference stage)
21Generative vs. Discriminative
22Unlabelled Data
23Generative Methods
- ? Relatively straightforward to characterize
invariances - ? They can handle partially labelled data
- ? They waste flexibility modelling variability
which is unimportant for classification - ? They scale badly with the number of classes and
the number of invariant transformations (slow on
test data)
24Discriminative Methods
- ? They use the flexibility of the model in
relevant regions of input space - ? They can be extremely fast once trained
- ? They interpolate between training examples, and
hence can fail if novel inputs are presented - ? They dont easily handle compositionality (e.g.
faces can have glasses and/or moustaches)
25Advantages of Knowing Posterior Probabilities
- No re-training if loss matrix changes (e.g.
screening) - inference hard, decision stage is easy
- Reject option dont make decision when largest
probability is less than threshold (e.g.
screening) - Compensating for skewed class priors (e.g.
screening) - Combining models, e.g. independent measurements
26Curve Fitting Re-visited
- Probabilistic formulation
- Assume target data generated from deterministic
function plus Gaussian additive noise - Conditional distribution
27Maximum Likelihood
- Training data set
- Likelihood function
- Log likelihood
- Maximum likelihood equivalent to least squares
28Parameter Prior
- Gaussian prior
- Log posterior probability
- MAP (maximum posterior) equivalent to regularized
least squares with - Bayesian optimization of l (model complexity)
- requires marginalization over w
29Classification Two Classes
- Posterior class probabilitywhere
- Called logistic sigmoid function
30Logistic Regression
- Fit parameterized model directly
- Target variable
- Class probability
- Log likelihood function (cross-entropy)
31Logistic Regression
- Fixed non-linear basis functions
- convex optimization problem
- efficient Newton-Raphson method (IRLS)
- decision boundaries linear in but
non-linear in
32Basis Functions
33Classification More Than Two Classes
- Posterior probabilitywhere
- Called the softmax or normalized exponential
34Question
Why not simply use fixed basis function models
for all pattern recognition problems?
35A History Lesson the Perceptron (1957)
36Perceptron Learning Algorithm
- Perceptron function
- For each mis-classified pattern in turn, update
the weights where target values are - Guaranteed to converge in a finite number of
steps, if there exists an exact solution
37Perceptron Hardware
38Perceptrons (1969)
The perceptron has many features that
attract attention its linearity, its intriguing
learning theorem its clear paradigmatic
simplicity as a kind of parallel computation.
There is no reason to suppose that any of these
virtues carry over to the many-layered version.
Nevertheless, we consider it to be an important
research problem to elucidate (or reject) our
intuitive judgement that the extension is
sterile. pp. 231 232
39Curse of Dimensionality
40Intrinsic Dimensionality
- Data often lives on a much lower-dimensional
manifold - example images of a rigid object
- Also for most problems the outputs are smooth
functions of the inputs, so we can use
interpolation
41Adaptive Basis Functions Strategy 1
- Position the basis functions in regions of input
space occupied by the data - one basis function on each data point
- Select from set of fixed candidates during
training - Support Vector Machine (SVM)
- Relevance Vector Machine (RVM)
42Support Vector Machine
- Consider two linearly-separable classes, linear
model - Maximize margin gives sparse solution
43Maximum Margin
- Justification from statistical learning theory
- Bayesian marginalization also gives a large
margin - e.g. logistic regression
44Quadratic Programming
- Extend to non-linear feature space
- Target values
- Minimize dual quadratic form (convex
optimization)subject to
45Overlapping Classes
46Kernels
- SVM solution depends only on dot product
- Feature space can be high (infinite)
dimensional - Kernels must be symmetric, positive definite
(Mercer) - Examples
- polynomial
- Gaussian
47Example Face Detection
- Romdhani, Torr, Schölkopf and Blake (2001)
- Cascade of ever more complex (slower) models
- low false negative rate at each step
- c.f. boosting hierarchy of Viola and Jones (2001)
48Face Detection
49Face Detection
50Face Detection
51Face Detection
52Face Detection
53Adaboost
Final classifier is linear combination of weak
classifiers
54Simple Features
55Limitations of the SVM
- Two classes
- Large number of kernels (in spite of sparsity)
- Kernels must satisfy Mercer criterion
- Cross-validation to set parameters C (and e)
- Decisions at outputs instead of probabilities
56Multiple Classes
57Relevance Vector Machine
- Linear model as for SVM
- Regression
- Classification
58Relevance Vector Machine
- Gaussian prior for with hyper-parameters
- Marginalize over
- sparse solution
- automatic relevance determination
59SVM-RVM Comparison
SVM
RVM
60RVM Tracking
- Williams, Blake and Cipolla (2003)
61RVM Tracking
62Adaptive Basis Functions Strategy 2
- Neural networks
- Use small number of efficient planar basis
functions - Adapt the parameters of the basis functions by
global optimization of cost function
63Neural Networks for Regression
- Simplest model has two layers of adaptive
functions
not a probabilistic graphical model
64Neural Networks for Classification
- For binary classification use logistic sigmoid
- For K-class classification use softmax function
65General Topologies
66Error Minimization
- For regression use sum-of-squares error
- For classification use cross-entropy error
- Minimize error function using
- gradient descent
- conjugate gradients
- quasi-Newton methods
- Requires derivatives of the error function
- efficiently evaluated using error
back-propagation - compared to
67Error Back-propagation
- Derived from chain rule for partial derivatives
- Three stages
- Evaluate an error signal at the output units
- Propagate the signal backwardsthrough the
network - Evaluate derivatives
68Synthetic Data
69Convolutional Neural Networks
Le Cun et al.
70Classification
71Noise Robustness
72Face Detection
- Osadchy, Miller and LeCun (2003)
73Summary of Part 3
- Decision theory
- Generative versus discriminative approaches
- Linear models and the curse of dimensionality
- Selecting basis functions
- support vector machine
- relevance vector machine
- Adapting basis functions
- neural networks
74Suggested Reading
Oxford University Press
75New Book
- Pattern Recognition and Machine Learning
- Springer (2005)
- 600 pages, hardback, four colour, low price
- Graduate-level text book
- worked solutions to all 250 exercises
- complete lectures on www
- Matlab software and companion text with Ian
Nabney
76Viewgraphs and papers
- http//research.microsoft.com/cmbishop