Part 3: Supervised Learning

About This Presentation

Title:

Part 3: Supervised Learning

Description:

Part 3: Supervised Learning – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 77

Provided by: cmbi5

Category:

more less

Transcript and Presenter's Notes

Title: Part 3: Supervised Learning

1
Machine Learning Techniques for Computer Vision

Part 3 Supervised Learning

Christopher M. Bishop
Microsoft Research Cambridge
ECCV 2004, Prague
2
Overview of Part 3

Linear models for regression and classification
Decision theory
Discriminative versus generative methods
The curse of dimensionality
Sparse kernel machines, boosting
Neural networks

3
Linear Basis Function Models

Prediction given by linear combination of basis
functions
Example polynomialso that the basis
functions are given by

4
Least Squares

Minimize sum-of-squares error function

5
Least Squares Solution

Exact closed-form minimizerwhereand
is the design matrix given by

6
Model Complexity
7
Generalization Error
8
Regularization

Discourage large values by adding penalty term to
error
Also called ridge regression or shrinkage or
weight decay
The regularization coefficient now controls
the effective model complexity

9
Regularized M 9 Polynomial
10
Regularized Parameters
11
Generalization
12
Probability Theory

Target values are corrupted with noise which is
intrinsically unpredictable from the observed
inputs
Inputs themselves may be noisy
The most complete description of the data is the
joint distribution
The parameters of any model are also uncertain
Bayesian probabilities

13
Decision Theory

Loss function
loss incurred in choosing when truth is
Minimize the average, or expected, loss
Two phases
Inference model the probability distribution
(hard)
Decision choose the optimal output (easy)

14
Squared Loss

Common choice for regression is the squared
loss
Minimum expected loss given by the conditional
average

15
Squared Loss
16
Classification

Assign input vector to one of two or more
classes
Joint distribution of data and classes
Any decision rule divides input space into
decision regions separated by decision boundaries

17
Minimum Misclassification Rate

Simplest loss minimize number of
misclassifications
For two classes
Since this says assign to class
for which is largest

18
Minimum Misclassification Rate
19
General Loss Matrix for Classification

Loss in choosing when true class is
denoted
Expected loss given by
Minimized by choosing class which
minimizes
again, this is trivial once we know

20
Generative vs. Discriminative Models

Generative approach separately model
class-conditional densities and priorsthen
evaluate posterior probabilities using Bayes
theorem
Discriminative approaches
model posterior probabilities directly
just predict class label (no inference stage)

21
Generative vs. Discriminative
22
Unlabelled Data
23
Generative Methods

? Relatively straightforward to characterize
invariances
? They can handle partially labelled data
? They waste flexibility modelling variability
which is unimportant for classification
? They scale badly with the number of classes and
the number of invariant transformations (slow on
test data)

24
Discriminative Methods

? They use the flexibility of the model in
relevant regions of input space
? They can be extremely fast once trained
? They interpolate between training examples, and
hence can fail if novel inputs are presented
? They dont easily handle compositionality (e.g.
faces can have glasses and/or moustaches)

25
Advantages of Knowing Posterior Probabilities

No re-training if loss matrix changes (e.g.
screening)
inference hard, decision stage is easy
Reject option dont make decision when largest
probability is less than threshold (e.g.
screening)
Compensating for skewed class priors (e.g.
screening)
Combining models, e.g. independent measurements

26
Curve Fitting Re-visited

Probabilistic formulation
Assume target data generated from deterministic
function plus Gaussian additive noise
Conditional distribution

27
Maximum Likelihood

Training data set
Likelihood function
Log likelihood
Maximum likelihood equivalent to least squares

28
Parameter Prior

Gaussian prior
Log posterior probability
MAP (maximum posterior) equivalent to regularized
least squares with
Bayesian optimization of l (model complexity)
requires marginalization over w

29
Classification Two Classes

Posterior class probabilitywhere
Called logistic sigmoid function

30
Logistic Regression

Fit parameterized model directly
Target variable
Class probability
Log likelihood function (cross-entropy)

31
Logistic Regression

Fixed non-linear basis functions
convex optimization problem
efficient Newton-Raphson method (IRLS)
decision boundaries linear in but
non-linear in

32
Basis Functions
33
Classification More Than Two Classes

Posterior probabilitywhere
Called the softmax or normalized exponential

34
Question
Why not simply use fixed basis function models
for all pattern recognition problems?
35
A History Lesson the Perceptron (1957)
36
Perceptron Learning Algorithm

Perceptron function
For each mis-classified pattern in turn, update
the weights where target values are
Guaranteed to converge in a finite number of
steps, if there exists an exact solution

37
Perceptron Hardware
38
Perceptrons (1969)
The perceptron has many features that
attract attention its linearity, its intriguing
learning theorem its clear paradigmatic
simplicity as a kind of parallel computation.
There is no reason to suppose that any of these
virtues carry over to the many-layered version.
Nevertheless, we consider it to be an important
research problem to elucidate (or reject) our
intuitive judgement that the extension is
sterile. pp. 231 232
39
Curse of Dimensionality
40
Intrinsic Dimensionality

Data often lives on a much lower-dimensional
manifold
example images of a rigid object
Also for most problems the outputs are smooth
functions of the inputs, so we can use
interpolation

41
Adaptive Basis Functions Strategy 1

Position the basis functions in regions of input
space occupied by the data
one basis function on each data point
Select from set of fixed candidates during
training
Support Vector Machine (SVM)
Relevance Vector Machine (RVM)

42
Support Vector Machine

Consider two linearly-separable classes, linear
model
Maximize margin gives sparse solution

43
Maximum Margin

Justification from statistical learning theory
Bayesian marginalization also gives a large
margin
e.g. logistic regression

44
Quadratic Programming

Extend to non-linear feature space
Target values
Minimize dual quadratic form (convex
optimization)subject to

45
Overlapping Classes

Slack variables

46
Kernels

SVM solution depends only on dot product
Feature space can be high (infinite)
dimensional
Kernels must be symmetric, positive definite
(Mercer)
Examples
polynomial
Gaussian

47
Example Face Detection

Romdhani, Torr, Schölkopf and Blake (2001)
Cascade of ever more complex (slower) models
low false negative rate at each step
c.f. boosting hierarchy of Viola and Jones (2001)

48
Face Detection
49
Face Detection
50
Face Detection
51
Face Detection
52
Face Detection
53
Adaboost
Final classifier is linear combination of weak
classifiers
54
Simple Features

Viola and Jones (2001)

55
Limitations of the SVM

Two classes
Large number of kernels (in spite of sparsity)
Kernels must satisfy Mercer criterion
Cross-validation to set parameters C (and e)
Decisions at outputs instead of probabilities

56
Multiple Classes
57
Relevance Vector Machine

Linear model as for SVM
Regression
Classification

58
Relevance Vector Machine

Gaussian prior for with hyper-parameters
Marginalize over
sparse solution
automatic relevance determination

59
SVM-RVM Comparison
SVM
RVM
60
RVM Tracking

Williams, Blake and Cipolla (2003)

61
RVM Tracking
62
Adaptive Basis Functions Strategy 2

Neural networks
Use small number of efficient planar basis
functions
Adapt the parameters of the basis functions by
global optimization of cost function

63
Neural Networks for Regression

Simplest model has two layers of adaptive
functions

not a probabilistic graphical model
64
Neural Networks for Classification

For binary classification use logistic sigmoid
For K-class classification use softmax function

65
General Topologies
66
Error Minimization

For regression use sum-of-squares error
For classification use cross-entropy error
Minimize error function using
gradient descent
conjugate gradients
quasi-Newton methods
Requires derivatives of the error function
efficiently evaluated using error
back-propagation
compared to

67
Error Back-propagation

Derived from chain rule for partial derivatives
Three stages
Evaluate an error signal at the output units
Propagate the signal backwardsthrough the
network
Evaluate derivatives

68
Synthetic Data
69
Convolutional Neural Networks
Le Cun et al.
70
Classification
71
Noise Robustness
72
Face Detection

Osadchy, Miller and LeCun (2003)

73
Summary of Part 3

Decision theory
Generative versus discriminative approaches
Linear models and the curse of dimensionality
Selecting basis functions
support vector machine
relevance vector machine
Adapting basis functions
neural networks

74
Suggested Reading
Oxford University Press
75
New Book

Pattern Recognition and Machine Learning
Springer (2005)
600 pages, hardback, four colour, low price
Graduate-level text book
worked solutions to all 250 exercises
complete lectures on www
Matlab software and companion text with Ian
Nabney

Part 3: Supervised Learning - PowerPoint PPT Presentation

Part 3: Supervised Learning

Part 3: Supervised Learning – PowerPoint PPT presentation