CS479679 Pattern Recognition Spring 2006 Prof' Bebis - PowerPoint PPT Presentation

1 / 52

About This Presentation

Title:

CS479679 Pattern Recognition Spring 2006 Prof' Bebis

Description:

Parametric/non-parametric density estimation techniques find the decision ... Maps a line in x-space to a parabola in y-space. ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 53

Provided by: cse5

Category:

more less

Transcript and Presenter's Notes

Title: CS479679 Pattern Recognition Spring 2006 Prof' Bebis

1
CS479/679 Pattern RecognitionSpring 2006 Prof.
Bebis

Linear Discriminant Functions
Chapter 5 (Duda et al.)

2
Statistical vs Discriminant Approach

Parametric/non-parametric density estimation
techniques find the decision boundaries by first
estimating the probability distribution of the
patterns belonging to each class.
In the discriminant-based approach, the decision
boundary is constructed explicitly.
Knowledge of the form of the probability
distribution is not required.

3
Discriminant Approach

Classification is viewed as learning good
decision boundaries that separate the examples
belonging to different classes in a data set.

4
Discriminant function estimation

Specify a parametric form of the decision
boundary (e.g., linear or quadratic) .
Find the best decision boundary of the
specified form using a set of training examples.
This is done by minimizing a criterion function
e.g., training error (or sample risk)

5
Linear Discriminant Functions

A linear discriminant function is a linear
combination of its components
where w is the weight vector and w0 is the bias
(or threshold weight).

6
Linear Discriminant Functions two category case

Decide w1 if g(x) gt 0 and w2 if g(x) lt 0
If g(x)0, then x is on the decision boundary and
can be assigned to either class.

7
Linear Discriminant Functions two category case
(contd)

If g(x) is linear, the decision boundary is a
hyperplane.
The orientation of the hyperplane is determined
by w and its location by w0.
w is the normal to the hyperplane.
If w00, the hyperplane passes through the origin.

8
Interpretation of g(x)

g(x) provides an algebraic measure of the
distance of x from the hyperplane.

specifies direction of r
9
Interpretation of g(x) (contd)

Substitute the above expression in g(x)
This gives the distance of x from the hyperplane
w0 determines the distance of the hyperplane from
the origin

10
Linear Discriminant Functions multi-category
case

There are several ways to devise multi-category
classifiers using linear discriminant functions
One against the rest (i.e., c-1 two-class
problems)

11
Linear Discriminant Functions multi-category
case (contd)

One against another (i.e., c(c-1)/2 pairs of
classes)

12
Linear Discriminant Functions multi-category
case (contd)

To avoid the problem of ambiguous regions
Define c linear discriminant functions
Assign x to wi if gi(x) gt gj(x) for all j ? i.
The resulting classifier is called a linear
machine

13
Linear Discriminant Functions multi-category
case (contd)

14
Linear Discriminant Functions multi-category
case (contd)

The boundary between two regions Ri and Rj is a
portion of the hyperplane given by
The decision regions for a linear machine are
convex.

15
Higher order discriminant functions

Can produce more complicated decision boundaries
than linear discriminant functions.

hyperquadric decision boundaries
16
Higher order discriminant functions (contd)

Generalized discriminant
- a is a dimensional weight vector
- the functions yi(x) are called f
functions
The functions yi(x) map points from the
d-dimensional x-space to the -dimensional
y-space (usually gtgt d )

17
Generalized discriminant functions

The resulting discriminant function is not linear
in x but it is linear in y.
The generalized discriminant separates points in
the transformed space by a hyperplane passing
through the origin.

18
Generalized discriminant functions (contd)

Example
Maps a line in x-space to a parabola in y-space.
The plane aty0 divides the y-space in two
decision regions
The corresponding decision regions R1,R2 in the
x-space are not simply connected!

f functions
19
Generalized discriminant functions (contd)

20
Generalized discriminant functions (contd)

Practical issues.
Computationally intensive.
Lots of training examples are required to
determine a if is very large (i.e.,curse of
dimensionality).

21
NotationAugmented feature/weight vectors

d1 dimensions

Decision hyperplane passes from origin in y-space

22
Two-Category, Linearly Separable Case

Given a linear discriminant function g(x)aty,
the goal is to learn the weights using a set of n
labeled samples (i.e., examples and their
associated classes).
Classification rule
If atyigt0 assign yi to ?1
else if atyilt0 assign yi to ?2

23
Two-Category, Linearly Separable Case (contd)

Every training sample yi places a constraint on
the weight vector a.
Given n examples, the solution must lie on the
intersection of n half-spaces.

a2
g(x)aty
a1
24
Two-Category, Linearly Separable Case (contd)

Solution vector is usually not unique!
Impose constraints to enforce uniqueness...

(normalized version)

If yi in ?2, replace yi by -yi
Find a such that atyigt0

25
Two-Category, Linearly Separable Case (contd)

Constrain margin
find min-length a with
Move solution to the center of the feasible
region

26
Iterative Optimization

Define a criterion function J(a) that is
minimized if a is a solution vector.
Minimize J(a) iteratively ...

a(k1)
search direction
learning rate
a(k)
27
Gradient Descent

Gradient descent rule

learning rate
28
Gradient Descent (contd)

29
Gradient Descent (contd)

too large learning rate!
?
?
30
Gradient Descent (contd)

How to choose the learning rate h(k)?
Note if J(a) is quadratic, the learning rate is
constant!

Taylor series expansion
Hessian
optimum learning rate
31
Newtons method

requires inverting H!
32
Newtons method (contd)

If the error function is quadratic,
Newtons method converges in one step!
33
ComparisonGradient descent vs Newtons method

34
Perceptron rule

where Y(a) is the set of samples misclassified
by a.
If Y(a) is empty, Jp(a)0 otherwise, Jp(a)gt0

Criterion Function
(normalized version)
35
Perceptron rule (contd)

The gradient of Jp(a) is
The perceptron update rule is obtained using
gradient descent

36
Perceptron rule (contd)

consider all examples missclassified
37
Perceptron rule (contd)

Move the hyperplane so that training samples are
on its positive side.

a2
a2
a1
a1
38
Perceptron rule (contd)
?(k)1
consider one example at a time
Perceptron Convergence Theorem If training
samples are linearly separable, then the sequence
of weight vectors by the above algorithm will
terminate at a solution vector in a finite number
of steps.
39
Perceptron rule (contd)

order of examples y2 y3 y1 y3
40
Perceptron rule (contd)

41
Perceptron rule (contd)

Some Direct Generalizations
Variable increment and a margin

42
Perceptron rule (contd)

43
Perceptron rule (contd)

44
Relaxation Procedures

Note that different criterion functions exist
One possible choice is
Where Y is again the set of the training samples
that are misclassified by a
However, there are two problems with this
criterion
The function is too smooth and can converge to
a0
Jq is dominated by training samples with large
magnitude

45
Relaxation Procedures (contd)

A modified version that avoids the above two
problems is
Here Y is the set of samples for which
Its gradient is given by

46
Relaxation Procedures (contd)

47
Relaxation Procedures (contd)

48
Relaxation Procedures (contd)

49
Relaxation Procedures (contd)

50
Minimum Squared Error Procedures

Minimum squared error and pseudoinverse
The problem is to find a weight vector a
satisfying Yab
If we have more equations than unknowns, a is
over-determined.
We want to choose the one that minimizes the
sum-of-squared-error criterion function

51
Minimum Squared Error Procedures (contd)