Title: CS479679 Pattern Recognition Spring 2006 Prof' Bebis
1CS479/679 Pattern RecognitionSpring 2006 Prof.
Bebis
- Linear Discriminant Functions
- Chapter 5 (Duda et al.)
2Statistical vs Discriminant Approach
- Parametric/non-parametric density estimation
techniques find the decision boundaries by first
estimating the probability distribution of the
patterns belonging to each class. - In the discriminant-based approach, the decision
boundary is constructed explicitly. - Knowledge of the form of the probability
distribution is not required.
3Discriminant Approach
- Classification is viewed as learning good
decision boundaries that separate the examples
belonging to different classes in a data set.
4Discriminant function estimation
- Specify a parametric form of the decision
boundary (e.g., linear or quadratic) . - Find the best decision boundary of the
specified form using a set of training examples.
- This is done by minimizing a criterion function
- e.g., training error (or sample risk)
5Linear Discriminant Functions
- A linear discriminant function is a linear
combination of its components -
-
- where w is the weight vector and w0 is the bias
(or threshold weight).
6Linear Discriminant Functions two category case
- Decide w1 if g(x) gt 0 and w2 if g(x) lt 0
- If g(x)0, then x is on the decision boundary and
can be assigned to either class.
7Linear Discriminant Functions two category case
(contd)
- If g(x) is linear, the decision boundary is a
hyperplane. - The orientation of the hyperplane is determined
by w and its location by w0. - w is the normal to the hyperplane.
- If w00, the hyperplane passes through the origin.
8Interpretation of g(x)
- g(x) provides an algebraic measure of the
distance of x from the hyperplane.
specifies direction of r
9Interpretation of g(x) (contd)
- Substitute the above expression in g(x)
- This gives the distance of x from the hyperplane
- w0 determines the distance of the hyperplane from
the origin
10Linear Discriminant Functions multi-category
case
- There are several ways to devise multi-category
classifiers using linear discriminant functions - One against the rest (i.e., c-1 two-class
problems)
11Linear Discriminant Functions multi-category
case (contd)
- One against another (i.e., c(c-1)/2 pairs of
classes)
12Linear Discriminant Functions multi-category
case (contd)
- To avoid the problem of ambiguous regions
- Define c linear discriminant functions
- Assign x to wi if gi(x) gt gj(x) for all j ? i.
- The resulting classifier is called a linear
machine
13Linear Discriminant Functions multi-category
case (contd)
14Linear Discriminant Functions multi-category
case (contd)
- The boundary between two regions Ri and Rj is a
portion of the hyperplane given by - The decision regions for a linear machine are
convex.
15Higher order discriminant functions
- Can produce more complicated decision boundaries
than linear discriminant functions.
hyperquadric decision boundaries
16Higher order discriminant functions (contd)
- Generalized discriminant
- - a is a dimensional weight vector
- - the functions yi(x) are called f
functions - The functions yi(x) map points from the
d-dimensional x-space to the -dimensional
y-space (usually gtgt d )
17Generalized discriminant functions
- The resulting discriminant function is not linear
in x but it is linear in y. - The generalized discriminant separates points in
the transformed space by a hyperplane passing
through the origin.
18Generalized discriminant functions (contd)
- Example
- Maps a line in x-space to a parabola in y-space.
- The plane aty0 divides the y-space in two
decision regions - The corresponding decision regions R1,R2 in the
x-space are not simply connected!
f functions
19Generalized discriminant functions (contd)
20Generalized discriminant functions (contd)
- Practical issues.
- Computationally intensive.
- Lots of training examples are required to
determine a if is very large (i.e.,curse of
dimensionality).
21NotationAugmented feature/weight vectors
d1 dimensions
- Decision hyperplane passes from origin in y-space
22Two-Category, Linearly Separable Case
- Given a linear discriminant function g(x)aty,
the goal is to learn the weights using a set of n
labeled samples (i.e., examples and their
associated classes). - Classification rule
- If atyigt0 assign yi to ?1
- else if atyilt0 assign yi to ?2
23Two-Category, Linearly Separable Case (contd)
- Every training sample yi places a constraint on
the weight vector a. - Given n examples, the solution must lie on the
intersection of n half-spaces.
a2
g(x)aty
a1
24Two-Category, Linearly Separable Case (contd)
- Solution vector is usually not unique!
- Impose constraints to enforce uniqueness...
(normalized version)
- If yi in ?2, replace yi by -yi
- Find a such that atyigt0
25Two-Category, Linearly Separable Case (contd)
- Constrain margin
- find min-length a with
- Move solution to the center of the feasible
region
26Iterative Optimization
- Define a criterion function J(a) that is
minimized if a is a solution vector. - Minimize J(a) iteratively ...
a(k1)
search direction
learning rate
a(k)
27Gradient Descent
learning rate
28Gradient Descent (contd)
29Gradient Descent (contd)
too large learning rate!
?
?
30Gradient Descent (contd)
- How to choose the learning rate h(k)?
- Note if J(a) is quadratic, the learning rate is
constant!
Taylor series expansion
Hessian
optimum learning rate
31Newtons method
requires inverting H!
32Newtons method (contd)
If the error function is quadratic,
Newtons method converges in one step!
33ComparisonGradient descent vs Newtons method
34Perceptron rule
- where Y(a) is the set of samples misclassified
by a. - If Y(a) is empty, Jp(a)0 otherwise, Jp(a)gt0
Criterion Function
(normalized version)
35Perceptron rule (contd)
- The gradient of Jp(a) is
- The perceptron update rule is obtained using
gradient descent
36Perceptron rule (contd)
consider all examples missclassified
37Perceptron rule (contd)
- Move the hyperplane so that training samples are
on its positive side.
a2
a2
a1
a1
38Perceptron rule (contd)
?(k)1
consider one example at a time
Perceptron Convergence Theorem If training
samples are linearly separable, then the sequence
of weight vectors by the above algorithm will
terminate at a solution vector in a finite number
of steps.
39Perceptron rule (contd)
order of examples y2 y3 y1 y3
40Perceptron rule (contd)
41Perceptron rule (contd)
- Some Direct Generalizations
- Variable increment and a margin
42Perceptron rule (contd)
43Perceptron rule (contd)
44Relaxation Procedures
- Note that different criterion functions exist
- One possible choice is
- Where Y is again the set of the training samples
that are misclassified by a - However, there are two problems with this
criterion - The function is too smooth and can converge to
a0 - Jq is dominated by training samples with large
magnitude
45Relaxation Procedures (contd)
- A modified version that avoids the above two
problems is - Here Y is the set of samples for which
- Its gradient is given by
46Relaxation Procedures (contd)
47Relaxation Procedures (contd)
48Relaxation Procedures (contd)
49Relaxation Procedures (contd)
50Minimum Squared Error Procedures
- Minimum squared error and pseudoinverse
- The problem is to find a weight vector a
satisfying Yab - If we have more equations than unknowns, a is
over-determined. - We want to choose the one that minimizes the
sum-of-squared-error criterion function
51Minimum Squared Error Procedures (contd)
52Minimum Squared Error Procedures (contd)