CS479679 Pattern Recognition Spring 2006 Prof' Bebis - PowerPoint PPT Presentation

1 / 67

About This Presentation

Title:

CS479679 Pattern Recognition Spring 2006 Prof' Bebis

Description:

Quantifies the tradeoffs between various classification ... using Bays rule: x*=xB. Bayes rule minimizes: 50. Error Probabilities and Integrals (cont'd) ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 68

Provided by: cse4

Category:

more less

Transcript and Presenter's Notes

Title: CS479679 Pattern Recognition Spring 2006 Prof' Bebis

1
CS479/679 Pattern RecognitionSpring 2006 Prof.
Bebis

Bayesian Decision Theory
Chapter 2 (Duda et al.)

2
Bayesian Decision Theory

Fundamental statistical approach to problem
classification.
Quantifies the tradeoffs between various
classification decisions using probabilities and
the costs associated with such decisions.
Each action is associated with a cost or risk.
The simplest risk is the classification error.
Design classifiers to recommend actions that
minimize some total expected risk.

3
Terminology (using sea bass salmon
classification example)

State of nature ? (random variable)
?1 for sea bass, ?2 for salmon.
Probabilities P(?1) and P(?2 ) (priors)
prior knowledge of how likely is to get a sea
bass or a salmon
Probability density function p(x) (evidence)
how frequently we will measure a pattern with
feature value x (e.g., x is a lightness
measurement)
Note if x and y are different measurements, p(x)
and p(y) correspond to different pdfs pX(x) and
pY(y)

4
Terminology (contd)(using sea bass salmon
classification example)

Conditional probability density p(x/?j)
(likelihood)
how frequently we will measure a pattern with
feature value x given that the pattern belongs to
class ?j

e.g., lightness distributions between
salmon/sea-bass populations
5
Terminology (contd)(using sea bass salmon
classification example)

Conditional probability P(?j /x) (posterior)
the probability that the fish belongs to class ?j
given measurement x.
Note we will be using an uppercase P(.) to
denote a probability mass function (pmf) and a
lowercase p(.) to denote a probability density
function (pdf).

6
Decision Rule Using Priors Only

Decide ?1 if P(?1) gt P(?2) otherwise decide
?2
P(error) minP(?1), P(?2)
Favours the most likely class (optimum if no
other info is available).
This rule would be making the same decision all
the times!
Makes sense to use for judging just one fish

7
Decision Rule Using Conditional pdf

Using Bayes rule, the posterior probability of
category ?j given measurement x is given by
where
(scale factor sum of probs 1)
Decide ?1 if P(?1 /x) gt P(?2/x)
otherwise decide ?2 or
Decide ?1 if p(x/?1)P(?1)gtp(x/?2)P(?2)
otherwise decide ?2

8
Decision Rule Using Conditional pdf (contd)

9
Probability of Error

The probability of error is defined as
The average probability error is given by
The Bayes rule is optimum, that is, it minimizes
the average probability error since
P(error/x) minP(?1/x), P(?2/x)

10
Where do Probabilities Come From?

The Bayesian rule is optimal if the pmf or pdf is
known.
There are two competitive answers to the above
question
(1) Relative frequency (objective) approach.
Probabilities can only come from experiments.
(2) Bayesian (subjective) approach.
Probabilities may reflect degree of belief and
can be based on opinion as well as experiments.

11
Example

Classify cars on UNR campus whether they are more
or less than 50K
C1 price gt 50K
C2 price lt 50K
Feature x height of car
From Bayes rule, we know how to compute the
posterior probabilities
Need to compute p(x/C1), p(x/C2), P(C1), P(C2)

12
Example (contd)

Determine prior probabilities
Collect data ask drivers how much their car was
and measure height.
e.g., 1209 samples C1221 C2988

13
Example (contd)

Determine class conditional probabilities
(likelihood)
Discretize car height into bins and use
normalized histogram

14
Example (contd)

Calculate the posterior probability for each bin

15
A More General Theory

Use more than one features.
Allow more than two categories.
Allow actions other than classifying the input to
one of the possible categories (e.g., rejection).
Introduce a more general error function
loss function (i.e., associate costs with
actions)

16
Terminology

Features form a vector
A finite set of c categories ?1, ?2, , ?c
A finite set l of actions a1, a2, , al
A loss function ?(ai/ ?j)
the loss incurred for taking action ai when the
classification category is ?j
Bayes rule using vector notation

17
Expected Loss (Conditional Risk)

Expected loss (or conditional risk) with taking
action ai
The expected loss can be minimized by selecting
the action that minimizes the conditional risk.

18
Overall Risk

Overall risk R
where a(x) determines which action a1, a2, ,
al to take for every x (i.e., a(x) is a decision
rule).
To minimize R, find a decision rule a(x) that
chooses the action with the minimum conditional
risk R(ai/x) for every x.
This rule yields optimal performance

conditional risk
19
Bayes Decision Rule

The Bayes decision rule minimizes R by
Computing R(ai /x) for every ai given an x
Choosing the action ai with the minimum R(ai /x)
Bayes risk (i.e., resulting minimum) is the best
performance that can be achieved.

20
ExampleTwo-category classification

Two possible actions
a1 corresponds to deciding ?1
a2 corresponds to deciding ?2
Notation
?ij?(ai,?j)
The conditional risks are

21
ExampleTwo-category classification

Decision rule

or
or (i.e., using likelihood ratio)
gt
22
Special CaseZero-One Loss Function

It assigns the same loss to all errors
The conditional risk corresponding to this loss
function

23
Special CaseZero-One Loss Function (contd)

The decision rule becomes
What is the overall risk in this case?
(answer average probability error)

or
or
24
Example

?a was determined assuming zero-one loss function
?b was determined assuming

(decision regions)
25
Minmax criterion

Design classifiers that perform well over a range
of prior probabilities (e.g., when prior
probabilities are not known exactly).
Minimize maximum (i.e., worst) possible overall
risk for any value of the priors.

(R1 and R2 are the decision regions for given
priors)
26
Minmax criterion (contd)
R is linear in P(?1)
Idea find decision boundary such that this term
0
worst case risk
27
Minmax criterion (contd)

How to find the minimax solution?
We need to find the prior which maximizes the
Bayes risk (i.e., Rmn becomes equal to the worst
Bayes risk)

fixed decision boundary, varying priors
find max Bayes error!!
(assume zero-one loss function in this example)
Bayes error curve
28
Neyman-Pearson Criterion

Minimize the overall risk subject to a constraint
e.g., do not misclassify more than 1 of salmon
as sea bass
Adjust decision boundaries numerically
Analytic solutions are possible assuming Gaussian
densities.

29
Discriminant Functions

Functional structure of a general statistical
classifier
Assign x to ?i if gi(x) gt gj(x) for
all

(discriminant functions)
pick max
30
Discriminants for Bayes Classifier

Using risks
gi(x)-R(ai/x)
Using zero-one loss function (i.e., min error
rate)
gi(x)P(?i/x)
Is the choice of gi unique?
Replacing gi(x) with f(gi(x)), where f() is
monotonically increasing, does not change the
classification results.

31
Decision Regions and Boundaries

Decision rules divide the feature space in
decision regions R1, R2, , Rc
The boundaries of the decision regions are the
decision boundaries.

g1(x)g2(x) at the decision boundaries
32
Case of two categories

More common to use a single discriminant function
(dichotomizer) instead of two
Examples of dichotomizers

33
Discriminant Function for Multivariate Gaussian

Assume the following discriminant function

N(µ,S)
p(x/?i)
34
Multivariate Gaussian DensityCase I

Assumption Sis2
Features are statistically independent
Each feature has the same variance

favors the a-priori more likely category
35
Multivariate Gaussian DensityCase I (contd)

wi
threshold or bias
)
)
36
Multivariate Gaussian DensityCase I (contd)

Comments about this hyperplane
It passes through x0
It is orthogonal to the line linking the means.
What happens when P(?i) P(?j) ?
If P(?i) P(?j), then x0 shifts away from the
more likely mean.
If s is very small, the position of the boundary
is insensitive to P(?i) and P(?j)

37
Multivariate Gaussian DensityCase I (contd)

38
Multivariate Gaussian DensityCase I (contd)

Minimum distance classifier
When P(?i) is the same for each of the c classes

39
Multivariate Gaussian DensityCase II

Assumption Si S

40
Multivariate Gaussian DensityCase II (contd)

Comments about this hyperplane
It passes through x0
It is NOT orthogonal to the line linking the
means.
What happens when P(?i) P(?j) ?
If P(?i) P(?j), then x0 shifts away from the
more likely mean.

41
Multivariate Gaussian DensityCase II (contd)

42
Multivariate Gaussian DensityCase II (contd)

Mahalanobis distance classifier
When P(?i) is the same for each of the c classes

43
Multivariate Gaussian DensityCase III

Assumption Si arbitrary

hyperquadrics
e.g., hyperplanes, pairs of hyperplanes,
hyperspheres, hyperellipsoids, hyperparaboloids
etc.
44
Multivariate Gaussian DensityCase III (contd)

P(?1)P(?2)
disconnected decision regions
45
Multivariate Gaussian DensityCase III (contd)

disconnected decision regions
non-linear decision boundaries
46
Multivariate Gaussian DensityCase III (contd)

More examples (S arbitrary)

47
Multivariate Gaussian DensityCase III (contd)

A four category example

48
Example - Case III

decision boundary
P(?1)P(?2)
boundary does not pass through midpoint of µ1,µ2
49
Error Probabilities and Integrals

Case of two categories

Bayes rule minimizes
Optimum using Bays rule xxB
50
Error Probabilities and Integrals (contd)

Case of multiple categories
Simpler to compute the probability of being
correct

Bayes rule maximizes P(correct)
51
Error Bounds for Gaussian Densities

Full calculation of the error could be difficult
Compute upper bounds for the probability error
Assume the case of two categories for convenience
Chernoff bound

52
Error Bounds for Gaussian Densities

If the class conditional distributions are
Gaussians, then
k(ß) is defined as follows

53
Error Bounds for Gaussian Densities (contd)

The Chernoff bound corresponds to ß that
minimizes e-?(ß)
1-D optimization regardless to dimensionality of
class conditional densities)

loose bound
loose bound
tight bound
54
Error Bounds for Gaussian Densities (contd)

Bhattacharyya bound
The error is given for ß0.5
Easier to compute than Chernoff error but looser.
The Chernoff and Bhattacharyya bounds will not be
tight if the distributions are not Gaussian !!

55
Example on Error Bounds

K(1/2)4.06
56
Receiver Operating Characteristic (ROC) Curve

Every classifier employs some kind of a threshold
value.
Changing the threshold affects the performance of
the system.
ROC curves can help us distinguish between
discriminability and decision bias (i.e., choice
of threshold)

57
Example Person Authentication

Authenticate a person using biometrics (e.g.,
face image).
There are two possible distributions
authentic (A) and impostor (I)

correct rejection
correct acceptance
A
I
false positive
false positive
58
Example Person Authentication (contd)

Possible cases
(1) correct acceptance (true positive)
X belongs to A, and we decide A
(2) incorrect acceptance (false positive)
X belongs to I, and we decide A
(3) correct rejection (true negative)
X belongs to I, and we decide I
(4) incorrect rejection (false negative)
X belongs to A, and we decide I

59
Error vs Threshold

x
60
False Negatives vs Positives

61
Bayes Decision Theory Case of Discrete Features

Replace with
Read section 2.9

62
Missing Features

Suppose x(x1,x2) is a feature vector.
What can we do when x1 is missing during
classification?
Maybe use the mean value of all x1 measurements?
But is the largest!

63
Missing Features (contd)

Suppose xxg,xb (xg good features, xb bad
features)
Derive the Bayes rule using the good features

64
Noisy Features

Suppose xxg,xb (xg good features, xb noisy
features)
Suppose noise is statistically independent and
We know noise model p(xb/xt)
xb observed feature values, xt true feature
values.
Assume statistically independent noise
if xt were known, xb would be independent of xg ,
?i

65
Noisy Features (contd)

use independence assumption

What happens when p(xb/xt) is uniform?

66
Compound Bayesian Decision Theory

Sequential compound decision
Decide as each fish emerges.
Compound decision
Wait for n fish to emerge.
Make all n decisions jointly.

67
Bayes Rule for Compound Decisions
cn possible vectors
cn possible values
(consecutive states ?i are not independent can
lead to better performance)

Write a Comment

User Comments (0)