Title: CS479679 Pattern Recognition Spring 2006 Prof' Bebis
1CS479/679 Pattern RecognitionSpring 2006 Prof.
- Bayesian Decision Theory
- Chapter 2 (Duda et al.)
2Bayesian Decision Theory
- Fundamental statistical approach to problem
classification. - Quantifies the tradeoffs between various
classification decisions using probabilities and
the costs associated with such decisions. - Each action is associated with a cost or risk.
- The simplest risk is the classification error.
- Design classifiers to recommend actions that
minimize some total expected risk.
3Terminology (using sea bass salmon
classification example)
- State of nature ? (random variable)
- ?1 for sea bass, ?2 for salmon.
- Probabilities P(?1) and P(?2 ) (priors)
- prior knowledge of how likely is to get a sea
bass or a salmon - Probability density function p(x) (evidence)
- how frequently we will measure a pattern with
feature value x (e.g., x is a lightness
measurement) - Note if x and y are different measurements, p(x)
and p(y) correspond to different pdfs pX(x) and
4Terminology (contd)(using sea bass salmon
classification example)
- Conditional probability density p(x/?j)
(likelihood) - how frequently we will measure a pattern with
feature value x given that the pattern belongs to
class ?j
e.g., lightness distributions between
salmon/sea-bass populations
5Terminology (contd)(using sea bass salmon
classification example)
- Conditional probability P(?j /x) (posterior)
- the probability that the fish belongs to class ?j
given measurement x. - Note we will be using an uppercase P(.) to
denote a probability mass function (pmf) and a
lowercase p(.) to denote a probability density
function (pdf).
6Decision Rule Using Priors Only
- Decide ?1 if P(?1) gt P(?2) otherwise decide
?2 -
- P(error) minP(?1), P(?2)
- Favours the most likely class (optimum if no
other info is available). - This rule would be making the same decision all
the times! - Makes sense to use for judging just one fish
7Decision Rule Using Conditional pdf
- Using Bayes rule, the posterior probability of
category ?j given measurement x is given by -
- where
(scale factor sum of probs 1)
- Decide ?1 if P(?1 /x) gt P(?2/x)
otherwise decide ?2 or - Decide ?1 if p(x/?1)P(?1)gtp(x/?2)P(?2)
otherwise decide ?2
8Decision Rule Using Conditional pdf (contd)
9Probability of Error
- The probability of error is defined as
- The average probability error is given by
- The Bayes rule is optimum, that is, it minimizes
the average probability error since - P(error/x) minP(?1/x), P(?2/x)
10Where do Probabilities Come From?
- The Bayesian rule is optimal if the pmf or pdf is
known. - There are two competitive answers to the above
question - (1) Relative frequency (objective) approach.
- Probabilities can only come from experiments.
- (2) Bayesian (subjective) approach.
- Probabilities may reflect degree of belief and
can be based on opinion as well as experiments.
- Classify cars on UNR campus whether they are more
or less than 50K - C1 price gt 50K
- C2 price lt 50K
- Feature x height of car
- From Bayes rule, we know how to compute the
posterior probabilities - Need to compute p(x/C1), p(x/C2), P(C1), P(C2)
12Example (contd)
- Determine prior probabilities
- Collect data ask drivers how much their car was
and measure height. - e.g., 1209 samples C1221 C2988
13Example (contd)
- Determine class conditional probabilities
(likelihood) - Discretize car height into bins and use
normalized histogram
14Example (contd)
- Calculate the posterior probability for each bin
15A More General Theory
- Use more than one features.
- Allow more than two categories.
- Allow actions other than classifying the input to
one of the possible categories (e.g., rejection). - Introduce a more general error function
- loss function (i.e., associate costs with
- Features form a vector
- A finite set of c categories ?1, ?2, , ?c
- A finite set l of actions a1, a2, , al
- A loss function ?(ai/ ?j)
- the loss incurred for taking action ai when the
classification category is ?j - Bayes rule using vector notation
17Expected Loss (Conditional Risk)
- Expected loss (or conditional risk) with taking
action ai - The expected loss can be minimized by selecting
the action that minimizes the conditional risk.
18Overall Risk
- Overall risk R
- where a(x) determines which action a1, a2, ,
al to take for every x (i.e., a(x) is a decision
rule). - To minimize R, find a decision rule a(x) that
chooses the action with the minimum conditional
risk R(ai/x) for every x. - This rule yields optimal performance
conditional risk
19Bayes Decision Rule
- The Bayes decision rule minimizes R by
- Computing R(ai /x) for every ai given an x
- Choosing the action ai with the minimum R(ai /x)
- Bayes risk (i.e., resulting minimum) is the best
performance that can be achieved.
20ExampleTwo-category classification
- Two possible actions
- a1 corresponds to deciding ?1
- a2 corresponds to deciding ?2
- Notation
- ?ij?(ai,?j)
- The conditional risks are
21ExampleTwo-category classification
or (i.e., using likelihood ratio)
22Special CaseZero-One Loss Function
- It assigns the same loss to all errors
- The conditional risk corresponding to this loss
23Special CaseZero-One Loss Function (contd)
- The decision rule becomes
- What is the overall risk in this case?
- (answer average probability error)
- ?a was determined assuming zero-one loss function
- ?b was determined assuming
(decision regions)
25Minmax criterion
- Design classifiers that perform well over a range
of prior probabilities (e.g., when prior
probabilities are not known exactly). - Minimize maximum (i.e., worst) possible overall
risk for any value of the priors.
(R1 and R2 are the decision regions for given
26Minmax criterion (contd)
R is linear in P(?1)
Idea find decision boundary such that this term
worst case risk
27Minmax criterion (contd)
- How to find the minimax solution?
- We need to find the prior which maximizes the
Bayes risk (i.e., Rmn becomes equal to the worst
Bayes risk)
fixed decision boundary, varying priors
find max Bayes error!!
(assume zero-one loss function in this example)
Bayes error curve
28Neyman-Pearson Criterion
- Minimize the overall risk subject to a constraint
- e.g., do not misclassify more than 1 of salmon
as sea bass - Adjust decision boundaries numerically
- Analytic solutions are possible assuming Gaussian
29Discriminant Functions
- Functional structure of a general statistical
classifier - Assign x to ?i if gi(x) gt gj(x) for
(discriminant functions)
pick max
30Discriminants for Bayes Classifier
- Using risks
- gi(x)-R(ai/x)
- Using zero-one loss function (i.e., min error
rate) - gi(x)P(?i/x)
- Is the choice of gi unique?
- Replacing gi(x) with f(gi(x)), where f() is
monotonically increasing, does not change the
classification results.
31Decision Regions and Boundaries
- Decision rules divide the feature space in
decision regions R1, R2, , Rc - The boundaries of the decision regions are the
decision boundaries.
g1(x)g2(x) at the decision boundaries
32Case of two categories
- More common to use a single discriminant function
(dichotomizer) instead of two - Examples of dichotomizers
33Discriminant Function for Multivariate Gaussian
- Assume the following discriminant function
34Multivariate Gaussian DensityCase I
- Assumption Sis2
- Features are statistically independent
- Each feature has the same variance
favors the a-priori more likely category
35Multivariate Gaussian DensityCase I (contd)
threshold or bias
36Multivariate Gaussian DensityCase I (contd)
- Comments about this hyperplane
- It passes through x0
- It is orthogonal to the line linking the means.
- What happens when P(?i) P(?j) ?
- If P(?i) P(?j), then x0 shifts away from the
more likely mean. - If s is very small, the position of the boundary
is insensitive to P(?i) and P(?j)
37Multivariate Gaussian DensityCase I (contd)
38Multivariate Gaussian DensityCase I (contd)
- Minimum distance classifier
- When P(?i) is the same for each of the c classes
39Multivariate Gaussian DensityCase II
40Multivariate Gaussian DensityCase II (contd)
- Comments about this hyperplane
- It passes through x0
- It is NOT orthogonal to the line linking the
means. - What happens when P(?i) P(?j) ?
- If P(?i) P(?j), then x0 shifts away from the
more likely mean.
41Multivariate Gaussian DensityCase II (contd)
42Multivariate Gaussian DensityCase II (contd)
- Mahalanobis distance classifier
- When P(?i) is the same for each of the c classes
43Multivariate Gaussian DensityCase III
e.g., hyperplanes, pairs of hyperplanes,
hyperspheres, hyperellipsoids, hyperparaboloids
44Multivariate Gaussian DensityCase III (contd)
disconnected decision regions
45Multivariate Gaussian DensityCase III (contd)
disconnected decision regions
non-linear decision boundaries
46Multivariate Gaussian DensityCase III (contd)
- More examples (S arbitrary)
47Multivariate Gaussian DensityCase III (contd)
48Example - Case III
decision boundary
boundary does not pass through midpoint of µ1,µ2
49Error Probabilities and Integrals
Bayes rule minimizes
Optimum using Bays rule xxB
50Error Probabilities and Integrals (contd)
- Case of multiple categories
- Simpler to compute the probability of being
Bayes rule maximizes P(correct)
51Error Bounds for Gaussian Densities
- Full calculation of the error could be difficult
- Compute upper bounds for the probability error
- Assume the case of two categories for convenience
- Chernoff bound
52Error Bounds for Gaussian Densities
- If the class conditional distributions are
Gaussians, then - k(ß) is defined as follows
53Error Bounds for Gaussian Densities (contd)
- The Chernoff bound corresponds to ß that
minimizes e-?(ß) - 1-D optimization regardless to dimensionality of
class conditional densities)
loose bound
loose bound
tight bound
54Error Bounds for Gaussian Densities (contd)
- Bhattacharyya bound
- The error is given for ß0.5
- Easier to compute than Chernoff error but looser.
- The Chernoff and Bhattacharyya bounds will not be
tight if the distributions are not Gaussian !!
55Example on Error Bounds
56Receiver Operating Characteristic (ROC) Curve
- Every classifier employs some kind of a threshold
value. - Changing the threshold affects the performance of
the system. - ROC curves can help us distinguish between
discriminability and decision bias (i.e., choice
of threshold)
57Example Person Authentication
- Authenticate a person using biometrics (e.g.,
face image). - There are two possible distributions
- authentic (A) and impostor (I)
correct rejection
correct acceptance
false positive
false positive
58Example Person Authentication (contd)
- Possible cases
- (1) correct acceptance (true positive)
- X belongs to A, and we decide A
- (2) incorrect acceptance (false positive)
- X belongs to I, and we decide A
- (3) correct rejection (true negative)
- X belongs to I, and we decide I
- (4) incorrect rejection (false negative)
- X belongs to A, and we decide I
59Error vs Threshold
60False Negatives vs Positives
61Bayes Decision Theory Case of Discrete Features
- Replace with
- Read section 2.9
62Missing Features
- Suppose x(x1,x2) is a feature vector.
- What can we do when x1 is missing during
classification? - Maybe use the mean value of all x1 measurements?
- But is the largest!
63Missing Features (contd)
- Suppose xxg,xb (xg good features, xb bad
features) - Derive the Bayes rule using the good features
64Noisy Features
- Suppose xxg,xb (xg good features, xb noisy
features) - Suppose noise is statistically independent and
- We know noise model p(xb/xt)
- xb observed feature values, xt true feature
values. - Assume statistically independent noise
- if xt were known, xb would be independent of xg ,
65Noisy Features (contd)
use independence assumption
- What happens when p(xb/xt) is uniform?
66Compound Bayesian Decision Theory
- Sequential compound decision
- Decide as each fish emerges.
- Compound decision
- Wait for n fish to emerge.
- Make all n decisions jointly.
67Bayes Rule for Compound Decisions
cn possible vectors
cn possible values
(consecutive states ?i are not independent can
lead to better performance)