Title: CS479679 Pattern Recognition Spring 2006 Prof' Bebis
1CS479/679 Pattern RecognitionSpring 2006 Prof.
Bebis
- Bayesian Decision Theory
- Chapter 2 (Duda et al.)
2Bayesian Decision Theory
- Fundamental statistical approach to problem
classification. - Quantifies the tradeoffs between various
classification decisions using probabilities and
the costs associated with such decisions. - Each action is associated with a cost or risk.
- The simplest risk is the classification error.
- Design classifiers to recommend actions that
minimize some total expected risk.
3Terminology (using sea bass salmon
classification example)
- State of nature ? (random variable)
- ?1 for sea bass, ?2 for salmon.
- Probabilities P(?1) and P(?2 ) (priors)
- prior knowledge of how likely is to get a sea
bass or a salmon - Probability density function p(x) (evidence)
- how frequently we will measure a pattern with
feature value x (e.g., x is a lightness
measurement) - Note if x and y are different measurements, p(x)
and p(y) correspond to different pdfs pX(x) and
pY(y)
4Terminology (contd)(using sea bass salmon
classification example)
- Conditional probability density p(x/?j)
(likelihood) - how frequently we will measure a pattern with
feature value x given that the pattern belongs to
class ?j
e.g., lightness distributions between
salmon/sea-bass populations
5Terminology (contd)(using sea bass salmon
classification example)
- Conditional probability P(?j /x) (posterior)
- the probability that the fish belongs to class ?j
given measurement x. - Note we will be using an uppercase P(.) to
denote a probability mass function (pmf) and a
lowercase p(.) to denote a probability density
function (pdf).
6Decision Rule Using Priors Only
-
- Decide ?1 if P(?1) gt P(?2) otherwise decide
?2 -
- P(error) minP(?1), P(?2)
- Favours the most likely class (optimum if no
other info is available). - This rule would be making the same decision all
the times! - Makes sense to use for judging just one fish
7Decision Rule Using Conditional pdf
- Using Bayes rule, the posterior probability of
category ?j given measurement x is given by -
- where
(scale factor sum of probs 1)
- Decide ?1 if P(?1 /x) gt P(?2/x)
otherwise decide ?2 or - Decide ?1 if p(x/?1)P(?1)gtp(x/?2)P(?2)
otherwise decide ?2
8Decision Rule Using Conditional pdf (contd)
9Probability of Error
- The probability of error is defined as
-
- The average probability error is given by
- The Bayes rule is optimum, that is, it minimizes
the average probability error since - P(error/x) minP(?1/x), P(?2/x)
10Where do Probabilities Come From?
- The Bayesian rule is optimal if the pmf or pdf is
known. - There are two competitive answers to the above
question - (1) Relative frequency (objective) approach.
- Probabilities can only come from experiments.
- (2) Bayesian (subjective) approach.
- Probabilities may reflect degree of belief and
can be based on opinion as well as experiments.
11Example
- Classify cars on UNR campus whether they are more
or less than 50K - C1 price gt 50K
- C2 price lt 50K
- Feature x height of car
- From Bayes rule, we know how to compute the
posterior probabilities - Need to compute p(x/C1), p(x/C2), P(C1), P(C2)
12Example (contd)
- Determine prior probabilities
- Collect data ask drivers how much their car was
and measure height. - e.g., 1209 samples C1221 C2988
13Example (contd)
- Determine class conditional probabilities
(likelihood) - Discretize car height into bins and use
normalized histogram
14Example (contd)
- Calculate the posterior probability for each bin
15A More General Theory
- Use more than one features.
- Allow more than two categories.
- Allow actions other than classifying the input to
one of the possible categories (e.g., rejection). - Introduce a more general error function
- loss function (i.e., associate costs with
actions)
16Terminology
- Features form a vector
- A finite set of c categories ?1, ?2, , ?c
- A finite set l of actions a1, a2, , al
- A loss function ?(ai/ ?j)
- the loss incurred for taking action ai when the
classification category is ?j - Bayes rule using vector notation
17Expected Loss (Conditional Risk)
- Expected loss (or conditional risk) with taking
action ai - The expected loss can be minimized by selecting
the action that minimizes the conditional risk.
18Overall Risk
- Overall risk R
-
- where a(x) determines which action a1, a2, ,
al to take for every x (i.e., a(x) is a decision
rule). - To minimize R, find a decision rule a(x) that
chooses the action with the minimum conditional
risk R(ai/x) for every x. - This rule yields optimal performance
conditional risk
19Bayes Decision Rule
- The Bayes decision rule minimizes R by
- Computing R(ai /x) for every ai given an x
- Choosing the action ai with the minimum R(ai /x)
- Bayes risk (i.e., resulting minimum) is the best
performance that can be achieved.
20ExampleTwo-category classification
- Two possible actions
- a1 corresponds to deciding ?1
- a2 corresponds to deciding ?2
- Notation
- ?ij?(ai,?j)
- The conditional risks are
21ExampleTwo-category classification
or
or (i.e., using likelihood ratio)
gt
22Special CaseZero-One Loss Function
- It assigns the same loss to all errors
- The conditional risk corresponding to this loss
function
23Special CaseZero-One Loss Function (contd)
- The decision rule becomes
- What is the overall risk in this case?
- (answer average probability error)
or
or
24Example
- ?a was determined assuming zero-one loss function
- ?b was determined assuming
(decision regions)
25Minmax criterion
- Design classifiers that perform well over a range
of prior probabilities (e.g., when prior
probabilities are not known exactly). - Minimize maximum (i.e., worst) possible overall
risk for any value of the priors.
(R1 and R2 are the decision regions for given
priors)
26Minmax criterion (contd)
R is linear in P(?1)
Idea find decision boundary such that this term
0
worst case risk
27Minmax criterion (contd)
- How to find the minimax solution?
- We need to find the prior which maximizes the
Bayes risk (i.e., Rmn becomes equal to the worst
Bayes risk)
fixed decision boundary, varying priors
find max Bayes error!!
(assume zero-one loss function in this example)
Bayes error curve
28Neyman-Pearson Criterion
- Minimize the overall risk subject to a constraint
- e.g., do not misclassify more than 1 of salmon
as sea bass - Adjust decision boundaries numerically
- Analytic solutions are possible assuming Gaussian
densities.
29Discriminant Functions
- Functional structure of a general statistical
classifier - Assign x to ?i if gi(x) gt gj(x) for
all
(discriminant functions)
pick max
30Discriminants for Bayes Classifier
- Using risks
- gi(x)-R(ai/x)
- Using zero-one loss function (i.e., min error
rate) - gi(x)P(?i/x)
- Is the choice of gi unique?
- Replacing gi(x) with f(gi(x)), where f() is
monotonically increasing, does not change the
classification results.
31Decision Regions and Boundaries
- Decision rules divide the feature space in
decision regions R1, R2, , Rc - The boundaries of the decision regions are the
decision boundaries.
g1(x)g2(x) at the decision boundaries
32Case of two categories
- More common to use a single discriminant function
(dichotomizer) instead of two - Examples of dichotomizers
33Discriminant Function for Multivariate Gaussian
- Assume the following discriminant function
N(µ,S)
p(x/?i)
34Multivariate Gaussian DensityCase I
- Assumption Sis2
- Features are statistically independent
- Each feature has the same variance
favors the a-priori more likely category
35Multivariate Gaussian DensityCase I (contd)
wi
threshold or bias
)
)
36Multivariate Gaussian DensityCase I (contd)
- Comments about this hyperplane
- It passes through x0
- It is orthogonal to the line linking the means.
- What happens when P(?i) P(?j) ?
- If P(?i) P(?j), then x0 shifts away from the
more likely mean. - If s is very small, the position of the boundary
is insensitive to P(?i) and P(?j)
37Multivariate Gaussian DensityCase I (contd)
38Multivariate Gaussian DensityCase I (contd)
- Minimum distance classifier
- When P(?i) is the same for each of the c classes
39Multivariate Gaussian DensityCase II
40Multivariate Gaussian DensityCase II (contd)
-
- Comments about this hyperplane
- It passes through x0
- It is NOT orthogonal to the line linking the
means. - What happens when P(?i) P(?j) ?
- If P(?i) P(?j), then x0 shifts away from the
more likely mean.
41Multivariate Gaussian DensityCase II (contd)
42Multivariate Gaussian DensityCase II (contd)
- Mahalanobis distance classifier
- When P(?i) is the same for each of the c classes
43Multivariate Gaussian DensityCase III
hyperquadrics
e.g., hyperplanes, pairs of hyperplanes,
hyperspheres, hyperellipsoids, hyperparaboloids
etc.
44Multivariate Gaussian DensityCase III (contd)
P(?1)P(?2)
disconnected decision regions
45Multivariate Gaussian DensityCase III (contd)
disconnected decision regions
non-linear decision boundaries
46Multivariate Gaussian DensityCase III (contd)
- More examples (S arbitrary)
47Multivariate Gaussian DensityCase III (contd)
48Example - Case III
decision boundary
P(?1)P(?2)
boundary does not pass through midpoint of µ1,µ2
49Error Probabilities and Integrals
Bayes rule minimizes
Optimum using Bays rule xxB
50Error Probabilities and Integrals (contd)
- Case of multiple categories
- Simpler to compute the probability of being
correct
Bayes rule maximizes P(correct)
51Error Bounds for Gaussian Densities
- Full calculation of the error could be difficult
- Compute upper bounds for the probability error
- Assume the case of two categories for convenience
- Chernoff bound
52Error Bounds for Gaussian Densities
- If the class conditional distributions are
Gaussians, then - k(ß) is defined as follows
53Error Bounds for Gaussian Densities (contd)
- The Chernoff bound corresponds to ß that
minimizes e-?(ß) - 1-D optimization regardless to dimensionality of
class conditional densities)
loose bound
loose bound
tight bound
54Error Bounds for Gaussian Densities (contd)
- Bhattacharyya bound
- The error is given for ß0.5
- Easier to compute than Chernoff error but looser.
- The Chernoff and Bhattacharyya bounds will not be
tight if the distributions are not Gaussian !!
55Example on Error Bounds
K(1/2)4.06
56Receiver Operating Characteristic (ROC) Curve
- Every classifier employs some kind of a threshold
value. - Changing the threshold affects the performance of
the system. - ROC curves can help us distinguish between
discriminability and decision bias (i.e., choice
of threshold)
57Example Person Authentication
- Authenticate a person using biometrics (e.g.,
face image). - There are two possible distributions
- authentic (A) and impostor (I)
correct rejection
correct acceptance
A
I
false positive
false positive
58Example Person Authentication (contd)
- Possible cases
- (1) correct acceptance (true positive)
- X belongs to A, and we decide A
- (2) incorrect acceptance (false positive)
- X belongs to I, and we decide A
- (3) correct rejection (true negative)
- X belongs to I, and we decide I
- (4) incorrect rejection (false negative)
- X belongs to A, and we decide I
59Error vs Threshold
x
60False Negatives vs Positives
61Bayes Decision Theory Case of Discrete Features
- Replace with
- Read section 2.9
62Missing Features
- Suppose x(x1,x2) is a feature vector.
- What can we do when x1 is missing during
classification? - Maybe use the mean value of all x1 measurements?
- But is the largest!
63Missing Features (contd)
- Suppose xxg,xb (xg good features, xb bad
features) - Derive the Bayes rule using the good features
64Noisy Features
- Suppose xxg,xb (xg good features, xb noisy
features) - Suppose noise is statistically independent and
- We know noise model p(xb/xt)
- xb observed feature values, xt true feature
values. - Assume statistically independent noise
- if xt were known, xb would be independent of xg ,
?i
65Noisy Features (contd)
use independence assumption
- What happens when p(xb/xt) is uniform?
66Compound Bayesian Decision Theory
- Sequential compound decision
- Decide as each fish emerges.
- Compound decision
- Wait for n fish to emerge.
- Make all n decisions jointly.
67Bayes Rule for Compound Decisions
cn possible vectors
cn possible values
(consecutive states ?i are not independent can
lead to better performance)