Title: Intro to Pattern Recognition : Bayesian Decision Theory
1Intro to Pattern Recognition Bayesian Decision
Theory
- 2. 1 Introduction
- 2.2 Bayesian Decision TheoryContinuous Features
Materials used in this course were taken from the
textbook Pattern Classification by Duda et al.,
John Wiley Sons, 2001 with the permission of
the authors and the publisher
2Credits and Acknowledgments
- Materials used in this course were taken from the
textbook Pattern Classification by Duda et al.,
John Wiley Sons, 2001 with the permission of
the authors and the publisher and also from - Other material on the web
- Dr. A. Aydin Atalan, Middle East Technical
University, Turkey - Dr. Djamel Bouchaffra, Oakland University
- Dr. Adam Krzyzak, Concordia University
- Dr. Joseph Picone, Mississippi State University
- Dr. Robi Polikar, Rowan University
- Dr. Stefan A. Robila, University of New Orleans
- Dr. Sargur N. Srihari, State University of New
York at Buffalo - David G. Stork, Stanford University
- Dr. Godfried Toussaint, McGill University
- Dr. Chris Wyatt, Virginia Tech
- Dr. Alan L. Yuille, University of California, Los
Angeles - Dr. Song-Chun Zhu, University of California, Los
Angeles
3 TYPICAL APPLICATIONS
GENERALIZATION AND RISK
- Optimal decision surface still a line
- Can we integrate prior knowledge about data,
confidence, or willingness to take risk?
4TYPICAL APPLICATIONS
FEATURES ARE CONFUSABLE
5TYPICAL APPLICATIONS
IMAGE PROCESSING EXAMPLE
6TYPICAL APPLICATIONS
LENGTH AS A DISCRIMINATOR
- Length is a poor discriminator
7TYPICAL APPLICATIONS
ADD ANOTHER FEATURE
- Lightness is a better feature than length because
it reduces the misclassification error. - Can we combine features in such a way that we
improve performance? (Hint correlation)
8TYPICAL APPLICATIONS
WIDTH AND LIGHTNESS
- Treat features as a N-tuple (two-dimensional
vector) - Create a scatter plot
- Draw a line (regression) separating the two
classes
9TYPICAL APPLICATIONS
WIDTH AND LIGHTNESS
- Treat features as a N-tuple (two-dimensional
vector) - Create a scatter plot
- Draw a line (regression) separating the two
classes
10TYPICAL APPLICATIONS
DECISION THEORY
- Can we do better than a linear classifier?
- What is wrong with this decision surface? (hint
generalization)
11TYPICAL APPLICATIONS
GENERALIZATION AND RISK
- Why might a smoother decision surface be a better
choice? (hint Occams Razor).
- This course investigates how to find such
optimal decision surfaces and how to provide
system designers with the tools to make
intelligent trade-offs.
12TYPICAL APPLICATIONS
CORRELATION
13TYPICAL APPLICATIONS
SPEECH RECOGNITION
14FEATURE EXTRACTION
15Application of Pattern RecognitionSpeaker
Verification
- Basic Principle This system extracts the
uniqueness in - human voice and creates an individual voice
signature.
- Data Collection
- Feature Extraction
- Classifier
- Decision
- Enrollment, Verification
- MFCC acoustic features
- (Mel Frequency Cepstrum Coefficients)
- Pattern Matching(using Likelihood Scores)
- Accept/Reject
16Application of Pattern Recognition
- CLASSIFIER USED IN THIS APPLICATION PERFORMS
PATTERN MATCHING - The pattern matching process involves the
comparison of a - given set of input feature vectors against the
speaker model - for the claimed identity and computing a
matching score.
A basic Speaker Verification System
172.1 Bayesian Decision Theory
18Thomas Bayes
- At the time of his death, Rev. Thomas Bayes (1702
1761) left behind two unpublished essays
attempting to determine the probabilities of
causes from observed effects. Forwarded to the
British Royal Society, the essays had little
impact and were soon forgotten. -
- When several years later, the French
mathematician Laplace independently rediscovered
a very similar concept, the English scientists
quickly reclaimed the ownership of what is now
known as the Bayes Theorem.
19BAYESIAN DECISION THEORY
PROBABILISTIC DECISION THEORY
- Bayesian decision theory is a fundamental
statistical approach to the problem of pattern
classification. - Quantify the tradeoffs between various
classification decisions using probability and
the costs that accompany these decisions. - Assume all relevant probability distributions are
known (later we will learn how to estimate these
from data). - Can we exploit prior knowledge in our fish
classification problem - Are the sequence of fish predictable?
(statistics) - Is each class equally probable? (uniform priors)
- What is the cost of an error? (risk, optimization)
20BAYESIAN DECISION THEORY
PRIOR PROBABILITIES
- State of nature is prior information
- Model as a random variable, ?
- ? ?1 the event that the next fish is a sea
bass - category 1 sea bass category 2 salmon
- P(?1) probability of category 1
- P(?2) probability of category 2
- P(?1) P( ?2) 1
- Exclusivity ?1 and ?2 share no basic events
- Exhaustivity the union of all outcomes is the
sample space (either ?1 or ?2 must occur) - If all incorrect classifications have an equal
cost - Decide ?1 if P(?1) gt P(?2) otherwise, decide ?2
21BAYESIAN DECISION THEORY
CLASS-CONDITIONAL PROBABILITIES
- A decision rule with only prior information
always produces the same result and ignores
measurements. - If P(?1) gtgt P( ?2), we will be correct most of
the time. - Probability of error P(E) min(P(?1),P( ?2)).
- p(x?1) and p(x?2) describe the difference in
lightness between populations of sea and salmon.
22BAYESIAN DECISION THEORY
PROBABILITY FUNCTIONS
- A probability density function is denoted in
lowercase and represents a function of a
continuous variable. - px(x?), often abbreviated as p(x), denotes a
probability density function for the random
variable X. Note that px(x?) and py(y?) can be
two different functions. - P(x?) denotes a probability mass function, and
must obey the following constraints
- Probability mass functions are typically used for
discrete random variables while densities
describe continuous random variables (latter must
be integrated).
23BAYESIAN DECISION THEORY
BAYES FORMULA
- Suppose we know both P(?j) and p(x?j), and we
can measure x. How does this influence our
decision? - The joint probability that of finding a pattern
that is in category j and that this pattern has a
feature value of x is
- Rearranging terms, we arrive at Bayes formula
where in the case of two categories
24BAYESIAN DECISION THEORY
POSTERIOR PROBABILITIES
- Bayes formula
- can be expressed in words as
- By measuring x, we can convert the prior
probability, P(?j), into a posterior probability,
P(?jx). - Evidence can be viewed as a scale factor and is
often ignored in optimization applications (e.g.,
speech recognition).
25BAYESIAN DECISION THEORY
POSTERIOR PROBABILITIES
- For every value of x, the posteriors sum to 1.0.
- At x14, the probability it is in category ?2 is
0.08, and for category ?1 is 0.92.
26BAYESIAN DECISION THEORY
BAYES DECISION RULE
- Decision rule
- For an observation x, decide ?1 if P(?1x) gt
P(?2x) otherwise, decide ?2 - Probability of error
- The average probability of error is given by
- If for every x we ensure that P(errorx) is as
small as possible, then the integral is as small
as possible. Thus, Bayes decision rule for
minimizes P(error).
27Bayes Decision Rule
28BAYESIAN DECISION THEORY
EVIDENCE
- The evidence, p(x), is a scale factor that
assures conditional probabilities sum to 1 - P(?1x)P(?2x)1
- We can eliminate the scale factor (which appears
on both sides of the equation) - Decide ?1 if p(x?1)P(?1) gt p(x?2)P(?2)
- Special cases
- if p(x ?1)p(x ?2) x gives us no useful
information - if P(?1) P(?2) decision is based entirely on
the likelihood, p(x?j).
29CONTINUOUS FEATURES
GENERALIZATION OF TWO-CLASS PROBLEM
- Generalization of the preceding ideas
- Use of more than one feature(e.g., length and
lightness) - Use more than two states of nature(e.g., N-way
classification) - Allowing actions other than a decision to decide
on the state of nature (e.g., rejection refusing
to take an action when alternatives are close or
confidence is low) - Introduce a loss of function which is more
general than the probability of error(e.g.,
errors are not equally costly) - Let us replace the scalar x by the vector x in a
d-dimensional Euclidean space, Rd, calledthe
feature space.
30CONTINUOUS FEATURES
LOSS FUNCTION 1
- Let ?1, ?2,, ?c be the set of c categories
- Let ?1, ?2,, ?a be the set of a possible
actions - Let ?(?i?j) be the loss incurred for taking
action ?i when the state of nature is ?j
31Examples
- Ex 1 Fish classification
- X is the image of fish
- x (brightness, length, fin , etc.)
- is our belief what the fish type is
- sea bass, salmon, trout, etc
- is a decision for the fish type, in this
case - sea bass, salmon, trout, manual
expection needed, etc
- Ex 2 Medical diagnosis
- X all the available medical tests, imaging scans
that a doctor can order for a patient - x (blood pressure, glucose level, cough, x-ray,
etc.) - is an illness type
- Flu, cold, TB, pneumonia, lung
cancer, etc - is a decision for treatment,
- Tylenol, Hospitalize, more tests
needed, etc
32CONTINUOUS FEATURES
LOSS FUNCTION
- ?(?i?j) be the loss incurred for taking action
?i when the state of nature is ?j - The posterior, P(?jx), can be computed from
Bayes formula
where the evidence is
- The expected loss from taking action ?i is
33CONTINUOUS FEATURES
BAYES RISK
- An expected loss is called a risk.
- R(?ix) is called the conditional risk.
- A general decision rule is a function ?(x) that
tells us which action to take for every possible
observation. - The overall risk is given by
- If we choose ?(x) so that R(?i(x)) is as small as
possible for every x, the overall risk will be
minimized. - Compute the conditional risk for every ? and
select the action that minimizes R(?ix). This is
denoted R, and is referred to as the Bayes risk. - The Bayes risk is the best performance that can
be achieved.
34CONTINUOUS FEATURES
TWO-CATEGORY CLASSIFICATION
- Let ?1 correspond to ?1, ?2 to ?2, and ?ij
?(?i?j) - The conditional risk is given by
- R(?1x) ??11P(?1x) ?12P(?2x)
- R(?2x) ??21P(?1x) ?22P(?2x)
- Our decision rule is
- choose ?1 if R(?1x) lt R(?2x) otherwise
decide ?2 - This results in the equivalent rule
- choose ?1 if (?21- ?11) P(x?1) gt (?12- ?22)
P(x?2) - otherwise decide ?2
- If the loss incurred for making an error is
greater than that incurred for being correct, the
factors (?21- ?11) and(?12- ?22) are positive,
and the ratio of these factors simply scales the
posteriors.
35CONTINUOUS FEATURES
LIKELIHOOD
- By employing Bayes formula, we can replace the
posteriors by the prior probabilities and
conditional densities - choose ?1 if
- (?21- ?11) p(x?1) P(?1) gt (?12- ?22) p(x?2)
P(?2) - otherwise decide ?2
- If ?21- ?11 is positive, our rule becomes
- If the loss factors are identical, and the prior
probabilities are equal, this reduces to a
standard likelihood ratio
36(No Transcript)
372.3 Minimum Error Rate Classification
38 Minimum Error Rate
MINIMUM ERROR RATE
- Consider a symmetrical or zero-one loss function
- The conditional risk is the average probability
of error. - To minimize error, maximize P(?ix) also known
as maximum a posteriori decoding (MAP).
39Minimum Error Rate
LIKELIHOOD RATIO
- Minimum error rate classification
- choose ?i if P(?i x) gt P(?j x) for all j?i
40Example
- 3. It is known that 1 of population suffers from
a particular disease. A blood test has a 97
chance to identify the disease for a diseased
individual, by also has a 6 chance of falsely
indicating that a healthy person has a disease.
- a. What is the probability that a random person
has a positive blood test. - b. If a blood test is positive, whats the
probability that the person has the disease? - c. If a blood test is negative, whats the
probability that the person does not have the
disease?
41- S is a boolean RV indicating whether a person
has a disease. P(S) 0.01 P(S) 0.99. - T is a boolean RV indicating the test result ( T
true indicates that test is positive.) - P(TS) 0.97 P(TS) 0.03
- P(TA) 0.06 P(TS) 0.94
- (a) P(T) P(S) P(TS) P(S)P(TS) 0.010.97
0.99 0.06 0.0691 - (b) P(ST)P(TS)P(S)/P(T) 0.97 0.01/0.0691
0.1403 - (c) P(ST) P(TS)P(S)/P(T)
P(TS)P(S)/(1-P(T)) 0.940.99/(1-.0691)0.9997
42- A physician can do two possible actions after
seeing patients test results - A1 - Decide the patient is sick
- A2 - Decide the patient is healthy
- The costs of those actions are
- If the patient is healthy, but the doctor decides
he/she is sick - 20,000. - If the patient is sick, but the doctor decides
he/she is healthy - 100.000 - When the test is positive
- R(A1T) R(A1S)P(ST) R(A1S) P(ST)
R(A1S) P(ST) 20.000 P(ST)
20.0000.8597 17194.00 - R(A2T) R(A2S)P(ST) R(A2S) P(ST)
R(A2S)P(ST) 100000 0.1403 14030.00
43- A physician can do three possible actions after
seeing patients test results - Decide the patient is sick
- Decide the patient is healthy
- Send the patient for another test
- The costs of those actions are
- If the patient is healthy, but the doctor decides
he/she is sick - 20,000. - If the patient is sick, but the doctor decides
he/she is healthy - 100.000 - Sending the patient for another test costs
15,000
44- When the test is positive
- R(A1T) R(A1S)P(ST) R(A1S) P(ST)
R(A1S) P(ST) 20.000 P(ST)
20.0000.8597 17194.00 - R(A2T) R(A2S)P(ST) R(A2S) P(ST)
R(A2S)P(ST) 100000 0.1403 14030.00 - R(A3T) 15000.00
- When the test is negative
- R(A1T) R(A1S)P(ST) R(A1S) P(ST)
R(A1S) P(ST) 20,000 0.9997 19994.00 - R(A2T) R(A2S)P(ST) R(A2S) P(ST)
R(A1S) P(ST) 100,0000.0003 30.00 - R(A3T) 15000.00
45Example
- For sea bass population, the lightness x is a
normal random variable distributes according to
N(4,1) - for salmon population x is distributed
according to N(10,1) - Select the optimal decision where
- The two fish are equiprobable
- P(sea bass) 2X P(salmon)
- The cost of classifying a fish as a salmon when
it truly is seabass is 2, and t The cost of
classifying a fish as a seabass when it is truly
a salmon is 1.
2
46(No Transcript)
47(No Transcript)
48(No Transcript)
49Exercise
Consider a 2-class problem with P(C1) 2/3,
P(C2)1/3 a scalar feature x and three possible
actions a1, a2, a3 defined as a1 choose C1 a2
choose C2 a3 do not classify Let the loss matrix
?(ai Cj) be Â
 a1 a2 a3
C1 0 1 1/4
C2 1 0 1/4
- Â
- And let P(x C1) (2-x)/2, P(x C2) 1/2, 0 ?
x ? 2 - Â
- Questions
- Â Â Â Â Which action to decide for a pattern x 0 ?
x ? 2 - Â Â Â Â What is the proportion of patterns for which
action a3 is performed (i.e., do not classify) - Â Â Â Â Compute the total minimum risk
- Â Â Â Â If you decide to take action a1 for all x,
then how much the total risk will be reduced. - Â
- Â
50Solution
P(x) (5-2x)/6 P(C1 x) (4-2x)/(5-2x) 0 ? x
? 2 P(C2 x) 1/ (5-2x) This leads to
conditional risks r1(x) r(a1 x) 0.P(C1
x) 1. P(C2 x) 1/(5-2x) r2(x) r(a2 x)
1.P(C1 x) 0. P(C2 x) (4-2x)/(5-2x) r3(x)
r(a3 x) 1/4.P(C1 x) 1/4. P(C2 x)
1/4 Bayes decision rule assigns to each x the
action with the minimum conditional risk. The
conditional risks are sketched in the following
figure and the optimal decision rule is therefore
51 x
If 0 ? x ?0.5 then action a1 choose C1 If
0.5 ? x ? 11/6 then action a3 do not
classify If 11/6 ? x ? 2 then action a2
choose C2
52In this particular case the action do not
classify is optimal whenIever x is between ½ and
11/6 2)Â Â Â Â Â Â
Therefore, do not classify action has been
performed for 60 of the input patterns. 3)Â Â Â Â Â Â
Total minimum risk
     4) If instead of using Bayes classifier we
choose to take a1 for all x, then the total risk
is
53Case 1 ?i ?2I
GAUSSIAN CLASSIFIERS
- Features are statistically independent, and all
features have the same variance Distributions
are spherical in d dimensions.
6
54GAUSSIAN CLASSIFIERS
THRESHOLD DECODING
- This has a simple geometric interpretation
- The decision region when the priors are equal and
the support regions are spherical is simply
halfway between the means (Euclidean distance).
55GAUSSIAN CLASSIFIERS
6
56GAUSSIAN CLASSIFIERS
Note how priors shift the boundary away from the
more likely mean !!!
6
57GAUSSIAN CLASSIFIERS
6
58Case 2 ?i ?
GAUSSIAN CLASSIFIERS
- Covariance matrices are arbitrary, but equal to
each other for all classes. Features then form
hyper-ellipsoidal clusters of equal size and
shape. - Discriminant function is linear
6
596
606
61Case 3 ?i arbitrary
- The covariance matrices are different for each
category - All bets off !In two class case, the decision
boundaries form hyperquadratics. - (Hyperquadrics are hyperplanes, pairs of
hyperplanes, hyperspheres, hyperellipsoids,
hyperparaboloids, hyperhyperboloids)
6
626
63GAUSSIAN CLASSIFIERS
ARBITRARY COVARIANCES