Title: Bayesian Decision Theory (Classification)
1Bayesian Decision Theory(Classification)
2Contents
- Introduction
- Generalize Bayesian Decision Rule
- Discriminant Functions
- The Normal Distribution
- Discriminant Functions for the Normal
Populations. - Minimax Criterion
- Neyman-Pearson Criterion
3Bayesian Decision Theory(Classification)
4What is Bayesian Decision Theory?
- Mathematical foundation for decision making.
- Using probabilistic approach to help making
decision (e.g., classification) so as to minimize
the risk (cost).
5Preliminaries and Notations
a state of nature
prior probability
feature vector
class-conditional density
posterior probability
6Bayesian Rule
7Decision
unimportant in making decision
8Decision
Decide ?i if P(?ix) gt P(?jx) ? j ? i
Decide ?i if p(x?i)P(?i) gt p(x?j)P(?j) ? j ? i
- Special cases
- P(?1)P(?2) ? ? ?P(?c)
- p(x?1)p(x?2) ? ? ? p(x?c)
9Two Categories
Decide ?i if P(?ix) gt P(?jx) ? j ? i
Decide ?i if p(x?i)P(?i) gt p(x?j)P(?j) ? j ? i
Decide ?1 if P(?1x) gt P(?2x) otherwise
decide ?2
Decide ?1 if p(x?1)P(?1) gt p(x?2)P(?2)
otherwise decide ?2
- Special cases
- P(?1)P(?2)
- Decide ?1 if p(x?1) gt p(x?2) otherwise
decide ?1 - 2. p(x?1)p(x?2)
- Decide ?1 if P(?1) gt P(?2) otherwise decide ?2
10Example
R2
R1
P(?1)P(?2)
11Example
P(?1)2/3 P(?2)1/3
Decide ?1 if p(x?1)P(?1) gt p(x?2)P(?2)
otherwise decide ?2
12Classification Error
Consider two categories
Decide ?1 if P(?1x) gt P(?2x) otherwise
decide ?2
13Classification Error
Consider two categories
Decide ?1 if P(?1x) gt P(?2x) otherwise
decide ?2
14Bayesian Decision Theory(Classification)
- Generalized Bayesian Decision Rule
15The Generation
a set of c states of nature
a set of a possible actions
The loss incurred for taking action ?i when the
true state of nature is ?j.
Risk
can be zero.
We want to minimize the expected loss in making
decision.
16Conditional Risk
Given x, the expected loss (risk) associated with
taking action ?i.
170/1 Loss Function
18Decision
Bayesian Decision Rule
19Overall Risk
Decision function
Bayesian decision rule the optimal one to
minimize the overall risk Its resulting overall
risk is called the Bayesian risk
20Two-Category Classification
21Two-Category Classification
Perform ?1 if R(?2x) gt R(?1x) otherwise
perform ?2
22Two-Category Classification
Perform ?1 if R(?2x) gt R(?1x) otherwise
perform ?2
positive
positive
Posterior probabilities are scaled before
comparison.
23Two-Category Classification
irrelevant
Perform ?1 if R(?2x) gt R(?1x) otherwise
perform ?2
24Two-Category Classification
This slide will be recalled later.
Threshold
Likelihood Ratio
Perform ?1 if
25Bayesian Decision Theory(Classification)
26The Multicategory Classification
How to define discriminant functions?
gi(x)s are called the discriminant functions.
g1(x)
?(x)
g2(x)
gc(x)
Assign x to ?i if
gi(x) gt gj(x) for all j ? i.
27Simple Discriminant Functions
If f(.) is a monotonically increasing function,
than f(gi(.) )s are also be discriminant
functions.
Minimum Risk case
Minimum Error-Rate case
28Decision Regions
Two-category example
Decision regions are separated by decision
boundaries.
29Bayesian Decision Theory(Classification)
30Basics of Probability
Discrete random variable (X) - Assume integer
Probability mass function (pmf)
Cumulative distribution function (cdf)
Continuous random variable (X)
not a probability
Probability density function (pdf)
Cumulative distribution function (cdf)
31Expectations
Let g be a function of random variable X.
The kth moment
The 1st moment
The kth central moments
32Important Expectations
Fact
Mean
Variance
33Entropy
The entropy measures the fundamental uncertainty
in the value of points selected randomly from a
distribution.
34Univariate Gaussian Distribution
- Properties
- Maximize the entropy
- Central limit theorem
XN(µ,s2)
EX µ
VarX s2
35Random Vectors
A d-dimensional random vector
Vector Mean
Covariance Matrix
36Multivariate Gaussian Distribution
XN(µ,S)
A d-dimensional random vector
EX µ
E(X-µ) (X-µ)T S
37Properties of N(µ,S)
XN(µ,S)
A d-dimensional random vector
Let YATX, where A is a d k matrix.
YN(ATµ, ATSA)
38Properties of N(µ,S)
XN(µ,S)
A d-dimensional random vector
Let YATX, where A is a d k matrix.
YN(ATµ, ATSA)
39On Parameters of N(µ,S)
XN(µ,S)
40More On Covariance Matrix
? is symmetric and positive semidefinite.
? orthonormal matrix, whose columns are
eigenvectors of ?.
? diagonal matrix (eigenvalues).
41Whitening Transform
XN(µ,S)
YATX
YN(ATµ, ATSA)
Let
42Whitening Transform
Whitening
XN(µ,S)
Linear Transform
YATX
YN(ATµ, ATSA)
Let
Projection
43Mahalanobis Distance
XN(µ,S)
r2
constant
depends on the value of r2
44Mahalanobis Distance
XN(µ,S)
r2
constant
depends on the value of r2
45Bayesian Decision Theory(Classification)
- Discriminant Functions for the Normal Populations
46Minimum-Error-Rate Classification
XiN(µi,Si)
47Minimum-Error-Rate Classification
Three Cases
Case 1
Classes are centered at different mean, and their
feature components are pairwisely independent
have the same variance.
Case 2
Classes are centered at different mean, but have
the same variation.
Case 3
Arbitrary.
48Case 1. ?i ?2I
irrelevant
irrelevant
49Case 1. ?i ?2I
50Case 1. ?i ?2I
Boundary btw. ?i and ?j
51Case 1. ?i ?2I
The decision boundary will be a hyperplane
perpendicular to the line btw. the means at
somewhere.
0 if P(?i)P(?j)
midpoint
Boundary btw. ?i and ?j
wT
52Case 1. ?i ?2I
Minimum distance classifier (template matching)
53Case 1. ?i ?2I
54Case 1. ?i ?2I
55Case 1. ?i ?2I
Demo
56Case 2. ?i ?
Mahalanobis Distance
Irrelevant if P(?i) P(?j) ?i, j
irrelevant
57Case 2. ?i ?
Irrelevant
58Case 2. ?i ?
59Case 2. ?i ?
60Case 2. ?i ?
Demo
61Case 3. ?i ? ? j
- Decision surfaces are hyperquadrics, e.g.,
- hyperplanes
- hyperspheres
- hyperellipsoids
- hyperhyperboloids
Without this term In Case 1 and 2
irrelevant
62Case 3. ?i ? ? j
Non-simply connected decision regions can arise
in one dimensions for Gaussians having unequal
variance.
63Case 3. ?i ? ? j
64Case 3. ?i ? ? j
65Case 3. ?i ? ? j
Demo
66Multi-Category Classification
67Bayesian Decision Theory(Classification)
68Bayesian Decision RuleTwo-Category
Classification
Threshold
Likelihood Ratio
Decide ?1 if
Minimax criterion deals with the case that the
prior probabilities are unknown.
69Basic Concept on Minimax
To choose the worst-case prior probabilities (the
maximum loss) and, then, pick the decision rule
that will minimize the overall risk.
Minimize the maximum possible overall risk.
70Overall Risk
71Overall Risk
72Overall Risk
73Overall Risk
74Overall Risk
R(x) ax b
The value depends on the setting of decision
boundary
The value depends on the setting of decision
boundary
The overall risk for a particular P(?1).
75Overall Risk
R(x) ax b
0 for minimax solution
Rmm, minimax risk
Independent on the value of P(?i).
76Minimax Risk
77Error Probability
Use 0/1 loss function
78Minimax Error-Probability
Use 0/1 loss function
P(?1?2)
P(?2?1)
79Minimax Error-Probability
P(?1?2)
P(?2?1)
80Minimax Error-Probability
81Bayesian Decision Theory(Classification)
82Bayesian Decision RuleTwo-Category
Classification
Threshold
Likelihood Ratio
Decide ?1 if
Neyman-Pearson Criterion deals with the case that
both loss functions and the prior probabilities
are unknown.
83Signal Detection Theory
- The theory of signal detection theory evolved
from the development of communications and radar
equipment the first half of the last century. - It migrated to psychology, initially as part of
sensation and perception, in the 50's and 60's as
an attempt to understand some of the features of
human behavior when detecting very faint stimuli
that were not being explained by traditional
theories of thresholds.
84The situation of interest
- A person is faced with a stimulus (signal) that
is very faint or confusing. - The person must make a decision, is the signal
there or not. - What makes this situation confusing and difficult
is the presences of other mess that is similar to
the signal. Let us call this mess noise.
85Example
Noise is present both in the environment and in
the sensory system of the observer. The observer
reacts to the momentary total activation of the
sensory system, which fluctuates from moment to
moment, as well as responding to environmental
stimuli, which may include a signal.
86Example
- A radiologist is examining a CT scan, looking for
evidence of a tumor. - A hard job, because there is always some
uncertainty. - There are four possible outcomes
- hit (tumor present and doctor says "yes'')
- miss (tumor present and doctor says "no'')
- false alarm (tumor absent and doctor says "yes")
- correct rejection (tumor absent and doctor says
"no").
Two types of Error
87The Four Cases
Signal detection theory was developed to help us
understand how a continuous and ambiguous signal
can lead to a binary yes/no decision.
Correct Rejection
Miss
P(?1?2)
P(?1?1)
False Alarms
Hit
P(?2?2)
P(?2?1)
88Decision Making
Discriminability
Based on expectancy (decision bias)
Criterion
Hit
P(?2?2)
False Alarm
P(?2?1)
89ROC Curve(Receiver Operating Characteristic)
Hit
PHP(?2?2)
False Alarm
PFAP(?2?1)
90Neyman-Pearson Criterion
Hit
PHP(?2?2)
NP
max. PH subject to PFA ? a
False Alarm
PFAP(?2?1)
91Likelihood Ratio Test
where T is a threshold that meets the PFA
constraint (? a).
How to determine T?
92Likelihood Ratio Test
93Neyman-Pearson Lemma
Consider the aforementioned rule with T chosen to
give PFA(?) a. There is no decision rule ? such
that PFA(?) ? a and PH(?) gt PH(?) .
Pf)
Let ? be a decision rule with
gt 0
1
? 0
94Neyman-Pearson Lemma
Consider the aforementioned rule with T chosen to
give PFA(?) a. There is no decision rule ? such
that PFA(?) ?a and PH(?) gt PH(?) .
Pf)
Let ? be a decision rule with
?0
?
0
?0
95Neyman-Pearson Lemma
OK
Consider the aforementioned rule with T chosen to
give PFA(?) a. There is no decision rule ? such
that PFA(?) ?a and PH(?) gt PH(?) .
Pf)
Let ? be a decision rule with
?0