Title: Outline
1Outline
- Bayesian Decision Theory
- Bayes' formula
- Error
- Bayes' Decision Rule
- Loss function and Risk
- Two-Category Classification Born 1702
- Classifiers, Discriminant Functions, and Decision
Surfaces - Discriminant Functions for the Normal Density
2Bayesian Decision Theory
- Bayesian decision theory is a fundamental
statistical approach to the problem of pattern
classification. - Decision making when all the probabilistic
information is known. - For given probabilities the decision is optimal.
- When new information is added, it is assimilated
in optimal fashion for improvement of decisions.
3Bayesian Decision Theory cont.
- Fish Example
- Each fish is in one of 2 states sea bass or
salmon - Let w denote the state of nature
- w w1 for sea bass
- w w2 for salmon
4 Bayesian Decision Theory cont.
- The State of nature is unpredictable w is
a - variable that must be described
probabilistically. - If the catch produced as much salmon as sea
bass - the next fish is equally likely to be sea bass
or - salmon.
- Define
- P(w1 ) a priori probability that the next
fish is sea bass - P(w2 ) a priori probability that the next
fish is salmon.
5 Bayesian Decision Theory cont.
- If other types of fish are irrelevant
- P( w1 ) P( w2 ) 1.
- Prior probabilities reflect our prior knowledge
(e.g. time of year, fishing area, ) - Simple decision Rule
- Make a decision without seeing the fish.
- Decide w1 if P( w1 ) gt P( w2 ) w2 otherwise.
- OK if deciding for one fish
- If several fish, all assigned to same class.
6 Bayesian Decision Theory cont.
-
- In general, we will have some features and
- more information.
- Feature lightness measurement x
- Different fish yield different lightness readings
(x is a random variable)
7 Bayesian Decision Theory cont.
- Define
- p(xw1) Class Conditional Probability
Density - Probability density function for x given that
the state of nature is w1 - The difference between p(xw1 ) and p(xw2 )
describes the difference in lightness between sea
bass and salmon.
8Bayesian Decision Theory cont.
Hypothetical class-conditional
probability Density functions are normalized
(area under each curve is 1.0)
9 Bayesian Decision Theory cont.
- Suppose that we know
- The prior probabilities P(w1 ) and P(w2 ),
- The conditional densities
and - Measure lightness of a fish x.
- What is the category of the fish
?
10Bayes' formula
- P(wj x) P(x wj ) P(wj ) /
P(x), - where
-
-
-
11Bayes' formula cont.
- p(xwj ) is called the likelihood of wj with
respect to x. - (the wj category for which p(xwj ) is
large - is more "likely" to be the true category)
- p(x) is the evidence
- how frequently we will measure a
pattern with - feature value x.
- Scale factor that guarantees that the
posterior probabilities sum to 1.
12Bayes' formula cont.
Posterior probabilities for the particular
priors P(w1)2/3 and P(w2)1/3. At every x the
posteriors sum to 1.
13Error
- For a given x, we can minimize the probability of
- error by deciding w1 if P(w1x) gt P(w2x) and
- w2 otherwise.
14Bayes' Decision Rule (Minimizes the probability
of error)
- w1 if P(w1x) gt P(w2x)
- w2 otherwise
- or
- w1 if P ( x w1) P(w1) gt P(xw2) P(w2)
- w2 otherwise
- and
- P(Errorx) min P(w1x) , P(w2x)
15Bayesian Decision Theory Continuous Features
General Case
- Formalize the ideas just considered in 4 ways
- Allow more than one feature
- Replace the scalar x by the feature vector
A - d-dimensional Euclidean space Rd is
called the feature space. - Allow more than 2 states of nature
- Generalize to several classes
- Allow actions other than merely deciding the
state of nature - Possibility of rejection, i.e., of refusing
to make a decision in - close cases.
- Introducing general loss function
16Loss function
- Loss ( or cost ) function states exactly how
costly each action is, and is used to convert a
probability determination into a decision. Loss
functions let us treat situations in which some
kinds of classification mistakes are more costly
than others. -
17Formulation
- Let w1, ... , wc be the finite set of c states
of nature ("categories"). - Let be the finite set of a
possible actions. - The loss function loss
incurred for taking action - when the state of nature is wj.
- x d-dimensional feature vector (random
variable) - P(xwj ) the state conditional probability
density function for x - (The probability density function for x
conditioned on wj being the true state of nature) - P(wj ) prior probability that nature is in
state wj .
18Expected Loss
- Suppose that we observe a particular x and that
we contemplate taking action . - If the true state of nature is wj then loss
is - Before we have done an observation
- the expected loss is
19Conditional Risk
- After the observation the expected risk which is
called now conditional risk is given by -
20Total Risk
- Objective Select the action that minimizes the
conditional risk - A general decision rule is a function
- For every x, the decision function
assumes one of the a values - The total risk is
21Bayes Decision Rule
- Compute the conditional risk
-
- for i 1, ... , a.
- Select the action for which
is minimum. - The resulting minimum total risk is called the
Bayes Risk, denoted R, and is the best
performance that can be achieved. -
22Two-Category Classification
- Action deciding that the true state is
w1 - Action deciding that the true state is
w2 . - Let be the loss
incurred for deciding wi when the true state is
wj. - Decide w1 if
- or if
- or if
- and w2 otherwise
23Two-Category Likelihood Ratio Test
-
- Under reasonable assumption that
(why?) - decide w1 if
- and w2 otherwise.
-
- The ratio is called
the likelihood ratio. We - can decide w1 if the likelihood ratio exceeds
a threshold T - value that is independent of the observation
x. -
24Minimum-Error-Rate Classification
- In classification problems, each state is usually
associated with one of a different C classes. - Action Decision that the true state is
wi. - If action is taken, and the true state is
wj, then the decision is correct if i j, and in
error otherwise. - The Zero-One Loss function is defined as
-
-
for i,j1,,c - all errors are equally costly
25Minimum-Error-Rate Classification cont.
- The conditional risk is
-
-
- To minimize the average probability of error, we
should select the i that maximizes the posterior
probability P(wix) - Decide wi if P(wix) gt P(wjx) for all
- (same as Bayes' decision rule)
-
26Decision Regions
- The likelihood ratio p(x w1 ) /p(x w2 ) vs. x
- The threshold qa for zero-one loss function
-
-
- If we put l12gt l21 we shall get qb gt qa
-
27Classifiers, Discriminant Functions, and
Decision SurfacesThe Multi-Category Case
- A pattern classifier can be represented by a set
of discriminant functions gi(x) i1, .., C. - The classifier assigns a feature vector x to
class wi - if gi(x) gt gj(x) for all
28Statistical Pattern Classifier
- Statistical pattern
classifier
29The Bayes Classifier
- A Bayes classifier can be represented in this way
- For the general case with risks
- For the minimum error-rate case
- If we replace every gi(x) by f(gi(x)), where f(.)
is a - monotonically increasing function, the resulting
classification - is unchanged, e.g. any of the following choices
gives identical - classification results
30The Bayes Classifier cont.
- The effect of any decision rule is to divide the
feature space into C decision regions, R1, ..,
Rc. - If gi(x) gt gj(x) for all then x is
in Ri, and x is assigned to wi. - Decision regions are separated by decision
boundaries. - Decision boundaries are surfaces in the feature
space.
31The Decision Regions
- Two dimensional two
category classifier
32The Two-Category Case
- Use 2 discriminant functions g1 and g2, and
assigning x to w1 if g1gtg2. - Alternative define a single discriminant
function g(x) g1(x) - g2(x), decide w1 if
g(x)gt0, otherwise decide w2. - In two category case two forms are frequently
used
33Normal Density - Univariate Case
- Gaussian density with mean and
standard deviation - (
named variance ) - It can be shown that
-
34Entropy
- Entropy is given by
- and measured by nats if is used
instead, the unit is the bit. - The entropy measures the fundamental
uncertainty in the values of points selected
randomly from a distribution. Normal distribution
has the maximum entropy of all distributions
having a given mean and variance. As stated by
the Central Limit Theorem, the aggregative effect
of the sum of a large number small, iid random
disturbances will lead to a Gaussian
distribution. - Because many patterns can be viewed as some
ideal or prototype pattern corrupted by a large
number of random processes, the Gaussian is often
a good model for the actual probability
distribution.
35Normal Density - Multivariate Case
- The general multivariate normal density (MND) in
a d dimensions is written as - It can be shown that
- which means for components
- The covariance matrix is always symmetric
and positive semidefinite.
36Normal Density - Multivariate Case cont.
- Diagonal elements are variances
and the off-diagonal elements are
covariances of xi and xj - If xi and xj are statistically independent,
If all
- then p(x) is
a product of univariate normal densities. - Linear combination of jointly normally
distributed random variables are normally
distributed if and
- where A is d-by-k matrix,
then - If A is a vector a, yatx is a scalar ,
is a variance of a projection of x onto a.
37Whitening transform
- Define to be the matrix whose columns are
the orthogonal eigenvectors of , and
the diagonal matrix of the corres-ponding
eigenvalues. The transformation with - converts an arbitrary MND into a spherical
with covariance matrix I .
38Normal Density - Multivariate Case cont.
- The multivariate normal density MND is completely
specified by - dd(d1)/2 parameters . Samples drawn from
MND fall in a - cluster which center is determined by
and a shape by The - loci of points of constant density are
hyperellipsoids - The r is called Mahalonobis
- distance from x to . The
- principal axes of the hyperelli-
- psoid are given by the eigenvectors
- of .
39Normal Density - Multivariate Case cont.
- The minimum-error-rate classification can be
achieved using the discriminant functions -
- or
- If
- then
40Discriminant Functions for the Normal Density
- The features are statistically independent, and
each feature has the same variance. - Determinant is
- and the inverse of is
- is
independent of i and can be ignored
41Case1 cont.
- where denotes the Eucledian norm
-
- is independent of i
- or as a linear discriminant function
- where
-
42Case1 cont.
- is called the threshold or bias in the ith
direction. - A classifier that uses linear discriminant
functions - is called a linear machine.
- The decision surfaces of a linear machine are
- pieces of hyperplanes defined by the linear
- equations for the 2
categories with - the highest posterior probabilities.
- For this particular example, setting
- reduces to
43Case1 cont.
- where
- The above equation defines a hyperplane through
x0 and orthogonal to w (line linking the means) - If P( wi )P( wj ), then x0 is halfway between
the means.
44Case1 cont.
45Case1 cont.
- If the covariances of 2 distributions are equal
and proportional to the identity matrix, then the
distributions are spherical in d-dimensions, and
the boundary is a generalized hyperplane of d-1
dimensions, perpendicular to the line separating
the means. - If P(wi) is not equal to P(wj), the point x0
shifts away from the more likely mean. -
46Case1 cont.
47Minimum Distance Classifier
-
- As the priors are changed, the decision
boundary shifts. - If all prior probabilities are the same, the
optimum decision rule becomes - Measure the Euclidean distance
from each x to each of the C mean vectors. - Assign x to the class of the nearest mean.
48Discriminant Functions for the Normal Density
Case2. Common Covariance Matrices
- Case 2
- Covariance matrices for all of the classes are
identical but arbitrary. - is
independent of i and can be ignored
49Case2 cont.
- or
- If all prior probabilities are the same, the
optimum decision rule becomes - Measure the squared Mahalanobis distance
- from x to each of the C mean vectors.
- Assign x to the class of the nearest mean.
50Case2 cont.
- Expanding and
dropping - we shall have a linear classifier
- where
- Decision boundaries are given by
51Discriminant Functions for the Normal Density
Case3. Arbitrary Class-Conditional Distributions
- Case 3
- Where
- Decision boundaries are hyperquadrics
52ERROR PROBABILITIES AND INTEGRALS
- Consider the 2-class problem and suppose that the
feature space is divided into 2 regions
and . There are 2 ways in which a
classification error can occur - An observation x falls in , and the
true state is w1. - An observation x falls in , and the
true state is w2.
53ERROR PROBABILITIES AND INTEGRALS cont.
54ERROR PROBABILITIES AND INTEGRALS cont.
- Because x is chosen arbitrarily, the
probability - of error is not as small as it might be.
- XB Bayes optimal decision boundary , and
- gives the lowest probability of error.
- In the multi-category case, there are more ways
to be - wrong than to be right, and it is simpler
to compute the - probability of being correct.
- This result depends neither on how the feature
space is - partitioned, nor on the form of the underlying
distribution. - Bayes classifier maximizes this probability, and
no other - partitioning can yield a smaller probability of
error. -