240650 Principles of Pattern Recognition - PowerPoint PPT Presentation

1 / 61

About This Presentation

Title:

240650 Principles of Pattern Recognition

Description:

Bayesian Decision Theory cont. Whenever we observe a particular x, the prob. of error is ... Bayesian Decision Theory--continuous features. Feature space ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 62

Provided by: Mon96

Category:

more less

Transcript and Presenter's Notes

Title: 240650 Principles of Pattern Recognition

1
240-650 Principles of Pattern Recognition
Montri Karnjanadecha montri_at_coe.psu.ac.th http//f
ivedots.coe.psu.ac.th/montri
2
Chapter 2

Bayesian Decision Theory

3
Statistical Approach to Pattern Recognition
4
A Simple Example

Suppose that we are given two classes w1 and w2
P(w1) 0.7
P(w2) 0.3
No measurement is given
Guessing
What shall we do to recognize a given input?
What is the best we can do statistically? Why?

5
A More Complicated Example

Suppose that we are given two classes
A single measurement x
P(w1x) and P(w2x) are given graphically

6
A Bayesian Example

Suppose that we are given two classes
A single measurement x
We are given p(xw1) and p(xw2) this time

7
A Bayesian Example cont.
8
Bayesian Decision Theory

Bayes formula
In case of two categories
In English, it can be expressed as

9
Bayesian Decision Theory cont.

A posterior probability
The probability of the state of nature being
given that feature value x has been measured
Likelihood
is the likelihood of with
respect to x
Evidence
The evidence factor can be viewed as a scaling
factor that guarantees that the posterior
probabilities sum to one.

10
Bayesian Decision Theory cont.

Whenever we observe a particular x, the prob. of
error is
The average prob. of error is given by

11
Bayesian Decision Theory cont.

Bayes decision rule
Decide w1 if P(w1x) gt P(w2x) otherwise decide
w2
Prob. of error
P(errorx)minP(w1x), P(w2x)
If we ignore the evidence, the decision rule
becomes
Decide w1 if P(xw1) P(w1) gt P(xw2) P(w2)
Otherwise decide w2

12
Bayesian Decision Theory--continuous features

Feature space
In general, an input can be represented by a
vector, a point in a d-dimensional Euclidean
space Rd
Loss function
The loss function states exactly how costly each
action is and is used to convert a probability
determination into a decision
Written as

13
Loss Function

Describe the loss incurred for taking action ai
when the state of nature is wj

14
Conditional Risk

Suppose we observe a particular x
We take action ai
If the true state of nature is wj
By definition we will incur the loss l(aiwj)
We can minimize our expected loss by selecting
the action that minimize the condition risk,
R(aix)

15
Bayesian Decision Theory

Suppose that there are c categories
w1, w2, ..., wc
Conditional risk
Risk is the average expected loss

16
Bayesian Decision Theory

Bayes decision rule
For a given x, select the action ai for which the
conditional risk is minimum
The resulting minimum overall risk is called the
Bayes risk, denoted as R, which is the best
performance that can be achieved

17
Two-Category Classification

Let lij l(aiwj)
Conditional risk
Fundamental decision rule
Decide w1 if R(a1x) lt R(w2x)

18
Two-Category Classification cont.

The decision rule can be written in several ways
Decide w1 if one of the followings is true

These rules are equivalent
Likelihood Ratio
19
Minimum-Error-Rate Classification

A special case of the Bayes decision rule with
the following zero-one loss function
Assigns no loss to correct decision
Assigns unit loss to any error
All errors are equally costly

20
Minimum-Error-Rate Classification

Conditional risk

21
Minimum-Error-Rate Classification

We should select i that maximizes the posterior
probability
For minimum error rate
Decide

22
Minimum-Error-Rate Classification
23
Classifiers, Discriminant Functions, and Decision
Surfaces

There are many ways to represent pattern
classifiers
One of the most useful is in terms of a set of
discriminant functions gi(x), i1,,c
The classifier assigns a feature vector x to
class if

24
The Multicategory Classifier
25
Classifiers, Discriminant Functions, and Decision
Surfaces

There are many equivalent discriminant functions
i.e., the classification results will be the same
even though they are different functions
For example, if f is a monotonically increasing
function, then

26
Classifiers, Discriminant Functions, and Decision
Surfaces

Some of discriminant functions are easier to
understand or to compute

27
Decision Regions

The effect of any decision is to divide the
feature space into c decision regions, R1, ...,
Rc
The regions are separated with decision
boundaries, where ties occur among the largest
discriminant functions

28
Decision Regions cont.
29
Two-Category Case (Dichotomizer)

Two-category case is a special case
Instead of two discriminant functions, a single
one can be used

30
The Normal Density

Univariate Gaussian Density
Mean
Variance

31
The Normal Density
32
The Normal Density

Central Limit Theorem
The aggregate effect of the sum of a large number
of small, independent random disturbances will
lead to a Gaussian distribution
Gaussian is often a good model for the actual
probability distribution

33
The Multivariate Normal Density

Multivariate Density (in d dimension)
Abbreviation

34
The Multivariate Normal Density

Mean
Covariance matrix
The ijth component of

35
Statistically Independence

If xi and xj are statistically independence then
The covariance matrix will become a diagonal
matrix where all off-diagonal elements are zero

36
Whitening Transform
Diagonal matrix of the corresponding eigenvalues
of
matrix whose columns are the orthonormal
eigenvectors of
37
Whitening Transform
38
Squared Mahalanobis Distance from x to m
Constant density
Principle axes of hyperellipsiods are given by
the eigenvectors of S Length of axes are
determined by eigenvalues of S
39
Discriminant Functions for the Normal Density

Minimum distance classifier
If the density are multivariate normal
i.e., if
Then we have

40
Discriminant Functions for the Normal Density

Case 1
Features are statistically independence and each
feature has the same variance
Where . denotes the Euclidean norm

41
Case 1 Si s2I
42
Linear Discriminant Function

It is not necessary to compute distances
Expanding the form yields
The term is the same for all i
We have the following linear discriminant
function

43
Linear Discriminant Function

where
and

Threshold or bias for the ith category
44
Linear Machine

A classifier that uses linear discriminant
functions is called a linear machine
Its decision surfaces are pieces of hyperplanes
defined by the linear equations
for the two categories
with the highest posterior probabilities. For our
case this equation can be written as

45
Linear Machine

Where
And
If then the second term
vanishes
It is called a minimum-distance classifier

46
Priors change -gt decision boundaries shift
47
Priors change -gt decision boundaries shift
48
Priors change -gt decision boundaries shift
49
Case 2 Si S

Covariance matrices for all of the classes are
identical but otherwise arbitrary
The cluster for the ith class is centered about
mi
Discriminant function

Can be ignored if prior probabilities are the
same for all classes
50
Case 2 Discriminant function