Title: 240650 Principles of Pattern Recognition
1240-650 Principles of Pattern Recognition
Montri Karnjanadecha montri_at_coe.psu.ac.th http//f
ivedots.coe.psu.ac.th/montri
2Chapter 2
3Statistical Approach to Pattern Recognition
4A Simple Example
- Suppose that we are given two classes w1 and w2
- P(w1) 0.7
- P(w2) 0.3
- No measurement is given
- Guessing
- What shall we do to recognize a given input?
- What is the best we can do statistically? Why?
5A More Complicated Example
- Suppose that we are given two classes
- A single measurement x
- P(w1x) and P(w2x) are given graphically
6A Bayesian Example
- Suppose that we are given two classes
- A single measurement x
- We are given p(xw1) and p(xw2) this time
7A Bayesian Example cont.
8Bayesian Decision Theory
- Bayes formula
- In case of two categories
- In English, it can be expressed as
9Bayesian Decision Theory cont.
- A posterior probability
- The probability of the state of nature being
given that feature value x has been measured - Likelihood
- is the likelihood of with
respect to x - Evidence
- The evidence factor can be viewed as a scaling
factor that guarantees that the posterior
probabilities sum to one.
10Bayesian Decision Theory cont.
- Whenever we observe a particular x, the prob. of
error is - The average prob. of error is given by
11Bayesian Decision Theory cont.
- Bayes decision rule
- Decide w1 if P(w1x) gt P(w2x) otherwise decide
w2 - Prob. of error
- P(errorx)minP(w1x), P(w2x)
- If we ignore the evidence, the decision rule
becomes - Decide w1 if P(xw1) P(w1) gt P(xw2) P(w2)
- Otherwise decide w2
12Bayesian Decision Theory--continuous features
- Feature space
- In general, an input can be represented by a
vector, a point in a d-dimensional Euclidean
space Rd - Loss function
- The loss function states exactly how costly each
action is and is used to convert a probability
determination into a decision - Written as
13Loss Function
- Describe the loss incurred for taking action ai
- when the state of nature is wj
14Conditional Risk
- Suppose we observe a particular x
- We take action ai
- If the true state of nature is wj
- By definition we will incur the loss l(aiwj)
- We can minimize our expected loss by selecting
the action that minimize the condition risk,
R(aix)
15Bayesian Decision Theory
- Suppose that there are c categories
- w1, w2, ..., wc
- Conditional risk
- Risk is the average expected loss
16Bayesian Decision Theory
- Bayes decision rule
- For a given x, select the action ai for which the
conditional risk is minimum - The resulting minimum overall risk is called the
Bayes risk, denoted as R, which is the best
performance that can be achieved
17Two-Category Classification
- Let lij l(aiwj)
- Conditional risk
- Fundamental decision rule
- Decide w1 if R(a1x) lt R(w2x)
18Two-Category Classification cont.
- The decision rule can be written in several ways
- Decide w1 if one of the followings is true
These rules are equivalent
Likelihood Ratio
19Minimum-Error-Rate Classification
- A special case of the Bayes decision rule with
the following zero-one loss function - Assigns no loss to correct decision
- Assigns unit loss to any error
- All errors are equally costly
20Minimum-Error-Rate Classification
21Minimum-Error-Rate Classification
- We should select i that maximizes the posterior
probability - For minimum error rate
- Decide
22Minimum-Error-Rate Classification
23Classifiers, Discriminant Functions, and Decision
Surfaces
- There are many ways to represent pattern
classifiers - One of the most useful is in terms of a set of
discriminant functions gi(x), i1,,c - The classifier assigns a feature vector x to
class if
24The Multicategory Classifier
25Classifiers, Discriminant Functions, and Decision
Surfaces
- There are many equivalent discriminant functions
- i.e., the classification results will be the same
even though they are different functions - For example, if f is a monotonically increasing
function, then
26Classifiers, Discriminant Functions, and Decision
Surfaces
- Some of discriminant functions are easier to
understand or to compute
27Decision Regions
- The effect of any decision is to divide the
feature space into c decision regions, R1, ...,
Rc - The regions are separated with decision
boundaries, where ties occur among the largest
discriminant functions
28Decision Regions cont.
29Two-Category Case (Dichotomizer)
- Two-category case is a special case
- Instead of two discriminant functions, a single
one can be used
30The Normal Density
- Univariate Gaussian Density
- Mean
- Variance
31The Normal Density
32The Normal Density
- Central Limit Theorem
- The aggregate effect of the sum of a large number
of small, independent random disturbances will
lead to a Gaussian distribution - Gaussian is often a good model for the actual
probability distribution
33The Multivariate Normal Density
- Multivariate Density (in d dimension)
- Abbreviation
34The Multivariate Normal Density
- Mean
- Covariance matrix
- The ijth component of
35Statistically Independence
- If xi and xj are statistically independence then
- The covariance matrix will become a diagonal
matrix where all off-diagonal elements are zero
36Whitening Transform
Diagonal matrix of the corresponding eigenvalues
of
matrix whose columns are the orthonormal
eigenvectors of
37Whitening Transform
38Squared Mahalanobis Distance from x to m
Constant density
Principle axes of hyperellipsiods are given by
the eigenvectors of S Length of axes are
determined by eigenvalues of S
39Discriminant Functions for the Normal Density
- Minimum distance classifier
- If the density are multivariate normal
i.e., if - Then we have
40Discriminant Functions for the Normal Density
- Case 1
- Features are statistically independence and each
feature has the same variance - Where . denotes the Euclidean norm
41Case 1 Si s2I
42Linear Discriminant Function
- It is not necessary to compute distances
- Expanding the form yields
- The term is the same for all i
- We have the following linear discriminant
function
43Linear Discriminant Function
Threshold or bias for the ith category
44Linear Machine
- A classifier that uses linear discriminant
functions is called a linear machine - Its decision surfaces are pieces of hyperplanes
defined by the linear equations - for the two categories
with the highest posterior probabilities. For our
case this equation can be written as
45Linear Machine
- Where
- And
- If then the second term
vanishes - It is called a minimum-distance classifier
46Priors change -gt decision boundaries shift
47Priors change -gt decision boundaries shift
48Priors change -gt decision boundaries shift
49Case 2 Si S
- Covariance matrices for all of the classes are
identical but otherwise arbitrary - The cluster for the ith class is centered about
mi - Discriminant function
Can be ignored if prior probabilities are the
same for all classes
50Case 2 Discriminant function
51For 2-category case
- If Ri and Rj are contiguous, the boundary between
them has the equation - where
- and
52(No Transcript)
53(No Transcript)
54Case 3 Si arbitrary
- In general, the covariance matrices are different
for each category - The only term that can be dropped is the (d/2) ln
2p term
55Case 3 Si arbitrary
- The discriminant functions are
- Where
- and
56Two-category case
- The decision surface are hyperquadrics
(hyperplanes, hyperspheres, hyperellipsoids,
hyperparaboloids,)
57(No Transcript)
58(No Transcript)
59(No Transcript)
60(No Transcript)
61Example