CS479679 Pattern Recognition Spring 2006 Prof' Bebis - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

CS479679 Pattern Recognition Spring 2006 Prof' Bebis

Description:

Quantifies the tradeoffs between various classification ... using Bays rule: x*=xB. Bayes rule minimizes: 50. Error Probabilities and Integrals (cont'd) ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 68
Provided by: cse4
Category:

less

Transcript and Presenter's Notes

Title: CS479679 Pattern Recognition Spring 2006 Prof' Bebis


1
CS479/679 Pattern RecognitionSpring 2006 Prof.
Bebis
  • Bayesian Decision Theory
  • Chapter 2 (Duda et al.)

2
Bayesian Decision Theory
  • Fundamental statistical approach to problem
    classification.
  • Quantifies the tradeoffs between various
    classification decisions using probabilities and
    the costs associated with such decisions.
  • Each action is associated with a cost or risk.
  • The simplest risk is the classification error.
  • Design classifiers to recommend actions that
    minimize some total expected risk.

3
Terminology (using sea bass salmon
classification example)
  • State of nature ? (random variable)
  • ?1 for sea bass, ?2 for salmon.
  • Probabilities P(?1) and P(?2 ) (priors)
  • prior knowledge of how likely is to get a sea
    bass or a salmon
  • Probability density function p(x) (evidence)
  • how frequently we will measure a pattern with
    feature value x (e.g., x is a lightness
    measurement)
  • Note if x and y are different measurements, p(x)
    and p(y) correspond to different pdfs pX(x) and
    pY(y)

4
Terminology (contd)(using sea bass salmon
classification example)
  • Conditional probability density p(x/?j)
    (likelihood)
  • how frequently we will measure a pattern with
    feature value x given that the pattern belongs to
    class ?j

e.g., lightness distributions between
salmon/sea-bass populations
5
Terminology (contd)(using sea bass salmon
classification example)
  • Conditional probability P(?j /x) (posterior)
  • the probability that the fish belongs to class ?j
    given measurement x.
  • Note we will be using an uppercase P(.) to
    denote a probability mass function (pmf) and a
    lowercase p(.) to denote a probability density
    function (pdf).

6
Decision Rule Using Priors Only
  • Decide ?1 if P(?1) gt P(?2) otherwise decide
    ?2
  • P(error) minP(?1), P(?2)
  • Favours the most likely class (optimum if no
    other info is available).
  • This rule would be making the same decision all
    the times!
  • Makes sense to use for judging just one fish

7
Decision Rule Using Conditional pdf
  • Using Bayes rule, the posterior probability of
    category ?j given measurement x is given by
  • where
    (scale factor sum of probs 1)
  • Decide ?1 if P(?1 /x) gt P(?2/x)
    otherwise decide ?2 or
  • Decide ?1 if p(x/?1)P(?1)gtp(x/?2)P(?2)
    otherwise decide ?2

8
Decision Rule Using Conditional pdf (contd)

9
Probability of Error
  • The probability of error is defined as
  • The average probability error is given by
  • The Bayes rule is optimum, that is, it minimizes
    the average probability error since
  • P(error/x) minP(?1/x), P(?2/x)

10
Where do Probabilities Come From?
  • The Bayesian rule is optimal if the pmf or pdf is
    known.
  • There are two competitive answers to the above
    question
  • (1) Relative frequency (objective) approach.
  • Probabilities can only come from experiments.
  • (2) Bayesian (subjective) approach.
  • Probabilities may reflect degree of belief and
    can be based on opinion as well as experiments.

11
Example
  • Classify cars on UNR campus whether they are more
    or less than 50K
  • C1 price gt 50K
  • C2 price lt 50K
  • Feature x height of car
  • From Bayes rule, we know how to compute the
    posterior probabilities
  • Need to compute p(x/C1), p(x/C2), P(C1), P(C2)

12
Example (contd)
  • Determine prior probabilities
  • Collect data ask drivers how much their car was
    and measure height.
  • e.g., 1209 samples C1221 C2988

13
Example (contd)
  • Determine class conditional probabilities
    (likelihood)
  • Discretize car height into bins and use
    normalized histogram

14
Example (contd)
  • Calculate the posterior probability for each bin

15
A More General Theory
  • Use more than one features.
  • Allow more than two categories.
  • Allow actions other than classifying the input to
    one of the possible categories (e.g., rejection).
  • Introduce a more general error function
  • loss function (i.e., associate costs with
    actions)

16
Terminology
  • Features form a vector
  • A finite set of c categories ?1, ?2, , ?c
  • A finite set l of actions a1, a2, , al
  • A loss function ?(ai/ ?j)
  • the loss incurred for taking action ai when the
    classification category is ?j
  • Bayes rule using vector notation

17
Expected Loss (Conditional Risk)
  • Expected loss (or conditional risk) with taking
    action ai
  • The expected loss can be minimized by selecting
    the action that minimizes the conditional risk.

18
Overall Risk
  • Overall risk R
  • where a(x) determines which action a1, a2, ,
    al to take for every x (i.e., a(x) is a decision
    rule).
  • To minimize R, find a decision rule a(x) that
    chooses the action with the minimum conditional
    risk R(ai/x) for every x.
  • This rule yields optimal performance

conditional risk
19
Bayes Decision Rule
  • The Bayes decision rule minimizes R by
  • Computing R(ai /x) for every ai given an x
  • Choosing the action ai with the minimum R(ai /x)
  • Bayes risk (i.e., resulting minimum) is the best
    performance that can be achieved.

20
ExampleTwo-category classification
  • Two possible actions
  • a1 corresponds to deciding ?1
  • a2 corresponds to deciding ?2
  • Notation
  • ?ij?(ai,?j)
  • The conditional risks are

21
ExampleTwo-category classification
  • Decision rule

or
or (i.e., using likelihood ratio)
gt
22
Special CaseZero-One Loss Function
  • It assigns the same loss to all errors
  • The conditional risk corresponding to this loss
    function

23
Special CaseZero-One Loss Function (contd)
  • The decision rule becomes
  • What is the overall risk in this case?
  • (answer average probability error)

or
or
24
Example
  • ?a was determined assuming zero-one loss function
  • ?b was determined assuming

(decision regions)
25
Minmax criterion
  • Design classifiers that perform well over a range
    of prior probabilities (e.g., when prior
    probabilities are not known exactly).
  • Minimize maximum (i.e., worst) possible overall
    risk for any value of the priors.

(R1 and R2 are the decision regions for given
priors)
26
Minmax criterion (contd)
R is linear in P(?1)
Idea find decision boundary such that this term
0
worst case risk
27
Minmax criterion (contd)
  • How to find the minimax solution?
  • We need to find the prior which maximizes the
    Bayes risk (i.e., Rmn becomes equal to the worst
    Bayes risk)

fixed decision boundary, varying priors
find max Bayes error!!
(assume zero-one loss function in this example)
Bayes error curve
28
Neyman-Pearson Criterion
  • Minimize the overall risk subject to a constraint
  • e.g., do not misclassify more than 1 of salmon
    as sea bass
  • Adjust decision boundaries numerically
  • Analytic solutions are possible assuming Gaussian
    densities.

29
Discriminant Functions
  • Functional structure of a general statistical
    classifier
  • Assign x to ?i if gi(x) gt gj(x) for
    all

(discriminant functions)
pick max
30
Discriminants for Bayes Classifier
  • Using risks
  • gi(x)-R(ai/x)
  • Using zero-one loss function (i.e., min error
    rate)
  • gi(x)P(?i/x)
  • Is the choice of gi unique?
  • Replacing gi(x) with f(gi(x)), where f() is
    monotonically increasing, does not change the
    classification results.

31
Decision Regions and Boundaries
  • Decision rules divide the feature space in
    decision regions R1, R2, , Rc
  • The boundaries of the decision regions are the
    decision boundaries.

g1(x)g2(x) at the decision boundaries
32
Case of two categories
  • More common to use a single discriminant function
    (dichotomizer) instead of two
  • Examples of dichotomizers

33
Discriminant Function for Multivariate Gaussian
  • Assume the following discriminant function

N(µ,S)
p(x/?i)
34
Multivariate Gaussian DensityCase I
  • Assumption Sis2
  • Features are statistically independent
  • Each feature has the same variance

favors the a-priori more likely category
35
Multivariate Gaussian DensityCase I (contd)

wi
threshold or bias
)
)
36
Multivariate Gaussian DensityCase I (contd)
  • Comments about this hyperplane
  • It passes through x0
  • It is orthogonal to the line linking the means.
  • What happens when P(?i) P(?j) ?
  • If P(?i) P(?j), then x0 shifts away from the
    more likely mean.
  • If s is very small, the position of the boundary
    is insensitive to P(?i) and P(?j)

37
Multivariate Gaussian DensityCase I (contd)

38
Multivariate Gaussian DensityCase I (contd)
  • Minimum distance classifier
  • When P(?i) is the same for each of the c classes

39
Multivariate Gaussian DensityCase II
  • Assumption Si S

40
Multivariate Gaussian DensityCase II (contd)
  • Comments about this hyperplane
  • It passes through x0
  • It is NOT orthogonal to the line linking the
    means.
  • What happens when P(?i) P(?j) ?
  • If P(?i) P(?j), then x0 shifts away from the
    more likely mean.

41
Multivariate Gaussian DensityCase II (contd)

42
Multivariate Gaussian DensityCase II (contd)
  • Mahalanobis distance classifier
  • When P(?i) is the same for each of the c classes

43
Multivariate Gaussian DensityCase III
  • Assumption Si arbitrary

hyperquadrics
e.g., hyperplanes, pairs of hyperplanes,
hyperspheres, hyperellipsoids, hyperparaboloids
etc.
44
Multivariate Gaussian DensityCase III (contd)

P(?1)P(?2)
disconnected decision regions
45
Multivariate Gaussian DensityCase III (contd)

disconnected decision regions
non-linear decision boundaries
46
Multivariate Gaussian DensityCase III (contd)
  • More examples (S arbitrary)

47
Multivariate Gaussian DensityCase III (contd)
  • A four category example

48
Example - Case III

decision boundary
P(?1)P(?2)
boundary does not pass through midpoint of µ1,µ2
49
Error Probabilities and Integrals
  • Case of two categories

Bayes rule minimizes
Optimum using Bays rule xxB
50
Error Probabilities and Integrals (contd)
  • Case of multiple categories
  • Simpler to compute the probability of being
    correct

Bayes rule maximizes P(correct)
51
Error Bounds for Gaussian Densities
  • Full calculation of the error could be difficult
  • Compute upper bounds for the probability error
  • Assume the case of two categories for convenience
  • Chernoff bound


52
Error Bounds for Gaussian Densities
  • If the class conditional distributions are
    Gaussians, then
  • k(ß) is defined as follows

53
Error Bounds for Gaussian Densities (contd)
  • The Chernoff bound corresponds to ß that
    minimizes e-?(ß)
  • 1-D optimization regardless to dimensionality of
    class conditional densities)

loose bound
loose bound
tight bound
54
Error Bounds for Gaussian Densities (contd)
  • Bhattacharyya bound
  • The error is given for ß0.5
  • Easier to compute than Chernoff error but looser.
  • The Chernoff and Bhattacharyya bounds will not be
    tight if the distributions are not Gaussian !!

55
Example on Error Bounds

K(1/2)4.06
56
Receiver Operating Characteristic (ROC) Curve
  • Every classifier employs some kind of a threshold
    value.
  • Changing the threshold affects the performance of
    the system.
  • ROC curves can help us distinguish between
    discriminability and decision bias (i.e., choice
    of threshold)

57
Example Person Authentication
  • Authenticate a person using biometrics (e.g.,
    face image).
  • There are two possible distributions
  • authentic (A) and impostor (I)

correct rejection
correct acceptance
A
I
false positive
false positive
58
Example Person Authentication (contd)
  • Possible cases
  • (1) correct acceptance (true positive)
  • X belongs to A, and we decide A
  • (2) incorrect acceptance (false positive)
  • X belongs to I, and we decide A
  • (3) correct rejection (true negative)
  • X belongs to I, and we decide I
  • (4) incorrect rejection (false negative)
  • X belongs to A, and we decide I

59
Error vs Threshold

x
60
False Negatives vs Positives

61
Bayes Decision Theory Case of Discrete Features
  • Replace with
  • Read section 2.9

62
Missing Features
  • Suppose x(x1,x2) is a feature vector.
  • What can we do when x1 is missing during
    classification?
  • Maybe use the mean value of all x1 measurements?
  • But is the largest!

63
Missing Features (contd)
  • Suppose xxg,xb (xg good features, xb bad
    features)
  • Derive the Bayes rule using the good features

64
Noisy Features
  • Suppose xxg,xb (xg good features, xb noisy
    features)
  • Suppose noise is statistically independent and
  • We know noise model p(xb/xt)
  • xb observed feature values, xt true feature
    values.
  • Assume statistically independent noise
  • if xt were known, xb would be independent of xg ,
    ?i

65
Noisy Features (contd)

use independence assumption
  • What happens when p(xb/xt) is uniform?

66
Compound Bayesian Decision Theory
  • Sequential compound decision
  • Decide as each fish emerges.
  • Compound decision
  • Wait for n fish to emerge.
  • Make all n decisions jointly.

67
Bayes Rule for Compound Decisions
cn possible vectors
cn possible values
(consecutive states ?i are not independent can
lead to better performance)
Write a Comment
User Comments (0)
About PowerShow.com