Outline - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Outline

Description:

Decide w1 if P( w1 ) P( w2 ); w2 otherwise. OK if ... other than merely deciding the state of nature ... can decide w1 if the likelihood ratio exceeds ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 55
Provided by: rud52
Category:
Tags: decide | outline

less

Transcript and Presenter's Notes

Title: Outline


1
Outline
  • Bayesian Decision Theory
  • Bayes' formula
  • Error
  • Bayes' Decision Rule
  • Loss function and Risk
  • Two-Category Classification Born 1702
  • Classifiers, Discriminant Functions, and Decision
    Surfaces
  • Discriminant Functions for the Normal Density

2
Bayesian Decision Theory
  • Bayesian decision theory is a fundamental
    statistical approach to the problem of pattern
    classification.
  • Decision making when all the probabilistic
    information is known.
  • For given probabilities the decision is optimal.
  • When new information is added, it is assimilated
    in optimal fashion for improvement of decisions.

3
Bayesian Decision Theory cont.
  • Fish Example
  • Each fish is in one of 2 states sea bass or
    salmon
  • Let w denote the state of nature
  •  w w1 for sea bass
  •  w w2 for salmon

4

Bayesian Decision Theory cont.
  •  The State of nature is unpredictable w is
    a
  • variable that must be described
    probabilistically.
  •  If the catch produced as much salmon as sea
    bass
  • the next fish is equally likely to be sea bass
    or
  • salmon.
  •  Define
  •  P(w1 ) a priori probability that the next
    fish is sea bass
  •  P(w2 ) a priori probability that the next
    fish is salmon.

5

Bayesian Decision Theory cont.
  • If other types of fish are irrelevant
  • P( w1 ) P( w2 ) 1.
  •  Prior probabilities reflect our prior knowledge
    (e.g. time of year, fishing area, )
  • Simple decision Rule
  • Make a decision without seeing the fish.
  • Decide w1 if P( w1 ) gt P( w2 ) w2 otherwise.
  • OK if deciding for one fish
  • If several fish, all assigned to same class.

6

Bayesian Decision Theory cont.
  •  
  •  In general, we will have some features and
  • more information.
  •  Feature lightness measurement x
  • Different fish yield different lightness readings
    (x is a random variable)

7

Bayesian Decision Theory cont.
  • Define  
  • p(xw1) Class Conditional Probability
    Density
  • Probability density function for x given that
    the state of nature is w1
  • The difference between p(xw1 ) and p(xw2 )
    describes the difference in lightness between sea
    bass and salmon.

8
Bayesian Decision Theory cont.

Hypothetical class-conditional
probability Density functions are normalized
(area under each curve is 1.0)
9

Bayesian Decision Theory cont.
  • Suppose that we know
  • The prior probabilities P(w1 ) and P(w2 ),
  • The conditional densities
    and
  • Measure lightness of a fish x.
  • What is the category of the fish
    ?

10
Bayes' formula
  • P(wj x) P(x wj ) P(wj ) /
    P(x),
  • where
  •  

11
Bayes' formula cont.
  •  p(xwj ) is called the likelihood of wj with
    respect to x.
  • (the wj category for which p(xwj ) is
    large
  • is more "likely" to be the true category)
  • p(x) is the evidence
  • how frequently we will measure a
    pattern with
  • feature value x.
  • Scale factor that guarantees that the
    posterior probabilities sum to 1.

12
Bayes' formula cont.
Posterior probabilities for the particular
priors P(w1)2/3 and P(w2)1/3. At every x the
posteriors sum to 1.
13
Error
  • For a given x, we can minimize the probability of
  • error by deciding w1 if P(w1x) gt P(w2x) and
  • w2 otherwise.

14
Bayes' Decision Rule (Minimizes the probability
of error) 
  • w1 if P(w1x) gt P(w2x)
  • w2 otherwise
  • or
  • w1 if P ( x w1) P(w1) gt P(xw2) P(w2)
  • w2 otherwise
  • and
  • P(Errorx) min P(w1x) , P(w2x)

15
Bayesian Decision Theory Continuous Features
General Case
  • Formalize the ideas just considered in 4 ways
  • Allow more than one feature
  • Replace the scalar x by the feature vector
    A
  • d-dimensional Euclidean space Rd is
    called the feature space.
  • Allow more than 2 states of nature
  • Generalize to several classes
  • Allow actions other than merely deciding the
    state of nature
  • Possibility of rejection, i.e., of refusing
    to make a decision in
  • close cases. 
  • Introducing general loss function

16
Loss function
  • Loss ( or cost ) function states exactly how
    costly each action is, and is used to convert a
    probability determination into a decision. Loss
    functions let us treat situations in which some
    kinds of classification mistakes are more costly
    than others.
  •  

17
Formulation
  • Let w1, ... , wc be the finite set of c states
    of nature ("categories").
  • Let be the finite set of a
    possible actions.
  • The loss function loss
    incurred for taking action
  • when the state of nature is wj.
  • x d-dimensional feature vector (random
    variable)
  • P(xwj ) the state conditional probability
    density function for x
  • (The probability density function for x
    conditioned on wj being the true state of nature)
  • P(wj ) prior probability that nature is in
    state wj .

18
Expected Loss
  • Suppose that we observe a particular x and that
    we contemplate taking action .
  • If the true state of nature is wj then loss
    is
  • Before we have done an observation
  • the expected loss is

19
Conditional Risk
  • After the observation the expected risk which is
    called now conditional risk is given by

20
Total Risk
  • Objective Select the action that minimizes the
    conditional risk
  • A general decision rule is a function
  • For every x, the decision function
    assumes one of the a values
  • The total risk is

21
Bayes Decision Rule
  • Compute the conditional risk
  • for i 1, ... , a.
  • Select the action for which
    is minimum.
  • The resulting minimum total risk is called the
    Bayes Risk, denoted R, and is the best
    performance that can be achieved.


22
Two-Category Classification
  • Action deciding that the true state is
    w1
  • Action deciding that the true state is
    w2 .
  • Let be the loss
    incurred for deciding wi when the true state is
    wj.
  • Decide w1 if
  • or if
  • or if
  • and w2 otherwise

23
Two-Category Likelihood Ratio Test
  • Under reasonable assumption that
    (why?)
  • decide w1 if
  • and w2 otherwise.
  •  
  • The ratio is called
    the likelihood ratio. We
  • can decide w1 if the likelihood ratio exceeds
    a threshold T
  • value that is independent of the observation
    x.


24
Minimum-Error-Rate Classification
  • In classification problems, each state is usually
    associated with one of a different C classes.
  • Action Decision that the true state is
    wi.
  • If action is taken, and the true state is
    wj, then the decision is correct if i j, and in
    error otherwise.
  • The Zero-One Loss function is defined as


  • for i,j1,,c
  • all errors are equally costly

25
Minimum-Error-Rate Classification cont.
  • The conditional risk is
  • To minimize the average probability of error, we
    should select the i that maximizes the posterior
    probability P(wix)
  • Decide wi if P(wix) gt P(wjx) for all
  • (same as Bayes' decision rule)


26
Decision Regions
  • The likelihood ratio p(x w1 ) /p(x w2 ) vs. x
  • The threshold qa for zero-one loss function
  • If we put l12gt l21 we shall get qb gt qa


27
Classifiers, Discriminant Functions, and
Decision SurfacesThe Multi-Category Case
  • A pattern classifier can be represented by a set
    of discriminant functions gi(x) i1, .., C.
  •  The classifier assigns a feature vector x to
    class wi
  • if gi(x) gt gj(x) for all

28
Statistical Pattern Classifier
  • Statistical pattern
    classifier

29
The Bayes Classifier
  • A Bayes classifier can be represented in this way
  • For the general case with risks
  • For the minimum error-rate case
  • If we replace every gi(x) by f(gi(x)), where f(.)
    is a
  • monotonically increasing function, the resulting
    classification
  • is unchanged, e.g. any of the following choices
    gives identical
  • classification results

30
The Bayes Classifier cont.
  • The effect of any decision rule is to divide the
    feature space into C decision regions, R1, ..,
    Rc.
  • If gi(x) gt gj(x) for all then x is
    in Ri, and x is assigned to wi.
  • Decision regions are separated by decision
    boundaries.
  • Decision boundaries are surfaces in the feature
    space.

31
The Decision Regions
  • Two dimensional two
    category classifier

32
The Two-Category Case
  • Use 2 discriminant functions g1 and g2, and
    assigning x to w1 if g1gtg2.
  • Alternative define a single discriminant
    function g(x) g1(x) - g2(x), decide w1 if
    g(x)gt0, otherwise decide w2.
  • In two category case two forms are frequently
    used

33
Normal Density - Univariate Case
  • Gaussian density with mean and
    standard deviation
  • (
    named variance )
  • It can be shown that

34
Entropy
  • Entropy is given by
  • and measured by nats if is used
    instead, the unit is the bit.
  • The entropy measures the fundamental
    uncertainty in the values of points selected
    randomly from a distribution. Normal distribution
    has the maximum entropy of all distributions
    having a given mean and variance. As stated by
    the Central Limit Theorem, the aggregative effect
    of the sum of a large number small, iid random
    disturbances will lead to a Gaussian
    distribution.
  • Because many patterns can be viewed as some
    ideal or prototype pattern corrupted by a large
    number of random processes, the Gaussian is often
    a good model for the actual probability
    distribution.

35
Normal Density - Multivariate Case
  • The general multivariate normal density (MND) in
    a d dimensions is written as
  • It can be shown that
  • which means for components
  • The covariance matrix is always symmetric
    and positive semidefinite.

36
Normal Density - Multivariate Case cont.
  • Diagonal elements are variances
    and the off-diagonal elements are
    covariances of xi and xj
  • If xi and xj are statistically independent,
    If all
  • then p(x) is
    a product of univariate normal densities.
  • Linear combination of jointly normally
    distributed random variables are normally
    distributed if and
  • where A is d-by-k matrix,
    then
  • If A is a vector a, yatx is a scalar ,
    is a variance of a projection of x onto a.

37
Whitening transform
  • Define to be the matrix whose columns are
    the orthogonal eigenvectors of , and
    the diagonal matrix of the corres-ponding
    eigenvalues. The transformation with
  • converts an arbitrary MND into a spherical
    with covariance matrix I .

38
Normal Density - Multivariate Case cont.
  • The multivariate normal density MND is completely
    specified by
  • dd(d1)/2 parameters . Samples drawn from
    MND fall in a
  • cluster which center is determined by
    and a shape by The
  • loci of points of constant density are
    hyperellipsoids
  • The r is called Mahalonobis
  • distance from x to . The
  • principal axes of the hyperelli-
  • psoid are given by the eigenvectors
  • of .

39
Normal Density - Multivariate Case cont.
  • The minimum-error-rate classification can be
    achieved using the discriminant functions
  • or
  • If
  • then

40
Discriminant Functions for the Normal Density
  • The features are statistically independent, and
    each feature has the same variance.
  • Determinant is
  • and the inverse of is
  • is
    independent of i and can be ignored

41
Case1 cont.
  • where denotes the Eucledian norm
  • is independent of i
  • or as a linear discriminant function
  • where

42
Case1 cont.
  •   is called the threshold or bias in the ith
    direction.
  • A classifier that uses linear discriminant
    functions
  • is called a linear machine.
  • The decision surfaces of a linear machine are
  • pieces of hyperplanes defined by the linear
  • equations for the 2
    categories with
  • the highest posterior probabilities.
  •  For this particular example, setting
  • reduces to

43
Case1 cont.
  • where
  • The above equation defines a hyperplane through
    x0 and orthogonal to w (line linking the means)
  • If P( wi )P( wj ), then x0 is halfway between
    the means.

44
Case1 cont.
  •  
  • 1D
    2D

45
Case1 cont.
  • If the covariances of 2 distributions are equal
    and proportional to the identity matrix, then the
    distributions are spherical in d-dimensions, and
    the boundary is a generalized hyperplane of d-1
    dimensions, perpendicular to the line separating
    the means. 
  • If P(wi) is not equal to P(wj), the point x0
    shifts away from the more likely mean.
  •  

46
Case1 cont.

  • 1D

 
47
Minimum Distance Classifier
  • As the priors are changed, the decision
    boundary shifts.
  • If all prior probabilities are the same, the
    optimum decision rule becomes
  • Measure the Euclidean distance
    from each x to each of the C mean vectors.
  • Assign x to the class of the nearest mean.

48
Discriminant Functions for the Normal Density
Case2. Common Covariance Matrices
  •  Case 2
  • Covariance matrices for all of the classes are
    identical but arbitrary.
  • is
    independent of i and can be ignored

49
Case2 cont.
  • or
  • If all prior probabilities are the same, the
    optimum decision rule becomes
  • Measure the squared Mahalanobis distance
  • from x to each of the C mean vectors.
  •  Assign x to the class of the nearest mean.

50
Case2 cont.
  • Expanding and
    dropping
  • we shall have a linear classifier
  • where
  • Decision boundaries are given by

51
Discriminant Functions for the Normal Density
Case3. Arbitrary Class-Conditional Distributions
  • Case 3
  • Where
  • Decision boundaries are hyperquadrics

52
ERROR PROBABILITIES AND INTEGRALS
  • Consider the 2-class problem and suppose that the
    feature space is divided into 2 regions
    and . There are 2 ways in which a
    classification error can occur
  •   An observation x falls in , and the
    true state is w1.
  • An observation x falls in , and the
    true state is w2.

53
ERROR PROBABILITIES AND INTEGRALS cont.
54
ERROR PROBABILITIES AND INTEGRALS cont.
  •   Because x is chosen arbitrarily, the
    probability
  • of error is not as small as it might be.
  •  XB Bayes optimal decision boundary , and
  • gives the lowest probability of error.
  • In the multi-category case, there are more ways
    to be
  • wrong than to be right, and it is simpler
    to compute the
  • probability of being correct.
  • This result depends neither on how the feature
    space is
  • partitioned, nor on the form of the underlying
    distribution.
  • Bayes classifier maximizes this probability, and
    no other
  • partitioning can yield a smaller probability of
    error.
  •  
Write a Comment
User Comments (0)
About PowerShow.com