Lecture 5 NonParameter Estimation for Supervised Learning Parzen Windows, KNN

1 / 35
About This Presentation
Title:

Lecture 5 NonParameter Estimation for Supervised Learning Parzen Windows, KNN

Description:

All classical parametric densities are unimodal (have a single ... Reading. Chapter 4, Pattern Classification by Duda, Hart, Stork, 2001, Sections 4.1-4.5 ... –

Number of Views:345
Avg rating:3.0/5.0
Slides: 36
Provided by: djam79
Category:

less

Transcript and Presenter's Notes

Title: Lecture 5 NonParameter Estimation for Supervised Learning Parzen Windows, KNN


1
Lecture 5Non-Parameter Estimationfor
Supervised Learning Parzen Windows, KNN

2
Outline
  • Introduction
  • Density Estimation
  • Parzen Windows Estimation
  • Probabilistic Neural Network based on Parzen
    Window
  • K Nearest Neighbor Estimation
  • Nearest Neighbor for Classification
  • 1NN
  • KNN

3
Introduction
  • All classical parametric densities are unimodal
    (have a single peaks), whereas many practical
    problems involve multi-modal densities
  • Nonparametric procedures can be used with
    arbitrary distributions and without the
    assumption that the forms of the underlying
    densities are known
  • There are two types of nonparametric methods
  • Estimating conditional density- P(x ?j )
  • Estimating a-posteriori probability estimation
    P(?j x )
  • Density estimation from samples
  • Learning density function from samples

4
Density Estimation
  • Basic idea given samples to estimate class
    conditional densities, from discrete samples to
    estimate density function
  • p(x) is continuous
  • P is constant within the small region R
  • V the volume enclosed by R

5
  • How to choose right volumes for DE?
  • Too big or too small volume are not good for
    density estimation
  • Depend on availability of data samples
  • Two popular methods to choose volumes
  • Fixed volume size
  • Fix no. of samples fallen in the volume (KNN),
    data dependent

6
(No Transcript)
7
  • The volume V needs to approach 0 anyway if we
    want to use this estimation
  • Practically, V cannot be allowed to become small
    since the number of samples is always limited
  • One will have to accept a certain amount of
    variance in the ratio k/n
  • Theoretically, if an unlimited number of samples
    is available, we can circumvent this difficulty
  • To estimate the density of x, we form a sequence
    of regions
  • R1, R2,containing x the first region contains
    one sample, the second two samples and so on.
  • Let Vn be the volume of Rn, kn the number of
    samples falling in Rn and pn(x) be the nth
    estimate for p(x)
  • pn(x) (kn/n)/Vn (7)

8
  • Three necessary conditions should apply if we
    want pn(x) to converge to p(x)
  • There are two different ways of obtaining
    sequences of regions that satisfy these
    conditions
  • (a) Shrink an initial region where Vn 1/?n and
    show that
  • This is called the Parzen-window
    estimation method
  • (b) Specify kn as some function of n, such as
    kn ?n the volume Vn is
  • grown until it encloses kn neighbors of x.
    This is called the kn-nearest neighbor
    estimation method

9
  • Condition for convergence
  • The fraction k/(nV) is a space averaged value of
    p(x).
  • p(x) is obtained only if V approaches zero.
  • This is the case where no samples are included
    in R it is an uninteresting case!
  • In this case, the estimate diverges it is an
    uninteresting case!

10
Parzen Windows Estimation
  • Parzen-window approach to estimate densities
    assume that the region Rn is a d-dimensional
    hypercube
  • ?((x-xi)/hn) is an unit window function
  • hn controls the kernel width, smaller hn require
    more samples, bigger hn produces density function
    smother

11
  • The number of samples in this hypercube is


  • (10)

  • By substituting kn in equation (7), we obtain the
    following estimate


  • (11)
  • Pn(x) estimates p(x) as an average of functions
    of x and
  • the samples (xi) (i 1, ,n). These functions ?
    can be general!

12
Example 1 Parzen Window Estimation for a Normal
Density p(x) ?N(0,1)
  • Using a window function ?(u) (1/?(2?)
    exp(-u2/2)
  • hn h1/?n, h1 is the parameter used (ngt1)

  • is an average of normal densities centered at
    the samples xi.
  • n is the no. of samples used for density
    estimation
  • The more samples used, better estimation can be
    obtained
  • Small window width h1 will sharpen the density
    distribution, but require more samples

13
  • For n 1 and h11
  • High bias due to small n
  • For n 10 and h 0.1, the contributions of the
    individual samples are clearly observable (see
    figures next page)

14
(No Transcript)
15
  • Analogous results are also obtained in two
    dimensions

16
(No Transcript)
17
Example 2 Density estimation for a mixture of a
uniform and a triangle density
  • Case where p(x) ?1.U(a,b) ?2.T(c,d)
  • (unknown density)

18
(No Transcript)
19
Parzen Window Estimation for classification
  • Classification example
  • We estimate the densities for each category and
    classify a test point by the label corresponding
    to the maximum posterior
  • The decision region for a Parzen-window
    classifier depends upon the choice of window
    function as illustrated in the following figure.

20
(No Transcript)
21
Probabilistic Neural Networks
  • PNN based on Parzen estimation
  • Input with d dimensional features
  • n patterns
  • c classes
  • Three layers input, (training) pattern, category
    output

.
22
Training the network
  • Normalize each pattern x of the training set to
    1
  • Place the first training pattern on the input
    units
  • Set the weights linking the input units and the
    first pattern units such that w1 x1
  • Make a single connection from the first pattern
    unit to the category unit corresponding to the
    known class of that pattern
  • Repeat the process for all remaining training
    patterns by setting the weights such that wk xk
    (k 1, 2, , n)

23
Testing the network
  • Normalize the test pattern x and place it at the
    input units
  • Each pattern unit computes the inner product in
    order to yield the net activation and emit a
    nonlinear function
  • Each output unit sums the contributions from all
    pattern units connected to it
  • Classify by selecting the maximum value of Pn(x
    ?j) (j 1, , c)

24
PNN summary
  • Advantages
  • Fast training and classification
  • Easy to add more training samples by adding more
    pattern nodes
  • Good for online applications
  • Much simpler than the back propagation NN
  • Disadvantages
  • High memory if many training samples used

25
K-Nearest neighbor estimation (KNN)
  • Goal a solution for the problem of the unknown
    best window function
  • Let the cell volume be a function of the training
    data
  • Center a cell about x and let it grows until it
    captures kn samples (kn f(n))
  • kn are called the kn nearest-neighbors of x
  • 2 possibilities can occur
  • Density is high near x therefore the cell will
    be small which provides a good resolution
  • Density is low therefore the cell will grow
    large and stop until higher density regions are
    reached
  • We can obtain a family of estimates by setting
    knk1/?n and choosing different values for k1

26
(No Transcript)
27
K-NN for Classification
  • Goal estimate P(?i x) from a set of n labeled
    samples
  • Lets place a cell of volume V around x and
    capture k samples
  • ki samples amongst k turned out to be labeled ?I
    then
  • pn(x, ?i) ki /n.V
  • An estimate for pn(?i x) is

28
  • ki/k is the fraction of the samples within the
    cell that are labeled ?i
  • For minimum error rate, the most frequently
    represented category within the cell is selected
  • If k is large and the cell sufficiently small,
    the performance will approach the best possible

29
The 1-NN (nearest neighbor) classifier
  • Let Dn x1, x2, , xn be a set of n labeled
    prototypes
  • Let x ? Dn be the closest prototype to a test
    point x then the nearest-neighbor rule for
    classifying x is to assign it the label
    associated with x
  • The nearest-neighbor rule leads to an error rate
    greater than the minimum possible the Bayes rate
  • If the number of prototype is large (unlimited),
    the error rate of the nearest-neighbor classifier
    is never worse than twice the Bayes rate (it can
    be demonstrated!)
  • If n ? ?, it is always possible to find x
    sufficiently close so that P(?i x) ? P(?i
    x)

30
(No Transcript)
31
The KNN rule
  • Goal Classify x by assigning it the label most
    frequently represented among the k nearest
    samples and use a voting scheme

32
  • Example
  • k 3 (odd value) and x (0.10, 0.25)t
  • Closest vectors to x with their labels are
  • (0.10, 0.28, ?2) (0.12, 0.20, ?2) (0.15,
    0.35,?1)
  • One voting scheme assigns the label ?2 to x
    since ?2 is the most frequently represented

33
More on K-NN
  • Most simple classifier, often used as a baseline
    for performance comparison with more
    sophisticated classifiers
  • High computation cost, especially when samples
    are high
  • Only became practical in 80s
  • Methods to improve efficiency
  • NN editing
  • Vector quantization (VQ) developed in early 90

34
Summary
  • Advantages of Parzen Window Density Estimation
  • No assumption on underlying distribution
  • Being a general DE
  • Only based on samples
  • High accuracy if enough samples
  • Disadvantages
  • Require too many samples
  • High computation cost
  • Curse of dimensionality
  • How to choose best window function?
  • KNN (K nearest neighbor) estimation

35
Reading
  • Chapter 4, Pattern Classification by Duda, Hart,
    Stork, 2001, Sections 4.1-4.5
Write a Comment
User Comments (0)
About PowerShow.com