Lecture 5 NonParameter Estimation for Supervised Learning Parzen Windows, KNN

1 / 35

About This Presentation

Title:

Lecture 5 NonParameter Estimation for Supervised Learning Parzen Windows, KNN

Description:

All classical parametric densities are unimodal (have a single ... Reading. Chapter 4, Pattern Classification by Duda, Hart, Stork, 2001, Sections 4.1-4.5 ... –

Number of Views:345

Avg rating:3.0/5.0

Slides: 36

Provided by: djam79

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 5 NonParameter Estimation for Supervised Learning Parzen Windows, KNN

1
Lecture 5Non-Parameter Estimationfor
Supervised Learning Parzen Windows, KNN

2
Outline

Introduction
Density Estimation
Parzen Windows Estimation
Probabilistic Neural Network based on Parzen
Window
K Nearest Neighbor Estimation
Nearest Neighbor for Classification
1NN
KNN

3
Introduction

All classical parametric densities are unimodal
(have a single peaks), whereas many practical
problems involve multi-modal densities
Nonparametric procedures can be used with
arbitrary distributions and without the
assumption that the forms of the underlying
densities are known
There are two types of nonparametric methods
Estimating conditional density- P(x ?j )
Estimating a-posteriori probability estimation
P(?j x )
Density estimation from samples
Learning density function from samples

4
Density Estimation

Basic idea given samples to estimate class
conditional densities, from discrete samples to
estimate density function
p(x) is continuous
P is constant within the small region R
V the volume enclosed by R

How to choose right volumes for DE?
Too big or too small volume are not good for
density estimation
Depend on availability of data samples
Two popular methods to choose volumes
Fixed volume size
Fix no. of samples fallen in the volume (KNN),
data dependent

6
(No Transcript)
7

The volume V needs to approach 0 anyway if we
want to use this estimation
Practically, V cannot be allowed to become small
since the number of samples is always limited
One will have to accept a certain amount of
variance in the ratio k/n
Theoretically, if an unlimited number of samples
is available, we can circumvent this difficulty
To estimate the density of x, we form a sequence
of regions
R1, R2,containing x the first region contains
one sample, the second two samples and so on.
Let Vn be the volume of Rn, kn the number of
samples falling in Rn and pn(x) be the nth
estimate for p(x)
pn(x) (kn/n)/Vn (7)

Three necessary conditions should apply if we
want pn(x) to converge to p(x)
There are two different ways of obtaining
sequences of regions that satisfy these
conditions
(a) Shrink an initial region where Vn 1/?n and
show that
This is called the Parzen-window
estimation method
(b) Specify kn as some function of n, such as
kn ?n the volume Vn is
grown until it encloses kn neighbors of x.
This is called the kn-nearest neighbor
estimation method

Condition for convergence
The fraction k/(nV) is a space averaged value of
p(x).
p(x) is obtained only if V approaches zero.
This is the case where no samples are included
in R it is an uninteresting case!
In this case, the estimate diverges it is an
uninteresting case!

10
Parzen Windows Estimation

Parzen-window approach to estimate densities
assume that the region Rn is a d-dimensional
hypercube
?((x-xi)/hn) is an unit window function
hn controls the kernel width, smaller hn require
more samples, bigger hn produces density function
smother

The number of samples in this hypercube is
(10)
By substituting kn in equation (7), we obtain the
following estimate
(11)
Pn(x) estimates p(x) as an average of functions
of x and
the samples (xi) (i 1, ,n). These functions ?
can be general!

12
Example 1 Parzen Window Estimation for a Normal
Density p(x) ?N(0,1)

Using a window function ?(u) (1/?(2?)
exp(-u2/2)
hn h1/?n, h1 is the parameter used (ngt1)
is an average of normal densities centered at
the samples xi.
n is the no. of samples used for density
estimation
The more samples used, better estimation can be
obtained
Small window width h1 will sharpen the density
distribution, but require more samples

For n 1 and h11
High bias due to small n
For n 10 and h 0.1, the contributions of the
individual samples are clearly observable (see
figures next page)

14
(No Transcript)
15

Analogous results are also obtained in two
dimensions

16
(No Transcript)
17
Example 2 Density estimation for a mixture of a
uniform and a triangle density

Case where p(x) ?1.U(a,b) ?2.T(c,d)
(unknown density)

18
(No Transcript)
19
Parzen Window Estimation for classification

Classification example
We estimate the densities for each category and
classify a test point by the label corresponding
to the maximum posterior
The decision region for a Parzen-window
classifier depends upon the choice of window
function as illustrated in the following figure.

20
(No Transcript)
21
Probabilistic Neural Networks

PNN based on Parzen estimation
Input with d dimensional features
n patterns
c classes
Three layers input, (training) pattern, category
output

.
22
Training the network

Normalize each pattern x of the training set to
1
Place the first training pattern on the input
units
Set the weights linking the input units and the
first pattern units such that w1 x1
Make a single connection from the first pattern
unit to the category unit corresponding to the
known class of that pattern
Repeat the process for all remaining training
patterns by setting the weights such that wk xk
(k 1, 2, , n)

23
Testing the network

Normalize the test pattern x and place it at the
input units
Each pattern unit computes the inner product in
order to yield the net activation and emit a
nonlinear function
Each output unit sums the contributions from all
pattern units connected to it
Classify by selecting the maximum value of Pn(x
?j) (j 1, , c)

24
PNN summary

Advantages
Fast training and classification
Easy to add more training samples by adding more
pattern nodes
Good for online applications
Much simpler than the back propagation NN
Disadvantages
High memory if many training samples used

25
K-Nearest neighbor estimation (KNN)

Goal a solution for the problem of the unknown
best window function
Let the cell volume be a function of the training
data
Center a cell about x and let it grows until it
captures kn samples (kn f(n))
kn are called the kn nearest-neighbors of x
2 possibilities can occur
Density is high near x therefore the cell will
be small which provides a good resolution
Density is low therefore the cell will grow
large and stop until higher density regions are
reached
We can obtain a family of estimates by setting
knk1/?n and choosing different values for k1

26
(No Transcript)
27
K-NN for Classification

Goal estimate P(?i x) from a set of n labeled
samples
Lets place a cell of volume V around x and
capture k samples
ki samples amongst k turned out to be labeled ?I
then
pn(x, ?i) ki /n.V
An estimate for pn(?i x) is

ki/k is the fraction of the samples within the
cell that are labeled ?i
For minimum error rate, the most frequently
represented category within the cell is selected
If k is large and the cell sufficiently small,
the performance will approach the best possible

29
The 1-NN (nearest neighbor) classifier

Let Dn x1, x2, , xn be a set of n labeled
prototypes
Let x ? Dn be the closest prototype to a test
point x then the nearest-neighbor rule for
classifying x is to assign it the label
associated with x
The nearest-neighbor rule leads to an error rate
greater than the minimum possible the Bayes rate
If the number of prototype is large (unlimited),
the error rate of the nearest-neighbor classifier
is never worse than twice the Bayes rate (it can
be demonstrated!)
If n ? ?, it is always possible to find x
sufficiently close so that P(?i x) ? P(?i
x)

30
(No Transcript)
31
The KNN rule

Goal Classify x by assigning it the label most
frequently represented among the k nearest
samples and use a voting scheme

Example
k 3 (odd value) and x (0.10, 0.25)t
Closest vectors to x with their labels are
(0.10, 0.28, ?2) (0.12, 0.20, ?2) (0.15,
0.35,?1)
One voting scheme assigns the label ?2 to x
since ?2 is the most frequently represented

33
More on K-NN

Most simple classifier, often used as a baseline
for performance comparison with more
sophisticated classifiers
High computation cost, especially when samples
are high
Only became practical in 80s
Methods to improve efficiency
NN editing
Vector quantization (VQ) developed in early 90

34
Summary