Nearest Neighbor Methods - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Nearest Neighbor Methods

Description:

Assume you have known classifications of a set of exemplars ... The basis functions can also involve sigmoids (aka squashing or logistic functions) ... – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 39

Provided by: michae1249

Category:

more less

Transcript and Presenter's Notes

Title: Nearest Neighbor Methods

1
Nearest Neighbor Methods

Similar outputs for similar inputs

2
Nearest neighbor classification The statistical
approach

Assume you have known classifications of a set of
exemplars
NN classification assumes that the best guess at
the classification of an unknown item is the
classification of its nearest neighbors (those
known items that are most similar to the unknown
item).
k-NN methods use the k nearest neighbors for that
guess.
For example, choose the most common class of its
neighbors

3
Issues

How large to make the neighborhood?
Which neighbors are nearest?
In other words, How to measure distance?
How to apply to continuous outputs?

4
Example
5
Example
6
Number of neighbors

Big neighborhoods produces smoother
categorization boundaries.
Small neighborhoods can produce overfitting.
But, small neighborhoods can make finer
distinctions
Solution cross-validate to find right size

7
How to measure distance?

Various metrics
City-block
Euclidean
Squared Euclidean
Chebychev
Will discuss some of these later with RBFs

8
How to apply nearest neighbormethod to continous
outputs?

Simple method - average the outputs of the k
nearest neighbors
Complex method - weighted average of outputs of
the k nearest neighbors
Weighted based on distance of neighbor

9
Advantages and Disadvantages

Advantage
Nonparametric - assumes nothing about shape of
boundaries in classification.
Disadvantages
Does not incorporate domain knowledge (if it
exists).
Irrelevant features have a negative impact on
distance metric (the curse of dimensionality).
Exceptions or errors in training set can have too
much influence on fit.
Computation-intensive to compute all distances
(good algorithms exist, though).
Memory-intensive - must store original exemplars.

10
Hybrid networks

Issue
Time to train
Backpropagation Problem with networks with
hidden units is that learning the right hidden
unit representation takes a long time.
Note - Previous discussion of nearest-neighbor
methods becomes relevant in a particular type of
hybrid network - RBF nets

11
Counterpropagation networks

A form of hybrid network
Includes supervised and unsupervised learning
Hidden unit representations are formed by
unsupervised learning.
Mapping hidden unit representations to network
output is accomplished using supervised learning
Single layer can use the delta rule
Multiple layers could use backprop
(At least one layer was made faster using
unsupervised learning!)
The unsupervised portion of the network attempts
to find the structure in the input.
Its a preprocessor
Could use competitive learning, or other forms
Some hybrid networks (including some RBFs) dont
use unsupervised learning at all the mapping
from input to hidden unit representations is
predetermined.

12
Disadvantage of the hybrid network approach

The hidden unit representation is independent of
the task to be learned
Its structure is not optimized for the task
Similar inputs are mapped to similar hidden unit
representations could produce problems for
networks that must make fine distinctions.

13
Basis function networks Details

Consider a network with a set of m hidden
units, each of which has its own transfer
function, hj , for how it reacts to its input
These m functions are the basis functions of the
network.
The weights are learned using the delta rule
How you define these basis functions determines
the behavior of the network.

14
Example 1

2 hidden units, one input x, one output y.
h1 1, h2 x
What does this compute?
simple linear regression
Weight for h1 is the intercept
Weight for h2 is the slope

15
Example 2

3 hidden units, one input x, one output y.
h1 1, h2 x, h3 x2
A form of polynomial regression fitting a
quadratic

16
Example 3

m hidden units, m - 1 inputs X, one output y.
h1 X1, h2 X2, hm-1 Xm-1, hm 1
multiple regression

17
Aside - unlearned basis functions

There is no learning of which functions are the
most appropriate in these examples
The models that Ill be discussing (e.g., ALCOVE)
are almost exclusively static models (no
unsupervised learning)
This means that its important to choose a good
set of basis functions to begin with
It also means that the learning is essentially
single layer FAST learning.

18
Example 4

The basis functions can also involve sigmoids
(aka squashing or logistic functions)
This is a simple, single-layer neural network

19
Radial basis functions are a special class of
hidden unit functions

Their response decreases monotonically with
distance from a central point
This functions like a receptive field for each
hidden unit
Each hidden unit has a center
The center is the input pattern that maximally
actives the unit.
Activity level decreases as the input pattern
grows increasingly dissimilar to the hidden
units center.

20
RBF Parameters

The locations of the centers for each unit (hj),
the shape of the RBFs, and the width of the RBFs
are all parameters of the model.
These parameters are fixed in advance if there is
no unsupervised learning.
One or more of these parameters can change by
using unsupervised learning methods.
The precise shape of the function is a function
of a distance/similarity metric how close is
the input pattern to the hidden units desired
input pattern?

21
Distance/similarity can be measured in many ways

Euclidean
City-block
More generally Minkowski metric
Minkowski metric is readily extended to many
input dimensions.

22
Minkowski metric

When r 1, city block when r 2, Euclidean

23
Example of city block

FOR r 1
hj 0,0, a 1,1
For i 1, 0 11 1
For i 2, 0 11 1
Add those two and raise to the power of 1 11
2

24
Example Euclidean distance

FOR r 2
hj 0,0, a 1,1
For i 1, 0 12 1
For i 2, 0 12 1
Add those two and raise to the 1/2 power (square
root) sqroot(11) 1.414

25
Turning distance into similarity

Desired properties
Similarity f(distance).
Maximal similarity when distance 0.
Function with these properties
sij e-cd
When dij 0, sij 1
When dij 1 and c 1, sij e-1 1/e.
This function defines the generalization gradient
c defines the width of the receptive field

26
Shape of receptive fields
27
RBF issues - conceptual

RBF networks pave the representational space
with a set of hidden units/receptive fields, each
of which is maximally active for a particular
input pattern.
When the receptive fields overlap, you have a
distributed representation (any given unit
participates in representing multiple input
patterns)
When the receptive fields do not overlap, you
have a highly localized representation
The degree of overlap is a modifiable parameter
and allows you to decrease the localization of
your hidden unit representations

28
Catastrophic retroactive interference

Networks with sigmoids divide up representational
space into halves
This produces a highly distributed representation
in which each unit participates in most internal
representations
RBFs have more localized receptive fields.
Interference occurs only when new patterns are
very similar to ones with which the network has
already been trained.

29
Design issueWhere to put the RBF centers?

Could be random or evenly distributed.
Some models (e.g., ALCOVE) place the RBF centers
where the training exemplars are.

30
Design issue How to move centers?

Vector quantization approaches
Example winner take all
Closest hidden unit moves towards the input
pattern

31
Design issue How to determine the width of the
receptive fields?

Usually ad hoc.
Commonly, chosen based on averaged distance
between training exemplars.
Some learning algorithms have been proposed (see
Moody Darkens work)

32
Comparison of RBF to nearest neighbor methods

Nearest neighbor methods are exemplar-based
RBF are prototype-based
Network doesnt require storage of every training
exemplar - it summarizes the data.
RBF and nearest neighbor are nearly identical
when RBFs are located at each and every training
exemplar.

33
Comparison of RBF to MLP

Behavior of hidden units
MLP hidden units use weighted linear summation
of input transformed by a sigmoid (can be
Gaussian, though)
RBF hidden units use distance to a prototype
vector followed by transformation by a localized
function
Local vs. distributed
MLP hidden units form a strongly distributed
representation (exception - Gaussian activation
functions)
RBF hidden unit representations are more
localized
Learning
MLP all of the parameters (weights) are
determined at the same time as part of a single
global training strategy
RBF training of the hidden unit representations
is decoupled from that of the output units.

34
Advantages of RBFs

Learn quickly
Can use unsupervised learning to learn the hidden
unit representations
Are not subject to catastrophic retroactive
interference
Are not overly sensitive to linear boundaries
(Kruschke, 1993)
Example XOR problem
Receptive field notion more biological

35
Disadvantages of RBFs

Hidden unit representations are general not
specific to the problem to be learned
Makes it difficult to map similar input
representations to different outputs.
May not quite achieve the accuracy of a backprop
network, but it gets close quickly!
Extrapolation!

36
Kruschkes ALCOVE model

Attention Learning COVEring map
A connectionist implementation of Nosofskys GCM
Category learning model
RBF network that uses the Minkowski metric
Includes a c parameter which he calls
specificity
This parameter determines the fixed width of the
RBFs

37
Attention learning

Includes attentional weights and attention
learning to determine those attentional weights
Attentional weights are learned via
backpropagation
Makes for slow learning of attention weights
Recent Kruschke models are acquiring attention
weights without using backprop.
Attentional weights serve to stretch the
representational space

38
Other applications of RBFs to psychology

Function learning
DeLosh, Busemeyer, McDaniels EXAM
(extrapolation-association model)
Object recognition
Edelmans model
Recognizing handwriting
Lee, Yuchun (1991).
Compares k nearest-neighbor, radial-basis
function, and backpropagation neural networks.
Sonar discrimination by dolphins
Au, Andersen, Rasmussen, Roitblat (1995)