Title: Nearest Neighbor Methods
1Nearest Neighbor Methods
- Similar outputs for similar inputs
2Nearest neighbor classification The statistical
approach
- Assume you have known classifications of a set of
exemplars - NN classification assumes that the best guess at
the classification of an unknown item is the
classification of its nearest neighbors (those
known items that are most similar to the unknown
item). - k-NN methods use the k nearest neighbors for that
guess. - For example, choose the most common class of its
neighbors
3Issues
- How large to make the neighborhood?
- Which neighbors are nearest?
- In other words, How to measure distance?
- How to apply to continuous outputs?
4Example
5Example
6Number of neighbors
- Big neighborhoods produces smoother
categorization boundaries. - Small neighborhoods can produce overfitting.
- But, small neighborhoods can make finer
distinctions - Solution cross-validate to find right size
7How to measure distance?
- Various metrics
- City-block
- Euclidean
- Squared Euclidean
- Chebychev
- Will discuss some of these later with RBFs
8How to apply nearest neighbormethod to continous
outputs?
- Simple method - average the outputs of the k
nearest neighbors - Complex method - weighted average of outputs of
the k nearest neighbors - Weighted based on distance of neighbor
9Advantages and Disadvantages
- Advantage
- Nonparametric - assumes nothing about shape of
boundaries in classification. - Disadvantages
- Does not incorporate domain knowledge (if it
exists). - Irrelevant features have a negative impact on
distance metric (the curse of dimensionality). - Exceptions or errors in training set can have too
much influence on fit. - Computation-intensive to compute all distances
(good algorithms exist, though). - Memory-intensive - must store original exemplars.
10Hybrid networks
- Issue
- Time to train
- Backpropagation Problem with networks with
hidden units is that learning the right hidden
unit representation takes a long time. - Note - Previous discussion of nearest-neighbor
methods becomes relevant in a particular type of
hybrid network - RBF nets
11Counterpropagation networks
- A form of hybrid network
- Includes supervised and unsupervised learning
- Hidden unit representations are formed by
unsupervised learning. - Mapping hidden unit representations to network
output is accomplished using supervised learning - Single layer can use the delta rule
- Multiple layers could use backprop
- (At least one layer was made faster using
unsupervised learning!) - The unsupervised portion of the network attempts
to find the structure in the input. - Its a preprocessor
- Could use competitive learning, or other forms
- Some hybrid networks (including some RBFs) dont
use unsupervised learning at all the mapping
from input to hidden unit representations is
predetermined.
12Disadvantage of the hybrid network approach
- The hidden unit representation is independent of
the task to be learned - Its structure is not optimized for the task
- Similar inputs are mapped to similar hidden unit
representations could produce problems for
networks that must make fine distinctions.
13Basis function networks Details
- Consider a network with a set of m hidden
units, each of which has its own transfer
function, hj , for how it reacts to its input - These m functions are the basis functions of the
network. - The weights are learned using the delta rule
- How you define these basis functions determines
the behavior of the network.
14Example 1
- 2 hidden units, one input x, one output y.
- h1 1, h2 x
- What does this compute?
- simple linear regression
- Weight for h1 is the intercept
- Weight for h2 is the slope
15Example 2
- 3 hidden units, one input x, one output y.
- h1 1, h2 x, h3 x2
- A form of polynomial regression fitting a
quadratic
16Example 3
- m hidden units, m - 1 inputs X, one output y.
- h1 X1, h2 X2, hm-1 Xm-1, hm 1
- multiple regression
17Aside - unlearned basis functions
- There is no learning of which functions are the
most appropriate in these examples - The models that Ill be discussing (e.g., ALCOVE)
are almost exclusively static models (no
unsupervised learning) - This means that its important to choose a good
set of basis functions to begin with - It also means that the learning is essentially
single layer FAST learning.
18Example 4
- The basis functions can also involve sigmoids
(aka squashing or logistic functions) - This is a simple, single-layer neural network
19Radial basis functions are a special class of
hidden unit functions
- Their response decreases monotonically with
distance from a central point - This functions like a receptive field for each
hidden unit - Each hidden unit has a center
- The center is the input pattern that maximally
actives the unit. - Activity level decreases as the input pattern
grows increasingly dissimilar to the hidden
units center.
20RBF Parameters
- The locations of the centers for each unit (hj),
the shape of the RBFs, and the width of the RBFs
are all parameters of the model. - These parameters are fixed in advance if there is
no unsupervised learning. - One or more of these parameters can change by
using unsupervised learning methods. - The precise shape of the function is a function
of a distance/similarity metric how close is
the input pattern to the hidden units desired
input pattern?
21Distance/similarity can be measured in many ways
- Euclidean
- City-block
- More generally Minkowski metric
- Minkowski metric is readily extended to many
input dimensions.
22Minkowski metric
- When r 1, city block when r 2, Euclidean
23Example of city block
- FOR r 1
- hj 0,0, a 1,1
- For i 1, 0 11 1
- For i 2, 0 11 1
- Add those two and raise to the power of 1 11
2
24Example Euclidean distance
- FOR r 2
- hj 0,0, a 1,1
- For i 1, 0 12 1
- For i 2, 0 12 1
- Add those two and raise to the 1/2 power (square
root) sqroot(11) 1.414
25Turning distance into similarity
- Desired properties
- Similarity f(distance).
- Maximal similarity when distance 0.
- Function with these properties
- sij e-cd
- When dij 0, sij 1
- When dij 1 and c 1, sij e-1 1/e.
- This function defines the generalization gradient
- c defines the width of the receptive field
26Shape of receptive fields
27RBF issues - conceptual
- RBF networks pave the representational space
with a set of hidden units/receptive fields, each
of which is maximally active for a particular
input pattern. - When the receptive fields overlap, you have a
distributed representation (any given unit
participates in representing multiple input
patterns) - When the receptive fields do not overlap, you
have a highly localized representation - The degree of overlap is a modifiable parameter
and allows you to decrease the localization of
your hidden unit representations
28Catastrophic retroactive interference
- Networks with sigmoids divide up representational
space into halves - This produces a highly distributed representation
in which each unit participates in most internal
representations - RBFs have more localized receptive fields.
- Interference occurs only when new patterns are
very similar to ones with which the network has
already been trained.
29Design issueWhere to put the RBF centers?
- Could be random or evenly distributed.
- Some models (e.g., ALCOVE) place the RBF centers
where the training exemplars are.
30Design issue How to move centers?
- Vector quantization approaches
- Example winner take all
- Closest hidden unit moves towards the input
pattern
31Design issue How to determine the width of the
receptive fields?
- Usually ad hoc.
- Commonly, chosen based on averaged distance
between training exemplars. - Some learning algorithms have been proposed (see
Moody Darkens work)
32Comparison of RBF to nearest neighbor methods
- Nearest neighbor methods are exemplar-based
- RBF are prototype-based
- Network doesnt require storage of every training
exemplar - it summarizes the data. - RBF and nearest neighbor are nearly identical
when RBFs are located at each and every training
exemplar.
33Comparison of RBF to MLP
- Behavior of hidden units
- MLP hidden units use weighted linear summation
of input transformed by a sigmoid (can be
Gaussian, though) - RBF hidden units use distance to a prototype
vector followed by transformation by a localized
function - Local vs. distributed
- MLP hidden units form a strongly distributed
representation (exception - Gaussian activation
functions) - RBF hidden unit representations are more
localized - Learning
- MLP all of the parameters (weights) are
determined at the same time as part of a single
global training strategy - RBF training of the hidden unit representations
is decoupled from that of the output units.
34Advantages of RBFs
- Learn quickly
- Can use unsupervised learning to learn the hidden
unit representations - Are not subject to catastrophic retroactive
interference - Are not overly sensitive to linear boundaries
(Kruschke, 1993) - Example XOR problem
- Receptive field notion more biological
35Disadvantages of RBFs
- Hidden unit representations are general not
specific to the problem to be learned - Makes it difficult to map similar input
representations to different outputs. - May not quite achieve the accuracy of a backprop
network, but it gets close quickly! - Extrapolation!
36Kruschkes ALCOVE model
- Attention Learning COVEring map
- A connectionist implementation of Nosofskys GCM
- Category learning model
- RBF network that uses the Minkowski metric
- Includes a c parameter which he calls
specificity - This parameter determines the fixed width of the
RBFs
37Attention learning
- Includes attentional weights and attention
learning to determine those attentional weights - Attentional weights are learned via
backpropagation - Makes for slow learning of attention weights
- Recent Kruschke models are acquiring attention
weights without using backprop. - Attentional weights serve to stretch the
representational space
38Other applications of RBFs to psychology
- Function learning
- DeLosh, Busemeyer, McDaniels EXAM
(extrapolation-association model) - Object recognition
- Edelmans model
- Recognizing handwriting
- Lee, Yuchun (1991).
- Compares k nearest-neighbor, radial-basis
function, and backpropagation neural networks. - Sonar discrimination by dolphins
- Au, Andersen, Rasmussen, Roitblat (1995)