Title: Modular Neural Networks II
1Modular Neural Networks II
Presented by David Brydon Karl Martens David
Pereira
CPSC 533 - Artificial Intelligence Winter
2000 Instructor C. Jacob Date 16-March-2000
2Presentation Agenda
- A Reiteration Of Modular Neural Networks
- Hybrid Neural Networks
- Maximum Entropy
- Counterpropagation Networks
- Spline Networks
- Radial Basis Functions
- Note The information contained in this
presentation has been obtained from Neural
Networks A Systematic Introduction by R. Rojas.
3A Reiteration of Modular Neural Networks
There are many different types of neural networks
- linear, recurrent, supervised, unsupervised,
self-organizing, etc. Each of these neural
networks have a different theoretical and
practical approach. However, each of these
different models can be combined. How ? Each of
the afore-mentioned neural networks can be
transformed into a module that can be freely
intermixed with modules of other types of neural
networks. Thus, we have Modular Neural Networks.
4A Reiteration of Modular Neural Networks
- But WHY do we have Modular Neural Network Systems
? - To Reduce Model Complexity
- To Incorporate Knowledge
- To Fuse Data and Predict Averages
- To Combine Techniques
- To Learn Different Tasks Simultaneously
- To Incrementally Increase Robustness
- To Emulate Its Biological Counterpart
5Hybrid Neural Networks
- A very well-known and promising family of
architectures was developed by Stephen Grossberg.
- It is called ART - Adaptive Resonance Theory.
- It is closer to the biological paradigm than
feed-forward networks or standard associative
memories. - The dynamics of the networks resembles learning
in humans. - One-shot learning can be recreated with this
model. - There are three different architectures in this
family - ART-1 Uses Boolean values
- ART-2 Uses real values
- ART-3 Uses differential equations
6Hybrid Neural Networks
Each category in the input space is represented
by a vector. The ART networks classify a
stochastic series of vectors into clusters. All
vectors located inside the cone around each
weight vector are considered members of a
specific cluster. Each unit fires only for vector
located inside it associated cone of radius
r. The value r is inversely proportional to
the attention parameter of the unit. Large r
means classification of the input space is
fine. Small r means classification of the
input space is coarse.
7Hybrid Neural Networks
Fig. 1. Vector clusters and attention
parameters
8Hybrid Neural Networks
- Once the weight vectors have been found, the
network computes whether new data can or cannot
be classified by the existing clusters. - If not, a new a new cluster is created with a new
associated weight vector. - ART networks have two major advantages
- Plasticity it can always react to unknown
inputs (by creating a new cluster with a new
weight vector, if the given input cannot be
classified by existing clusters). - Stability Existing clusters are not deleted by
the introduction of new inputs (New clusters will
just be created in addition to the old ones). - However, enough potential weight vectors must be
provided.
9Hybrid Neural Networks
Fig. 2. The ART-1 Architecture
10Hybrid Neural Networks
The Structure of ART-1 (Part 1 of 2) There are
two basic layers of computing units. Layer F1
receives binary input vectors from the input
sites. As soon as an input vector arrives it is
passed to layer F1 and from there to layer
F2. Layer F2 contains elements which fire
according to the winner-takes-all method. (Only
the element receiving the maximal scalar product
of its weight vector and input vector
fires). When a unit in layer F2 has fired, the
negative weight turns off the attention unit.
Also, the winning unit in layer F2 sends back a 1
throughout the connection between layer F2 and
F1. Now each unit in layer F1 becomes as input
the corresponding component of the input vector x
and of the weight vector w.
11Hybrid Neural Networks
The Structure of ART-1 (Part 2 of 2) The i-th F1
unit compares xi with wi and outputs the product
xiwi. The reset unit receives this information
and also the components of x, weighted by p, the
attention parameter so that its own computation
is p (x1x2xn) - x.w 0 which is the same
as (x.w) / (x1x2xn) p The reset unit fires
only if the input lies outside the attention cone
of the winning unit. A reset signal is sent to
layer F2, but only the winning layer is
inhibited. This is turns activates the attention
unit and a new round of computation begins.
Hence, there is resonance.
12Hybrid Neural Networks
The Structure of ART-1 (Some Final Details) The
weight vectors in layer F2 are initialized with
all components equal to 1 and p is selected to
satisfy 0ltplt1. This ensures that eventually an
unused vector will be recruited to represent a
new cluster. The selected weight vector w is
updated by pulling it in the direction of x. This
is done in ART-1 by turning of all component in w
which are zeros in x. The purpose of the reset
signal is to inhibit all units that do not
resonate with the input. A unit in layer F2,
which is still unused, can be selected for the
new cluster containing x. In this way,
sufficiently different input data can create a
new cluster. By modifying the value of the
attention parameter p, we can control the number
of clusters and how wide they are.
13Hybrid Neural Networks
The Structure of ART-2 and ART-3 ART-2 uses
vectors that have real-valued components instead
of Boolean components. The dynamics of the ART-2
and ART-3 models is governed by differential
equations. However, computer simulations consume
too much time. Consequently, implementations
using analog hardware or a combination of optical
and electronic elements are more suited to this
kind of model.
14Hybrid Neural Networks
Maximum entropy So whats the problem with ART ?
It tries to build clusters of the same size,
independently of the distribution data. So, is
there a better solution ? Yes, Allow the clusters
to have varying radii with a technique called the
Maximum Entropy Method. What is entropy ? The
entropy H of a data set of N points assigned to k
differently clusters c1, c2, c3,,cn is given
by H- p(c1)log(p(c1)) p(c1)log(p(c2)) ...
p(cn)log(p(cn)) where p(ci) denotes the
probability of hitting the i-th cluster, when an
element of the data set is picked at random.
Since the probabilities add up to 1, the cluster
that maximizes the entropy is one for which all
clusters are identical. This means that the
clusters will tend to cover the same number of
points.
15Hybrid Neural Networks
Maximum entropy However, there is still a problem
- whenever the number of elements of each class
in the data set is different. Consider the case
of unlabeled speech data some phonemes are more
frequent than others and if a maximum entropy
method is used, the boundaries between clusters
will deviate from the natural solution and
classify some data erroneously. So how do we
solve this problem ? With the Boostrapped
Iterative Algorithm cluster Computer a maximum
entropy clustering with the training data. Label
the original data data according to this
clustering. select Build a new training set by
selecting from each class the same number of
points (random selection with replacement). Go to
the previous step.
16Hybrid Neural Networks
Counterpropagation network Are there any other
hybrid network models ? Yes, the
counter-propagation network as proposed by
Hecht-Nielsen. So what are counter-propagation
networks designed for ? To approximate a
continuous mapping f and it inverse f-1. A
counter-propagation consists of an n-dimentional
input vector which is fed to a hidden layer
consisting of h cluster vectors. The output is
generated by a single linear associator unit.
The weights in the network are adjusted using
supervised learning. The above network can
successfully approximate functions of the form f
Rn -gt R.
17Hybrid Neural Networks
Fig. 3 Simplified counterpropagation
nework
18Hybrid Neural Networks
- Counterpropagation network
- The training phase is completed in two parts
- Training of the hidden layer into a clustering of
input space that corresponds to an n-dimentional
Voronoi tiling. The hidden layers output needs
to be controlled so that only the element with
the highest activation fires. - The zi weights are then adjusted to represent the
value of the approximation for the cluster
region. - This network can be extended to handle multiple
output - units.
19Hybrid Neural Networks
- Fig. 4 Function approximation with a
counterpropagation network.
20Hybrid Neural Networks
- Spline networks
- Can the approximation created by a
counterpropagation network be improved on? Yes - In the counterpropagation network the Voronoi
Tiling, is composed of a series horizontal tiles.
Each of which represents an average of the
function in that region. - The spline network solves this problem by
extending the hidden layer in the
counterpropagation network. Each unit is paired
with a linear associator, the cluster unit is
used to inhibit or activate the linear associator
which is connected to all inputs. - This modification allows the resulting set of
tiles to be oriented differently with respect to
each other. Creating an approximation with a
smaller quadratic error, and a better solution to
the problem. - Training proceeds as before except the newly
added linear associators are trained using back
propagation.
21Hybrid Neural Networks
- Fig. 5 Function approximation with linear
associators
22Hybrid Neural Networks
- Radial basis functions
- Has a simular structure as that of the counter
propagation network. The difference is in the
activation function used for each unit is
Gaussian instead of Sigmoidal. - The Gaussian approach uses locally concentrated
functions. - The Sigmodal approach uses a smooth step
approach. - Which is better depends on the specific problem
at hand. If the function is smooth step then the
Gaussian approach would require more units, where
if the function is Gaussian then the Sigmodal
approach will require more units.