Title: RECENT DEVELOPMENTS IN MULTILAYER PERCEPTRON NEURAL NETWORKS
1 RECENT DEVELOPMENTS IN MULTILAYER PERCEPTRON
NEURAL NETWORKS
- Walter H. Delashmit
- Lockheed Martin Missiles and Fire Control
- Dallas, TX 75265
- walter.delashmit_at_lmco.com
- walter.delashmit_at_verizon.net
- Michael T. Manry
- The University of Texas at Arlington
- Arlington, TX 76010
- manry_at_uta.edu
- Memphis Area Engineering and Science Conference
2005 - May 11, 2005
2Outline of Presentation
- Review of Multilayer Perceptron Neural Networks
- Network Initial Types and Training Problems
- Common Starting Point Initialized Networks
- Dependently Initialized Networks
- Separating Mean Processing
- Summary
3Review of Multilayer Perceptron Neural Networks
4Typical 3 Layer MLP
5MLP Performance Equations
Mean Square Error (MSE)
Output
Net Function
6Net Control
- Scales and shifts all net functions so that
they do not generate small gradients and do not
allow large inputs to mask the potential effects
of small inputs
7Neural Network Training Algorithms
- Backpropagation Training
- Output Weight Optimization Hidden Weight
Optimization (OWO-HWO) - Full Conjugate Gradient
8Output Weight Optimization Hidden Weight
Optimization (OWO-HWO)
- Used in this development
- Linear equations used to solve for output weights
in OWO - Separate error functions for each hidden unit are
used and multiple sets of linear equations solved
to determine the weights connecting to the hidden
units in HWO
9Network Initial Types and Training Problems
10Problem Definition
- Assume that a set of MLPs of different sizes are
to be designed for a given training data set - Let be the set of all MLPs for that
training data having Nh hidden units, Eint(Nh)
denote the corresponding training error of am
initial network that belongs to - Let Ef(Nh) denote the corresponding training
error of a well-trained network - Let Nhmax denote the maximum number of hidden
units for which networks are to be designed - Goal Choose a set of initial networks from S0,
S1, S2, such that - Eint(0) ? Eint (1) ? Eint (2) ? . Eint(Nhmax)
and train the network to minimize Ef(Nh) - such that Ef(0) ? Ef (1) ? Ef (2) ? . Ef(Nhmax)
- Axiom 3.1 If Ef(Nh) ? Ef (Nh-1) then the network
having Nh hidden units is useless since the
training resulted in a larger, more complex
network with a larger or the same training error.
11Network Design Methodologies
- Design Methodology One (DM-1) A well-organized
researcher may design a set of different size
networks in an orderly fashion, each with one or
more hidden units than the previous network - Thorough design approach
- May take longer time to design
- Allows achieving a trade-off between network
performance and size - Design Methodology Two (DM-2) A researcher may
design different size networks in no particular
order - May be quickly pursued for only a few networks
- Possible that design could be significantly
improved with a bit more attention to network
design
12Three Types of Networks Defined
- Randomly Initialized (RI) Networks No members
of this set of networks have any initial weights
and thresholds in common. Practically this means
that the initial random number seeds (IRNS) are
widely separated. Useful when the goal is to
quickly design one or more networks of the same
or different sizes whose weights are
statistically independent of each other. Can be
designed using DM-1 or DM-2 - Common Starting Points Initialized (CSPI)
Networks When a set of networks are CSPI, each
one starts with the same IRNS. These networks are
useful when it is desired to make performance
comparisons of networks that have the same IRNS
for the starting point. Can be designed using
DM-1 or DM-2 - Dependently Initialized (DI) Networks A series
of networks are designed with each subsequent
network having one or more hidden units than the
previous network. Larger size networks are
initialized using the final weights and
thresholds from training a smaller size network
for the values of the common weights and
thresholds. DI networks are useful when the goal
is a thorough analysis of network performance
versus size and are most relevant to being
designed using DM-1.
13Network Properties
- Theorem 3.1 If two initial RI networks (1) are
the same size, (2) have the same training data
set and (3) the training data set has more than
one unique input vector, then the hidden unit
basis functions are different for the two
networks. - Theorem 3.2 If two CSPI networks (1) are the
same size and (2) use the same algorithm for
processing random numbers into weights, then they
are identical. - Corollary 3.2 If two initial CSPI networks are
the same size and use the same algorithm for
processing random numbers into weights, then they
have all common basis functions.
14Problems with MLP Training
- Non-monotonic Ef(Nh)
- No standard way to initialize and train
additional hidden units - Net control parameters are arbitrary
- No procedure to initialize and train DI networks
- Network linear and nonlinear component
interference
15Mapping Error Examples
16Tasks Performed in this Research
- Analysis of RI networks
- Improved Initialization in CSPI networks
- Improved initialization of new hidden units in DI
networks - Analysis of separating mean training approaches
17CSPI and CSPI-SWI Networks
- Improvement to RI networks
- Each CSPI network starts with same IRNS
- Extended to CSPI-SWI (Structured Weight
Initialization) networks - Every hidden unit of the larger network has the
same initial weights and threshold values as the
corresponding units of the smaller network - Input to output weights and thresholds are also
identical - Theorem 5.1 If two CSPI networks are designed
with structured weight initialization, the common
subset of the hidden unit basis functions are
identical. - Corollary 5.1 If two CSPI networks are designed
using structured weight initialization, the only
initial basis functions that are not the same are
the hidden unit basis functions for the
additional hidden units in the larger network. - Detailed flow chart for CSPI-SWI initialization
in dissertation
18CSPI-SWI Examples
fm
twod
19DI Network Development and Evaluation
- Improvement over RI, CSPI and CSPI-SWI networks
- The values of the common subset of the initial
weights and thresholds for the larger network
are initialized with the final weights and
thresholds from a previously well-trained smaller
network - Designed with DM-1
- Single network designs ? networks are
implementable - After training, testing is feasible on a
different set of data set
20Basic DI Network Flowgraph
21Properties of DI Networks
- Eint(Nh) lt Eint(Nh-p)
- Ef(Np) curve is monotonic non-increasing (i. e.,
Ef(Nh) ? Ef(Nh-p)) - Eint(Nh) Ef(Nh-p)
22Performance Results for DI Networks with Fixed
Iterations
fm
twod
F24
F17
23RI Network and DI Network Comparison
- (1)Â Â DI network standard DI network design for
Nh hidden units - (2)Â Â RI type 1 RI networks were designed using
a single network for each value of Nh and every
network of size Nh was trained using the value of
Niter that the corresponding network was trained
with for the DI network. - (3)Â Â RI type 2 RI networks were designed using
a single network for each value of Nh and every
network was trained using the total number of
Niter that was used for the entire sequence of DI
networks. This can be expressed by - This results in the RI type 2 network actually
having a larger value of Niter than the - DI network.
24RI Network and DI Network Comparison Results
fm
twod
25Separating Mean Processing Techniques
- Bottom-Up Separating Mean
- Top-Down Separating Mean
26Bottom-Up Separating Mean
Generate new desired output vector
Generate linear
Train MLP
mapping results
using new data
- Basic Idea
- A linear mapping is removed from the training
data. - The nonlinear fit to the resulting data may
perform better.
27Bottom-up Separating Mean Results
fm
power12
single2
28Top-Down Separating Mean
- Basic Idea
- If we know which subsets of inputs and outputs
have the same means in Signal Model 2 and 3, we
can estimate and remove these means. - Network performance is more robust.
29Separating Mean Results
power12
30Conclusions
- On the average CSPI-SWI networks have more
monotonic non-increasing MSE versus Nh curves
than RI networks - MSE versus Nh curves are always monotonic
non-increasing for DI networks - DI network training was improved by calculating
the number of training iterations and limiting
the amount of training used for previously
trained units - DI networks always produce more consistent MSE
versus Nh curves than RI, CSPI and CSPI-SWI
networks - Separating mean processing using both a bottom-up
and top-down architecture often produce improved
performance results - A new technique was developed to determine which
inputs and outputs are similar to use for
top-down separating mean processing