Title: Outline Lecture 1
1EE459 Neural NetworksHow to design a good
performance NN?
Kasin Prakobwaitayakit Department of Electrical
Engineering Chiangmai University
2Glossary
- Pattern a complete set of data inputs and
outputs that provide a snapshot of the system
being modelled - also called examples, cases
- Feature an identifying characteristic in the
data that the model will ideally capture - Domain a set of boundaries that define the range
of expected / observed data for a particular
model or problem
3Backpropagation Modelling Heuristics
- Selection of model inputs and outputs
- Defining the model domain
- Data pre-processing
- Selection of training/testing cases
- Data scaling
- Number of hidden layers
- Number of neurons
- Activation function selection
- Initial weight values
- Learning rate and momentum
- Presentation of patterns
- Stopping criteria for training
- Improving model performance
4Selection of Model Inputs and Outputs
- Start with a set of inputs that are KNOWN to
affect the process - then add other inputs that are suspected of
having a relationship in the process one at a
time - Eliminate input variables that are redundant
(high covariance) - Eliminate data patterns that do not contribute to
training (no new information) - Identify / eliminate data patterns with the same
inputs that have different outputs
5Defining the Model Domain
- ANNs should be confined to a limited domain
- develop separate models for contradictory areas
of the domain - Build models that predict a single output
- link multiple models together, if required
6Data Pre-processing
- Any time the dynamic range of an input is over a
few orders of magnitude, a logarithmic
transformation should be applied - Transformations may also be useful in
reconditioning input data when an ANN has
trouble converging - Non-numerical inputs need to be codified
7Selection of Training/Testing Cases
- Training cases should be representative of the
problem domain - For good generalization capacity, the training
set must be complete - important variable must be measured
- Data set is randomly divided into training and
testing sets - in a 7030 ratio, for example
8- Rules of thumb
- The number of training patterns should be at
least 5 times the number of nodes in the network - The number of training cases should be roughly
the number of weightsthe inverse of the accuracy
parameter (e, e0.9 means 90 accuracy of
prediction is required)
9Data Scaling
- Output variables should be scaled in the 0.1 to
0.9 range - avoids operating in the saturation range of the
sigmoid function
10Number of Hidden Layers
- Increasing the number of hidden layers increases
both the time and number of patterns (examples)
required for training - For most problems, one hidden layer will suffice
- problems can always be solved with two
- Multiple slabs (Ward Nets) supposedly increase
processing power - each slab (group of neurons) acts as a detector
for one or more input features
11Number of Neurons
- One neuron in the input layer for each input
- One neuron in the output layer for each output
- Proper number of hidden neurons is often
determined experimentally - too few poor ability to capture features
- too many poor ability to generalize (ANN simply
memorizes training data) - Various rules of thumb have been reported
- 0.75N
- 2N 1
N number of inputs
12- In general, more weights introduce more local
minima to the error surface - Flat regions in the error surface can mislead
gradient-based error minimization methods
(backpropagation) - Start with a small network and then add
connections as needed - avoids convergence problems as the networks get
too large - The optimum ratio of hidden neurons in the first
to second hidden layer is 31
13Activation Function Selection
- Sigmoidal (logistic) activation functions are the
most widely used - Thresholding functions are only useful for
categorical outputs
14Initial Weight Values
- Weights need to be randomized initially
- if all weights are set to the same number, the
GDR would never be able to leave the starting
point - The backpropagation algorithm may also have
difficulty if the connection weights are
prematurely saturated (gt0.9)
15Learning Rate and Momentum
- Learning rate (?) affects the speed of
convergence. - If it is large (gt0.5), the weights will be
changed more drastically, but this may cause the
optimum to be overshot - If it is small (lt0.2) the weights will be changed
in smaller increments, thus causing the system to
converge more slowly with little oscillation - The best training occurs when the learning rate
is as large as possible without leading to
oscillation
16- The learning rate can be increased as learning
progresses, or a momentum term added to improve
network learning speed - The momentum factor (?) has the ability to dampen
out oscillations caused by increasing the
learning rate - a momentum of 0.9 allows higher learning rates
17Presenting Patterns to the ANN
- Present patterns in a random fashion
- If input patterns can be easily classified, do
not train the ANN on all patterns in a class in
succession - the ANN will forget information as it moves from
class to class - Shaping can be used to improve network training
- involves starting off on a very small cohesive
data set and then adding more patterns that have
greater deviations from variable means as
training progresses
18Stopping Criteria for Training
- Training should be stopped when one of the
following conditions is met - the testing error is sufficiently small
- the testing error begins to increase
- a set number of iterations has passed
19Improving Model Performance
- Re-initialize network weights to a new set of
random values - re-train the model
- Adjust learning rate and momentum
- Modify stopping criteria
- Prune network weights
- Use genetic algorithms to adjust network topology
- Add noise to training cases to decrease the
chance of memorization
20ANN Modelling Approach
21Needs and Suitability Assessment
- What are the modelling needs ?
- Is the ANN technique suitable to meet these needs
? - Can the following be met ?
- data requirements
- software requirements
- hardware requirements
- personnel requirements
22Data Collection and Analysis
- Successful models require careful attention to
data details - Recall relevant historical data is a key
requirement - For data collection, investigate
- data availability
- parameters, time-frame, frequency, format
- QA/QC protocols
- data reliability
- process changes
23- Data requirements and guidelines
- data for each of the parameters must be available
- at least one full cycle of data must be available
- appropriate QA/QC protocols must be in place
- data collected prior to major process changes
should generally not be used - Data analysis involves
- data characterization
- a complete statistical analysis
24- Data characterization for each parameter
- qualitative assessment of hourly, daily, seasonal
trends (graphical examination of data) - time-series analyses may be warranted
- Statistical analysis for each parameter
- measures of central tendency
- mean, median, mode
- measures of variability
- standard deviation, variance
- percentile analyses
- identification of outliers, erroneous entries,
non-entries
25Application of a Model-Building Protocol
- There is no accepted best method of developing
ANN models - An infinite number of distinct architectures are
possible - A protocol is to reduce the number of
architectures that are evaluated
26- A sample five-step protocol
27- Selection of model inputs and outputs
- Why its important
- ANN models are based on process inputs and
outputs - How its done
- first, select the model output
- best models only have one output parameter
- next, select model inputs from available input
parameters - selection is based on data availability,
literature, expert knowledge
28- Selection and organization of data patterns
- Why its important
- ANN models are only as good as the data used
- separate independent data sets are required to
test and validate the model - How its done
- examine each data pattern for erroneous entries,
outliers, blank entries - delete questionable data patterns
- sort and divide data into training, testing, and
production sets - perform a statistical analysis on each of the
three data sets
29- Determination of architecture characteristics
- Why its important
- each modeling scenario will have an optimal
architecture - How its done
- initially, hold many factors at software defaults
or at pre-determined values - use modelling heuristics from literature along
with expert knowledge - determine the number of hidden layer neurons
- compare results for different runs
30- Evaluation of model stability
- Why its important
- ensure that model results are independent of the
method of data sorting - How its done
- build new training, testing, and production sets
from the original database - re-train models on the new data sets
- compare results with initial runs
31- Model fine tuning
- Why its important
- some models require minor improvements in order
to meet process operating criteria - How its done
- modeling parameters previously held constant can
be varied to improve model results - the model fine-tuning methodology is typically
researcher-specific
32Evaluating Model Performance
- In many situations, more than one good model can
be developed - The best model is the one that both
- meets the modelling needs initially identified
- offers the smallest prediction errors
- Therefore need to be able to evaluate model
performance
33- Prediction errors can be assessed
- graphically
- visual representation of missed predictions
34- using statistics
- absolute measures of error
- mean absolute error (MAE)
- maximum absolute error
- relative measures of error
- mean absolute percent error (MAPE)
- coefficients of correlation
- coefficient of correlation (r)
- coefficient of multiple correlation (R)
- coefficients of determination
- coefficient of determination (r2)
- coefficient of multiple determination (R2)
35- Model residuals should also be studied
- residual prediction error
- residuals should be
- Normally distributed
- plot a histogram of the residuals
- have a mean 0
- independent
- plot the residuals (in time order if applicable)
- plot should be free of obvious trends
- have constant variance
- plot the residuals against the predicted values
- plot should not show spreading, converging, or
other trends
36Model Evaluation Using Real-time Data
- Need to consider
- changes in the frequency of data collection
- changes in the methodology of data collection
- existence of QA/QC protocols to detect erroneous
data - The evaluation can take the form of
- simulated real-time testing on a stand-alone PC
- online testing in real-time
37- Methodology
- select the time frame of the test
- the data is ported to the developed models
- process each data pattern and record the results
- the model predicted values are compared to the
actual values - prediction errors are determined