Title: Analysis of Classification-based Error Functions
1Analysis of Classification-based Error Functions
- Mike Rimer
- Dr. Tony Martinez
- BYU Computer Science Dept.
- 18 March 2006
2Overview
- Machine learning
- Teaching artificial neural networks with an error
function - Problems with conventional error functions
- CB algorithms
- Experimental results
- Conclusion and future work
3Machine Learning
- Goal Automating learning of problem domains
- Given a training sample from a problem domain,
induce a correct solution-hypothesis over the
entire problem population - The learning model is often used as a black box
4Teaching ANNs with an Error Function
- Used to train a multi-layer perceptron (MLP)
- to guide the gradient descent learning procedure
to an optimal state - Conventional error metrics are sum-squared error
(SSE) and cross entropy (CE) - SSE suited to function approximation
- CE aimed at classification problems
- CB error functions Rimer Martinez 06 work
better for classification
5SSE, CE
- Attempts to approximate 0-1 targets in order to
represent making a decision
Pattern labeled as class 2
6Issues with approximating hard targets
- Requires weights to be large to achieve
optimality - Leads to premature weight saturation
- Weight decay, etc., can improve the situation
- Learns areas of the problem space unevenly and at
different times during training - Makes global learning problematic
7Classification-basedError Functions
- Designed to more closely match the goal of
learning a classification task (i.e. correct
classifications, not low error on 0-1 targets),
avoiding premature weight saturation and
discouraging overfit - CB1 Rimer Martinez 02, 06
- CB2 Rimer Martinez 04
- CB3 (submitted to ICML 06)
8CB1
- Only backpropagates error on misclassified
training patterns
9CB2
- Adds a confidence margin, µ, that is increased
globally as training progresses
10CB3
- Learns a confidence Ci for each training pattern
i as training progresses - Patterns often misclassified have low confidence
- Patterns consistently classified correctly gain
confidence
11Neural Network Training
- Influenced by
- Initial parameter (weight) settings
- Pattern order presentation (stochastic training)
- Learning rate
- of hidden nodes
- Goal of training
- High generalization
- Low bias and variance
12Experiments
- Empirical comparison of six error functions
- SSE, CE, CE w/ WD, CB1-3
- Used eleven benchmark problems from the UC Irvine
Machine Learning Repository - ann, balance, bcw, derm, ecoli, iono, iris,
musk2, pima, sonar, wine - Testing performed using stratified 10-fold
cross-validation - Model selection by hold-out set
- Results were averaged over ten tests
- LR 0.1, M 0.7
13Classifier output difference (COD)
- Evaluation of behavioral difference of two
hypotheses (e.g. classifiers)
T is the test set I is the identity or
characteristic function
14Robustness to initial network weights
- Averaged 30 random runs over all datasets
algorithm Test acc St Dev Epoch
CB3 93.468 4.7792 200.67
CB2 92.839 4.0800 366.69
CB1 92.828 5.3290 514.14
CE 92.789 5.3937 319.57
CE w/ WD 92.251 5.4735 197.24
SSE 91.951 5.6131 774.70
15Robustness to initial network weights
Algorithm Test error COD
CB3 0.0653 0.0221
CB2 0.0716 0.0274
CB1 0.0717 0.0244
CE 0.0721 0.0248
CE w/ WD 0.0774 0.0255
SSE 0.0804 0.0368
16Robustness to pattern presentation order
- Averaged 30 random runs over all datasets
algorithm Test acc St Dev Epoch
CB3 93.446 5.0409 200.46
CB2 92.641 5.4197 402.52
CB1 92.542 5.473 560.09
CE 92.290 5.6020 329.65
CE w/ WD 91.818 5.6278 221.21
SSE 91.817 5.6653 593.30
17Robustness to pattern presentation order
Algorithm Test error COD
CB3 0.0655 0.0259
CB2 0.0736 0.0302
CB1 0.0746 0.0282
CE 0.0771 0.0329
CE w/ WD 0.0818 0.0338
SSE 0.0818 0.0344
18Robustness to learning rate
- Average of varying the learning rate from 0.01
0.3
Algorithm Test acc St Dev Epoch
CB3 93.175 3.514 334.8
CB2 92.285 3.437 617.8
SSE 92.211 3.449 525.7
CB1 91.908 3.880 505.4
CE 91.629 3.813 466.2
CE w/ WD 91.330 3.845 234.6
19Robustness to learning rate
20Robustness to number of hidden nodes
- Average of varying the number of nodes in the
hidden layer from 1 - 30
Algorithm Test acc St dev Epoch
CB3 93.026 3.397 303.9
CB1 92.291 3.610 381.0
CB2 92.136 3.410 609.4
SSE 92.066 3.402 623.1
CE 91.956 3.563 397.0
CE w/ WD 91.74 3.493 190.6
21Robustness to number of hidden nodes
22Conclusion
- CB1-3 are generally more robust than SSE, CE, and
CE w/ WD, with respect to - Initial weight settings
- Pattern presentation order
- Pattern variance
- Learning rate
- hidden nodes
- CB3 is superior, most robust, with most
consistent results
23Questions?
24(No Transcript)
25(No Transcript)
26(No Transcript)
27(No Transcript)
28(No Transcript)
29(No Transcript)