Title: Efficient Neural Network Training Using Subsets of Very Large Datasets
1Efficient Neural Network TrainingUsing Subsets
of Very Large Datasets
- Srinivas Vadrevu
- University of Minnesota Duluth
2Overview
- Very large amount of data
- Knowledge Discovery in Databases (KDD)
- Machine Learning (ML) algorithms
- Neural Networks in KDD
- Training with memory sized subsets
3Outline of Talk
- Motivation
- Background
- Artificial Neural Networks (ANNs)
- Speeding up ANN training
- Training with Subsets
- Our Idea Subset Training
- Experimental Results
- Other Related Work
- Future Work
- Conclusions
4Classification Tasks An Example
Positive Examples
Negative Examples
Concept Determine whether
Positive/Negative Input Features Color,
Sides, Corner Output Feature
Positive/Negative Example Colorred,
Sides 8 Corner sharp
5Learning
- Supervised
- Teacher labeled data
- Un-supervised
- No teacher labels
- Neural Networks in classification
- Supervised
- Accurate
- Efficient
- Immune to noise
6Knowledge Discovery in Databases (KDD)
- Data, Data, Data!!!
- Single pass learning
- Read data into memory once
- Disk references costly
- In memory processing cheap
- Our approach
- Divide dataset into memory-sized subset
- Train ANN successively with each subset
7Basic Idea
.
.
.
8Outline of Talk
- Motivation
- Background
- Artificial Neural Networks (ANNs)
- Speeding up ANN training
- Training with Subsets
- Our Idea Subset Training
- Experimental Results
- Other Related Work
- Future Work
- Conclusions
9A Typical Feed-forward Neural Network
Example
Corner sharp
Color red
Sides 8
Color blue
Sides 4
Corner round
Error
Activation
1
Bias Unit
Output positive/negative
weights
10Learning in Neural Networks
- Activation
- Inputs set by input features
- For other units determine net input
- nk ( ? wj?k X aj)
- Then calculate activation (e.g., sigmoid
function) - ak
?j ? LinkedTo(k)
11Backpropagation
- Learning in multi-layer feed-forward ANN
- For each example
- Propagate activation forward
- Propagate error backward
- Compute error for outputs
- Backpropagate error to hiddens
- Update weights with gradient descent
12Other Ideas
- Backprop requires many epochs to converge
- Some ideas to overcome this
- Stochastic learning
- Update weights after each training example
- Momentum
- Add fraction of previous update to current update
- Faster convergence
13Outline of Talk
- Motivation
- Background
- Artificial Neural Networks (ANNs)
- Speeding up ANN training
- Training with Subsets
- Our Idea Subset Training
- Experimental Results
- Other Related Work
- Future Work
- Conclusions
14Several Methods to Speed up Learning in Neural
Networks
- QuickProp (Fahlman, 1988)
- RProp (Reidmiller and Braun, 1993)
- Dynamic adaptation of ? and ?
- (Salomon and van Hemmen, 1996)
- Exploring error surface
- (Schmidhuber, 1989)
- Redefining error function
- (Balakrishnan and Hanovar, 1992)
15RProp
- Resilient Propagation
- Variant of backpropagation
- Examines sign of partial derivative of error
- Compute weight update
- If sign changes, weight update change retained
- Otherwise, update amount slightly increased
- May converge quicker than backprop
16Outline of Talk
- Motivation
- Background
- Artificial Neural Networks (ANNs)
- Speeding up ANN training
- Training with Subsets
- Our Idea Subset Training
- Experimental Results
- Other Related Work
- Future Work
- Conclusions
17Training with Subsets
- Catlett (1991) trained with subsets of data
- Classifiers from subsets generally inferior
- Investigated other sampling methods (e.g.,
stratified)
18Other Ideas in Training With Subsets
- Breiman, 1999
- Ensemble of classifiers trained on subsets of
data - Street and Kim, 2001
- Similar to Breiman, decided whether to add
classifier to ensemble - Training on subsets of input features
19Outline of Talk
- Motivation
- Background
- Artificial Neural Networks (ANNs)
- Speeding up ANN training
- Training with Subsets
- Our Idea Subset Training
- Experimental Results
- Other Related Work
- Future Work
- Conclusions
20Basic Idea
.
.
.
21NN(Subset) Algorithm
- P number of pages of memory available
- G Data Pages / P ( Groups)
- Initialize ANN
- For each of G partitions
- Randomly select (w/o replacement) P data pages
- Train ANN for N epochs on current subset
- Output resulting ANN
22NNGrow(Subset) Algorithm
- P number of pages of memory available
- G Data Pages / P ( Groups)
- Initialize ANN
- For each of G partitions
- Randomly select P data pages
- Train ANN for N epochs on current subset
- If not the last partition
- Lower learning rate of current weights
- Add (1 or more) hidden units with standard
learning rate - Output resulting ANN
23Outline of Talk
- Motivation
- Background
- Artificial Neural Networks (ANNs)
- Speeding up ANN training
- Training with Subsets
- Our Idea Subset Training
- Experimental Results
- Other Related Work
- Future Work
- Conclusions
24Datasets
25Error (letter-recognition)
26Error (splice)
27Discussion
- Different mechanisms perform well on different
datasets - Decision tree learning generally effective
- ANNs perform well, generally better with larger
number of hidden units - Naïve Bayes, K-Nearest Neighbor perform well on
some problems, and poorly on others
28Convergence Results (letter)
29Convergence Results (adult)
30Discussion
- ANNs often converge quickly (in less than 10
epochs) - Accuracy after a few epochs is comparable to
final accuracy - Q is it possible to adjust learning to always
achieve good accuracy quickly?
31Varying of Hidden Units (letter)
32Varying of Hidden Units (splice)
33Discussion
- Determining network topology difficult
- For larger number of hidden units convergence may
be delayed - More hidden units generally produce lower error
34Varying the Learning Rate (letter)
35Varying the Learning Rate (splice)
36Varying the Momentum (shuttle)
37Varying the Momentum (splice)
38Discussion
- Choosing single learning rate impossible
- For momentum close to one, learner often does not
learn - But higher momentum may produce lower error rates
- Varying learning rate, momentum can produce
faster results - No single value of either is effective
39NN(Subset) Results
40NN(Subset) Results
41NN(Subset) Vs NN(Baseline)
42NN(Subset) Vs NN(Baseline)
43NNGrow(Subset) Results
44Subset Vs Baseline (msweb)
45Subset Vs Baseline (letter)
46Subset Vs Baseline (shuttle)
47Subset Vs Baseline (splice)
48Discussion
- NN(Subset), NN(Baseline) results comparable
- NNGrow(Subset) often produces lower error
- Error of subset methods often comparable to ANN
on entire dataset
49Summary Conclusions
- ANNS converge quickly (often lt 10 epochs)
- Difficult to reduce training time by altering the
network topology or learning parameters - NN(Subset), NNGrow(Subset) often produce results
comparable to baseline - NNGrow(Subset) often produces lower error than
NN(Subset)
50Outline of Talk
- Motivation
- Background
- Artificial Neural Networks (ANNs)
- Speeding up ANN training
- Training with Subsets
- Our Idea Subset Training
- Experimental Results
- Other Related Work
- Future Work
- Conclusions
51Breimans Method
- Select subsets of data
- Build new classifier on subset
- Aggregate with previous classifiers
- Compare error after adding classifier
- Repeat as long as error decreases
52Cascade Correlation (Fahlman Lebiere, 1990)
- Start perceptron
- Add hidden units until error plateaus
- Freeze older weights in network
- Network topology not predetermined
53QuickProp
- Try to adapt each weight in network to optimum
value - Model error surface as parabola using gradient in
current, previous steps and previous weight
change - New weight in current step is minimum point on
the parabola - Not guaranteed to converge
54Future Work
- Combination of other training methods and our
approach - Use data to determine when to stop learning
- Use next subset as validation set
- An approach similar to Breimans method
- Use overlapping subsets
- Use data to select learning parameters
- Use subsets as validation sets and alter network
topology and parameters
55Conclusions
- ANNs often converge quickly
- Difficult to reduce training time with topology
and learning parameters - Idea train large datasets by looking at
memory-sized subsets of data - Network can be built in one pass, making it
applicable to KDD
56Acknowledgements
- I am grateful to my advisor
- Dr. Rich Maclin for providing me an opportunity
to work with him and his valuable guidance - I also thank Dr. Taek Kwon and
- Dr. Tim Colburn for their co-operation