Title: Real World Machine Learning
1Zeus Genetic Algorithms and Support Vector
Machines for Time Series Classification
- Damian Eads1,2, Daniel Hill2, Sean Davis1,
- Simon Perkins1, Junshui Ma1, Reid Porter1, and
James Theiler1
Nonproliferation and Intl Security Division1 Los
Alamos National Laboratory MS D436 Los Alamos, NM
87545
Department of Computer Science2 Rochester
Institute of Technology 102 Lomb Memorial
Drive Rochester, NY 14623
2What is Zeus?
- A software system which generates time series
classifiers. - Principal application for classifying lightning
events. - Named after the supreme ruler of Mount Olympus.
3FORTE Satellite
- Equipped with a suite of optical and
radio-frequency (RF) instruments. - Collected 3 million lightning events in the
22-MHz subbands lifetime.
4Purpose
- Develop a more sophisticated weather monitoring
system. - Improve out understanding of storm evolution.
- Explore the concept of feature selection for
Support Vector Machines.
5Data Acquisition
Ground Station
Transmission
Triggering Event
Preprocessing
Zeus Classifier Generator
6Preprocessing
- Load Very High Frequency (VHF) Data
- Derive Spectrogram via Fourier Transform
- Produce Power Density Time Series
Example CG Event
7Classes of Lightning
- Cloud-to-Ground
- Positive Initial Return Stroke (CG)
- Negative Initial Return Stroke (IR)
- Subsequent Negative Initial Return Stroke (SR)
- Intra-Cloud
- Impulsive Event (I)
- Trans-ionospheric Pulse Pair (TIPP/I2)
- Gradual Intra-Cloud Stroke (KM)
- Off Record (O)
8Examples of Power Densities
9Zeus Software System
- Implemented in C.
- Uses the libsvm Support Vector Machine Library.
- Runs on an Intel processor-based Linux
Workstation. - Performance measurement code written in C,
MATLAB, and bash.
10Front/Back-end Architecture
Time Series
Zeus Classifier Generator
Genetic Algorithm
Feature Extraction
Front-end Stochastic Search
Classification
Support Vector Machine
Back-end Classification
Time Series Classifier
11Classifier Architecture
Zeus Classifier Generator
Time Series
TIME SERIES CLASSIFIER
FEATUREEXTRACTOR
FEATURE SET
MODEL
Prediction Result
12Support Vector Machine
- Projects the n-dimensional feature space into a
higher dimension. - Uses a non-linear mapping defined by a kernel
function. - Maximizes the margin.
- See Vladimir Vapniks The Nature of Statistical
Learning Theory for more information.
13Genetic Algorithm
- A. Produce Initial Population
- B. Evaluate Chromosomes
- C. Perform Selection
- D. Perform Sexual Recombination of parents to
produce new population. - E. Based on a probability, perform mutation.
- F. If stopping criteria is not met, go to step B.
14Chromosome
- Composed of primitive statistical, arithmetic,
and signal processing operators. - Each gene (or algorithm) is represented as a
tree, accepts both scalar and series input, and
outputs scalar features. - The chromosome produces a feature vector set.
15Chromosome Representation
Time Series
Genes
Chromosome
y
Feature Vector
x1
xi
xn
16Example Chromosome
(define-feature-selector '((ratio-3 s1 '(
0)) (skew (gs (chunk s1 '(0.33 0.5))
'(15))) (int-t s1 '(0.73 0.98)) (sum s1) (kurt
s1) (kurt (drv (lcomb s1 (drv s1)))) (skew
s1) (max (drv (gs s1))) (/ (int-t s1) (sum (drv
s1))) (ratio-3 s1 '( 4))))
Interpretation of first two features The first
feature represents the ratio of the average power
of first 266 microseconds of the signal over the
last 266 microseconds. The second feature is the
skewness of the smoothed power density from 266
to 400 microseconds.
17Primitive Operators
- Minimum
- Maximum
- Ratio of Means
- Add, Subtract, Multiply, Divide
- Subseries
- Subsampling
- Derivative Approximation
- Convolution Filtering
- Mean
- Standard Deviation
- Variance
- Skewness
- Kurtosis
- Integral
- Sum
- Linear Combination
18Crossover Operators
- Uniform
- Single-point
- GP (Genetic Programming) Crossover
19Uniform Crossover
Procedure For each gene, randomly select a
parent. And place the corresponding gene into the
child.
Mother
Father
Child
20Single-Point Crossover
Procedure Select a cut point. Place the mothers
genes in the child before the cut-point. Place
the fathers genes after the cut-point.
Mother
Cut Point
Father
Child
21GP Crossover
- For each gene, select a compatible branch from
each parent, and swap them.
Mother Gene
Father Gene
Child Gene
22Mutation
- Algorithm Randomization completely randomize a
specific gene. - Hoisting select an cut point and a grab point.
Delete the node at the cut point and its
decedents and insert the gene at the cut point.
Cut Point
Grab Point
23Fitness Evaluation
- In-sample Classification Rate Simply calculates
the in-sample classification rate. - 10-Fold Cross Validation Score Provides an
estimate of how well a chromosome will perform on
unseen (out-of-sample) data.
24Fitness In-sample Rate
3181 Features
10 Features
Training Set
Processed Set
Chromosome
A
B
Feature Extractor
Model
C
E
Result Set
F
Classifier
Train SVM
- Steps
- Run Feature Extractor
- Produce Training Set
- Train SVM
- Produce Model
- Run Classifier
- Produce Result Set
- Calculate Score
D
Score
25Fitness N-Fold Cross Valid.
3181 Features
10 Features
Training Set
Processed Set
Chromosome
A
B
Feature Extractor
E
Model
Result Set
D
C
Testing Partitions
F
Classifier
Finished? No
- Steps
- Run Feature Extractor
- Produce Training Set
- Produce Testing Partitions
- Train on Complement
Yes
E. Produce Model F. Predict Labels of Test Set G.
Score if finished, otherwise, goto step D.
Score
26Performance Testing
- Another layer of cross-validation is needed.
- Equally sized testing partitions are created.
- An entire Zeus run is performed on each testing
partition complement. - After 10 runs are complete, a final 10 fold cross
validation score is calculated. - If in-sample is the fitness criteria, 90 of the
training data is used to train an SVM, otherwise
81 of the training data is used. - Tested against a raw SVM without feature
selection.
27Results
- 10 Features, In-sample for Fitness, 50
Generations, Pop. size of 15
28Results
- 10 Features, 10-Fold CV for Fitness, 50
Generations, Pop. size of 15
29Results
- 49 Features, In-sample for Fitness, 50
Generations, Pop. size of 15
30Results
- 49 Features, 10-Fold CV for Fitness, 50
Generations, Pop. size of 15
31Results
- 3181 Features, Raw SVM without Zeus
32Result Summary
33Conclusions
- The 10-fold cross validation fitness function led
to higher scores for the outer layer of
validation. - The raw SVM outperformed Zeus by 3.95 however
with significantly more features. - Fewer features can reduce the strain of satellite
resources such as bandwidth and payload storage. - The raw SVMs parameters may have been
over-optimized.
34Future Work
- Implement better primitive operators.
- Use another stochastic search technique to select
SVM parameters. - Facilitate control structures and function
definitions. - Add more lightning data to database(currently 143
samples). - Explore other data sets.
35Acknowledgements
- Special thanks is given to the FORTE Project
Leader Abe Jacobson and the ISIS (Intelligent
Searching of Images and Signals) Team for without
their support this work would not be possible. - This work was supported by a funding from a
Laboratory Directed Research and Development
Directed Research (LDRD/DR) as well as by funding
from various government agencies.
36Questions
?