Title: Anomaly Detection Through a Bayesian SVM
1Anomaly Detection Through a Bayesian SVM
- Vasilis A. Sotiris
- AMSC 664 Final Presentation
- May 6th 2008
- Advisor Dr. Michael Pecht
- University of Maryland
- College Park, MD 20783
2Objectives
- Develop an algorithm to detect anomalies in
electronic systems (large multivariate datasets) - Perform detection in the absence of negative
class data One Class Classification - Predict future system performance
- Develop application toolbox CALCEsvm to
implement a proof of concept on simulated and
real data - Simulated degradation
- Lockheed Martin Data-set
3Motivation
- With increasing functional complexity of on-board
autonomous systems, there is an increasing demand
for system level - Health assessment,
- Fault diagnostics
- Failure prognostics
- This is of special importance for analyzing
intermittent failures, some of the most common
failure modes in todays electronics - There is a need for efficient and reliable
prognostics for electronic systems using
algorithms that can - fuse sensor data,
- discriminate false alarms from actual failures
- correlate faults with relevant system events
- and reduce redundant processing elements which
are subject to common mode failures
4Algorithm Objectives
- Develop a machine learning approach to
- detect anomalies in large multivariate systems
- detect anomalies in the absence of reliable
failure data - Mitigate false alarms and intermittent faults and
failures - Predict future system performance
x2
Distribution of fault/failure data
Fault Space ?
Distribution of training data
x1
5Data Setup
- Data is collected at times Ti from a multivariate
distribution of random variables x1ixmi - xs are the system covariates
- Xis are independent random vectors
- Class ? -1,1
- Class probability p(classX)
estimate
given
X
6Data Decomposition (Models)
- Extract features from the data by
- constructing lower dimensional models
- X training data ? Rnxm
- Singular Value Decomposition (SVD)
- With H project data onto M and R models
- k number of principal components (k2)
- xM the projection of x onto the model space M
- xR projection of x onto the residual space R
7Two Class Support Vector Machines
D(x)
Solution
Mapping F
Feature Space
Input Space
Input Space
- Given nonlinearly separable labeled data xi with
labels yi ? 1,-1 - Solve linear optimization problem to find w and b
in the feature space - Form a nonlinear decision function my mapping
back to the input space - The result is that we can obtain a decision
boundary on the given training set and use it to
classify new observations
8Two Class Support Vector Machines
- Interested in a function that best separates two
classes of data - The margin M2/w can be maximized by
minimizing w - the learning problem is stated as
- subject to
- The classifier function D(x) is constructed
- with appropriate w and b (distance origin to
D(x))
9Two Class Support Vector Machines
- Lagrangian function
- Instead of minimizing LP w.r.t. to w and b,
minimize LD w.r.t to a - where H is the Hessian Matrix,
- Hi j yi yj xiT xj
- aa1,,an
- and p is a unit vector
KKT conditions
10Two Class Support Vector Machines
- In the nonlinear case use kernel function F
centered at each x - Form the same optimization problem
- where
- Argument the resulting function D(x) is the best
classifier for the given training set
x2
D(x)-1
Distribution of fault/failure data
D(x)0
D(x)1
Support Vectors
Distribution of training data
x1
11Bayesian Interpretation of D(x)
- The classification y ? -1.1 for any x, is
equivalent to asking p(Y1 Xx) gt ? lt p(Y-1
Xx) - An optimal classifier yMAP maximizes the
conditional probability - Quadratic optimization problem D(x)
- It can be shown that D(x) is the maximum a
posteriori (MAP) solution to P(YyXx) ?
P(classdata), and therefore the optimal
classifier of the given two classes
if
if
12One Class Training
- In the absence of negative class data (fault or
failure information), a one-class-classification
approach is used - X(X1, X2) bivariate distribution
- Likelihood of positive class L p(Xxiy1)
- Class label y ? (-1,1)
- Use the margin of this likelihood to construct
the negative class
L
X1
X2
13Nonparametric Likelihood Estimation
- If the probability that any data point xi falls
into the kth bin is r, then the probability of a
set of data x1,,xm falling into the kth bin is
given by a binomial distribution - Total sample size n
- Number of samples in kth bin m
- Region defined by bin R
- MLE of r
- Density estimate
14Estimate likelihood Gaussian kernel j
- The volume of R
- For uniform kernel the number of data m in R
- Kernel function f
- Points xi which are close to the sample point x
receive higher weight - Resulting density fj(x) is smooth
- The bandwidth h is selected according to a
nearest neighbor algorithm - Each bin R contains kn data
15Estimate of Negative Class
- The negative class is estimated based on the
likelihood of the positive class (training data) - A threshold t is used to estimate the likelihood
ratio of positive to negative class probability
for the given training data - A 1D cross-section of the density illustrates the
idea of the threshold ratio
Positive
Negative
16D(x) as a Sufficient Statistic
- D(x) can be used as a sufficient statistic to
classify data point x - Argument since D(x) is the optimal classifier,
posterior class probabilities are related to
datas distance to D(x)0 - These probabilities can be modeled by a logistic
distribution, centered at D(x)0
D(x)
17Posterior Class Probability
- The positive posterior class probability is given
by - Use D(x) as the sufficient statistic for the
classification of xi, by replacing ai by D(xi) - Simplify
- Get MLE for parameters A and B
logistic distribution
where
18Joint Probability Model
- Interested in P P(YXM,XR), the joint
probability of classification given two models - XM model space M
- XR residual space R
- Assume XM, XR independent
- After some algebra get the joint positive and
negative posterior class probabilities P() and
P(-)
19Case Studies
- Simulated degradation
- Lockheed Martin Dataset
20Case Study I Simulated Degradation
- Given
- Simulated correlated data
- X1 gamma, X2 student t,
- X3 beta
- Degradation modeling
- Period of healthy data
- Three successive periods of increasingly larger
changes in the mean for each parameter - Expecting a posterior classification probability
to reflect these four periods accordingly - First with a probability close to 1
- For the three successive a decreasing trend
x1
Observation
21Case Study I Results Simulated Degradation
- Results a plot of the joint positive
classification probability
P1
P2
P4
P3
22Case Study II Lockheed Martin Data (Known
Faulty Periods)
- Given Data set from Lockheed martin
- Type of data server data, unknown parameters
- Multivariate, 22 parameters, 2741 observations
- Healthy period (T) observations 0 - 800
- Fault periods observations F1 912 1040, F2
1092 1106, F3 1593 - 1651 - Training data constructed with sample from period
T, with size n140 - Goal
- Detect onset of known faulty periods without the
knowledge of unhealthy system characteristics
23Case Study II - Results
Period F1
Period F2
Period T
912
800
24Comparison Metrics of Code Accuracy (LibSVM vs
CALCEsvm)
- An established and commercially used C SVM
software (LibSVM) was used to test the accuracy
of the code - LibSVM features used two class SVM
- does not include classification probabilities for
one class SVM - Input to LibSVM
- Positive class same training data
- Negative class estimated negative class data
from CALCEsvm - Metrics detection accuracy
- The count of correct classifications based on two
categories - Classification label y
- Correct classification probability estimate
25Detection Accuracy LibSVM vs CALCEsvm (Case Study
1 Degradation Simulation)
- Description of test
- Period 1 should be captured with a probability
estimate ranging from 80 to 100 positive class - Period 2 equivalently between 70 and 85
- Period 3 between 30 and 70
- Period 4 between 0 and 40
- Based on just the class index, the detection
accuracy for both algorithms was almost identical - Based on ranges of probabilities LibSVM performs
better in determining the early stages where the
system is healthy, but performs worse is
detecting degradation in comparison to CALCEsvm
P1
P2
P3
P4
P1
P2
P3
P4
26Detection Accuracy LibSVM vs CALCEsvm (Case Study
2 Lockheed Data)
- Description of test
- The acceptable probability estimate for a correct
positive classification should lie between 80 and
100 - Similarly the acceptable probability estimate for
a negative classification should not exceed 40 - Based on the class index, both LibSVM and
CALCEsvm perfrom almost identically, with small
improved performance for CALCEsvm - Based on acceptable probability estimates,
- LibSVM
- does a poor job at identifying the healthy state
between each successive faulty period - Has a much better performance at detecting the
anomalies - CALCEsvm
- Seems to perform overall much better, and
identifies correctly both base on index and
acceptable probability ranges the faulty and
healthy periods in the data
27Summary
- For the given data, and on some additional data
sets the CALCEsvm algorithm has accomplished the
objective - Detected the time events for known anomalies
- Identified trends of degradation
- Comparison of its performance accuracy to LibSVM
is at first hand good!
28Backups
29Dual Form of Lagrangian Function
- Dual form of the Lagrangian function, for the
optimization problem in LD space
through KKT conditions
subject to
30Karush-Kuhn-Tucker (KKT) Conditions
- Optimal solution (w, b, a) exists if and only
if KKT conditions are satisfied. In other words,
KKT conditions are necessary and sufficient to
solve w, b and a in a convex problem
31Posterior Class Probability
- Interested in finding the maximum likelihood
estimates for parameters A and B - The classification probability of a set of test
data Xx1,,xk, into c1,0 is given by a
product Bernoulli distribution - Where pi is the probability of classification
when c1 (y1) and 1-pi is the probability of
classification when c0 (refers to class y-1)
32Posterior Class Probability
- Maximize the likelihood of correct classification
y for each xi (MLE) - Determine parameters AMLE and BMLE from maximum
likelihood equation (above) - Use AMLE and BMLE to compute p(i)MLE in
- Where piMLE is the
- maximum likelihood estimator of the posterior
class probability pi (due to the invariance
property of the MLE) - best estimate for the classification probability
of each xi - Currently implemented is