Anomaly Detection Through a Bayesian SVM - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Anomaly Detection Through a Bayesian SVM

Description:

There is a need for efficient and reliable prognostics for electronic systems ... be shown that D(x) is the maximum a posteriori (MAP) solution to P(Y=y|X=x) P ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 33

Provided by: vasilisa7

Category:

more less

Transcript and Presenter's Notes

Title: Anomaly Detection Through a Bayesian SVM

1
Anomaly Detection Through a Bayesian SVM

Vasilis A. Sotiris
AMSC 664 Final Presentation
May 6th 2008
Advisor Dr. Michael Pecht
University of Maryland
College Park, MD 20783

2
Objectives

Develop an algorithm to detect anomalies in
electronic systems (large multivariate datasets)
Perform detection in the absence of negative
class data One Class Classification
Predict future system performance
Develop application toolbox CALCEsvm to
implement a proof of concept on simulated and
real data
Simulated degradation
Lockheed Martin Data-set

3
Motivation

With increasing functional complexity of on-board
autonomous systems, there is an increasing demand
for system level
Health assessment,
Fault diagnostics
Failure prognostics
This is of special importance for analyzing
intermittent failures, some of the most common
failure modes in todays electronics
There is a need for efficient and reliable
prognostics for electronic systems using
algorithms that can
fuse sensor data,
discriminate false alarms from actual failures
correlate faults with relevant system events
and reduce redundant processing elements which
are subject to common mode failures

4
Algorithm Objectives

Develop a machine learning approach to
detect anomalies in large multivariate systems
detect anomalies in the absence of reliable
failure data
Mitigate false alarms and intermittent faults and
failures
Predict future system performance

x2
Distribution of fault/failure data
Fault Space ?
Distribution of training data
x1
5
Data Setup

Data is collected at times Ti from a multivariate
distribution of random variables x1ixmi
xs are the system covariates
Xis are independent random vectors
Class ? -1,1
Class probability p(classX)

estimate
given
X
6
Data Decomposition (Models)

Extract features from the data by
constructing lower dimensional models
X training data ? Rnxm
Singular Value Decomposition (SVD)
With H project data onto M and R models
k number of principal components (k2)
xM the projection of x onto the model space M
xR projection of x onto the residual space R

7
Two Class Support Vector Machines
D(x)
Solution
Mapping F
Feature Space
Input Space
Input Space

Given nonlinearly separable labeled data xi with
labels yi ? 1,-1
Solve linear optimization problem to find w and b
in the feature space
Form a nonlinear decision function my mapping
back to the input space
The result is that we can obtain a decision
boundary on the given training set and use it to
classify new observations

8
Two Class Support Vector Machines

Interested in a function that best separates two
classes of data
The margin M2/w can be maximized by
minimizing w
the learning problem is stated as
subject to
The classifier function D(x) is constructed
with appropriate w and b (distance origin to
D(x))

9
Two Class Support Vector Machines

Lagrangian function
Instead of minimizing LP w.r.t. to w and b,
minimize LD w.r.t to a
where H is the Hessian Matrix,
Hi j yi yj xiT xj
aa1,,an
and p is a unit vector

KKT conditions
10
Two Class Support Vector Machines

In the nonlinear case use kernel function F
centered at each x
Form the same optimization problem
where
Argument the resulting function D(x) is the best
classifier for the given training set

x2
D(x)-1
Distribution of fault/failure data
D(x)0
D(x)1
Support Vectors
Distribution of training data
x1
11
Bayesian Interpretation of D(x)

The classification y ? -1.1 for any x, is
equivalent to asking p(Y1 Xx) gt ? lt p(Y-1
Xx)
An optimal classifier yMAP maximizes the
conditional probability
Quadratic optimization problem D(x)
It can be shown that D(x) is the maximum a
posteriori (MAP) solution to P(YyXx) ?
P(classdata), and therefore the optimal
classifier of the given two classes

if
if
12
One Class Training

In the absence of negative class data (fault or
failure information), a one-class-classification
approach is used
X(X1, X2) bivariate distribution
Likelihood of positive class L p(Xxiy1)
Class label y ? (-1,1)
Use the margin of this likelihood to construct
the negative class

L
X1
X2
13
Nonparametric Likelihood Estimation

If the probability that any data point xi falls
into the kth bin is r, then the probability of a
set of data x1,,xm falling into the kth bin is
given by a binomial distribution
Total sample size n
Number of samples in kth bin m
Region defined by bin R
MLE of r
Density estimate

14
Estimate likelihood Gaussian kernel j

The volume of R
For uniform kernel the number of data m in R
Kernel function f
Points xi which are close to the sample point x
receive higher weight
Resulting density fj(x) is smooth
The bandwidth h is selected according to a
nearest neighbor algorithm
Each bin R contains kn data

15
Estimate of Negative Class

The negative class is estimated based on the
likelihood of the positive class (training data)
A threshold t is used to estimate the likelihood
ratio of positive to negative class probability
for the given training data
A 1D cross-section of the density illustrates the
idea of the threshold ratio

Positive
Negative
16
D(x) as a Sufficient Statistic

D(x) can be used as a sufficient statistic to
classify data point x
Argument since D(x) is the optimal classifier,
posterior class probabilities are related to
datas distance to D(x)0
These probabilities can be modeled by a logistic
distribution, centered at D(x)0

D(x)
17
Posterior Class Probability

The positive posterior class probability is given
by
Use D(x) as the sufficient statistic for the
classification of xi, by replacing ai by D(xi)
Simplify
Get MLE for parameters A and B

logistic distribution
where
18
Joint Probability Model

Interested in P P(YXM,XR), the joint
probability of classification given two models
XM model space M
XR residual space R
Assume XM, XR independent
After some algebra get the joint positive and
negative posterior class probabilities P() and
P(-)

19
Case Studies

Simulated degradation
Lockheed Martin Dataset

20
Case Study I Simulated Degradation

Given
Simulated correlated data
X1 gamma, X2 student t,
X3 beta
Degradation modeling
Period of healthy data
Three successive periods of increasingly larger
changes in the mean for each parameter
Expecting a posterior classification probability
to reflect these four periods accordingly
First with a probability close to 1
For the three successive a decreasing trend

x1
Observation
21
Case Study I Results Simulated Degradation

Results a plot of the joint positive
classification probability

P1
P2
P4
P3
22
Case Study II Lockheed Martin Data (Known
Faulty Periods)

Given Data set from Lockheed martin
Type of data server data, unknown parameters
Multivariate, 22 parameters, 2741 observations
Healthy period (T) observations 0 - 800
Fault periods observations F1 912 1040, F2
1092 1106, F3 1593 - 1651
Training data constructed with sample from period
T, with size n140
Goal
Detect onset of known faulty periods without the
knowledge of unhealthy system characteristics

23
Case Study II - Results
Period F1
Period F2
Period T
912
800
24
Comparison Metrics of Code Accuracy (LibSVM vs
CALCEsvm)

An established and commercially used C SVM
software (LibSVM) was used to test the accuracy
of the code
LibSVM features used two class SVM
does not include classification probabilities for
one class SVM
Input to LibSVM
Positive class same training data
Negative class estimated negative class data
from CALCEsvm
Metrics detection accuracy
The count of correct classifications based on two
categories
Classification label y
Correct classification probability estimate

25
Detection Accuracy LibSVM vs CALCEsvm (Case Study
1 Degradation Simulation)

Description of test
Period 1 should be captured with a probability
estimate ranging from 80 to 100 positive class
Period 2 equivalently between 70 and 85
Period 3 between 30 and 70
Period 4 between 0 and 40
Based on just the class index, the detection
accuracy for both algorithms was almost identical
Based on ranges of probabilities LibSVM performs
better in determining the early stages where the
system is healthy, but performs worse is
detecting degradation in comparison to CALCEsvm

P1
P2
P3
P4
P1
P2
P3
P4
26
Detection Accuracy LibSVM vs CALCEsvm (Case Study
2 Lockheed Data)

Description of test
The acceptable probability estimate for a correct
positive classification should lie between 80 and
100
Similarly the acceptable probability estimate for
a negative classification should not exceed 40
Based on the class index, both LibSVM and
CALCEsvm perfrom almost identically, with small
improved performance for CALCEsvm
Based on acceptable probability estimates,
LibSVM
does a poor job at identifying the healthy state
between each successive faulty period
Has a much better performance at detecting the
anomalies
CALCEsvm
Seems to perform overall much better, and
identifies correctly both base on index and
acceptable probability ranges the faulty and
healthy periods in the data

27
Summary

For the given data, and on some additional data
sets the CALCEsvm algorithm has accomplished the
objective
Detected the time events for known anomalies
Identified trends of degradation
Comparison of its performance accuracy to LibSVM
is at first hand good!

28
Backups
29
Dual Form of Lagrangian Function

Dual form of the Lagrangian function, for the
optimization problem in LD space

through KKT conditions
subject to
30
Karush-Kuhn-Tucker (KKT) Conditions

Optimal solution (w, b, a) exists if and only
if KKT conditions are satisfied. In other words,
KKT conditions are necessary and sufficient to
solve w, b and a in a convex problem

31
Posterior Class Probability

Interested in finding the maximum likelihood
estimates for parameters A and B
The classification probability of a set of test
data Xx1,,xk, into c1,0 is given by a
product Bernoulli distribution
Where pi is the probability of classification
when c1 (y1) and 1-pi is the probability of
classification when c0 (refers to class y-1)

32
Posterior Class Probability

Maximize the likelihood of correct classification
y for each xi (MLE)
Determine parameters AMLE and BMLE from maximum
likelihood equation (above)
Use AMLE and BMLE to compute p(i)MLE in
Where piMLE is the
maximum likelihood estimator of the posterior
class probability pi (due to the invariance
property of the MLE)
best estimate for the classification probability
of each xi
Currently implemented is