CSE%20881:%20Data%20Mining - PowerPoint PPT Presentation

About This Presentation
Title:

CSE%20881:%20Data%20Mining

Description:

CSE 881: Data Mining Lecture 22: Anomaly Detection – PowerPoint PPT presentation

Number of Views:226
Avg rating:3.0/5.0
Slides: 46
Provided by: Comput665
Learn more at: http://www.cse.msu.edu
Category:

less

Transcript and Presenter's Notes

Title: CSE%20881:%20Data%20Mining


1
CSE 881 Data Mining
  • Lecture 22 Anomaly Detection

2
Anomaly/Outlier Detection
  • What are anomalies/outliers?
  • Data points whose characteristics are
    considerably different than the remainder of the
    data
  • Applications
  • Credit card fraud detection
  • telecommunication fraud detection
  • network intrusion detection
  • fault detection

3
Examples of Anomalies
  • Data from different classes
  • An object may be different from other objects
    because it is of a different type or class
  • Natural (random) variation in data
  • Many data sets can be modeled by statistical
    distributions (e.g., Gaussian distribution)
  • Probability of an object decreases rapidly as its
    distance from the center of the distribution
    increases
  • Chebyshev inequality
  • Data measurement or collection errors

4
Importance of Anomaly Detection
  • Ozone Depletion History
  • In 1985 three researchers (Farman, Gardinar and
    Shanklin) were puzzled by data gathered by the
    British Antarctic Survey showing that ozone
    levels for Antarctica had dropped 10 below
    normal levels
  • Why did the Nimbus 7 satellite, which had
    instruments aboard for recording ozone levels,
    not record similarly low ozone concentrations?
  • The ozone concentrations recorded by the
    satellite were so low they were being treated as
    outliers by a computer program and discarded!

Sources http//exploringdata.cqu.edu.au/ozon
e.html http//www.epa.gov/ozone/science/hole
/size.html
5
Anomalies
  • General characteristics
  • Rare occurrence
  • Deviant behavior compared to the majority of the
    data
  • Distribution
  • Natural variation
  • uniform distribution
  • Data from different classes
  • distribution may be clustered

6
Anomaly Detection
  • Challenges
  • Method is (mostly) unsupervised
  • Validation can be quite challenging (just like
    for clustering)
  • Small number of anomalies
  • Finding needle in a haystack

7
Anomaly Detection Schemes
  • General Steps
  • Build a profile of the normal behavior
  • Profile can be patterns or summary statistics for
    the normal population
  • Use the normal profile to detect anomalies
  • Anomalies are observations whose
    characteristicsdiffer significantly from the
    normal profile
  • Types of anomaly detection schemes
  • Graphical Statistical-based
  • Distance-based

8
Graphical Approaches
  • Boxplot (1-D), Scatter plot (2-D), Spin plot
    (3-D)
  • Limitations
  • Time consuming
  • Subjective

9
Convex Hull Method
  • Extreme points are assumed to be outliers
  • Use convex hull method to detect extreme values
  • What if the outlier occurs in the middle of the
    data?

10
Statistical Approaches
  • Assume a parametric model describing the
    distribution of the data (e.g., normal
    distribution)
  • Apply a statistical test that depends on
  • Data distribution
  • Parameter of distribution (e.g., mean, variance)
  • Number of expected outliers (confidence limit)

11
Grubbs Test
  • Detect outliers in univariate data
  • Assume data comes from normal distribution
  • Detects one outlier at a time, remove the
    outlier, and repeat
  • H0 There is no outlier in data
  • HA There is at least one outlier
  • Grubbs test statistic
  • Reject H0 if

12
Statistical-based Likelihood Approach
  • Assume the data set D consists of samples from a
    mixture of two probability distributions
  • M (majority distribution)
  • A (anomalous distribution)
  • General Approach
  • Initially, assume all the data points belong to M
  • Let Lt(D) be the log likelihood of D
  • Choose a point xt that belongs to M and move it
    to A
  • Let Lt1 (D) be the new log likelihood.
  • Compute the difference, ? Lt(D) Lt1 (D)
  • If ? gt c (some threshold), then xt is declared
    an anomaly and is moved permanently from M to A

13
Statistical-based Likelihood Approach
  • Data distribution, D (1 ?) M ? A
  • M is a probability distribution estimated from
    data
  • Can be based on any modeling method (naïve Bayes,
    maximum entropy, etc)
  • A is often assumed to be uniform distribution
  • Likelihood at time t

14
Limitations of Statistical Approaches
  • Most of the tests are for a single attribute
  • In many cases, the data distribution may not be
    known
  • For high dimensional data, it may be difficult to
    estimate the true distribution

15
Distance-based Approaches
  • Data is represented as a vector of features
  • Three approaches
  • Nearest-neighbor based
  • Density based
  • Clustering based

16
Nearest-Neighbor Based Approach
  • Approach
  • Compute the distance between every pair of data
    points
  • There are various ways to define outliers
  • Data points with fewer than p points within a
    neighborhood of radius D
  • Data points whose distance to the kth nearest
    neighbor is among the highest
  • Data points whose average distance to the k
    nearest neighbors is among the highest

17
Outliers in Lower Dimensional Projection
  • In high-dimensional space, data is sparse and
    notion of proximity becomes meaningless
  • Every point is an almost equally good outlier
    from the perspective of proximity-based
    definitions
  • Lower-dimensional projection methods
  • A point is an outlier if in some lower
    dimensional projection, it is present in a local
    region of abnormally low density

18
Outliers in Lower Dimensional Projection
  • Divide each attribute into ? equal-depth
    intervals
  • Each interval contains a fraction f 1/? of the
    records
  • Consider a k-dimensional cube created by picking
    grid ranges from k different dimensions
  • If attributes are independent, we expect region
    to contain a fraction fk of the records
  • If there are N points, we can measure sparsity of
    a cube D as
  • Negative sparsity indicates cube contains smaller
    number of points than expected

19
Example
  • N100, ? 5, f 1/5 0.2, N ? f2 4

20
Density-based LOF approach
  • For each point, compute the density of its local
    neighborhood
  • Compute local outlier factor (LOF) of a sample p
    as the average of the ratios of the density of
    sample p and the density of its nearest neighbors
  • Outliers are points with largest LOF value

In the NN approach, p2 is not considered as
outlier, while LOF approach find both p1 and p2
as outliers
21
Clustering-Based
  • Basic idea
  • Cluster the data into groups of different density
  • Choose points in small cluster as candidate
    outliers
  • Compute the distance between candidate points and
    non-candidate clusters.
  • If candidate points are far from all other
    non-candidate points, they are outliers

22
One-Class SVM
  • Based on support vector clustering
  • Extension of SVM approach to clustering
  • 2 key ideas in SVM
  • It uses the maximal margin principle to find the
    linear separating hyperplane
  • For nonlinearly separable data, it uses a kernel
    function to project the data into higher
    dimensional space

23
Support Vector Machine (Idea 1)
  • Maximal margin principle

24
Support Vector Machine (Idea 2)
Original Space
High-dimensional Feature Space
25
Support Vector Clustering
What is the corresponding maximum margin
principle?
Original Space
26
Support Vector Clustering
  • In SVM
  • Start with the simplest case first, then make the
    problem more complex
  • Simplest case linearly separable data
  • Apply same idea to clustering
  • What is the simplest case?
  • All the points belong to a single cluster
  • The cluster is globular (spherical)

27
Support Vector Clustering
SVM
Choose the hyperplane with largest margin
Choose the sphere with smallest radius
28
Support Vector Clustering
  • Let R be the radius of the sphere
  • Goal is to
  • subject to
  • where
  • a is the center of the sphere

a
x
29
Support Vector Clustering
  • Objective function
  • where ?Is are the Lagrange multipliers
  • Subject to
  • ?i ? 0

30
Support Vector Clustering
  • Objective function (dual form)
  • Find the ?Is that maximizes the expression s.t.

31
Support Vector Clustering
  • Since
  • If xi is located in the interior of the sphere,
    then ?i 0
  • If xi is located on the surface of the sphere
    then ?i ? 0
  • Support vectors are the data points located on
    the cluster boundary

32
Outliers
  • Outliers are considered data points located
    outside the sphere
  • Let ?i be the error for xi
  • Goal is to
  • subject to

a
?
x
?
33
Outliers
  • Lagrangian
  • Subject to

34
Outliers
  • Dual form
  • Same as the previous (no outlier) case

35
Outliers
  • Since
  • If xi is located in the interior of the sphere,
    then ?i 0
  • If xi is located on the surface of the sphere
    then ?i ? 0
  • Such points are called the support vectors
  • If xi is located outside of the sphere then ?i
    0
  • Such points are called the bounded support
    vectors

36
Irregular Shaped Clusters
  • What if the cluster have irregular shaped in the
    original space?
  • Instead of using a very large sphere, or a sphere
    with large errors (? ?i), project the data into
    higher-dimensional space (kernel trick)

?(xi)
xi
?
37
Irregular Shaped Clusters
  • Objective function (dual form)
  • Kernel trick
  • Use kernel function in place of ?(xi)? ?(xj)
  • Typical kernel function
  • Gaussian

38
References
  • Support Vector ClusteringBy Ben-Hur, Horn,
    Siegelmann, and Vapnik (Journal of Machine
    Learning Research, 2001)
  • http//citeseer.ist.psu.edu/hur01support.html
  • Cone Cluster Labeling for Support Vector
    ClusteringBy Lee and Daniels (in Proc. of SIAM
    Intl Conf on Data Mining, 2006)
  • http//www.siam.org/meetings/sdm06/proceedings/04
    6lees.pdf

39
Graph-based Method
  • Represent the data as a graph
  • Objects ? nodes
  • Similarity ? edges
  • Apply graph-based method to determine outliers

40
Graph-based Method
Find the most outlying node in the graph gt
Opposite of finding the most central node
41
Graph-based Method
  • Many measures of node centrality
  • Degree
  • Closeness
  • where d(u,n) is the geodesic distance between u
    and n
  • Geodesic distance is the shortest path distance
  • Betweenness
  • where gjk(n) is the number of geodesic paths
    from j to k that pass through n
  • Random walk method

42
Random Walk Method
  • Random walk model
  • Randomly pick a starting node, s
  • Randomly choose a neighboring node linked to s.
    Set current node s to be the neighboring node.
  • Repeat step 2
  • Compute the probability that you will reach a
    particular node in the graph
  • The higher the probability, the more central
    the node is.

43
Random Walk Method
  • Goal Find the stationary distribution c
  • Vector c represents probability value for each
    object
  • Initially, set c(i) 1/N (for all i1,,N)
  • Let S be the adjacency matrix of the graph
  • Normalized the rows so that S(i,j) becomes a
    transition probability
  • Iteratively compute
  • Until c converges to a stationary distribution
  • To ensure convergence, use a damping factor, d

44
Random Walk Method
  • Applications
  • Web search (PageRank algorithm used by Google)
  • Text summarization
  • Keyword extraction

45
Random Walk for Anomaly Detection
  • Assess the centrality or importance of individual
    objects

Highly relevant web pages
Anomalies
For closely related data (e.g., documents
returned by PageRank)
For data containing anomalies
46
Example
  • Sample dataset

Object Connectivity Rank
1 2 3 4 5 6 7 8 9 10 11 0.0835 0.0764 0.0930 0.0922 0.0914 0.0940 0.0936 0.0930 0.0942 0.0942 0.0939 2 1 5 4 3 9 7 6 10 11 8
  • Model parameter tuning
  • damping factor0.1
  • Converge after 112 steps
Write a Comment
User Comments (0)
About PowerShow.com