Mining for Structural Anomalies in GraphBased Data - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Mining for Structural Anomalies in GraphBased Data

Description:

... discovers the following anomalous substructure: 9/3/09. 10. Probabilistic ... Partial Substructure Approach ... for isolating anomalous graph substructures. ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 19
Provided by: csewe
Category:

less

Transcript and Presenter's Notes

Title: Mining for Structural Anomalies in GraphBased Data


1
Mining for Structural Anomalies in Graph-Based
Data
  • William Eberle
  • Department of Computer Science and Engineering
  • University of Texas at Arlington

2
Overview
  • The Big Picture
  • Definition of Graph-Based Anomaly
  • Algorithms
  • Information Theoretic Algorithm
  • Probabilistic Algorithm
  • Maximum Partial Substructure Algorithm
  • Results
  • Synthetic
  • Real-world
  • Conclusions and Future Work

3
The Big Picture - Fraud
  • Fraud
  • Financial
  • Stolen credit card, identity theft, etc.
  • Estimated U.S. credit card losses for 2007 at
    3.2 billion.
  • Money laundering
  • Telecommunications

4
The Big Picture Homeland Security
  • Homeland Security
  • Border protection, drug smuggling, weapons of
    mass destruction, etc.
  • Terrorist networks

5
Motivation
  • By representing data as a graph, can mine for
    structural patterns not inherent in data
    represented in traditional databases.
  • Very little research has been done in the area of
    graph-based data mining for anomaly detection.
  • Existing approaches limited to specific anomalous
    situations or domains.
  • Using a unique definition of an anomaly, we have
    generated three novel algorithms for detecting
    all types of anomalies.

6
Definition of Graph-Based Anomaly
  • The more successful money-laundering apparatus
    is in imitating the patterns and behavior of
    legitimate transactions, the less the likelihood
    of it being exposed. United Nations Office on
    Drugs and Crime
  • Related to fraud detection.
  • Definition Given graph G with a normative
    substructure S, a substructure S is considered
    anomalous if the difference d between S and S
    satisfies 0 lt d lt X, where X is a user-defined
    threshold and d is a measure of the unexpected
    structural difference between two sub-graphs of
    a graph.
  • Basis for definition lies in the assumption that
    someone committing fraud will attempt to imitate
    legitimate transaction as much as possible.

7
Assumptions
  • Assumption 1 The majority of a graph consists
    of a normative pattern, and no more than X of
    the normative pattern is altered in the case of
    an anomaly.
  • Assumption 2 Anomalies consist of one or more
    modifications, insertions or deletions.
  • Assumption 3 The normative pattern is connected.

8
Approach
  • Unsupervised.
  • Graph-based Anomaly Detection (GBAD)
  • Determine normative pattern using SUBDUE minimum
    description length heuristic that minimizes the
    following
  • M(S,G) DL(GS) DL(S)
  • where G is the entire graph, S is the
    substructure, DL(GS) is the description length
    of G after compressing it with S, and DL(S) is
    the description length of the substructure.
  • Three algorithms for handling each of the
    different anomaly categories GBAD-MDL, GBAD-P
    and GBAD-MPS.

9
Information Theoretic Approach
  • GBAD-MDL.
  • Uses minimum description length principle to
    discover normative pattern, then examines all
    instances of that pattern that are within a user
    specified threshold of change to find the one
    that compresses it the most without compressing
    all of it.
  • Example
  • GBAD-MDL discovers the following anomalous
    substructure

10
Probabilistic Approach
  • GBAD-P
  • Uses MDL evaluation to discover best substructure
    in graph, then examines all extensions to
    normative pattern looking for extensions with
    lowest probability.
  • Example
  • GBAD-P returns after 2 iterations

11
Maximum Partial Substructure Approach
  • GBAD-MPS
  • Uses the MDL approach to discover the best
    substructure in a graph, then examines all of the
    instances of ancestral substructures that are
    missing various edges and vertices.
  • Example
  • GBAD-MPS discovers the following

12
Synthetic Results
  • Created synthetic graphs of varying sizes with
    anomalies of varying sizes.
  • GBAD-MDL discovers all anomalies where the
    anomalous instance contains the smallest
    modification among all same-sized substructures
    false positives increase as amount of change
    increases (noise).
  • Similarly, all anomalies are discovered on graphs
    of all sizes with anomalous insertions (using
    GBAD-P) and deletions (using GBAD-MPS) of all
    sizes.
  • The GBAD-MPS algorithm is less susceptible to
    noise in synthetically generated graphs because
    noise is additional information, rather than
    information to be removed.

13
Advantages/Disadvantages
  • Advantages of these algorithms
  • 100 discovery rate for each algorithm on all of
    the graphs when normative pattern discovered and
    anomalies are smaller deviations than noise.
  • No graphs of any size or any anomaly went
    undetected by all three approaches.
  • Do not just return the pattern of the anomaly
    they also return the actual anomalous instances
    within the data.
  • Disadvantages of these algorithms
  • Each is focused on specific anomalies
    modifications, insertions or deletions, so would
    need to be used in conjunction with each other.
  • Running-time of GBAD-MDL could be prohibitive
    when graph is large.

14
Real-World Data Cargo Shipments
  • Customs and Border Protection (CBP) agency
    shipping manifests in a system called PIERS.
  • Portion of the actual graph generated from the
    PIERS data.
  • Example Scenario
  • Marijuana seized at port on Florida U.S.
    Customers Service 2000.
  • Smuggler did not disclose some financial
    information, and ship traversed extra port.
  • GBAD-P discovers the extra traversed port
    GBAD-MPS discovers the missing financial
    information.

15
Other Results
  • More Cargo Examples
  • Transshipment A false declaration or
    information given in order to circumvent existing
    trade laws for the purpose of avoiding quotas,
    embargoes or prohibitions, or to obtain
    preferential duty treatment.
  • For this test, randomly changed country of origin
    on one of the shipments to CANADA.
  • GBAD-MDL algorithm discovered the anomalous
    instance.
  • Eberle and Holder, Intelligent Data Analysis
    Journal, 2007.
  • 1999 KDD Cup Network Intrusion Data
  • Labeled TCP packets.
  • Testing and training data sets.
  • 100 of attacks are discovered with GBAD-MDL
    algorithm 55.8 discovery rate with GBAD-P and
    47.8 for GBAD-MPS.

16
Conclusions
  • Purpose of this work was to present an approach
    for discovering the three possible graph
    anomalies modifications, insertions and
    deletions.
  • Using a practical definition of fraud, we
    designed algorithms to specifically handle the
    scenario where the anomalies are small deviations
    to a normative pattern.
  • Synthetic tests verified each of the algorithms
    on graphs and anomalies of varying sizes, with
    results showing very high detection rates and
    minimal false positives.
  • Tests using real-world cargo data and actual
    fraud scenarios injected into the data set
    indicated 100 accuracy with no false positives.
  • Tests on network intrusions using the GBAD-MDL
    algorithm achieve a 100 discovery rate with
    minimal false positives.

17
Future Work
  • Additional research using these algorithms
  • Further experimentation as to the effect of
    different patterns.
  • Investigate other pattern evaluation techniques
    besides MDL.
  • Implement other probabilistic measurements.
  • Use other real-world domains such as FINCEN data,
    telecommunication call records and Enron e-mail
    data set.
  • Long term research
  • Other compression techniques.
  • Investigate other information theoretic
    approaches, like Kolmogorov Complexity.
  • Graph partitioning for isolating anomalous graph
    substructures.
  • Other data mining and machine learning
    techniques.
  • Develop knowledge discovery platform for
    multiple, cross-discipline domains.

18
Contact Information
  • William (Bill) Eberle
  • Computer Science and Engineering
  • University of Texas at Arlington
  • Box 19015
  • Arlington, TX 76019-0015
  • E-mail eberle_at_cse.uta.edu
  • Web-site http//cseweb.uta.edu/eberle/
Write a Comment
User Comments (0)
About PowerShow.com