Parallel Shared Nearest Neighbor Clustering - PowerPoint PPT Presentation

1 / 10
About This Presentation
Title:

Parallel Shared Nearest Neighbor Clustering

Description:

Can use any similarity measure that can be coded in or linked to C . ... Read and distribute data one chunk at a time. Loop (To calculate similarity matrix) ... – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 11
Provided by: A239
Category:

less

Transcript and Presenter's Notes

Title: Parallel Shared Nearest Neighbor Clustering


1
Parallel Shared Nearest Neighbor Clustering
  • Ben Mayer, Eric Eilertson, Levent Ertoz, Vipin
    Kumar
  • AHPCRC
  • bmayer_at_cs.umn.edu

2
Applications
  • Serial code is memory and compute time limited
    O(n2)
  • Currently the parallel code is utilized in
    Network Intrusion Detection (MINDS) to analyze
    long term data.

3
Additional Applications
  • NASA Earth Science Data
  • Used to find spatial patterns in climate data
  • Can use any similarity measure that can be coded
    in or linked to C. This makes the code
    applicable to a wide range of problems.

4
Comparison
  • We have a serial code.
  • No parallel codes found.
  • Enables scaling the problem size up (looking at
    larger data sets) or reduced computation time
    (same data only faster results).

5
Serial vs Parallel code features
  • Serial code
  • Can work with sparse or dense data sets
  • Only does symmetric measures (computes half of
    the matrix)
  • Many more options (similarity measures, program
    control, etc)
  • Highly optimized
  • Mature code, well tested and correct to our
    knowledge
  • Parallel code
  • Only dense data sets
  • Only does asymmetric measures (computes entire
    similarity matrix, 2x more work).
  • No options (other then filenames to use)
  • Not very optimized
  • Very new code, it is functional and scalable
    giving a good base to build upon. It has been
    verified against the serial code.

6
Algorithm Overview
  • Read and distribute data one chunk at a time
  • Loop (To calculate similarity matrix)
  • Calculate similarity for data in current
    processor
  • Keep k most similar items from all iterations of
    the loop for each local processor
  • Shift data
  • Collect each set of k top items to root processor
  • Run Clustering algorithm

7
Use cases
  • The parallel code was used to process 800K
    records of AHPCRC network data.
  • Also used to process a large Army network data
    set.
  • These produced interesting patterns with near
    linear speedup.

8
Obtaining PSNN
  • Currently finishing testing and tuning
  • Should be available very soon
  • Contact bmayer_at_cs.umn.edu for more information

9
Platforms
  • Should compile and run with no or minor
    modifications on any platform with C and MPI
  • Successfully Tested on Cray T3E, Cray X1, and an
    Intel Cluster

10
References
  • Book Chapters
  • Ertoz, L., Eilertson, E., Lazarevic, A., Tan, P.,
    Srivastava, J., Kumar, V., Dokas, P. The MINDS -
    Minnesota Intrusion Detection System, accepted
    for the book "Next Generation Data Mining".
  • Conference papers
  • Ertoz, L., Lazarevic, A., Eilertson, E.,
    Lazarevic, A., Tan, P., Dokas, P., Kumar, V.,
    Srivastava, J. Protecting Against Cyber Threats
    in Networked Information Systems, SPIE Annual
    Symposium on AeroSense, Battlespace Digitization
    and Network Centric Systems III, April, 2003,
    Orlando, FL.
  • Lazarevic, A., Ertoz, L., Ozgur, A, Srivastava,
    J., Kumar, V. A Comparative Study of Anomaly
    Detection Schemes in Network Intrusion Detection,
    Proceedings of Third SIAM Conference on Data
    Mining, San Francisco, May. 2003.
  • Dokas, P., Ertoz, L., Kumar, V., Lazarevic, A.,
    Srivastava, J., Tan, P. Data Mining for Network
    Intrusion Detection, Proc. NSF Workshop on Next
    Generation Data Mining, Baltimore, MD, November
    2002.
  • Lazarevic, A., Dokas, P., Ertoz, L., Kumar, V.,
    Srivastava, J., Tan, P. Cyber Threat Analysis -
    A Key Enabling Technology for the Objective Force
    (A Case Study in Network Intrusion Detection),
    Proceedings 23rd Army Science Conference,
    Orlando, FL, December 2002.
  • Ertoz, L., Eilertson, E., Lazarevic, A., Tan, P.,
    Dokas, P., Srivastava, J., Kumar, V. Detection
    and Summarization of Novel Network Attacks Using
    Data Mining, Technical Report, 2003.
Write a Comment
User Comments (0)
About PowerShow.com