Data Mining Approach for Network Intrusion Detection - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining Approach for Network Intrusion Detection

Description:

Data Mining Approach for Network Intrusion Detection Zhen Zhang Advisor: Dr. Chung-E Wang 04/24/2002 Department of Computer Science California State University ... – PowerPoint PPT presentation

Number of Views:271
Avg rating:3.0/5.0
Slides: 26
Provided by: zhe92
Category:

less

Transcript and Presenter's Notes

Title: Data Mining Approach for Network Intrusion Detection


1
Data Mining Approach for Network Intrusion
Detection
  • Zhen Zhang
  • Advisor Dr. Chung-E Wang
  • 04/24/2002
  • Department of Computer Science
  • California State University, Sacramento

2
Outline
  • Background
  • Intrusion Detection promises and challenges
  • Data Mining in IDS how can it help
  • Motivation
  • Approaches, tasks, problems and my contributions
  • Results
  • Conclusion and future work

3
Intrusion Detection- Building a Secure Network
  • Primary assumptions
  • System activities are observable
  • Normal and intrusive activities have distinct
    evidence
  • Main techniques
  • Misuse detection patterns of well-known attacks
  • Anomaly detection deviation from normal usage

4
Data Mining in IDS
  • Shortfalls with current IDS (mostly misuse
    detections)
  • Variants Intrusions change easily and
    frequently.
  • False positive Difficult to pick up intrusions.
  • False negative Detecting attacks for which there
    are no known signatures
  • Data overload Amount of data grows rapidly.

5
What is Data Mining
  • Data Mining
  • Take data and pull from it patterns or
    deviations.
  • Many different types of algorithms
  • Decision Tree, Link analysis, Clustering,
    Association, Rule abduction, Deviation Analysis,
    and Sequence analysis.
  • Software and Tools
  • MS SQL Server 2000
  • Ripper and many others

6
How can Data Mining help
  • Variants
  • Use anomaly detection, no great concern with
    variants in an exploit code.
  • False positives
  • To identify recurring sequences of alarms in
    order to help identify valid network activity.
  • False negatives
  • Attacks for which signatures have not been
    developed might be detected.
  • Data overload
  • Data mining plays a vital role.

7
Summary of my work
  • Identify objective
  • Distinguish network attacks from normal traffic
  • New area, several research projects, no
    commercial products
  • Focus on the principle and basic implementation
    of concepts
  • Data Collection
  • Data Pre-processing on tcpdump dataset
  • Apply data mining on processed data
  • Investigate results
  • Software packages used Visual Basic, Microsoft
    SQL Server 2000 with Analysis Server, Tcpdump

8
Data Collection
  • Tcpdump data (http//iris.cs.uml.edu8080/)
  • Tcpdump was executed on the gateway, to capture
    the traffic between LAN and external, and
    broadcast packets within LAN
  • Only header, no user data
  • Filters were used, only TCP and UDP packets
  • Baseline and 4 simulated attacks

9
TCPDUMP data format
  • TCP packet
  • Time stamp
  • Source IP address
  • Source port
  • Destination IP address
  • Destination port
  • Flags (SYN, FIN, PUSH, RST, or .)
  • Data sequence number of this packet
  • Data sequence number of the data expected in
    return
  • Number of bytes of receive buffer space available
  • Indication of whether or not the data is urgent

10
Tcpdump data format
  • UDP packet
  • Time stamp
  • Source IP address
  • Source port
  • Destination IP address
  • Destination port
  • Length of the packet
  • Example data

11
Example tcpdump data
12
Data Pre-processing- 80 90 work
  • Packet level information to connection level
  • Group by same source/destination IP/Port
  • Use flags, acks to determine status of the
    connection
  • SF, REJ, S0, S1, S3, S3, S4, RSTOSn, RSTRSn, SS,
    SH, SHR, OOS1, OOS2
  • Record start time, duration, protocol
  • Calculate bytes in, bytes out, resent rate
  • UDP is connectionless, so simply treat each
    packet as a connection

13
First round of processing
Intrinsic Features
14
Establish more information
Count_per_dest of connections to this destination IP
REJ_count_per_dest of connections that get the flag REJ
S01_count_per_dest of connections that send a SYN packet but never get the ACK packet (S0), or receive an ACK on SYN that they never have sent (S1).
Diff_Services_per_dest of unique services
Diff_Service_Rate Diff_Services / Count
  • Same Destination Temporal and Statistical
    Attributes (last 2 seconds)

15
Establish more information
Count_per_service of connections to this type of service
REJ_count_per_service of connections that get the flag REJ (SYN met by RST)
S01_count_per_service of connections that send a SYN packet but never get the ACK packet (S0), or receive an ACK on SYN that they never have sent (S1).
Diff_Hosts_per_service of unique destination hosts
Diff_Hosts_Rate Diff_Hosts / Count
  • Same Service Temporal and Statistical Attributes
    (last 2 seconds)

16
Second round of processing
Same Destination Temporal and Statistical
Attributes
17
Final round of processing
  • Final, but important
  • Reduce data amount
  • Remove noise or trivial information
  • Re-organization data, add new feature if
    necessary
  • Challenges
  • Hard to tell which data to reduced/remove
  • Requires tremendous domain knowledge
  • Need experiments and adjustments

18
Data Mining
  • Decision Tree Algorithm
  • Microsoft SQL Server 2000 Analysis Server
  • Steps
  • 80 of baseline (normal) dataset as training data
  • Use 20 left as validation data, compute
    misclassification.
  • 20 of each of the four intrusion datasets as
    predication data, compute misclassification.

19
Dependency Network
20
Decision Tree
21
Apply Data Mining Model to Validate/Predicate
22
Results
misclassification (by final state)
Normal 149/1510 9.86
Intrusion1 443/2324 19.06
Intrusion2 376/1968 19.10
Intrusion3 386/2011 19.19
Intrusion4 437/2298 19.01
23
Conclusion and future improvement
  • Accuracy
  • Preliminary experiments of using DM on the
    tcpdump data showed promising results
  • depends on sufficient training data and right
    feature set.
  • Performance
  • 6 hours on one dataset (628775 records)
  • Size of time window
  • 2 seconds or larger?
  • Automated process
  • Call MSSQL DM and DTS procedures within VB
  • Real-time monitor and alarm

24
References
  • Intrusion Detection, Rebecca Gurley Bace,
    Macmillan Technical Publishing, 2000
  • Data Mining Concepts and Techniques, Jiawei Han
    Micheline kamber, Morgan Kaufmann Publishers 2001
  • Data Mining with Microcoft SQL Server 2000,
    Claude Seidman. Microsoft Press, 2001
  • http//www.cs.columbia.edu/sal/hpapers/USENIX/use
    nix.html
  • http//iris.cs.uml.edu8080/network.html
  • http//www-nrg.ee.lbl.gov/. Network Research
    Group (NRG) of the Information and Computing
    Sciences Division (ICSD) at Lawrence Berkeley
    National Laboratory (LBNL) in Berkeley,
    California.

25
Thank You!
Write a Comment
User Comments (0)
About PowerShow.com