Title: Data Mining Approach for Network Intrusion Detection
1Data Mining Approach for Network Intrusion
Detection
- Zhen Zhang
- Advisor Dr. Chung-E Wang
- 04/24/2002
- Department of Computer Science
- California State University, Sacramento
2Outline
- Background
- Intrusion Detection promises and challenges
- Data Mining in IDS how can it help
- Motivation
- Approaches, tasks, problems and my contributions
- Results
- Conclusion and future work
3Intrusion Detection- Building a Secure Network
- Primary assumptions
- System activities are observable
- Normal and intrusive activities have distinct
evidence - Main techniques
- Misuse detection patterns of well-known attacks
- Anomaly detection deviation from normal usage
4Data Mining in IDS
- Shortfalls with current IDS (mostly misuse
detections) - Variants Intrusions change easily and
frequently. - False positive Difficult to pick up intrusions.
- False negative Detecting attacks for which there
are no known signatures - Data overload Amount of data grows rapidly.
5What is Data Mining
- Data Mining
- Take data and pull from it patterns or
deviations. - Many different types of algorithms
- Decision Tree, Link analysis, Clustering,
Association, Rule abduction, Deviation Analysis,
and Sequence analysis. - Software and Tools
- MS SQL Server 2000
- Ripper and many others
6How can Data Mining help
- Variants
- Use anomaly detection, no great concern with
variants in an exploit code. - False positives
- To identify recurring sequences of alarms in
order to help identify valid network activity. - False negatives
- Attacks for which signatures have not been
developed might be detected. - Data overload
- Data mining plays a vital role.
7Summary of my work
- Identify objective
- Distinguish network attacks from normal traffic
- New area, several research projects, no
commercial products - Focus on the principle and basic implementation
of concepts - Data Collection
- Data Pre-processing on tcpdump dataset
- Apply data mining on processed data
- Investigate results
- Software packages used Visual Basic, Microsoft
SQL Server 2000 with Analysis Server, Tcpdump
8Data Collection
- Tcpdump data (http//iris.cs.uml.edu8080/)
- Tcpdump was executed on the gateway, to capture
the traffic between LAN and external, and
broadcast packets within LAN - Only header, no user data
- Filters were used, only TCP and UDP packets
- Baseline and 4 simulated attacks
9TCPDUMP data format
- TCP packet
- Time stamp
- Source IP address
- Source port
- Destination IP address
- Destination port
- Flags (SYN, FIN, PUSH, RST, or .)
- Data sequence number of this packet
- Data sequence number of the data expected in
return - Number of bytes of receive buffer space available
- Indication of whether or not the data is urgent
10Tcpdump data format
- UDP packet
- Time stamp
- Source IP address
- Source port
- Destination IP address
- Destination port
- Length of the packet
- Example data
11Example tcpdump data
12Data Pre-processing- 80 90 work
- Packet level information to connection level
- Group by same source/destination IP/Port
- Use flags, acks to determine status of the
connection - SF, REJ, S0, S1, S3, S3, S4, RSTOSn, RSTRSn, SS,
SH, SHR, OOS1, OOS2 - Record start time, duration, protocol
- Calculate bytes in, bytes out, resent rate
- UDP is connectionless, so simply treat each
packet as a connection
13First round of processing
Intrinsic Features
14Establish more information
Count_per_dest of connections to this destination IP
REJ_count_per_dest of connections that get the flag REJ
S01_count_per_dest of connections that send a SYN packet but never get the ACK packet (S0), or receive an ACK on SYN that they never have sent (S1).
Diff_Services_per_dest of unique services
Diff_Service_Rate Diff_Services / Count
- Same Destination Temporal and Statistical
Attributes (last 2 seconds)
15Establish more information
Count_per_service of connections to this type of service
REJ_count_per_service of connections that get the flag REJ (SYN met by RST)
S01_count_per_service of connections that send a SYN packet but never get the ACK packet (S0), or receive an ACK on SYN that they never have sent (S1).
Diff_Hosts_per_service of unique destination hosts
Diff_Hosts_Rate Diff_Hosts / Count
- Same Service Temporal and Statistical Attributes
(last 2 seconds)
16Second round of processing
Same Destination Temporal and Statistical
Attributes
17Final round of processing
- Final, but important
- Reduce data amount
- Remove noise or trivial information
- Re-organization data, add new feature if
necessary - Challenges
- Hard to tell which data to reduced/remove
- Requires tremendous domain knowledge
- Need experiments and adjustments
18Data Mining
- Decision Tree Algorithm
- Microsoft SQL Server 2000 Analysis Server
- Steps
- 80 of baseline (normal) dataset as training data
- Use 20 left as validation data, compute
misclassification. - 20 of each of the four intrusion datasets as
predication data, compute misclassification.
19Dependency Network
20Decision Tree
21Apply Data Mining Model to Validate/Predicate
22Results
misclassification (by final state)
Normal 149/1510 9.86
Intrusion1 443/2324 19.06
Intrusion2 376/1968 19.10
Intrusion3 386/2011 19.19
Intrusion4 437/2298 19.01
23Conclusion and future improvement
- Accuracy
- Preliminary experiments of using DM on the
tcpdump data showed promising results - depends on sufficient training data and right
feature set. - Performance
- 6 hours on one dataset (628775 records)
- Size of time window
- 2 seconds or larger?
- Automated process
- Call MSSQL DM and DTS procedures within VB
- Real-time monitor and alarm
24References
- Intrusion Detection, Rebecca Gurley Bace,
Macmillan Technical Publishing, 2000 - Data Mining Concepts and Techniques, Jiawei Han
Micheline kamber, Morgan Kaufmann Publishers 2001 - Data Mining with Microcoft SQL Server 2000,
Claude Seidman. Microsoft Press, 2001 - http//www.cs.columbia.edu/sal/hpapers/USENIX/use
nix.html - http//iris.cs.uml.edu8080/network.html
- http//www-nrg.ee.lbl.gov/. Network Research
Group (NRG) of the Information and Computing
Sciences Division (ICSD) at Lawrence Berkeley
National Laboratory (LBNL) in Berkeley,
California.
25Thank You!