Title: Data Mining
1Data Mining Intrusion Detection
- Shan Bai
- Instructor Dr. Yingshu Li
- CSC 8712 ,Spring 08
2Outline
- Intrusion Detection
- Data Mining
- Data Mining in Intrusion Detection
- Reference
3What is an intrusion?
- An intrusion can be defined as any set of
actions that attempt to compromise the - Integrity
- confidentiality, or
- availability
- of a resource.
Incidents Reported to Computer Emergency Response
Team/Coordination Center
Spread of SQL Slammer worm 10 minutes after its
deployment
4Intrusion Examples
- Trojan horse /worm
- Address spoofing
- a malicious user uses a fake IP address to send
malicious packets to a target. - Many others
- DOS
- denial-of-service
- R2L
- unauthorized access from a remote machine, e.g.
guessing password - U2R
- unauthorized access to local super user (root)
privileges, e.g., various buffer overflow''
attacks - Probing
- surveillance and other probing, e.g., port
scanning.
5Intrusion Detection System (IDS)
- Intrusion Detection System
- combination of software and hardware that
attempts to perform intrusion detection raises
the alarm when possible intrusion happens.
6IDS Categories
- Intrusion detection systems are split into two
groups - Anomaly detection systems
- Identify malicious traffic based on deviations
from established normal network. - Misuse detection systems
- Identify intrusions based on a known pattern
(signatures) for the malicious activity.
7Anomaly Detection
probable intrusion
activity measures
baseline the normal traffic and then look for
things that are out of the norm
Relatively high false positive rate -
anomalies can just be new normal activities.
8Misuse Detection
Example if (src_ip dst_ip) then land attack
look for known indicators ICMP Scans, port scans,
connection attempts CPU, RAM I/O Utilization,
File system activity, modification of system
files, permission modifications
Cant detect new attacks
9- Goal of Intrusion Detection Systems (IDS)
- To detect an intrusion as it happens and be able
to respond to it. - False positives
- A false positive is a situation where something
abnormal (as defined by the IDS) happens, but it
is not an intrusion. - Too many false positives
- User will quit monitoring IDS because of noise.
- False negatives
- A false negative is a situation where an
intrusion is really happening, but IDS doesn't
catch it.
10Outline
- Intrusion Detection
- Data Mining
- Data Mining in Intrusion Detection
- Reference
11Why do we need Data Mining?
- Despite the enormous amount of data, particular
events of interest are still quite rare,
frequency ranges from 0.1 to less than 10 - We are drowning in data, but starving for
knowledge!??
12Data Mining vs. KDD
- Knowledge Discovery in Databases (KDD) The whole
process of finding useful information and
patterns in data - Data Mining Use of algorithms to extract the
information and patterns derived by the KDD
process - Data mining is the core of the knowledge
discovery process
13KDD Process
- Selection Obtain data from various sources.
- Preprocessing Cleanse data.
- Transformation Convert to common format.
Transform to new format. - Data Mining Obtain desired results.
- Interpretation/Evaluation Present results to
user in meaningful manner
14Data Mining A KDD Process
Knowledge
- Data mining core of knowledge discovery process
Pattern Evaluation
Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
15Typical Data Mining Architecture
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data warehouse server
Filtering
Data cleaning data integration
Data Warehouse
Databases
16Outline
- Intrusion Detection
- Data Mining
- Data Mining in Intrusion Detection
- Reference
17- Network intrusion detection
- Number of intrusions on the network is
typically a very small fraction of the total
network traffic
18Why Can Data Mining Help?
- Learn from traffic data
- Supervised learning learn precise models from
past intrusions - Unsupervised learning identify suspicious
activities - Maintain models on dynamic data
- Correlation of suspicious events across network
sites - Helps detect sophisticated attacks not
identifiable by single site analyses - Analysis of long term data (months/years)
- Uncover suspicious stealth activities (e.g.
insiders leaking/modifying information)
19Intrusion Detection
- Traditional intrusion detection system IDS tools
(e.g. SNORT) are based on signatures of known
attacks - Limitations
- Signature database has to be manually revised
for each new type of discovered intrusion - They cannot detect emerging cyber threats
- Substantial latency in deployment of newly
created signatures across the computer system
20Data Mining for Intrusion Detection Techniques
and Applications
- Frequent pattern mining
- Classification
- Clustering
- Mining data streams
21Frequent pattern mining
- Patterns that occur frequently in a database
- Mining Frequent patterns finding regularities
- Process of Mining Frequent patterns for intrusion
detection - Phase I mine a repository of normal frequent
itemsets for attack-free data - Phase II find frequent itemsets in the last n
connections and compare the patterns to the
normal profile
22Frequent pattern mining
- Apriori
- Any subset of a frequent itemset must be also
frequent an anti-monotone property - A transaction containing beer, diaper, nuts
also - contains beer, diaper
- beer, diaper, nuts is frequent beer,
diaper must - also be frequent
- No superset of any infrequent itemset should be
generated or tested - Many item combinations can be pruned
23Sequential Pattern Analysis
- Models sequence patterns
- (Temporal) order is important in many situations
- Time-series databases and sequence databases
- Frequent patterns ? (frequent) sequential
patterns - Sequential patterns for intrusion detection
- Capture the signatures for attacks in a series of
packets
24Sequential Pattern Mining
- Given a set of sequences, find the complete set
of frequent subsequences
25Apriori Property in Sequences
26Classification A Two-Step Process
- Model construction describe a set of
predetermined classes - Training dataset tuples for model construction
- Each tuple/sample belongs to a predefined class
- Classification rules, decision trees, or math
formulae - Model application classify unseen objects
- Estimate accuracy of the model using an
independent test set - Acceptable accuracy ? apply the model to classify
data tuples with unknown class labels
27Classification
28Classification Decision Tree
- A node in the tree a test of some attribute
- A branch a possible value of the attribute
- Classification
- Start at the root
- Test the attribute
- Move down the tree branch
29Neural classification HIDE
- A hierarchical network intrusion detection
system using statistical processing and neural
network classification by Zheng et al. - Five major components
- Probes collect traffic data
- Event preprocessor preprocesses traffic data and
feeds the statistical model - Statistical processor maintains a model for
normal activities and generates vectors for new
events - Neural network classifies the vectors of new
events - Post processor generates reports
30Clustering
- What Is Clustering?
- Group data into clusters
- Similar to one another within the same cluster
- Dissimilar to the objects in other clusters
- Unsupervised learning no predefined classes
31Clustering
- What Is A Good Clustering?
- High intra-class similarity and low
interclasssimilarity - Depending on the similarity measure
- The ability to discover some or all of the hidden
patterns
32Clustering
- Clustering Approaches
- Partitioning algorithms
- Partition the objects into k clusters
- Iteratively reallocate objects to improve the
clustering - Hierarchy algorithms
- Agglomerative each object is a cluster, merge
clusters to form larger ones - Divisive all objects are in a cluster, split
it up into smaller clusters
33Clustering
34Mining Data Streams for Intrusion Detection
- Maintaining profiles of normal activities
- The profiles of normal activities may drift
- Identifying novel attacks
- Identifying clusters and outliers in traffic data
streams - Reduce the future alarm load by writing filtering
rules that automatically discard well-understood
false positives
35Data Mining for Intrusion Detection
- Misuse detection
- Predictive models are built from labeled data
sets (instances are labeled as normal or
intrusive) - These models can be more sophisticated and
precise than manually created signatures - Recent research e.g. JAM (Java Agents for
Metalearning)
36Misuse Detection
Example if (src_ip dst_ip) then land attack
look for known indicators ICMP Scans, port scans,
connection attempts CPU, RAM I/O Utilization,
File system activity, modification of system
files, permission modifications
Cant detect new attacks
37JAM (Java Agents for Metalearning)
- JAM (developed at Columbia University) uses data
mining techniques to discover patterns of
intrusions. It then applies a meta-learning
classifier to learn the signature of attacks. - The association rules algorithm determines
relationships between fields in the audit trail
records, and the frequent episodes algorithm
models sequential patterns of audit events.
Features are then extracted from both algorithms
and used to compute models of intrusion behavior.
- The classifiers build the signature of attacks.
So thus, data mining in JAM builds misuse
detection model. - Classifiers in the JAM are generated by using
rule learning program on training data of system
usage. After training, resulting classification
rules is used to recognize anomalies and detect
known intrusions. - The system has been tested with data from
Sendmail-based attacks, and with network attacks
using TCP dump data.
38Data Mining for Intrusion Detection
- Anomaly detection
- Identifies anomalies as deviations from normal
behavior - E.g. ADAM Audit Data Analysis and Mining MINDS
MINnesota INtrusion Detection System
39Anomaly Detection
probable intrusion
activity measures
baseline the normal traffic and then look for
things that are out of the norm
Relatively high false positive rate -
anomalies can just be new normal activities.
40ADAM Audit Data Analysis and Mining
- Detecting Intrusion by Data Mining
- Combination of Association Rule and
Classification Rule - Firstly, ADAM collects known frequent datasetsan
off-line algorithm - Secondly, ADAM runs an online algorithm
- Finds last frequent connection records
- Compare them with known mined data
- Discards those, which seems to be normal
- Suspicious ones are forwarded to the classifier
- Trained classifier then classify the suspicious
data as one of the following - Known type of attack
- Unknown type of attack
- False alarm
41ADAM Detecting Intrusion by Data Mining
42ADAM Audit Data Analysis and Mining
- ADAM has two phases in their model
- 1st Phase Train the classifier
- Offline process
- Takes place only once
- Before the main experiment
- 2nd Phase Using the trained classifier
- Trained classifier is then used to detect
anomalies - Online process
43The MINDS Project
- MINDS MINnesota INtrusion Detection System
- Learning from Rare Class Building rare class
prediction models - Anomaly/outlier detection
- Summarization of attacks using association
pattern analysis
Rules Discovered Milk --gt Coke
Diaper, Milk --gt Beer
44MINDS - Learning from Rare Class
- Problem Building models for rare network attacks
(Mining needle in a haystack) - Standard data mining models are not suitable for
rare classes - Models must be able to handle skewed class
distributions - Learning from data streams - intrusions are
sequences of events
45MINDS - Anomaly Detection
- Detect novel attacks/intrusions by identifying
them as deviations from normal, i.e. anomalous
behavior - Identify normal behavior
- Construct useful set of features
- Define similarity function
- Use outlier detection algorithm
- Nearest neighbor approach
- Density based schemes
- Unsupervised Support Vector Machines (SVM)
46Experimental Evaluation
- Publicly available data set
- DARPA 1998 Intrusion Detection Evaluation Data
Set prepared and managed by MIT Lincoln Lab
includes a wide variety of intrusions simulated
in a military network environment - Real network data from
- University of Minnesota
- Anomaly detection is applied
- 4 times a day
- 10 minutes time window
Open source signature-based network IDS
network
www.snort.org
10 minutes cycle 2 millions connections
net-flow data using CISCO routers
Anomaly scores
Association pattern analysis
MINDSanomaly detection
Data preprocessing
47 MINDS - Framework for Mining Associations
Ranked connections
attack
Discriminating Association Pattern Generator
Anomaly Detection System
normal
update
- Build normal profile
- Study changes in normal behavior
- Create attack summary
- Detect misuse behavior
- Understand nature of the attack
R1 TCP, DstPort1863 ? Attack R100 TCP,
DstPort80 ? Normal
Knowledge Base
MINDS association analysis module
48Discovered Real-life Association Patterns
- Rule 1 SrcIPXXXX, DstPort80, ProtocolTCP,
FlagSYN, NoPackets 3, NoBytes120180
(c1256, c2 1) - Rule 2 SrcIPXXXX, DstIPYYYY, DstPort80,
ProtocolTCP, FlagSYN, NoPackets 3, NoBytes
120180 (c1177, c2 0)
- At first glance, Rule 1 appears to describe a Web
scan - Rule 2 indicates an attack on a specific machine
- Both rules together indicate that a scan is
performed first, followed by an attack on a
specific machine identified as vulnerable by the
attacker
49Discovered Real-life Association Patterns
DstIPZZZZ, DstPort8888, ProtocolTCP (c1369,
c20)DstIPZZZZ, DstPort8888, ProtocolTCP,
FlagSYN (c1291, c20)
- This pattern indicates an anomalously high number
of TCP connections on port 8888 involving machine
ZZZZ - Follow-up analysis of connections covered by the
pattern indicates that this could be a machine
running a variation of the Kazaa file-sharing
protocol - Having an unauthorized application increases the
vulnerability of the system
50Discovered Real-life Association Patterns(ctd)
SrcIPXXXX, DstPort27374, ProtocolTCP,
FlagSYN, NoPackets4, NoBytes189200 (c1582,
c22) SrcIPXXXX, DstPort12345, NoPackets4,
NoBytes189200 (c1580, c23) SrcIPYYYY,
DstPort27374, ProtocolTCP, FlagSYN,
NoPackets3, NoBytes144 (c1694, c23)
- This pattern indicates a large number of scans on
ports 27374 (which is a signature for the
SubSeven worm) and 12345 (which is a signature
for NetBus worm) - Further analysis showed that no fewer than five
machines scanning for one or both of these ports
in any time window
51Discovered Real-life Association Patterns(ctd)
DstPort6667, ProtocolTCP (c1254, c21)
- This pattern indicates an unusually large number
of connections on port 6667 detected by the
anomaly detector - Port 6667 is where IRC (Internet Relay Chat) is
typically run - Further analysis reveals that there are many
small packets from/to various IRC servers around
the world - Although IRC traffic is not unusual, the fact
that it is flagged as anomalous is interesting - This might indicate that the IRC server has been
taken down (by a DOS attack for example) or it is
a rogue IRC server (it could be involved in some
hacking activity)
52Discovered Real-life Association Patterns(ctd)
DstPort1863, ProtocolTCP, Flag0, NoPackets1,
NoByteslt139 (c1498, c26)DstPort1863,
ProtocolTCP, Flag0 (c1587, c26)DstPort1863,
ProtocolTCP (c1606, c28)
- This pattern indicates a large number of
anomalous TCP connections on port 1863 - Further analysis reveals that the remote IP block
is owned by Hotmail - Flag0 is unusual for TCP traffic
53 MINDS Conclusion
- Data mining based algorithms are capable of
detecting intrusions that cannot be detected by
state-of-the-art signature based methods - SNORT has static knowledge manually updated by
human analysts - MINDS anomaly detection algorithms are adaptive
in nature - MINDS anomaly detection algorithms can also be
effective in detecting anomalous behavior
originating from a compromised or infected machine
54IDS Using both Misuse and Anomaly
DetectionRIDS-100
Â
Â
- RIDS( Rising Intrusion Detection System) is
provided by Rising Tech. It is a leader in
antivirus and content security software and
services in China. - The company is a leading provider of client,
gateway and server security solutions for virus
protection, firewall and intrusion detection
technologies and security services to enterprises
and service providers around China. - RIDS make the use of both intrusion detection
technique, misuse and anomaly detection. - Distance based outlier detection algorithm is
used for detection deviational behavior among
collected network data. - For misuse detection, it has very vast set of
collected data pattern which can be matched with
scanned network data for misuse detection. - This large amount of data pattern is scanned
using data mining classification Decision Tree
algorithm. - http//www.rising-global.com/
55A cooperative anomaly and intrusiondetection
system (CAIDS),
- built with a network-based intrusion detection
system (NIDS) and an anomaly detection system
(ADS) operating interactively through a signature
generator.
56A cooperative anomaly and intrusiondetection
system (CAIDS),
- A frequent episode rule (FER) is generated out of
a collection of frequent episodes. The FER is
defined over episode sequences with multiple
connection events. - For an example, we envision a window where we
observe a 3-event sequence - E, D, and F. An FER is generated as E ? D, F
- confidence level freq (a U b)/freq (b)0.8,
- where a represents the event E on the LHS and b
corresponds to the two events D and F on the RHS
of the rule. - If the b occurs with 5 and the joint event a and
b has 4 to occur, there is a (0.04/0.05) 80
chance that D and F will follow in the same
window.
57A cooperative anomaly and intrusiondetection
system (CAIDS),
- In practice, the event E could be an
authentication service characterized by two
attributes - (service authentication, flagSF).
- The events D, F may be two sequential smtp
requests denoted by (service smtp). - Thus we can derive an FER with a confidence level
of c 80, that two smtp services will follow
the authentication service within a window w 2
sec. The three joint traffic events accounts with
a support level s 10 out of all the network
connections being evaluated. This FER is formally
stated as follows - (service authentication) ? (service smtp)
- (service smtp) (0.8, 0.1, 2 sec)
(1)
58A cooperative anomaly and intrusiondetection
system (CAIDS),
- An association rule is aimed at finding
interesting intra-relationship inside a single
connection record - In general, an FER is specified by the following
expression - L1, L2,, Ln ? R1,, Rm (c, s, window)
(2) - Li (1 i n) and Rj (1 j m) are ordered
traffic connection events. - We call L1, L2,, Ln the LHS episode and R1,,
Rm the RHS of the episode rule.
59A cooperative anomaly and intrusiondetection
system (CAIDS),
Architecture of the CAIDS simulator built with a
2,000-signature Snort and an anomaly detection
subsystem (ADS) with 60 FERs after 2 weeks of
rule training over the Lincoln Lab IDS evaluation
dataset
60Conclusion
- In this report we have studied basic concept and
some classic system models, like ADAM ,MINDSin
this area. - To make summary of those system models, their
technologies and their validation methods. - Hope to a overview on currently development in
this area and how data mining is evolving into
the field of network intrusion detection.
61Reference
- DARPA 1998 data set
- A cleansed set in KDDCup99
- DARPA 1991 data set is also available
- http//www.ll.mit.edu/IST/ideval/data/data_index.h
tml - Daniel Barbara, Julia Couto, Sushil Jajodia,
Leonard Popyack, Ningning Wu, ADAM Detecting
Intrusions by Data Mining, Proceedings of the
2001 IEEE Workshop on Information Assurance and
Security, United States Military Academy, West
Point, NY, 5-6 June 2001 - Zhang, J. and Zulkernine, M. 2006. A Hybrid
Network Intrusion Detection Technique Using
Random Forests. In Proceedings of the First
international Conference on Availability,
Reliability and Security (April 20 - 22, 2006). - W. Lee et al. A data mining framework for
building intrusion detection models. In
Information and System Security, Vol. 3, No. 4,
2000. - Ertoz L. et Al, "MINDS - Minnesota Intrusion
Detection System", Next Generation Data Mining
Chapter 3, 2004 - Exploiting efficient data mining techniques to
enhance intrusion detection systems Lu, C.-T.
Boedihardjo, A.P. Manalwar, P. Information Reuse
and Integration, Conf, 2005. IRI -2005 IEEE
International Conference on. Volume , Issue ,
15-17 Aug. 2005 Page(s) 512 - 517 - Sal Stolfo, Andreas Prodromidis, Shelley
Tselepis, Wenke Lee, Dave Fan, and Phil Chan
(Honorable mention (runner-up) for Best Paper
Award in Applied Research Category) In
Proceedings of the Third International Conference
on Knowledge Discovery and Data Mining (KDD '97),
Newport Beach, CA, August 1997
62Questions Comments