Title: Contentbased Detection System using Clustering
1Content-based Detection Systemusing Clustering
- Bracha Shapira
- Elovici Y. Last M. Kandell A. Zaafrany O. Fridman
M.
E-mail bshapira_at_bgu.ac.il
NATO ARW Ben-Gurion University, June 2007
2Agenda
- Overview
- Performance Measures
- Research questions
- Experimental Environment
- Initial Results
- Challenges
3Goals
- Develop a model for detecting terrorist
activities in non-terror environments based on
the network traffic content - On-line detection
- Detection should based on passive eavesdropping
on the network - Detection true-positive and false-positive
similar to IDS that are based on anomaly detection
4Assumption
- A group of users in some defined environment
would have typical interests - Terror groups that share ideology would have
typical interests -
5Basic Idea
- A terrorist (or a new supporter) would be abnoral
in the typical environment - His/her interest would resemble terror typical
behavior
6Detection environment (example)
Terrorist
Related
Site
ATDS
WEB
University Campus
7Content-Based Methodology Learning Phase
Network
Sniffer
OHT- Online-HTML Tracer
Configuration Data
Filter
Vector Generator
Representation and storage
Vectors of usersTransactions
8Sniffer collect group related sites
9Filter
- Take out pages containing non-textual information
and pages in unsupported languages
10Convert each page to a representing vector of
term weights
Count all occurrences of meaningful terms on the
page using search engines techniques Normalize
weight of each term vector space model Vector
should represent vectors as accurately as
possible.
Bomb 8 Suicide 4 War - 2 Food - 5 .
(0.2,0.1, 0.05,0.25, 0.12)
11Content-Based Methodology Learning Phase
(Continued)
Vectors of users Transactions
Clustering
Cluster 1 (Vectors)
Cluster n (Vectors)
Normal User Behavior Computation
Group-Representor
Normal User Behavior
12Apply clustering on vectors to find common
interests of group
- Clustering is a statistical method to find group
of similar objects according to their properties
13Cluster Centroid Computation
- - Vector representation of cluster j
- Dj - number of vectors in cluster j
- - one vector representation
14Content-Based Methodology Detection Phase
Network
Sniffer
OHT- Online-HTML Tracer
Configuration data
Filtering
Vector Generator
Representation
Normal User Behavior
detector
Threshold (tr)
Detection
Alarm
15OHT- Online-HTML Tracer
16OHT- Online-HTML Tracer (cont.)
IP Packets
Recording stream of packets
Sniffer
Disregard non textual sequences
HTTP Filter
HTML Reconstruction
Reconstruct HTML Pages
HTML Page
17Detection
Representation
Group vectors by IP
Normal User Profile
Each queue holds one normal user acceses
18Detection Algorithm Parameters
- The size of the sub queue for each IP.
- Alarm thresholds values
- The ratio between the number of accesses detected
as abnormal and the size of the sub queue - Alarm threshold (similarity)
- Number of clusters representing the typical
profile of the monitored users.
19Performance Measures
- Message loss rate ( pages correctly captured)
- Detection (True Positive) Rate
- Positive-corrected classified/total_number_of
_positive - False Alarm (False Positive) Rate
- Negatively-incorrected-classified
/total_number_of_negative
20OHT Performance Evaluation
- Access to a given list of 100 URLs that include
textual HTML. - Simulated page requests sent from 38 PCs
- Each run if ideally performed would result in
3800 reconstructed HTML pages.
21Experimental Environment
- A small network of 38 computers having constant
IP addresses - All computers access the web through the same
switch - The switch was programmed to send all the packets
to one port - About 170,000 web transactions (page views) have
been recorded during 24 days of the normal
group - We simulated suspicious behavior by accessing
terror related web sites (582).
22TP and FP as a function of the number of clusters
23Conclusions
- No. of clusters affects results
- Can not be generalized
- Must be calibrated
24TP and FP for 100 alarm threshold
25TP and FP for 50 alarm threshold
26TP and FP positive for queue size 2
27TP and FP as for queue size 32
28Conclusions
- Queue size and abnormal ratio affect results
- Can not be generalized
- Need to be claibrated
29Improvements and challenges
- Large scale evaluations
- Multilingual (Arabic..)
- More effective representation of pages
- Graphs
- Context aware representations
- Optimal number of clusters
- Analyzing views of non-textual information
- Example
30Problems
- False positive
- Hiding behind a NAT
- Deployment or De-NAT
- Privacy..