Boundary Detection in Tokenizing Network Application Payload for Anomaly Detection

About This Presentation
Title:

Boundary Detection in Tokenizing Network Application Payload for Anomaly Detection

Description:

Existing anomaly detection techniques rely on information ... GET /default.ida?NNNNNNNNN... Parsing the payload is required! Problems in hand-coded parsing: ... –

Number of Views:57
Avg rating:3.0/5.0
Slides: 29
Provided by: rvar2
Learn more at: https://cs.fit.edu
Category:

less

Transcript and Presenter's Notes

Title: Boundary Detection in Tokenizing Network Application Payload for Anomaly Detection


1
Boundary Detection in Tokenizing Network
Application Payload for Anomaly Detection
  • Rachna Vargiya and Philip Chan
  • Department of Computer Sciences
  • Florida Institute of Technology

2
Motivation
  • Existing anomaly detection techniques rely on
    information derived only from the packet headers
  • More sophisticated attacks involve the
    application payload
  • Example Code Red II worm
  • GET /default.ida?NNNNNNNNN
  • Parsing the payload is required!
  • Problems in hand-coded parsing
  • Large number of application protocols
  • Frequent introduction of new protocols

3
Problem Statement
  • To parse application payload into tokens without
    explicit knowledge of the application protocols
  • These tokens are later used as features for
    anomaly detection

4
Related work
  • Pattern Detection - Important Tokens
  • Fixed Length
  • Forrest et al. (1998)
  • Variable Length
  • Wespi et al. (2000)
  • Jiang et al.(2002)
  • Boundary Detection All Tokens
  • VOTING EXPERTS by Cohen et al. (2002)
  • Boundary Entropy
  • Frequency
  • Binary Votes

5
Approach
  • Boundary Finding Algorithms
  • Boundary Entropy
  • Frequency
  • Augmented Expected Mutual Information
  • Minimum Description Length
  • Approach is domain independent (no prior domain
    knowledge)

6
Combining Boundary Finding Algorithms
  • Combination of all or a subset (E.g. Frequency
    Minimum Description Length) of techniques
  • Each algorithm can cast multiple votes, depending
    on confidence measure

7
Boundary Entropy (Cohen et al)
  • Entropy at the end of each possible window is
    calculated

High Entropy means more variation
w
X
Itisarainyday
x is the byte following the current window
8
Voting using Boundary Entropy change graph to
discrete bars
Itisarainyday
  • Entropy in meaningful tokens starts with a high
    value, drops, and peaks at the end
  • Vote for positions with the peak entropy
  • Threshold suppresses votes for low entropy values
  • Threshold Average BE

9
Frequency (Cohen et al)
  • Most frequent set of tokens are assumed to be
    meaningful tokens
  • Frequencies of tokens with length 1, 2, 3., 6
  • Shorter tokens are inherently more frequent than
    longer tokens
  • Normalize frequencies for tokens of the same
    length using standard deviation
  • Boundaries are assigned at the end of most
    frequent token in the window

arainyday
Itis
Frequency in window (1)I 3 (2)It 5
(3) Iti 2 (4)It is 3
10
Mutual Information (MI)
  • Mutual Information given by
  • Gives us the reduction of uncertainty in
    presence of event b given event a
  • MI does not incorporate the counter evidence when
    a occurs without b and vice versa

11
Augmented Expected Mutual Information(AEMI)
  • AEMI sums the supporting evidence and
    subtracts the counter evidence
  • For each window, the location with the minimum
    AEMI value suggests a boundary

Itisarainyday
a
b
12
Minimum Description Length(MDL)
  • Shorter code assigned to frequent tokens to
    minimize the overall coding length
  • Boundary yielding shortest coding length is
    assigned votes
  • Coding Length per byte
  • Lg P(ti) no of bits to encode ti
  • tilength of ti

Itisarainyday
tleft
tright
13
Normalize scores of each algorithm
  • Each algorithm produces list of scores
  • Since the number of votes is proportional to the
    score, the scores must be normalized
  • Each score is replaced by the number of standard
    deviations that the score is away from the mean
    value

14
Normalize votes of each algorithm
  • Algorithms produce list of votes depending on the
    scores
  • Make sure each algorithm votes with the same
    weight.
  • Number of votes is replaced by the number of
    standard deviations from the mean value

15
Normalizing Scores and Votes
I t I s
I t I s
s1
s2
s3
s4
s1
s2
s3
s4
Scores
Normalized scores
ns1
ns2
ns3
ns4
ns1
ns2
ns3
ns4
v2
v3
v4
v1
v2
v3
v4
v1
Votes
nv1
nv1
nv1
nv1
nv1
nv1
nv1
nv1
Combined Normalized Votes
16
Combined Approach with Weighted Voting
  • A list of votes from all the experts is gathered
  • For each boundary, the final votes are summed
  • A boundary is placed at a position if the votes
    at the position exceed threshold.
  • Threshold Average number of Votes

17
Evaluation Criteria
  • Evaluation A of space separated words
    retrieved
  • Evaluation B of keywords in the protocol
    specification that were retrieved
  • Evaluation C entropy of the tokens in output
    file (lower the better)
  • Evaluation D number of detected attacks in
    network traffic
  • A and B only for text based protocols

18
Anomaly Detection Algorithm LERAD (Mahoney and
Chan)
  • LERAD forms rules based on 23 attributes
  • First 15 attributes from packet header
  • Next 8 attributes from the payload
  • Example Rule
  • If port 80 then word1 GET
  • Original Payload attributes space separated
    tokens
  • Our Payload attributes Boundary separated tokens

19
Experimental Data
  • 1999 DARPA Intrusion Detection Evaluation Data
    Set
  • Week 3 attack free (training) data
  • Weeks 4, 5 attack containing (test) data
  • Evaluations A, B, C (Known boundaries) Week 3
  • trained days 1 - 4
  • tested days 5 7
  • Prevent gaining knowledge from Weeks 4 and 5
  • Evaluation D (Detected attacks)
  • Trained Week 3
  • Tested Weeks 4 and 5

20
Evaluation A of Space-Separated Tokens
Recovered
Method Port 25 Port 80 Port 21 Port 79 Avg
FreqMDL 52 26 21 81 45.0
Frequency 15 16 13 99 36.0
BE AEMI MDL Freq 21 14 5 12 13.0
AEMI 5 9 4 32 12.5
MDL 6 7 3 25 10.3
BE 3 3 1 9 4.0
21
Evaluation B of Keywords in RFCs Recovered
Method Port25 Port80 Port21 Avg
FreqMDL 40 36 59 45.0
Frequency 31 28 40 33.0
BEAEMIMDLFreq 12 13 21 15.3
AEMI 9 5 2 5.3
MDL 7 6 1 4.7
BE 3 2 2 2.3
22
Evaluation C Entropy of Output(Lower is
Better) average across 6 ports
Method Average Value
Frequency 5.0
MDL 5.03
FreqMDL 5.06
BE 5.25
BE AEMI Freq MDL 5.56
AEMI 6.38
23
Ranking of Algorithms
Method Evaluation A Evaluation B Evaluation C
FreqMDL 1 1 3
Frequency 2 2 1
BEAEMIMDL Freq 3 3 5
AEMI 4 4 6
MDL 5 5 2
BE 6 6 4
24
  • Detection Rate for Space Separated Vs Boundary
    Separated (Freq MDL)

Port 10 FP/day Space Boundary 10 FP/day Space Boundary 100 FP/day Space Boundary 100 FP/day Space Boundary
20 2 2 4 5
21 14 16 14 17
22 3 3 3 3
23 13 14 13 14
25 15 16 16 16
79 3 3 3 3
80 10 10 11 13
113 2 2 2 2
Overall 59 62 63 68
Improvement -- 5 -- 8
25
Summary of Contributions
  • Used payload information, while most IDS
    concentrate on header information.
  • Proposed AEMI MDL for boundary detection
  • Combined all and subset of algorithms
  • Used weighted voting to indicate confidence
  • Proposed techniques find boundaries better than
    spaces
  • Achieved higher detection rates in an anomaly
    detection system

26
Future Work
  • Further evaluation on other ports
  • Pick more useful tokens instead of first 8
  • DARPA data set is partially synthetic, further
    evaluation on real traffic
  • Evaluation with other Anomaly detection algorithms

27
  • Thank you

28
Experimental Results
  • Table 4.3.4 Results from Additional Ports for
    Freq MDL and ALL

Method Evaluation A Words Found Evaluation A Words Found Evaluation B Keywords Found Evaluation B Keywords Found Evaluation Entropy Evaluation Entropy
FrqMDL ALL FrqMDL ALL FrqMDL ALL
23 13 7 5 3 7.88 8.08
115 43 20 - - 4.45 5.18
515 38 14 - - 7.66 7.27
Write a Comment
User Comments (0)
About PowerShow.com