(Network%20Security)

About This Presentation

Title:

(Network%20Security)

Description:

The name and the address of hosts and network devices. The opened services. ... Before this paper: ... Denial of Service (DDoS) Network Invasion Network ... – PowerPoint PPT presentation

Number of Views:334

Avg rating:3.0/5.0

Slides: 126

Provided by: edut1550

Category:

more less

Transcript and Presenter's Notes

Title: (Network%20Security)

1
???? (Network Security)
????? ??????????/??????? E-mail
nfhuang_at_cs.nthu.edu.tw
2
Agenda

Introduction of Network Security
Content Inspection Technologies
Pattern Matching Algorithms
Flow Classification by Stateful Mechanism
Machine Learning Based Application Identification
Technologies
Network Security Research Topics
Conclusions

3
-- ?????? --

2000/3????DDos???????,??Yahoo?Amazon?CNN?eBay
???????
2001/7Amazon.com ??? Bibliofind ?????????????
2002 ??????
2003/1 SQL Slammer ??
2003/4 ??????????
2003/8 Blaster ??????
2003/9 SoBig ??????
2003/9 ??????
2004/3 Netsky ??????
2004/4 Sasser ??????
2005/5 ?????????????
2005/6 ????????????????????

4
???????

??????????,????????,??????,??????,???????
?????????????,??????????????????
???????????????????????????????
????????????????
???????????????

5
????????
6
????????
7
????????
8
????????
Policy

??????
???????
???????

9
??????

Denial of Service (DoS), Distributed Denial of
Service (DDoS)
Network Invasion
Network Scanning
Network Sniffing
Torjan Horse and Backdoors
Worm

10
(1) DoS/DDoS

Prevent another user from using network
connection, or disable server or services e.g.
Smurf and Fraggle attacks, Land,
Teardrop, NewTear, Bonk, Boink, SYN
flooding, Ping of death, IGMP Nuke, buffer
overflow.
Caused by protocol fault or program fault.
It damages the Availability.

11
????? DoS ??

Ping Flooding
??????? ICMP echo ???????,????????
Ping of Death
??????? 65,536???? ICMP echo ???????,???????????
(TCP/IP ??????)?
UDP flooding (Chargen)
???????? UDP ???????????????(Port 19, Character
Generator),????????????????UDP??,????????

12
????? DoS ??

Smurf Attack
????????????????????? ICMP echo
??,??????????????????????????????? ICMP reply
?????,?????????,????????????????
SYN flooding
???????????? SYN ??(????TCP??)?????????,??????????
???????????????? SYN-ACK ???????,?????????????????
??????????? TCP ??,??????????????

13
Smurf attack (DoS)

Dangerous attacks
Network-based, fills access pipes
Uses ICMP echo/reply (smurf) or UDP echo
(fraggle) packets with broadcast networks to
multiply traffic
Requires the ability to send spoofed packets
Abuses bounce-sites to attack victims
Traffic multiplied by a factor of 50 to 200
Low-bandwidth source can kill high-bandwidth
connections
Similar to ping flooding, UDP flooding but more
dangerous due to traffic multiplication

14
Smurf Attack (contd)
15
SYN flooding Attack (DoS)

Goal is to deny access to a TCP service running
on a host.
Creates a number of half-open TCP connections
which fill up a hosts listen queue host stops
accepting connections.
Requires the TCP service be open to connections
from the victim.

16
SYN flooding (contd)
Spoofed SYN
ACK to spoofed address

Attacker
Victim
The Innocents
17
DDoS Attack
Attacker
Handler
Handler
Handler
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Control message Maybe encrypted or hidden in
normal packets.
Victim
Spoofed packets.
18
DDoS Attack

????????????????????????????
?? Yahoo.com,Amazon.com,CNN.com,buy.com?
ebay.com??????DDoS??

19
DDoS ????

DDOS ????????
Trin00 (?????)
Tribe Flood Network(TFN) (?????)
TFN2K
Stacheldraht
Trin00
Trin00 ????????????,??????,?????? Trin00 Daemon
??????????? UDP ??(?????????),????????????????????
?????? ICMP port unreachable ??,????????????????
TFN
????? Trin00 ??, ? TFN????????????? SYN flood?UDP
flood?ICMP flood??Smurf ???????? TFN
????????????????,??????????????????

20
(2) Network Invasion

Goal is to get into the target system and obtain
information
Account usernames, passwords
Source code, business critical information
Usually caused by improper configurations or
privilege setting, or program fault.
Network invasion is diverse and various,
knowledge about attack pattern may help to
detect, but it is quite hard to detect all
attacks.

21
Example of network invasion IIS unicode buffer
overflow

For IIS 5.0 on windows 2000 without this
security patch, a simple URL string
http//address.of.iis5.system/scripts/..c11c../w
innt/system32/cmd.exe?/cdirc\ will show the
information of root directory.

22
(3) Network Scanning

Goal is generally to obtain the chance, the
topology of victims network.
The name and the address of hosts and network
devices.
The opened services.
Usually uses technique of ICMP scanning, Xmas
scan, SYN-FIN scan, SNMP scan.
There is an automatic and powerful tool Nmap.

23
(4) Sniffing

Goal is generally to obtain the content of
communication
Account usernames, passwords, mail account
Network Topology
Usually a program placing an Ethernet adapter
into promiscuous mode and saving information for
retrieval later
Hosts running the sniffer program (e.g. NetBus)
is often compromised using host attack methods.

24
(5) Backdoor and Torjan horse

Usually, the backdoor and torjan horse is the
consequences of invasion or hostile programs.
It may open a private communication channel and
wait for remote commands.
Available toolkits
Subseven,
BirdSpy,
Dragger
It can be detected by monitoring known control
channel activities, but not with 100 precision.

25
(6) Worm

The chief intention of worm is to propagate and
survive.
It takes advantages of system vulnerabilities to
infect and then tries to infect any possible
targets.
It may decrease the production of system, leave
back doors, steal confidential information and so
on.

26
P2P/IM ????

P2P (Peer-to-Peer) ????
IM (Instant Messenger) ???
Spyware ????
Adware ????
Tunneling ????

27
P2P A new paradigm

Bottleneck of Server
Powerful PC
Flexible, efficient information sharing
P2P changes the way of Web (Internet)

28
P2P???????????

P2P ???????????,??????????,?? SoftEther ?
Skype??????,????,????,?????????
P2P ????????,??
??????????
?????????
??????
?????
????????
??????????
??????,?????

29
Famous P2P Examples

BitTorrent
eZpeer
Kuro
eDonkey
eMule
MLdonkey
Gnutella
Kazaa/Morpheus

Shareaza
Direct-connect
Gnutella
Soulseek
Opennap
Worklink
Opennext
Jelawat
PP???

SoftEther
iMESH
MIB
WinMix
WinMule
Skype

30
Instant Messenger (IM)

MSN
Yahoo Messenger
ICQ
YamQQ
AIM (AOL IM)

31
????????

Firewall (Layer-4)
VPN ? SSL VPN
PKI
IDS/IPS
Defense-in-Depth
Application Firewall (Layer-7)
UTM (Unified Threat Management)
NAC (Network Access Control)

32
??????Intrusion Detection System (IDS)
????????Intrusion Detection and Prevention
System (IPS/IDP)
33
Intrusion Detection System

Intrusion Detection System a computer system
that attempts to detect any set of actions that
try to compromise the integrity, confidentiality,
or availability of a resource.
An IDS has much more knowledge and many delicate
detection functions than common firewalls.
(Remember that, the main function of a firewall
is to do access control).

34
IDS Types

Host based vs. Network based.
Misused detection vs. Anomaly detection
Active vs. Passive
Centralized vs. Distributed

35
Host based Network based IDS

Host based IDS installed on target host as a
monitor service. It checks system activity, user
privilege, user behavior.
Network based IDS installed on network node,
usually in promiscuous mode to listen all passing
traffic. It checks network traffic, nodes
interactions.

36
Misused detection Anomaly detection IDS

Misused detection (signature-based) based on the
assumption that intrusion attempts can be
characterized by the comparison of user
activities against a database of known attacks.
Anomaly detection (statistical-based) identify
abusive behavior by noting and analyzing audit
data that deviates from a predicted norm.

37
Active IDS vs. Passive IDS

Active IDS an participate in the system. Not
only observe the events, but also involve in the
necessary operation. Also called IPS or IDP
(Intrusion Detection and Prevention System)
Passive IDS work on a monitor or bystander
basis.

38
Active IDS v.s. Passive IDS
39
Centralized IDS v.s. Distributed IDS

Centralized The sensors are managed by a single
analyzer or manager.
Distributed The sensors are managed by multiple
automated analyzers or managers. And among
analyzers and managers, they can communicate to
each other.

40
Comparison between Firewall and Network based
active IDS

Same
Cant protect insider to insider attack.
Cant protect against connections that dont go
through.
Can do ACL and filtering. (For Active IDS)
Different
IDS has the ability to detect new threats.
IDS focuses on intrusion while Firewall focuses
on access control and privacy.
Firewalls use address as the passport while IDS
will do much more checks.

41
The Challenge of IDS

Speed limitation NIDS cannot keep pace with the
network speed. (NIDS need to check more fields of
a packet than a firewall does.)
The inability to see all the traffic The
switched Ethernet is getting largely deployed.
Fail-open/fail-close architecture when a NIDS
fails often without notification of the problem
to the central console., leave the network as an
open one. A fail-closed methodology means the
network is out of service until the NIDS is
brought back on-line.

42
IDS False Alarms
43
Content Inspection Technologies
44
A Generic Layer-7 Engine

Packet Normalizer
Makes sure the integrity of incoming packets
Eliminates the ambiguity
Decodes URI strings if necessary
Pattern-Matching Engine
Policy Engine
Gather information from pattern-matching engine
and issue the verdict to allow/drop the packets

45
Packet Normalizer

Integrity Checking
IP Fragment Reassemble
TCP Segment Reassemble
TCP Segments may come out-of-order
SEQ out of window size
Segment Overlapping
URI Decode
URI hex code obfuscation (a 61)
URI unicode/UTF-8 obfuscation
self-referential directories obfuscation
(/././././ /)
directories obfuscation (/abc/a/../a/../a/
/abc/a)

46
Pattern-Matching Engine

The most computation-intensive task in packet
processing. Normally the PM engine needs to
process every single byte in packet payload.
In Snort, the PM routine accounts for 31 of the
total execution time

47
Pattern Matching is Expensive!

50 Instructions/ 1500 Byte packet

30 Instructions/ Byte. 45K Instructions/1500
Byte packet

Source Intel Corp.
48
Content Inspection Technologies

Pattern-Matching Algorithms
Software Based
Boyer-Moore
Aho-Corasick (AC)
Wu-Manber
Hardware Based
Bloom-Filter
Reconfigure Hardware (FSM)
TCAM-based

49
Pattern Matching Problem Definition

Given an input text T t0, t1, , tn ,and a
finite set of strings P P1, P2, , Pr, the
string matching problem involves locating and
identifying the substring of T which is identical
to Pj , 1? j? r, where
tsi , 0? i? m-1. And this equation can
be also denoted as
tstsm-1

Text
G C A T C G C A G A G A G T A T A C A G T A A G
G C A G A G A G
50
Aho-Corasick (AC) Algorithm

AC is a classic solution to exact set matching.
It works in time O(n m z) where z is number
of patterns occurrences in T.
AC is based on a refinement of a keyword tree.
AC is a deterministic algorithm. That is, the
performance is independent of the number of
patterns.

51
An Example of AC Algorithm

Example P ab, ba, babb, bb

52
An example of AC Algorithm
!h,s
he
h
e
Patterns hers his she
h
r
s
1
0
2
8
9
hers
i
his
s
s
6
7
he, she
h
e
3
4
5
s
sh
Dashed fail transitions those not shown leads
to the root
53
An example of AC Algorithm
i
h
e
s
Got a Match!
h
i
s
Text h e i s h i s
54
Reconfigure Hardware (FSM)

Implement the AC FSM in configurable Logic
Elements (LEs) of FPGA.
Achieve multiple gigabit performance. (Depends on
the FPGA model)
A powerful FPGA is necessary to accommodate
thousands of patterns, so that its not practical
and visible in commercial market.

55
FPGA-based pattern matching

FPGA-based

56
Bloom Filter

Given a string X, the Bloom filter computes k
hash functions on it producing k hash values
ranging from 1 to m. The same procedure is
repeated for all the members of the pattern set.
The input text is verified by generating k hash
values in the same way. If at least one of these
k bits is found not set then the string is
declared to be impossible to match.
Patterns in Length n are grouped into Bn.

57
Bloom Filter (Cont.)

False positive
Mim f (0.5)K, while m (k x n) / Ln2
So, total space, sum(Bi) m x (w - 1)
if k 1, n 2048, m 3072 bits
k 1, n 3072, m 4608 bits
if k 4, f 0.0625
k 5, f 0.0313
k 6, f 0.0156

K Hash functions H1, H2, , Hk
58
TCAM fundamental

TCAM stores data with three logic values 0,
1, X (dont care)
Multiple match modes are needed.

59
Policy Engine

Collect the matching events from Pattern-Matching
Engine.
Clarify the relationship between matched
patterns
Ordered A policy may consists more than one
pattern and should be matched in order.
Offset, Depth The matched position should be
within a certain range or location.
Distance, Within The distance between two
matched patterns should be taken into
consideration also.
Trace Application States
Some applications are difficult to identify by
using only one signature (e.g. P2P). Policy
Engine needs to track the connection state like
the following diagram

Msg Exchange
Data Exchange
Request File
S1
S0
S2
S3
60
Fast Pattern Matching Algorithms

A Pattern Matching Coprocessor for Deep and Large
Signature Set in Network Security System (IEEE
GLOBECOM 2005)
Hierarchical Matching Algorithm (HMA) for
Intrusion Detection Systems (IEEE GLOBECOM2005)
A Time and Memory Efficient String Matching
Algorithm for Intrusion Detection Systems, (IEEE
GLOBECOM 2006)
A non-Computation Intensive Pre-filter for String
Pattern Matching in Network Intrusion Detection
Systems, (IEEE GLOBECOM 2006)
Smart Architecture for High-speed Intrusion
Detection and Prevention Systems, International
Conference on Cryptology and Network Security
(CANS 2006, Acceptance rate lt 18).
A Deterministic Cost-effective String Matching
Algorithm for Network Intrusion Detection
Systems, (IEEE ICC2007).
A Novel Algorithm and Architecture for High Speed
Pattern Matching in Resource-limited Silicon
Solution, (IEEE ICC2007)
Flow Digest A State Synchronization Scheme for
Stateful High Availability, (IEEE ICC2007).
Performing Packet Content Inspection by Longest
Prefix Matching Technology, (IEEE GLOBECOM2007).

61
Security SoC

BroadWeb Security SoC
ARM922 RISC CPU (250Mhz)
Hardware NAT (400Mbps)
Hardware Content Inspection Engine (40Mbps)
Two 10/100/1000 RJ-45 Ports
Embedded-Linux
NSS and ICSA approved IPS signature database
IPS/Anti-virus functions
IM/P2P Management
Turn-key solution (ASIC Software module)
1-tier Customers

62
Security SoC (Cont.)

BroadWeb Security SoC (2nd Generation)
ARM926EJ RISC CPU (300Mhz)
Intelligent Hardware NAT (1Gbps)
Hardware Content Inspection Engine (100Mbps)
Embedded GbE Smart Switch and 4-port GPHY core
NSS and ICSA approved IPS Technology
IPS/Anti-virus functions
IM/P2P Management
Turn-key solution (ASICSoftware module)
1-tier Customers

63
Cisco/Linksys Wireless Security Router

IEEE 802.11n 108 Mbps EWC Wireless LAN
IPS protection and IM/P2P management
Firewall/VPN/Routing
Gigabit Ethernet x 5

64
State Machine Based Technologies
65
The FA Example FTP
State Machine Based Technologies
66
The FAs of BitTorrent protocols.
67
The FAs of Yahoo Messenger protocol.
68
We can identify and manage Over 60 Applications

IM
MSN, Yahoo Messanger, AIM, QQ, Google Talk, TM,
ICQ, iChat, MIRC, Odigo, Rediff, Gadu-Gadu
Web-IM
Meebo.com, eBuddy.com, iLoveIM.com, MSN, AIM,
Yahoo, ICQ
P2P
eDonkey, BitTorrent, Gnutella, Foxy, FastTrack,
Vagaa, Winny, BitComet, DirectConnect, PiGo,
PP365, WInMX, POCO, iMesh, ClubBox
Streaming-Media
QQLive, Podcast Bar, PPLive, RealPlayer, Window
Media Player, iTunes, WinAMP, Player 365,
QuickTime, FlashMedia Video, TVAnts

Webmail
Yahoo, Hotmail, Gmail
VoIP
Skype (3.6)
File Transfer
FTP, Web File Transfer, Thunder, GetRight,
FlashGet
VPN
VNN, SpftEther, Hamachi, TinyVPN, PacketiX,
HTTP-Tunnel, Tor, Ping-Tunnel
Terminal Control
VNC, PCAnywhere
Online Game
QQGame, OurGame, Cga.com.cn, QQFO

69
Machine Learning Based Technologies
70
Application Traffic identification

Traffic identification(or traffic classification)
issues are focused in recently years since
The introducing of P2P application greatly
impacts the network management task.
Port number is not the best and efficient
discriminator to identify these prevalent
traffics.
How about string matching method? Accurate! But
It cannot identify the encrypted traffic.
High cost on manually maintenance work for
protocol signatures.
High cost to match string in very high speed
network.
Privacy issue is under debating.

71
How to resolve the problem?

Heuristics methods(20042005)
Based on some intrinsically different behavior,
some rule can be constructed.
E.g. dest ip of dest port ? the host is
running P2P.
To differentiate P2P or non-P2P traffic.
Machine learning based techniques(2004 )
To construct the statistical signatures for
different categories/application protocols.
Most machine learning techniques are directly
employed to construct traffic signature.

72
The Milestone of Researches on Application
Traffic Identification

Before 2003 String matching and port number.
20032005
Heuristics
Machine learning method.
2006 Machine learning method for real-time
based traffic classification.
First k data packet sizes and direction of TCP
connection.
Stage-based classification(Statistical data in
each stage)

73
Different Objects of Application Traffic
Identification

At different levels
Category level or QoS class (Bulk data transfer -
FTPP2P, interactive, mail, web, streaming)
Protocol level (Kazza, eMule/eDonkey, Bittorrent,
MSN, FTP, POP3, SMTP, HTTP, Skype, Winny,
Share,.)
Behavior level (FTP control, FTP data, MSN file
transfer, MSN message chatting, MSN voip, Skype
Chatting, Skype voip, Skype File transfer, Skype
Video conference,)
All existing researches focus on classification
in protocol or category level.
Application field
Offline based traffic trend analysis.
Online based traffic shaping, traffic
engineering, security management.

74
The Classes of Applied Machine Learning Algorithms

Supervised-Machine learning
The model of traffic characteristics is
constructed from the training instances with
previously defined class label.
Unsupervised-Machine learning (Clustering)
The model of traffic characteristics is
constructed from the training instances without
previously defined class label.
However, all the existing training set employed
by both include pre-classified label.
Because each cluster would contain several
different classes/protocols.

75
The Discriminators (Attributes)

The key issues for machine-learning based traffic
identification are
What are the most distinguishable characteristics
(attributes/discriminators)?
How to remove the expensive cost on training?
Different discriminators
From L3/L4 layerpacket inter-arrival time, total
packet size, number of packets,,etc.
Combination of L3/L4 attributes with different
perspectives. e.g. upload/download size ratio.

76
The Milestone of Researches (Applying Machine
Learning techniques)

20032004
Matthew Roughan, IMC04 Class-of-Service
Mapping for QoS.
2005
Sebastian Zander Automated Traffic
Classification.
Andrew W. Moore Using Bayesian Analysis
Techniques.
2006
Sebastian Zander Internet Archeology
Estimating Individual Application Trends in
Incomplete Historic Traffic Traces.
Laurent Bernaille Traffic classification on the
fly. (first 5 packets of TCP with k-means
clustering).
Jeffrey Erman Internet Traffic Identification
using Machine Learning (k-means, EM clustering).

77
The Milestone of Researches (Applying Machine
Learning techniques)

2006 (cont.)
Laurent Bernaille Early Application
Identification.(first 4 packets of TCP with
k-means, GMM , and HMM clustering)
2007 Real time based methods
Zhu Li Accurate Classification of the Internet
Traffic Based on the SVM Method. (TCP and UDP
flow classification)
Laurent Bernaille Early Recognition of
Encrypted Application. (first 3 packets of TCP
with GMM clustering)
Jeffrey Erman Semi-Supervised Network Traffic
Classification. (Stage-based classification)

78
Class-of-Service Mapping for QoS A Statistical
Signature-based Approach to IP TrafficACM
SIGCOMM Internet Measurement Conference (IMC '04)

Matthew Roughan1, Subhabrata Sen2, Oliver
Spatscheck2, Nick Duffield2
1School of Mathematical Sciences, University of
Adelaide, Australia
2ATT Labs Research, Florham Park, NJ, USA

79
Introduction

Before this paper
Traditional researches tried to find the model
for traditional protocol (FTP, web, mail).
Most researches of traffic characteristics
modeling which focus on P2P and IM are case
studies.
Features
This paper studied the requirements and proposed
a framework of QoS for traffic which consists of
traditional and novel P2P/IM application in QoS
class level.
Classification is based on utilizing the
statistics of particular applications in order to
form signatures.

80
Ideas

The statistical attributes are aggregated with
respect to Server ports and Server IP addresses,
separately.
Employing machine learning techniques to
construct the mapping from Server port
aggregation/Server IP aggregation to different
QoS classes.
Nearest Neighbor(NN)
Linear Discriminant Analysis(LDA)
Then, the port number of aggregation that belongs
to particular QoS class can form one rule.
Disadvantage Applications that require different
QoS might use the same server port number.(e.g.
P2P)

81
Nearest Neighbor

To classify a data point x, lets find the
nearest neighbor!
The points with same property should be closely.
The class of the nearest neighbor will be
assigned to the data point x.
K- Nearest Neighbor
To find the k nearest neighbors and let them
vote.

More information http//neural.cs.nthu.edu.tw/ja
ng/books/dcpr/4.2-knnr.asp?title4-2
K-nearest-neighbor Rule
82
Linear Discriminant Analysis

To find the good projection for original
points.
Linear discriminant analysis finds a linear
transformation ("discriminant function") of the
two predictors, X and Y, that yields a new set of
transformed values that provides a more accurate
discrimination than either predictor alone
Transformed Target C1X C2Y

2 features
3 features
More information http//www.dtreg.com/lda.htm ht
tp//neural.cs.nthu.edu.tw/jang/books/dcpr/index.a
sp
83
Evaluation Example

Attributes for this evaluation the average
packet size, flow duration, bytes per flow,
packets per flow, and Root Mean Square (RMS)
packet size.

84
Internet Traffic Classification Using Bayesian
Analysis TechniquesACM SIGMETRICS'05

Andrew W. Moore1, Denis Zuev2
1University of Cambridge
2University of Oxford

85
Introduction

Features
Only TCP flows are considered.
Category-level classification.
Supervised-machine-learning
Naïve Bayesian algorithm (?????).
Uniquely use data that has been hand-classified
(based upon flow content) to one of a number of
categories.
Feature selection was applied to improved the
accuracy.

86
Ideas

Discriminators
About 248 discriminators of each flow.
E.g. Packet inter-arrival time (mean, variance, .
. . ), Payload size (mean, variance, . . . ),
Fourier Transform of the packet inter-arrival
time, TTL value, Flow duration, TCP Portetc.
Naïve Bayesian classifier
For a flow with known statistical attributes,
which class is most likely happened?
To find the maximum probability Pr(Ci X)
Ci is i-th class
X is the attributes of flow which will be
classified.
Only about 65 accuracy on flow level was
achieved.

87
Ideas(cont.)

Improvement
Naïve Bayes Kernel estimation method.
Kernel estimation was used instead of Gaussian
distribution model assumed by Naïve Bayesian.
Discriminator selection and dimension reduction.
The accuracy was improved upto 95
Disadvantages
All the discriminators are available after the
flow is closed.
Only TCP flows are considered for classification.
Network management might need more finer classes
(protocol level or behavior level).

88
Evaluation for Train and Test sets from traffic
of different time
FCBF Fast Correlation-Based Filter
89
Traffic Classification on the FlyACM SIGCOMM
Computer Communication Review Journal, Volume 36
, Issue 2, 200604

Laurent Bernaille, Renata Teixeira, Ismael
Akodkenou, Augustin Soule, Kave Salamatian
LIP6, Universit e Pierre et Marie Curie,
Thomson Paris Lab
Paris, FRANCE

90
Introduction

Features
The first paper focused on real-time flow-level
application classification.
To approximately model the L7 protocol
handshaking.
Protocol level classification.
Unsupervised machine learning.
K-means clustering. (50 clusters are the best)
Protocol assignment for each cluster, the
protocol of the largest proportion dominates the
cluster.
Discriminators the first q data packet sizes
(payload) and direction of each TCP connection.
q 5 is the best. (300, -200, 100, 200, -400)

91
K-means Clustering

For given number of clusters k, to iteratively
find k centers of these k clusters and
partition all the points into these k clusters
until the nearest center does not change.
Each data point is expressed as a vector, and
Euclidean distance is the most common distance
computation function.

92
Evaluation Result

Above 80 average accuracy can be achieved.
Disadvantages
Only TCP connections are considered.
Protocol assignment will result in classification
starvation.
The protocols which dont dominate any cluster
will be always classified as other protocol.

93
Early Application Identification 200612-ACM
Conf-CONEXT06(International Conference On
Emerging Networking Experiments And Technologies)

Laurent Bernaille, R. Teixeira and K. Salamatian,
Universit e Pierre et Marie Curie LIP6, CNRS
Paris, France

94
Introduction

Features
Three unsupervised machine learning (clustering)
algorithms were used to evaluate cluster
assignment accuracy and protocol labeling
accuracy.
K-means
Gaussian Mixture Models (GMM) on an Euclidean
space
Spectral clustering on Hidden Markov Models (HMM,
in order to consider order of packets)
Discriminators size and direction of first P
data packets.
To deal with the starvation problem in each
group, a labeling heuristic method based on
standard server port number (e.g. 25 for SMTP,
110 for POP3) is used to classify protocols in
each cluster group.
Only focus on TCP flows.
Wireless traffic trace has been included for
evaluation.

95
Discriminators

Discussion about the discriminators
The size and direction of each packet adds more
information to distinguish applications than
arrival time related metrics.
The range of packet sizes for each application is
similar across traces.
These models can be used to classify the same set
of applications at another network.
P 4 packets for the three clustering methods.
Clustering number
Kh 30 for HMM,
Kk 40 for K-Means and
Kg 45 for GMM.

96
Packet size is a better attribute
97
On-line Classification
98
Labeling
set of standard server ports
std(S) FTP, SSH, SMTP, HTTP, POP3, NNTP, HTTPS,
POP3S.
99
Labeling Accuracy
100
Features

Pros
Easy, fast, and simple!
Payload size and packet direction of first P data
packets.
Unsupervised training ? automatic learning
mechanism.
Cons
In Jeffrey Erman HP TR is unsuccessful
classifying application types with
variable-length packets in their protocol
handshakes such as Gnutella. Neither of these
studies access the byte accuracy of their
approaches which makes direct comparisons to our
work difficult.

101
Features

Cons
Only TCP are included for classification.
According to the description of traces, there are
un-ignorable fraction of flows which contain less
than 4 data packets!
And, the control flow might prevent the
identification system from classifying detailed
protocol behavior.
Classification starvation is still exist for
protocols which dont use standard port.

102
Early Recognition of Encrypted Applications20070
405-0406Passive and Active Measurement
Conference (PAM 2007)

Laurent Bernaille, Renata Teixeira
Universite Pierre et Marie Curie - LIP6-CNRS
Paris, France

103
Introduction

Features
The classification of SSL-encrypted protocols.
Two stagesSSL detection Protocol
identification.
First 3 packets and 35 clusters for Gaussian
Mixture Model.
Size of original packet
Most accurate method is to look up the encryption
method in the handshake packets and transform the
size of application packets accordingly.
For the five most common ciphers this method is
overkill because the increase varies from 21 to
33 bytes.
Simple heuristic subtract 21 from the size of
the encrypted packet regardless of the cipher.
Extending the ClusterPort labeling heuristic
SSL-specific ports 443 for HTTPS, 993 for IMAPS
and 995 for POP3S.

104
(No Transcript)
105
Accurate Classification of the Internet Traffic
Based on the SVM MethodIEEE ICC 2007

Zhu Li1, Ruixi Yuan1, and Xiaohong Guan1, 2
1Center for Intelligent and Networked Systems
(CFINS) Tsinghua University, Beijing 100084 ,
China
2SKLMS Lab and MOE Key Lab for Intelligent
Networks and Network Security Xian Jiatong
University, Xian 710049, China

106
Introduction

Features
Category level classification.
Supervised-machine learning.
Support Vector Machine.
Feature selection (Discriminator selection) is
employed to select the best set of attributes.
Both TCP and UDP are considered.
Discriminators Statistical data of flows.
Disadvantages the discriminators are available
after the flow has finished the communication.

107
Feature Selection

Sequential forward selection
Begin with 0 feature chosen sequentially append
1 feature which can arrive at the best
classification result.
Plus-m-minus-r algorithm
Begin with 0 feature chosen sequentially append
m features into chosen ones and pop r features
from them (mgtr) each time.
Plus-2-minus-1 was used in this paper.

108
Feature Selection (Cont.)
109
Accuracy After Feature selection

For the data sample set with respect to original
proportion in the traffic

110
Offline/Realtime Traffic Classification Using
Semi-Supervised Learning20070713-Technique
Report-HPPresented at Performance 2007, 2-5
October 2007, Cologne, Germany, and published in
Performance Evaluation journal(special issue on
Performance 2007 for the Proceedings of IFIP
Performance 2007)

Jeffrey Erman, Anirban Mahanti, Martin Arlitt,
Ira Cohen, Carey Williamson
Enterprise Systems and Software Laboratory
HP Laboratories Palo Alto

111
Introduction

Features
Semi-supervised learning techniques
Allows classifiers to be designed from training
data that consists of only a few labeled and many
unlabeled flows.
Both high byte accuracy and flow accuracy (i.e.,
gt 90).
To examine traffic over an extended period of
time, to assess the longevity of the classifiers.
Focused on TCP only.
It would likely be advantageous to have a
separate classier for the non-TCP traffic.(future
work).
Consideration about the elements in training set.
Elephant vs. Mice Flows
In order to obtain higher byte accuracy.

112
Introduction

Semi-supervised Learning
Hypothesis few flows are labeled in each
cluster, we have a reasonable basis for creating
the clusters to application type mapping.
Step1 Clustering K-Means
Step 2 Mapping from the clusters to the
different known q applications (Y) according to
the fraction of labeled application flows within
the cluster.
The clusters are unlabeled if they have no
labeled flows.
Use the unlabeled clusters to represent new or
unknown applications.
For most experiments, the number of clusters K
400.

113
Discriminators

11 Discriminators (After feature selection from
25 discriminators)
Total number of packets.
Average packet size.
Total bytes.
Total header (transport plus network layer)
bytes.
Number of caller to callee packets.
Total caller to callee bytes.
Total caller to callee payload bytes.
Total caller to callee header bytes.
Number of callee to caller Packets.
Total callee to caller payload bytes.
Total callee to caller header bytes.

114
On-line Classification

Online classification
Layered classification system.
A packet milestone is reached when the count of
the total number of packets a flow (SYN/SYNACK
packets are included) has sent or received
reaches a specific value.
Each layer is an independent model that
classifies ongoing flows into one of the many
class types using the flow statistics available
at the chosen milestone.
Each milestone's classification model is trained
using flows that have reached each specific
packet milestone.
Reclassifying whenever a upper layer is reached
When a flow is reclassified, any previously
assigned labels are disregarded.

115
Byte Accuracy
April 13, 9 am trace
78 of the flows had correct labels after
classification
116
Features

Pros
Semi-supervised mechanism reduces the cost to
prepare large training data set.
Considering sampling techniques to form the
training set.
Cons
Only TCP are included.
Is exponential packet milestone suitable for
real-time classification?

117
A High Accurate Machine-Learning Algorithm for
Identifying Application Traffic in Early Stage

Nen-Fu Huang , Gin-Yuan Jai, and Han-Chieh
Chao1
Department of Computer Science, National Tsing
Hua University, Taiwan
Department of Electronics, National Ilan
University, Taiwan

118
Classification in Early Stage

To get characteristics of protocol handshaking
for each flow in L7 perspective.
Flow idtuple (sip, sport, dip, dport, protocol)
Statistical information of each flow at first k
rounds.
Elapsed time, transmitted size, throughput,
response time, inter-arrival time.

119
(No Transcript)
120
Rule-based Machine Learning

Rule-based ML (Supervised machine learning)
Rules generated are suitable for intrinsic
architecture of firewall and IDS/IPS.
Rules generated by ML algorithm provide
information to understand potential
characteristics of application protocols
One Rule, PART, Ripple down, DecisionTable,
ConjunctiveRule, Ripper

121
Experiment Architecture
122
Accuracy Comparison with Respective to Sample Set
L. Bernaille 2006
123
Accuracy Comparison with Respective to Sample
Set(cont.)
Zhu Li ICC2007
124
Accuracy After Discriminators Selection
125
Conclusions

Machine learning based techniques to identify the
Network Applications are more and more important.
Focus on real-time based, protocol level
requirement of application traffic
classification.
No existing common traffic traces provided for
comparing the performance in the same base line.
Expensive training is still a problem.
Identifying encrypted traffic (e.g. Skype, Winny,
Encrypted BT) is a new challenge.
Identifying detailed behaviors of encrypted
traffic is even a big challenge.