Title: (Network%20Security)
1???? (Network Security)
????? ??????????/??????? E-mail
nfhuang_at_cs.nthu.edu.tw
2Agenda
- Introduction of Network Security
- Content Inspection Technologies
- Pattern Matching Algorithms
- Flow Classification by Stateful Mechanism
- Machine Learning Based Application Identification
Technologies - Network Security Research Topics
- Conclusions
3 -- ?????? --
- 2000/3????DDos???????,??Yahoo?Amazon?CNN?eBay
??????? - 2001/7Amazon.com ??? Bibliofind ?????????????
- 2002 ??????
- 2003/1 SQL Slammer ??
- 2003/4 ??????????
- 2003/8 Blaster ??????
- 2003/9 SoBig ??????
- 2003/9 ??????
- 2004/3 Netsky ??????
- 2004/4 Sasser ??????
- 2005/5 ?????????????
- 2005/6 ????????????????????
4???????
- ??????????,????????,??????,??????,???????
- ?????????????,??????????????????
- ???????????????????????????????
- ????????????????
- ???????????????
5????????
6????????
7????????
8????????
Policy
9??????
- Denial of Service (DoS), Distributed Denial of
Service (DDoS) - Network Invasion
- Network Scanning
- Network Sniffing
- Torjan Horse and Backdoors
- Worm
10(1) DoS/DDoS
- Prevent another user from using network
connection, or disable server or services e.g.
Smurf and Fraggle attacks, Land,
Teardrop, NewTear, Bonk, Boink, SYN
flooding, Ping of death, IGMP Nuke, buffer
overflow. - Caused by protocol fault or program fault.
- It damages the Availability.
11????? DoS ??
- Ping Flooding
- ??????? ICMP echo ???????,????????
- Ping of Death
- ??????? 65,536???? ICMP echo ???????,???????????
(TCP/IP ??????)? - UDP flooding (Chargen)
- ???????? UDP ???????????????(Port 19, Character
Generator),????????????????UDP??,????????
12????? DoS ??
- Smurf Attack
- ????????????????????? ICMP echo
??,??????????????????????????????? ICMP reply
?????,?????????,???????????????? - SYN flooding
- ???????????? SYN ??(????TCP??)?????????,??????????
???????????????? SYN-ACK ???????,?????????????????
??????????? TCP ??,??????????????
13Smurf attack (DoS)
- Dangerous attacks
- Network-based, fills access pipes
- Uses ICMP echo/reply (smurf) or UDP echo
(fraggle) packets with broadcast networks to
multiply traffic - Requires the ability to send spoofed packets
- Abuses bounce-sites to attack victims
- Traffic multiplied by a factor of 50 to 200
- Low-bandwidth source can kill high-bandwidth
connections - Similar to ping flooding, UDP flooding but more
dangerous due to traffic multiplication
14Smurf Attack (contd)
15SYN flooding Attack (DoS)
- Goal is to deny access to a TCP service running
on a host. - Creates a number of half-open TCP connections
which fill up a hosts listen queue host stops
accepting connections. - Requires the TCP service be open to connections
from the victim.
16SYN flooding (contd)
Spoofed SYN
ACK to spoofed address
Attacker
Victim
The Innocents
17DDoS Attack
Attacker
Handler
Handler
Handler
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Control message Maybe encrypted or hidden in
normal packets.
Victim
Spoofed packets.
18DDoS Attack
- ????????????????????????????
- ?? Yahoo.com,Amazon.com,CNN.com,buy.com?
ebay.com??????DDoS??
19DDoS ????
- DDOS ????????
- Trin00 (?????)
- Tribe Flood Network(TFN) (?????)
- TFN2K
- Stacheldraht
- Trin00
- Trin00 ????????????,??????,?????? Trin00 Daemon
??????????? UDP ??(?????????),????????????????????
?????? ICMP port unreachable ??,???????????????? - TFN
- ????? Trin00 ??, ? TFN????????????? SYN flood?UDP
flood?ICMP flood??Smurf ???????? TFN
????????????????,??????????????????
20(2) Network Invasion
- Goal is to get into the target system and obtain
information - Account usernames, passwords
- Source code, business critical information
- Usually caused by improper configurations or
privilege setting, or program fault. - Network invasion is diverse and various,
knowledge about attack pattern may help to
detect, but it is quite hard to detect all
attacks.
21Example of network invasion IIS unicode buffer
overflow
- For IIS 5.0 on windows 2000 without this
security patch, a simple URL string
http//address.of.iis5.system/scripts/..c11c../w
innt/system32/cmd.exe?/cdirc\ will show the
information of root directory.
22(3) Network Scanning
- Goal is generally to obtain the chance, the
topology of victims network. - The name and the address of hosts and network
devices. - The opened services.
- Usually uses technique of ICMP scanning, Xmas
scan, SYN-FIN scan, SNMP scan. - There is an automatic and powerful tool Nmap.
23(4) Sniffing
- Goal is generally to obtain the content of
communication - Account usernames, passwords, mail account
- Network Topology
- Usually a program placing an Ethernet adapter
into promiscuous mode and saving information for
retrieval later - Hosts running the sniffer program (e.g. NetBus)
is often compromised using host attack methods.
24(5) Backdoor and Torjan horse
- Usually, the backdoor and torjan horse is the
consequences of invasion or hostile programs. - It may open a private communication channel and
wait for remote commands. - Available toolkits
- Subseven,
- BirdSpy,
- Dragger
- It can be detected by monitoring known control
channel activities, but not with 100 precision.
25(6) Worm
- The chief intention of worm is to propagate and
survive. - It takes advantages of system vulnerabilities to
infect and then tries to infect any possible
targets. - It may decrease the production of system, leave
back doors, steal confidential information and so
on.
26P2P/IM ????
- P2P (Peer-to-Peer) ????
- IM (Instant Messenger) ???
- Spyware ????
- Adware ????
- Tunneling ????
27P2P A new paradigm
- Bottleneck of Server
- Powerful PC
- Flexible, efficient information sharing
- P2P changes the way of Web (Internet)
28P2P???????????
- P2P ???????????,??????????,?? SoftEther ?
Skype??????,????,????,????????? - P2P ????????,??
- ??????????
- ?????????
- ??????
- ?????
- ????????
- ??????????
- ??????,?????
29Famous P2P Examples
- BitTorrent
- eZpeer
- Kuro
- eDonkey
- eMule
- MLdonkey
- Gnutella
- Kazaa/Morpheus
- Shareaza
- Direct-connect
- Gnutella
- Soulseek
- Opennap
- Worklink
- Opennext
- Jelawat
- PP???
- SoftEther
- iMESH
- MIB
- WinMix
- WinMule
- Skype
30Instant Messenger (IM)
- MSN
- Yahoo Messenger
- ICQ
- YamQQ
- AIM (AOL IM)
31????????
- Firewall (Layer-4)
- VPN ? SSL VPN
- PKI
- IDS/IPS
- Defense-in-Depth
- Application Firewall (Layer-7)
- UTM (Unified Threat Management)
- NAC (Network Access Control)
32??????Intrusion Detection System (IDS)
????????Intrusion Detection and Prevention
System (IPS/IDP)
33Intrusion Detection System
- Intrusion Detection System a computer system
that attempts to detect any set of actions that
try to compromise the integrity, confidentiality,
or availability of a resource. - An IDS has much more knowledge and many delicate
detection functions than common firewalls.
(Remember that, the main function of a firewall
is to do access control).
34IDS Types
- Host based vs. Network based.
- Misused detection vs. Anomaly detection
- Active vs. Passive
- Centralized vs. Distributed
35Host based Network based IDS
- Host based IDS installed on target host as a
monitor service. It checks system activity, user
privilege, user behavior. - Network based IDS installed on network node,
usually in promiscuous mode to listen all passing
traffic. It checks network traffic, nodes
interactions.
36Misused detection Anomaly detection IDS
- Misused detection (signature-based) based on the
assumption that intrusion attempts can be
characterized by the comparison of user
activities against a database of known attacks. - Anomaly detection (statistical-based) identify
abusive behavior by noting and analyzing audit
data that deviates from a predicted norm.
37Active IDS vs. Passive IDS
- Active IDS an participate in the system. Not
only observe the events, but also involve in the
necessary operation. Also called IPS or IDP
(Intrusion Detection and Prevention System) - Passive IDS work on a monitor or bystander
basis.
38Active IDS v.s. Passive IDS
39Centralized IDS v.s. Distributed IDS
- Centralized The sensors are managed by a single
analyzer or manager. - Distributed The sensors are managed by multiple
automated analyzers or managers. And among
analyzers and managers, they can communicate to
each other.
40Comparison between Firewall and Network based
active IDS
- Same
- Cant protect insider to insider attack.
- Cant protect against connections that dont go
through. - Can do ACL and filtering. (For Active IDS)
- Different
- IDS has the ability to detect new threats.
- IDS focuses on intrusion while Firewall focuses
on access control and privacy. - Firewalls use address as the passport while IDS
will do much more checks.
41The Challenge of IDS
- Speed limitation NIDS cannot keep pace with the
network speed. (NIDS need to check more fields of
a packet than a firewall does.) - The inability to see all the traffic The
switched Ethernet is getting largely deployed. - Fail-open/fail-close architecture when a NIDS
fails often without notification of the problem
to the central console., leave the network as an
open one. A fail-closed methodology means the
network is out of service until the NIDS is
brought back on-line.
42IDS False Alarms
43Content Inspection Technologies
44A Generic Layer-7 Engine
- Packet Normalizer
- Makes sure the integrity of incoming packets
- Eliminates the ambiguity
- Decodes URI strings if necessary
- Pattern-Matching Engine
- Policy Engine
- Gather information from pattern-matching engine
and issue the verdict to allow/drop the packets
45Packet Normalizer
- Integrity Checking
- IP Fragment Reassemble
- TCP Segment Reassemble
- TCP Segments may come out-of-order
- SEQ out of window size
- Segment Overlapping
- URI Decode
- URI hex code obfuscation (a 61)
- URI unicode/UTF-8 obfuscation
- self-referential directories obfuscation
(/././././ /) - directories obfuscation (/abc/a/../a/../a/
/abc/a)
46Pattern-Matching Engine
- The most computation-intensive task in packet
processing. Normally the PM engine needs to
process every single byte in packet payload. - In Snort, the PM routine accounts for 31 of the
total execution time
47Pattern Matching is Expensive!
- 50 Instructions/ 1500 Byte packet
- 30 Instructions/ Byte. 45K Instructions/1500
Byte packet
Source Intel Corp.
48Content Inspection Technologies
- Pattern-Matching Algorithms
- Software Based
- Boyer-Moore
- Aho-Corasick (AC)
- Wu-Manber
- Hardware Based
- Bloom-Filter
- Reconfigure Hardware (FSM)
- TCAM-based
49Pattern Matching Problem Definition
- Given an input text T t0, t1, , tn ,and a
finite set of strings P P1, P2, , Pr, the
string matching problem involves locating and
identifying the substring of T which is identical
to Pj , 1? j? r, where - tsi , 0? i? m-1. And this equation can
be also denoted as - tstsm-1
Text
G C A T C G C A G A G A G T A T A C A G T A A G
G C A G A G A G
50Aho-Corasick (AC) Algorithm
- AC is a classic solution to exact set matching.
It works in time O(n m z) where z is number
of patterns occurrences in T. - AC is based on a refinement of a keyword tree.
- AC is a deterministic algorithm. That is, the
performance is independent of the number of
patterns.
51An Example of AC Algorithm
- Example P ab, ba, babb, bb
52An example of AC Algorithm
!h,s
he
h
e
Patterns hers his she
h
r
s
1
0
2
8
9
hers
i
his
s
s
6
7
he, she
h
e
3
4
5
s
sh
Dashed fail transitions those not shown leads
to the root
53An example of AC Algorithm
i
h
e
s
Got a Match!
h
i
s
Text h e i s h i s
54Reconfigure Hardware (FSM)
- Implement the AC FSM in configurable Logic
Elements (LEs) of FPGA. - Achieve multiple gigabit performance. (Depends on
the FPGA model) - A powerful FPGA is necessary to accommodate
thousands of patterns, so that its not practical
and visible in commercial market.
55FPGA-based pattern matching
56Bloom Filter
- Given a string X, the Bloom filter computes k
hash functions on it producing k hash values
ranging from 1 to m. The same procedure is
repeated for all the members of the pattern set. - The input text is verified by generating k hash
values in the same way. If at least one of these
k bits is found not set then the string is
declared to be impossible to match. - Patterns in Length n are grouped into Bn.
57Bloom Filter (Cont.)
- False positive
- Mim f (0.5)K, while m (k x n) / Ln2
- So, total space, sum(Bi) m x (w - 1)
- if k 1, n 2048, m 3072 bits
- k 1, n 3072, m 4608 bits
- if k 4, f 0.0625
- k 5, f 0.0313
- k 6, f 0.0156
K Hash functions H1, H2, , Hk
58TCAM fundamental
- TCAM stores data with three logic values 0,
1, X (dont care) - Multiple match modes are needed.
59Policy Engine
- Collect the matching events from Pattern-Matching
Engine. - Clarify the relationship between matched
patterns - Ordered A policy may consists more than one
pattern and should be matched in order. - Offset, Depth The matched position should be
within a certain range or location. - Distance, Within The distance between two
matched patterns should be taken into
consideration also. - Trace Application States
- Some applications are difficult to identify by
using only one signature (e.g. P2P). Policy
Engine needs to track the connection state like
the following diagram
Msg Exchange
Data Exchange
Request File
S1
S0
S2
S3
60Fast Pattern Matching Algorithms
- A Pattern Matching Coprocessor for Deep and Large
Signature Set in Network Security System (IEEE
GLOBECOM 2005) - Hierarchical Matching Algorithm (HMA) for
Intrusion Detection Systems (IEEE GLOBECOM2005) - A Time and Memory Efficient String Matching
Algorithm for Intrusion Detection Systems, (IEEE
GLOBECOM 2006) - A non-Computation Intensive Pre-filter for String
Pattern Matching in Network Intrusion Detection
Systems, (IEEE GLOBECOM 2006) - Smart Architecture for High-speed Intrusion
Detection and Prevention Systems, International
Conference on Cryptology and Network Security
(CANS 2006, Acceptance rate lt 18). - A Deterministic Cost-effective String Matching
Algorithm for Network Intrusion Detection
Systems, (IEEE ICC2007). - A Novel Algorithm and Architecture for High Speed
Pattern Matching in Resource-limited Silicon
Solution, (IEEE ICC2007) - Flow Digest A State Synchronization Scheme for
Stateful High Availability, (IEEE ICC2007). - Performing Packet Content Inspection by Longest
Prefix Matching Technology, (IEEE GLOBECOM2007).
61Security SoC
- BroadWeb Security SoC
- ARM922 RISC CPU (250Mhz)
- Hardware NAT (400Mbps)
- Hardware Content Inspection Engine (40Mbps)
- Two 10/100/1000 RJ-45 Ports
- Embedded-Linux
- NSS and ICSA approved IPS signature database
- IPS/Anti-virus functions
- IM/P2P Management
- Turn-key solution (ASIC Software module)
- 1-tier Customers
62 Security SoC (Cont.)
- BroadWeb Security SoC (2nd Generation)
- ARM926EJ RISC CPU (300Mhz)
- Intelligent Hardware NAT (1Gbps)
- Hardware Content Inspection Engine (100Mbps)
- Embedded GbE Smart Switch and 4-port GPHY core
- NSS and ICSA approved IPS Technology
- IPS/Anti-virus functions
- IM/P2P Management
- Turn-key solution (ASICSoftware module)
- 1-tier Customers
63Cisco/Linksys Wireless Security Router
- IEEE 802.11n 108 Mbps EWC Wireless LAN
- IPS protection and IM/P2P management
- Firewall/VPN/Routing
- Gigabit Ethernet x 5
64State Machine Based Technologies
65The FA Example FTP
State Machine Based Technologies
66The FAs of BitTorrent protocols.
67The FAs of Yahoo Messenger protocol.
68We can identify and manage Over 60 Applications
- IM
- MSN, Yahoo Messanger, AIM, QQ, Google Talk, TM,
ICQ, iChat, MIRC, Odigo, Rediff, Gadu-Gadu - Web-IM
- Meebo.com, eBuddy.com, iLoveIM.com, MSN, AIM,
Yahoo, ICQ - P2P
- eDonkey, BitTorrent, Gnutella, Foxy, FastTrack,
Vagaa, Winny, BitComet, DirectConnect, PiGo,
PP365, WInMX, POCO, iMesh, ClubBox - Streaming-Media
- QQLive, Podcast Bar, PPLive, RealPlayer, Window
Media Player, iTunes, WinAMP, Player 365,
QuickTime, FlashMedia Video, TVAnts
- Webmail
- Yahoo, Hotmail, Gmail
- VoIP
- Skype (3.6)
- File Transfer
- FTP, Web File Transfer, Thunder, GetRight,
FlashGet - VPN
- VNN, SpftEther, Hamachi, TinyVPN, PacketiX,
HTTP-Tunnel, Tor, Ping-Tunnel - Terminal Control
- VNC, PCAnywhere
- Online Game
- QQGame, OurGame, Cga.com.cn, QQFO
69Machine Learning Based Technologies
70Application Traffic identification
- Traffic identification(or traffic classification)
issues are focused in recently years since - The introducing of P2P application greatly
impacts the network management task. - Port number is not the best and efficient
discriminator to identify these prevalent
traffics. - How about string matching method? Accurate! But
- It cannot identify the encrypted traffic.
- High cost on manually maintenance work for
protocol signatures. - High cost to match string in very high speed
network. - Privacy issue is under debating.
71How to resolve the problem?
- Heuristics methods(20042005)
- Based on some intrinsically different behavior,
some rule can be constructed. - E.g. dest ip of dest port ? the host is
running P2P. - To differentiate P2P or non-P2P traffic.
- Machine learning based techniques(2004 )
- To construct the statistical signatures for
different categories/application protocols. - Most machine learning techniques are directly
employed to construct traffic signature.
72The Milestone of Researches on Application
Traffic Identification
- Before 2003 String matching and port number.
- 20032005
- Heuristics
- Machine learning method.
- 2006 Machine learning method for real-time
based traffic classification. - First k data packet sizes and direction of TCP
connection. - Stage-based classification(Statistical data in
each stage)
73Different Objects of Application Traffic
Identification
- At different levels
- Category level or QoS class (Bulk data transfer -
FTPP2P, interactive, mail, web, streaming) - Protocol level (Kazza, eMule/eDonkey, Bittorrent,
MSN, FTP, POP3, SMTP, HTTP, Skype, Winny,
Share,.) - Behavior level (FTP control, FTP data, MSN file
transfer, MSN message chatting, MSN voip, Skype
Chatting, Skype voip, Skype File transfer, Skype
Video conference,) - All existing researches focus on classification
in protocol or category level. - Application field
- Offline based traffic trend analysis.
- Online based traffic shaping, traffic
engineering, security management.
74The Classes of Applied Machine Learning Algorithms
- Supervised-Machine learning
- The model of traffic characteristics is
constructed from the training instances with
previously defined class label. - Unsupervised-Machine learning (Clustering)
- The model of traffic characteristics is
constructed from the training instances without
previously defined class label. - However, all the existing training set employed
by both include pre-classified label. - Because each cluster would contain several
different classes/protocols.
75The Discriminators (Attributes)
- The key issues for machine-learning based traffic
identification are - What are the most distinguishable characteristics
(attributes/discriminators)? - How to remove the expensive cost on training?
- Different discriminators
- From L3/L4 layerpacket inter-arrival time, total
packet size, number of packets,,etc. - Combination of L3/L4 attributes with different
perspectives. e.g. upload/download size ratio.
76The Milestone of Researches (Applying Machine
Learning techniques)
- 20032004
- Matthew Roughan, IMC04 Class-of-Service
Mapping for QoS. - 2005
- Sebastian Zander Automated Traffic
Classification. - Andrew W. Moore Using Bayesian Analysis
Techniques. - 2006
- Sebastian Zander Internet Archeology
Estimating Individual Application Trends in
Incomplete Historic Traffic Traces. - Laurent Bernaille Traffic classification on the
fly. (first 5 packets of TCP with k-means
clustering). - Jeffrey Erman Internet Traffic Identification
using Machine Learning (k-means, EM clustering).
77The Milestone of Researches (Applying Machine
Learning techniques)
- 2006 (cont.)
- Laurent Bernaille Early Application
Identification.(first 4 packets of TCP with
k-means, GMM , and HMM clustering) - 2007 Real time based methods
- Zhu Li Accurate Classification of the Internet
Traffic Based on the SVM Method. (TCP and UDP
flow classification) - Laurent Bernaille Early Recognition of
Encrypted Application. (first 3 packets of TCP
with GMM clustering) - Jeffrey Erman Semi-Supervised Network Traffic
Classification. (Stage-based classification)
78Class-of-Service Mapping for QoS A Statistical
Signature-based Approach to IP TrafficACM
SIGCOMM Internet Measurement Conference (IMC '04)
- Matthew Roughan1, Subhabrata Sen2, Oliver
Spatscheck2, Nick Duffield2 - 1School of Mathematical Sciences, University of
Adelaide, Australia - 2ATT Labs Research, Florham Park, NJ, USA
79Introduction
- Before this paper
- Traditional researches tried to find the model
for traditional protocol (FTP, web, mail). - Most researches of traffic characteristics
modeling which focus on P2P and IM are case
studies. - Features
- This paper studied the requirements and proposed
a framework of QoS for traffic which consists of
traditional and novel P2P/IM application in QoS
class level. - Classification is based on utilizing the
statistics of particular applications in order to
form signatures.
80Ideas
- The statistical attributes are aggregated with
respect to Server ports and Server IP addresses,
separately. - Employing machine learning techniques to
construct the mapping from Server port
aggregation/Server IP aggregation to different
QoS classes. - Nearest Neighbor(NN)
- Linear Discriminant Analysis(LDA)
- Then, the port number of aggregation that belongs
to particular QoS class can form one rule. - Disadvantage Applications that require different
QoS might use the same server port number.(e.g.
P2P)
81Nearest Neighbor
- To classify a data point x, lets find the
nearest neighbor! - The points with same property should be closely.
- The class of the nearest neighbor will be
assigned to the data point x. - K- Nearest Neighbor
- To find the k nearest neighbors and let them
vote.
More information http//neural.cs.nthu.edu.tw/ja
ng/books/dcpr/4.2-knnr.asp?title4-2
K-nearest-neighbor Rule
82Linear Discriminant Analysis
- To find the good projection for original
points. - Linear discriminant analysis finds a linear
transformation ("discriminant function") of the
two predictors, X and Y, that yields a new set of
transformed values that provides a more accurate
discrimination than either predictor alone
Transformed Target C1X C2Y
2 features
3 features
More information http//www.dtreg.com/lda.htm ht
tp//neural.cs.nthu.edu.tw/jang/books/dcpr/index.a
sp
83Evaluation Example
- Attributes for this evaluation the average
packet size, flow duration, bytes per flow,
packets per flow, and Root Mean Square (RMS)
packet size.
84Internet Traffic Classification Using Bayesian
Analysis TechniquesACM SIGMETRICS'05
- Andrew W. Moore1, Denis Zuev2
- 1University of Cambridge
- 2University of Oxford
85Introduction
- Features
- Only TCP flows are considered.
- Category-level classification.
- Supervised-machine-learning
- Naïve Bayesian algorithm (?????).
- Uniquely use data that has been hand-classified
(based upon flow content) to one of a number of
categories. - Feature selection was applied to improved the
accuracy.
86Ideas
- Discriminators
- About 248 discriminators of each flow.
- E.g. Packet inter-arrival time (mean, variance, .
. . ), Payload size (mean, variance, . . . ),
Fourier Transform of the packet inter-arrival
time, TTL value, Flow duration, TCP Portetc. - Naïve Bayesian classifier
- For a flow with known statistical attributes,
which class is most likely happened? - To find the maximum probability Pr(Ci X)
- Ci is i-th class
- X is the attributes of flow which will be
classified. - Only about 65 accuracy on flow level was
achieved.
87Ideas(cont.)
- Improvement
- Naïve Bayes Kernel estimation method.
- Kernel estimation was used instead of Gaussian
distribution model assumed by Naïve Bayesian. - Discriminator selection and dimension reduction.
- The accuracy was improved upto 95
- Disadvantages
- All the discriminators are available after the
flow is closed. - Only TCP flows are considered for classification.
- Network management might need more finer classes
(protocol level or behavior level).
88Evaluation for Train and Test sets from traffic
of different time
FCBF Fast Correlation-Based Filter
89Traffic Classification on the FlyACM SIGCOMM
Computer Communication Review Journal, Volume 36
, Issue 2, 200604
- Laurent Bernaille, Renata Teixeira, Ismael
Akodkenou, Augustin Soule, Kave Salamatian - LIP6, Universit e Pierre et Marie Curie,
Thomson Paris Lab - Paris, FRANCE
90Introduction
- Features
- The first paper focused on real-time flow-level
application classification. - To approximately model the L7 protocol
handshaking. - Protocol level classification.
- Unsupervised machine learning.
- K-means clustering. (50 clusters are the best)
- Protocol assignment for each cluster, the
protocol of the largest proportion dominates the
cluster. - Discriminators the first q data packet sizes
(payload) and direction of each TCP connection. - q 5 is the best. (300, -200, 100, 200, -400)
91K-means Clustering
- For given number of clusters k, to iteratively
find k centers of these k clusters and
partition all the points into these k clusters
until the nearest center does not change. - Each data point is expressed as a vector, and
Euclidean distance is the most common distance
computation function.
92Evaluation Result
- Above 80 average accuracy can be achieved.
- Disadvantages
- Only TCP connections are considered.
- Protocol assignment will result in classification
starvation. - The protocols which dont dominate any cluster
will be always classified as other protocol.
93Early Application Identification 200612-ACM
Conf-CONEXT06(International Conference On
Emerging Networking Experiments And Technologies)
- Laurent Bernaille, R. Teixeira and K. Salamatian,
- Universit e Pierre et Marie Curie LIP6, CNRS
- Paris, France
94Introduction
- Features
- Three unsupervised machine learning (clustering)
algorithms were used to evaluate cluster
assignment accuracy and protocol labeling
accuracy. - K-means
- Gaussian Mixture Models (GMM) on an Euclidean
space - Spectral clustering on Hidden Markov Models (HMM,
in order to consider order of packets) - Discriminators size and direction of first P
data packets. - To deal with the starvation problem in each
group, a labeling heuristic method based on
standard server port number (e.g. 25 for SMTP,
110 for POP3) is used to classify protocols in
each cluster group. - Only focus on TCP flows.
- Wireless traffic trace has been included for
evaluation.
95Discriminators
- Discussion about the discriminators
- The size and direction of each packet adds more
information to distinguish applications than
arrival time related metrics. - The range of packet sizes for each application is
similar across traces. - These models can be used to classify the same set
of applications at another network. - P 4 packets for the three clustering methods.
- Clustering number
- Kh 30 for HMM,
- Kk 40 for K-Means and
- Kg 45 for GMM.
96Packet size is a better attribute
97On-line Classification
98 Labeling
set of standard server ports
std(S) FTP, SSH, SMTP, HTTP, POP3, NNTP, HTTPS,
POP3S.
99Labeling Accuracy
100Features
- Pros
- Easy, fast, and simple!
- Payload size and packet direction of first P data
packets. - Unsupervised training ? automatic learning
mechanism. - Cons
- In Jeffrey Erman HP TR is unsuccessful
classifying application types with
variable-length packets in their protocol
handshakes such as Gnutella. Neither of these
studies access the byte accuracy of their
approaches which makes direct comparisons to our
work difficult.
101Features
- Cons
- Only TCP are included for classification.
- According to the description of traces, there are
un-ignorable fraction of flows which contain less
than 4 data packets! - And, the control flow might prevent the
identification system from classifying detailed
protocol behavior. - Classification starvation is still exist for
protocols which dont use standard port.
102Early Recognition of Encrypted Applications20070
405-0406Passive and Active Measurement
Conference (PAM 2007)
- Laurent Bernaille, Renata Teixeira
- Universite Pierre et Marie Curie - LIP6-CNRS
- Paris, France
103Introduction
- Features
- The classification of SSL-encrypted protocols.
- Two stagesSSL detection Protocol
identification. - First 3 packets and 35 clusters for Gaussian
Mixture Model. - Size of original packet
- Most accurate method is to look up the encryption
method in the handshake packets and transform the
size of application packets accordingly. - For the five most common ciphers this method is
overkill because the increase varies from 21 to
33 bytes. - Simple heuristic subtract 21 from the size of
the encrypted packet regardless of the cipher. - Extending the ClusterPort labeling heuristic
- SSL-specific ports 443 for HTTPS, 993 for IMAPS
and 995 for POP3S.
104(No Transcript)
105Accurate Classification of the Internet Traffic
Based on the SVM MethodIEEE ICC 2007
- Zhu Li1, Ruixi Yuan1, and Xiaohong Guan1, 2
- 1Center for Intelligent and Networked Systems
(CFINS) Tsinghua University, Beijing 100084 ,
China - 2SKLMS Lab and MOE Key Lab for Intelligent
Networks and Network Security Xian Jiatong
University, Xian 710049, China
106Introduction
- Features
- Category level classification.
- Supervised-machine learning.
- Support Vector Machine.
- Feature selection (Discriminator selection) is
employed to select the best set of attributes. - Both TCP and UDP are considered.
- Discriminators Statistical data of flows.
- Disadvantages the discriminators are available
after the flow has finished the communication.
107Feature Selection
- Sequential forward selection
- Begin with 0 feature chosen sequentially append
1 feature which can arrive at the best
classification result. - Plus-m-minus-r algorithm
- Begin with 0 feature chosen sequentially append
m features into chosen ones and pop r features
from them (mgtr) each time. - Plus-2-minus-1 was used in this paper.
108Feature Selection (Cont.)
109Accuracy After Feature selection
- For the data sample set with respect to original
proportion in the traffic
110Offline/Realtime Traffic Classification Using
Semi-Supervised Learning20070713-Technique
Report-HPPresented at Performance 2007, 2-5
October 2007, Cologne, Germany, and published in
Performance Evaluation journal(special issue on
Performance 2007 for the Proceedings of IFIP
Performance 2007)
- Jeffrey Erman, Anirban Mahanti, Martin Arlitt,
Ira Cohen, Carey Williamson - Enterprise Systems and Software Laboratory
- HP Laboratories Palo Alto
111Introduction
- Features
- Semi-supervised learning techniques
- Allows classifiers to be designed from training
data that consists of only a few labeled and many
unlabeled flows. - Both high byte accuracy and flow accuracy (i.e.,
gt 90). - To examine traffic over an extended period of
time, to assess the longevity of the classifiers. - Focused on TCP only.
- It would likely be advantageous to have a
separate classier for the non-TCP traffic.(future
work). - Consideration about the elements in training set.
- Elephant vs. Mice Flows
- In order to obtain higher byte accuracy.
112Introduction
- Semi-supervised Learning
- Hypothesis few flows are labeled in each
cluster, we have a reasonable basis for creating
the clusters to application type mapping. - Step1 Clustering K-Means
- Step 2 Mapping from the clusters to the
different known q applications (Y) according to
the fraction of labeled application flows within
the cluster. - The clusters are unlabeled if they have no
labeled flows. - Use the unlabeled clusters to represent new or
unknown applications. - For most experiments, the number of clusters K
400.
113Discriminators
- 11 Discriminators (After feature selection from
25 discriminators) - Total number of packets.
- Average packet size.
- Total bytes.
- Total header (transport plus network layer)
bytes. - Number of caller to callee packets.
- Total caller to callee bytes.
- Total caller to callee payload bytes.
- Total caller to callee header bytes.
- Number of callee to caller Packets.
- Total callee to caller payload bytes.
- Total callee to caller header bytes.
114On-line Classification
- Online classification
- Layered classification system.
- A packet milestone is reached when the count of
the total number of packets a flow (SYN/SYNACK
packets are included) has sent or received
reaches a specific value. - Each layer is an independent model that
classifies ongoing flows into one of the many
class types using the flow statistics available
at the chosen milestone. - Each milestone's classification model is trained
using flows that have reached each specific
packet milestone. - Reclassifying whenever a upper layer is reached
- When a flow is reclassified, any previously
assigned labels are disregarded.
115Byte Accuracy
April 13, 9 am trace
78 of the flows had correct labels after
classification
116Features
- Pros
- Semi-supervised mechanism reduces the cost to
prepare large training data set. - Considering sampling techniques to form the
training set. - Cons
- Only TCP are included.
- Is exponential packet milestone suitable for
real-time classification?
117A High Accurate Machine-Learning Algorithm for
Identifying Application Traffic in Early Stage
- Nen-Fu Huang , Gin-Yuan Jai, and Han-Chieh
Chao1 - Department of Computer Science, National Tsing
Hua University, Taiwan - Department of Electronics, National Ilan
University, Taiwan
118Classification in Early Stage
- To get characteristics of protocol handshaking
for each flow in L7 perspective. - Flow idtuple (sip, sport, dip, dport, protocol)
- Statistical information of each flow at first k
rounds. - Elapsed time, transmitted size, throughput,
response time, inter-arrival time.
119(No Transcript)
120Rule-based Machine Learning
- Rule-based ML (Supervised machine learning)
- Rules generated are suitable for intrinsic
architecture of firewall and IDS/IPS. - Rules generated by ML algorithm provide
information to understand potential
characteristics of application protocols - One Rule, PART, Ripple down, DecisionTable,
ConjunctiveRule, Ripper
121Experiment Architecture
122Accuracy Comparison with Respective to Sample Set
L. Bernaille 2006
123Accuracy Comparison with Respective to Sample
Set(cont.)
Zhu Li ICC2007
124Accuracy After Discriminators Selection
125Conclusions
- Machine learning based techniques to identify the
Network Applications are more and more important. - Focus on real-time based, protocol level
requirement of application traffic
classification. - No existing common traffic traces provided for
comparing the performance in the same base line. - Expensive training is still a problem.
- Identifying encrypted traffic (e.g. Skype, Winny,
Encrypted BT) is a new challenge. - Identifying detailed behaviors of encrypted
traffic is even a big challenge.