Title: Lucent Technologies
1Learning Sequential Models for Detecting
Anomalous Protocol Usage (work in progress)
- Lloyd Greenwald, Lucent Bell Labs
2Machine Learning Algorithms for Surveillance and
Event Detection
- Surveillance
- Network traffic
- Event Detection
- Unknown vulnerability exploits using sequences
of messages - Machine Learning Algorithms
- Learning Markov models to capture recent
sequential protocol usage
3NIDS Monitors Traffic and Detects Events That
Violate Security Policy
(from Bro user manual)
4Example Attack Sequence NIDS Evasion Attack
Fake missing packet (to cause buffering) Send two
interspersed sequences for same connection Even
with same ttls there is ambiguity with how end
systems will re-create sequence
(from Handley et. al. 01)
5Example Attack Multi-Step
- Apache/mod_ssl worm (aka Slapper)
- Probe/scan target for vulnerability by sending
HTTP GET request on tcp port 80 that violates 1.1
standard - Response identifies server as Apache
- Exploit for SSLv2-enabled OpenSSL 0.9.6d
vulnerability sent to tcp port 443 - Target sends traffic back to attacker on udp port
2002 - Target begins scanning for other vulnerable hosts
6Technical Approach
- Automatically build sequential models of recent
protocol usage - Analyze models for common and uncommon sequences
- Proactively exercise protocol implementation with
uncommon sequences sampled from models - Reactively detect uncommon sequences
- Build new defense policies for NIDS
7Prior Work Machine Learning Algorithms for
Automated Test Case Generation
- Surveillance
- Web logs
- Event Detection
- Exercise errors in web applications
- Machine Learning Algorithms
- Learning Markov models to capture recent
sequential web application usage
8Prior Work Automated Test Case Generation
- Leverage dynamic user information to
automatically generate NEW test cases for web
applications.
Session Data
Key contribution 1) sequential statistical
models built using machine learning techniques.
Key contribution 2) flexible test case
generation exploiting probabilistic sampling
methods.
9Web Application Studied
- Front end JSP
- Back end - MySql
- 10K lines of code, 118 methods, 12 classes
- 123 user sessions (sequential application usage
extracted from web log) - Question Can we build models that can be used
to generate new, valid user sessions?
10Building Markov Models From Web Logs
- Extract User Sessions from Web Log
- 12.3.40.65 GET index.jsp
- 12.3.40.65 GET login.jsp
- 12.3.40.65 GET /apps/bookstore/reg.jsp?member_log
inhellomember_passwordworldmember_password2
world - 12.3.40.65 GET myinfo.jsp
- Control Model possible sequences of URLS that
are visited - Data Model possible sets of parameter values
(name-value pairs)
11Control Models
- unigram Probability of a user visiting a given
page independent of previous page - P(currentPageX)
default
search
0.10
0.20
0.65
book Detail
register
0.05
12Control Models
- bigram Conditional probability of a user
visiting a page, given the previous page - P(currentPageX lastPageY)
default
search
0.45
book Detail
register
13Control Models
- trigram Conditional probability of a user
visiting a page, given the previous two pages - P(currentPageX lastPage1Y1, lastPage2Y2)
default
search
0.30
0.05
0.10
book Detail
register
0.55
14Reliability vs. Discrimination
Greater discrimination (more context)
Greater reliability (more training data)
unigram
bigram
trigram
15Data Models
- simple P(valuesX currentPage Y)
important parameter
- Books.do?category3BookDetail.do?category3itemI
d8
- Books.do?category3BookDetail.do?category3itemI
d8
advanced P(valuesX lastPageimportantParamsY1
,currentPageY2)
16Simple Data Model
Page1 http//decide.cs/bookstore/BookDetail.do?it
emId18
quantity99itemId36
Page2 http//decide.cs/bookstore/AddOrder.do?
17Advanced Data Model
Page1 http//decide.cs/bookstore/BookDetail.do?it
emId18
quantity1itemId18
Page2 http//decide.cs/bookstore/AddOrder.do?
18Generating Test Cases by Combining Control and
Data Models
- Generate arbitrary queries about user sessions
and use these queries to build test cases - What are the k most likely user sessions?
- What are the k least likely user sessions?
- Generate k user sessions randomly, according to
the distribution represented in a web log.
19Can our models be used to generate valid user
sessions?
20Network Protocol Modeling Challenges
- Using live network data instead of logs
- Access to reconstructed traffic in both
directions - Can build models using data from multiple
machines (instead of web log from single server) - What are we generating?
- Sequences of packets
- Sequence of high-level events that can be turned
into packets - What is a user session?
- Single connection
- Cluster connections from subset of 5-tuple
(srcIP, dstIP, srcPort, dstPort, Protocol) - What are control and data models?
- Can we generate valid new sequences?
21Building Sequential Model to Discover NIDS
Evasion Attack
Control model sequence numbers Data model TTLs
and payload How hard is it to discover that this
pattern is uncommon ?
(from Handley et. al. 01)
22Discussion
- Are Markov models sufficient for this task? Too
propositional? - Are data models too sparse? Are state spaces too
large? - How hard is anomaly detection in this framework?
What is a good definition for uncommon traffic
that doesnt produce many false positives or
false negatives? What about emerging new usage
patterns? How to avoid training attacks? - How much protocol knowledge to use in building
models? - Can signature matching events be used in data
model? - Besides generating sequences, what other analyses
can we perform? Entropy of models to determine
level of history-dependence in traffic?
23Related Work
- Host-based and Network-base Intrusion Detection
Systems (NIDS) - Signature-based anomaly detection -- manual
analysis - Packet-based or with context detect known
vulnerabilities and behaviors - Formal verification of protocols require
extensive protocol knowledge do not account for
implementation variations - Scrubbers and Normalizers remove TCP/IP
ambiguities do not account for
application-layer ambiguities and must make
tradeoffs concerning removing ambiguities that
change semantics or lead to performance loss - Fuzzing/Fault-injection random generation of
inputs for vulnerability detection generates
invalid sequences