Title: RAMSES Regeneration And iMmunity SErviceS: A Cognitive Immune System
1RAMSES (Regeneration And iMmunity SErviceS)A
Cognitive Immune System
Self Regenerative Systems 18 December 2007
- Mark Cornwell
- James Just
- Nathan Li
- Robert Schrag
- Global InfoTek, Inc
R. Sekar Stony Brook University
2Outline
- Overview
- Efficient content-based taint identification
- Syntax and taint-aware policies
- Memory attack detection and response
- Testing
- Red Team suggestions
- Questions
- Demo
3RAMSES Attack Context
- Attack target program mediatingaccess to
protected resources/services - Attack approach use maliciously crafted input
to exert unintended control over protected
resource operations - Resource or service uses
- Well-defined APIs to access
- OS resources
- Command interpreters
- Database servers
- Transaction servers,
-
- Internal interfaces
- Data structures and functions within program
- Used by program components to talk to each other
4Example 1 SquirrelMail Command Injection
Input Interface
sendtonobody rm rf
send_to_list _GETsendto
commandgpg r nobody rm rf 2gt1
command gpg -r send_to_list 2gt1
Program
popen(command) Attack Removes all removable
files in web server document tree
popen(command)
Output Interface
5Example 2 phpBB SQL Injection
topic-1 UNION SELECT ord(substring(user_passwo
rd,1,1)) FROM phpbb_users WHERE user_id 3
Input Interface
topic_id_GETtopic
sql SELECT p.post_id FROM POSTS_TABLE WHERE
p.topic_id topic_id
sql SELECT p.post_id FROM POSTS_TABLE WHERE
p.topic_id -1 UNION SELECT ord(substring(user_p
assword,1,1)) FROM phpbb_users WHERE user_id 3
Program
sql_query(sql) Attack Steal another users
password
sql_query(sql)
Output Interface
6Attack Space of Interest (CVE 2006)
Generalized Injection Attacks
7Detection Approach
- Attack use maliciously crafted input to exert
unintended control over output operations - Detect exertion of control
- Based on taint degree towhich output depends
on input - Detect if control is intended
- Requires policies (or training)
- Application-independent policies are preferable
8RAMSES Goals and Approach
- Taint analysis develop efficient and
non-invasive alternatives - Analyze observed inputs and outputs
- Needs no modifications to program
- Language-neutral
- Leverage learning to speed up analysis
- Attack detection develop framework to detect a
wide range of attacks, while minimizing policy
development effort and FP/FNs - Structure-aware policies leverage
interplaybetween taint and structural changes to
output requests - Use Address-Space Randomization (ASR) for memory
corruption - ASR efficient, in-band, positive tainting for
pointer-valued data - Immunization filter out future attack instances
- Output filters drop output requests that violate
taint-based policies - Input filters Project policies on outputs to
those on inputs - Relies on learning relationships between input
and output fields - Network-deployable
9Efficient Content-Based Taint Identification
10Steps
- Develop efficient algorithms for inferring flow
of input data into outputs - Compare input and output values
- Allow for parts of input to flow into parts of
output - Tolerate some changes to input
- Changes such as space removal, quoting, escaping,
case-folding are common in string-based
interfaces - Based on approximate substring matching
- Leverage learning to speed up taint inference
- Even the efficient content-matching algorithms
are too expensive to run on every input/output - Same learning techniques can be used for
detecting attacks using anomaly detection
11Weighted Substring Edit Distance Algorithm
- Maintain a matrix Dij of minimum edit
distance between p1..i and s1..j - Dij minDi-1j-1 SubstCost(pi,sj),
Di-1j
DeleteCost(pi), Dij-1
InsertCost(sj) - D0j 0 (No cost for omitting any prefix of
s) - Di0 DeleteCost(p1)DeleteCost(pi)
- Matches can be reconstructed from the D matrix
- Quadratic time and space complexity
- Uses O(ps) memory and time
12Improving performance
- Quadratic complexity algorithms can be too
expensive for large s, e.g., HTML outputs - Storage requirements are even more problematic
- Solution Use linear-time coarse filtering
algorithm - Approximate D by FD, defined on substrings of s
of length p - Let P (and S) denote a multiset of characters in
p (resp., s) - FD(p, s) min(P-S, S-P)
- Slide a window of size p over s, compute FD
incrementally - Prove D(p, r) lt t ? FD(p, r) lt t for all
substrings r of s - Result O(p2) space and time complexity in
practice - Implementation results
- Typically 30x improvement in speed
- 200x to 1000x reduction in space
- Preliminary performance measurements 40MB/sec
13Efficient online operation
- Weighted edit-distance algorithms are still too
expensive if applied to every input/output - Need to run for every input parameter and output
- Key idea
- Use learning to construct a classifier for
outputs - Each class consists of similarly tainted outputs
- taint identified quickly, once the class is
known - Classifying strings is difficult
- Our technique operates on parse trees of output
- For ease of development, generality, and
tolerance to syntax errors, we use a rough
parser - Classifier is a decision tree that inspects parse
tree nodes in an order that leads to good
decisions
14Decision Tree Construction
- Examines the nodes of syntax tree in some order
- The order of examination is a function of the set
of syntax trees - Chooses nodes that are present in all candidate
syntax trees - Avoids tests on tainted data, as they can vary
- Avoids tests that dont provide significant
degree of discrimination - similar-valued fields will be collected
together and generalized, instead of storing
individual values - Incorporates a notion of suitability for each
field or subtree in the syntax tree - Takes into account approximations made in parsing
15Example of a Decision Tree
- 1. SELECT FROM phpbb_config
- 2. SELECT u.,s. FROM phpbb_sessions
s,phpbb_users u WHERE s.session_id'a3523d78160ef
dafe63d8db1ce5cb0ba' AND u.user_ids.session_user
_id - 3. SELECT FROM phpbb_themes WHERE themes_id1
- 4. SELECT c.cat_id,c.cat_title,c.cat_order FROM
phpbb_categories c,phpbb_forums f WHERE
f.cat_idc.cat_id GROUP BY
c.cat_id,c.cat_title,c.cat_order ORDER BY
c.cat_order - 5. SELECT FROM phpbb_forums ORDER BY
cat_id,forum_order - switch (1)
- case ROOT switch (1.1)
- case CMD switch (1.1.2)
- case c FINAL _at_1.1.1SELECT
_at_1.1.3. cat_id,c.cat_title,c.cat_order
FROM phpbb_categories
c,phpbb_forums f WHERE f.cat_idc.cat_id GROUP
BY
c.cat_id,c.cat_title,c.cat_order ORDER BY
c.cat_order - case u FINAL _at_1.1.1SELECT
_at_1.1.3. ,s. FROM phpbb_sessions
s,phpbb_users u WHERE
s.session_id'a3523d78160efdafe63d8db1ce5cb0ba'
AND
u.user_ids.session_user_id - case FINAL _at_1.1.1SELECT
_at_1.1.3FROM phpbb_?????? -
-
-
16Implementation Status and Next Steps
- Rough parsers implemented for
- HTML/XML
- Shell-like languages (including Perl/PHP)
- SQL
- Preliminary performance measurements
- Construction of decision trees 3MB/sec
- Classification only 15MB/sec
- Significant improvements expected with some
performance tuning - Next steps
- Develop better clustering/classification
algorithms based on tree edit-distance - Current algorithm is based entirely on a top-down
traversal, and fails to exploit similarities
among subtrees
17Syntax and taint-aware policies
18Overview of Policies
- Leverage structuretaint to simplify/generalize
policy - Policy structure mirrors that of parse trees
- And-Or trees with cycles
- Can specify constraints on values (using regular
expressions) and taint associated with a parse
tree node - Most attacks detected using one basic policy
- Controlling commands vs command parameters
- Controlling pointers vs data
19Controlling commands Vs parameters
- Observation parameters dont alter syntactic
structure of victims requests - Policy Structure of parse tree for victims
request should not be controlled by untrusted
input (tainted data) - Alternate formulation tainted data shouldnt
span multiple fields or tokens in victims
request
20Policy prohibiting structure changes
- Define structure change without using a
reference - Avoids need for training and associated FP issues
- Policy 1
- Tainted data cannot span multiple nodes
- for binary data, it should not span multiple
fields - Policy 2
- Tainted data cannot straddle multiple subtrees
- Tainted data spans two adjacent subtrees, and at
least one of them is not fully tainted - Tainted data overflowed beyond the end of one
subtree and resulted in a second subtree - Both policies can be further refined to constrain
the node types and children subtrees of the nodes
21Commands Vs parameters Example 2
- Memory corruption attack overflowing stack buffer
- For binary data, we talk about message fields
rather than parse trees -
.. - Violation tainted data spans multiple stack
fields - Heap overflows involve tainted data spanning
across multiple heap blocks
22Attacks Detected by No structure change Policy
- Various forms of script or command injection
- SQL injection
- XPath injection
- Format string attacks
- HTTP response splitting
- Log injection
- Stack overflow and heap overflow
23Application-specific policies
- Not all attacks have the flavor of command
injection - Develop application-specific policies to detect
such attacks - Policy 3 Cross-site scripting no tainted
scripts in HTML data - Policy 4 Path traversal tainted file names
cannot access data outside of a certain document
tree -
- Other examples
- Policy 5 No tainted CMD_NAME or CMD_SEPARATOR
nodes in shell or SQL commands
24Implementation status
- Four test applications
- phpBB
- SquirrelMail
- PHP/XMLRPC
- WebGoat (J2EE)
- Detects following attacks without FPs
- Command injection (Policies 1, 2, 5)
- SQL injection (1, 2, 5)
- XSS (3)
- HTTP Response splitting (2)
- Path traversal (4)
- Memory corruption detected using ASR
- Should be able to detect many other attacks
easily - XPATH injection (1,2), Format-string (1, 2), Log
injection (1,2)
25Memory Attack Discussion
26Memory Error Based Remote Attack
- Attackers goal
- Overwrite target of interest to take over
instruction execution - Attackers approach
- Propagate attacker controlled input to target of
interest - Violate certain structural constraints in the
propagation process
27Stack Frame Structural Violation
As stack frame
Function arguments
High
Return address
Previous stack frame
Exception Registration Record
Local variables
Bs stack frame
Function arguments
Return address( to A)
Previous stack frame
Local variables
Cs stack frame
Function arguments
Low
Return address (to B)
EBP
Previous stack frame
FS0
Exception Registration Record
Local variables
ESP
28Heap Block Structural Violation
Size Previous Size
Segment Index
Flags
Unused
Tag Index
FLink
BLink
Windows Free Heap Block Header Structure
- Happens when removing free block from
double-linked list - Ability to write 4 bytes into any address,
usually well known address, like function
pointer, return address, SEH etc.
29ASLR and Crash Analysis
- ASLR randomizes the addresses of targets of
interest - Memory attack using the original address will
miss and cause crash (exception). - Crash analysis tracks back to vulnerability,
which enables accurate signature generation - Structural information usually retrievable at
runtime, thanks to enhanced debugging technology - Crash analysis aided with JIT(Just In-time
Tracing) - JIT triggered at certain events
- Suspicious network inputs, e.g. sensitive JMP
address - Attach/detach JIT monitor at event of interest
- Memory dump can be dumped in the right
granularity, log info from a few KB to a 2GB
30Crash Root Cause Analysis
Root Cause Analysis
Exception Record/Context, Faulting
thread/Instructions/Registers Stack
trace/Heap/Module/Symbols
Stack Corruption
Heap Corruption
Read Access Violation Bad EIP (Corrupted
Return Address or SEH)
Read Access Violation Bad Deference (Corrupted
Local Variables/passing parameters)
Write Access Violation (Address to write, Value
to write )
31Stack-based Overflow Analysis
- Target driven analysis
- The goal of attack string is to overwrite target
of interest on stack, e.g., return address, SEH
handler. - Start matching target values from crash dump to
input, like EIP, EBP and SEH handler - More efficient than pattern match in the whole
address space - If any targets are matched in input, expand in
both directions to find LCS - A match usually indicates the input size needed
to overflow certain targets
32SEH Overflow and Analysis
- A unique approach for Windows exploit
- SEH stands for Structured Exception Handler
- Windows put EXCEPTION_REGISTRATION_RECORD chain
on stack with SEH in the record. - More reliable and powerful than overwrite return
address - More JMP address to use (pop/pop/ret)
- An exception (accidental/intentional) is desired
- Can bypass /GS buffer check
- SEH crash analysis
- Catch the first exception as well as the second
one (caused by ASR) - Locate the SEH chain head from first dump,
usually overwritten by input - Usually first exception is enough, second
exception can be used for confirmation
33Heap Overflow Analysis
- How to analyze heap overflow attack?
- Exploit happens in free blocks unlink
- Multiple ways to trigger
- Write Access Violation with ASR
- with overwriting in invalid address
- Overwrite 4 bytes value in arbitrary address
- Interested targets include return address, SEH,
PEB and UEF - Exploit contains the pair (Address To Write,
Value to Write) - Appeared in the overflowed heap blocks
- Usually contained in registers
- Should be provided from input by attacker
- Match found in synthetic heap exploits
- The value pairs need to be in fixed offset
- For a given heap overflow vulnerability
- To enable overwrite the right address with the
right value desired
34Case Studies
35Case Study RPC DCOM
- Step 1 Exception Analysis
- FAULTING_IP
- 18759f
- ExceptionCode c0000005 (Access violation)
- Attempt to read from address 0018759f
- PROCESS_NAME svchost.exe
- FAULTING_THREAD 00000290
- PRIMARY_PROBLEM_CLASS STACK_CORRUPTION
- Step 2 Target Input correlation
- StackBase 0x6c0000, StackLimit 0x6bc000,Size
0x4000 - Begin analyze on Target Overwrite and Input
Correlation - Analyze crash EIP
- Find EIP pattern at socket input
- Bytes size to overwrite EIP 128
- Analyze crash EIP done!
- Analyze SEH
- Find SEH byte at socket input
- Bytes size to overwrite SEH handler 1588
- Analyze SEH done!
36Signature Generation
- Signature generation
- Signature captures the vulnerability
characteristics - Minimum size to overwrite certain target(s)
- Use contexts to reduce false positive
- Using incoming input calling stack
- Stack offset can uniquely identify the context
- Using incoming input semantic context
- Message format like HTTP url/parameter
- Binary message field
37Components Implementation
- RAMSES
- Crash Monitor
- Catch interested
- exception only
- Snapshots for a
- given period
- Self healer
Protected Application
1
Infrastructure Save Crash
Dump Extract Relevant Info Search/Match Disassembl
e
Crash(Exception)
Uses
Windows Debug Engine
Generate
2
Crash Dump
5
Analyze
4
Signature
- RAMSES
- Crash Analyzer
- Fault type detection
- Security oriented
- analysis
- Feedback
Provide Input History
3
Uses
Crash Dump provides the same interface as LIVE
process, so Crash Analyzer actually does NOT
have to work on saved crash dump file.
38Testing
39Test Attacks Applications
- Baseline Applications
- phpBB (php)
- squirrelMail (php)
- WebGoat (java)
- hMailServer (C)
Many sub languges SQL, XML, JavaScript, HTML,
HTTP, JSON, shell, cmd, path
40Possible Testbed Configurations
41Traffic Generation
- Purpose
- Coverage of legitmate structural variation in
monitored structures - SQL, command strings, call parameters
- Stress of log complexity for practicality
- Multiple users, multiple sessions
- Performance measurements
- Program performance metrics
- Quantify performance impact
42Traffic Generation to Web Sites
- Approaches
- Simple Record/Playback (basic)
- with minor substitutions (cookies, ips)
- shell scripts, netcat, MaxQ (jython based
- Custom DOM/Ajax scripting (learning)
- Can access dynamically generated browser content
after(during) client side script eval - Automated site crawls of URLS
- Automated form contents (site specific metadata)
- COTS tools
- Load testing and metrics
43(No Transcript)
44Red Team Suggestions
45Suggested Red Team ROEs
- Initial telecons held in Fall
- Claim RAMSES will defeat most generalized
injection attacks on protected applications - Red Team should target our current and planned
applications rather than new ones (unless new
application, sample attacks and complete traffic
generator can be provided to RAMSES far enough in
advance for learning and testing) - Remote network access to the targeted application
- Attack designated application suite
- Required instrumentation yet to be determined
- Red Team exercise start 15 April or later
46RAMSES Project Schedule
Baseline Tasks 1. Refine RAMSES
Requirements 2. Design RAMSES 3. Develop
Components 4. Integrate System 5. Analyze Test
RAMSES 6. Coordinate Rept Prototypes Optional
Tasks O.3 Cross-Area Exper
CY06
CY09
CY07
CY08
Q4
Q1
Q2
Q3
Q4
Q1
Q2
Q3
Q4
Q1
Q3
Q2
Q3
1
2
3
Red Team Exercise
Today 11 September 2007
47Next Steps
48Plans
- Develop input filters from output policies
- Extend memory error analyzer
- Demonstrate RAMSES on more applications and
attack types - Native C/C app (most likely app is hMail
server) - Java
- Integrate components
- Performance and false positive testing
- Red Team exercise
49Questions?
50Backup
51Tokenizing and Parsing
- Focus on rough parsing that reveals approximate
structure, but not necessarily all the details - Accurate parsers are time-consuming to write
- More important may not gracefully handle errors
(common in HTML) or language extensions and
variations (different shells, different flavors
of SQL) - Implemented using Flex/Bison
- Currently done for SQL and shell command
languages - Parse into a sequence of statements, each
statement consisting of a command name and
parameters - Incorporates a notion of confidence to deal with
complex language features, e.g., variable
substitutions in shell - Modest effort for adding additional languages,
but substantially simplifies subsequent learning
tasks - Dont anticipate significant additions to this
language list (other than HTML/XML)
52Taint inference Vs Taint-tracking
- Disadvantages of learning
- False negatives if inputs transformed before use
- Low likelihood for most web apps
- False positives due to coincidence
- Mitigated using statistical information
- Plan to evaluate these experimentally
- Benefits of learning
- Low performance overhead
- Some significant implicit flows handled without
incurring high false positives - Can address attacks multi-step attacks where
tainted data is first stored in a file/database
before use - More generally, in dealing with information flow
that crosses module boundaries
53Attack Coverage 2004
(Stack-smashing, heap overflow, integer overflow,
data attacks)
Generalized Injection Attacks
CVE Vulnerabilities (Ver. 20040901)
54RAMSES System Concept
Protected System
Web Server (IIS/Apache)
Web App (PHP/ ASP)
SQL Database (MySQL)
Network/App Firewall (e.g. mod_security)
OS DLLs
Application DLLs
Network DLLs
- Key research problems
- Learn taint propagation
- Identify tainted components in output, generate
filtering criteria - Learn input/output transformation
- Use transformation to project output filters to
input
55Advantages of RAMSES Filters
- Filters easily sharable
- Complements Application Community focus on end
user applications - Filters are human readable
- Filter generation algorithms can be enhanced to
address privacy concerns wrt sharing
56Filter types
- Filter Criteria
- Correlative filters
- Equality-based filter
- Structure-based filter
- Statistical filter
- Causal filters
- Filtering criteria derived from attack detection
criteria (policy or anomaly)
- Filter Location
- Input filter
- Easier to deploy but harder to synthesize
- Output filter (precedes sensitive operation)
- Easier to synthesize than input filter, but
deployment needs deeper instrumentation - May be too late for some attacks (memory
corruption)
Note All filters evaluated using large number of
benign samples and ?1 attack sample