Title: String Analysis for Dependable Input Validation
1String Analysis for Dependable Input Validation
- Tevfik Bultan
- Verification Lab
- Department of Computer Science
- University of California, Santa Barbara
- bultan_at_cs.ucsb.edu http//www.cs.ucsb.edu/vl
ab
2VLab String Analysis Publications
- With my students Fang Yu and Muath Alkhalaf
- Verifying Client-Side Input Validation Functions
Using String Analysis ICSE12 - Patching Vulnerabilities with Sanitization
Synthesis ICSE11 - Relational String Verification Using Multi-Track
Automata IJFCS11, CIAA10. - String Abstractions for String Verification
SPIN11 - Stranger An Automata-based String Analysis Tool
for PHP TACAS10 - Generating Vulnerability Signatures for String
Manipulating Programs Using Automata-based
Forward and Backward Symbolic Analyses ASE09 - Symbolic String Verification Combining String
Analysis and Size Analysis TACAS09 - Symbolic String Verification An Automata-based
Approach SPIN08
3Web Software Everywhere
- Commerce, entertainment, social interaction
- We will rely on web apps more in the future
- Web apps cloud will make desktop apps obsolete
4(No Transcript)
5(No Transcript)
6(No Transcript)
7Road Block Dependability
- Web applications are not dependable!
- As web applications are becoming increasingly
dominant - and as their use in safety critical areas is
increasing - their dependability is becoming a critical issue
- Web applications are especially notorious for
security vulnerabilities - Their global accessibility makes them a target
for many malicious users -
8Vulnerabilities in Web Applications
- There are many well-known security
vulnerabilities that exist in many web
applications. Here are some examples - Malicious file execution where a malicious user
causes the server to execute malicious code - SQL injection where a malicious user executes
SQL commands on the back-end database by
providing specially formatted input - Cross site scripting (XSS) causes the attacker
to execute a malicious script at a users browser - These vulnerabilities are typically due to
- errors in user input validation or
- lack of user input validation
9Why Is Input Validation Error-prone?
- Extensive string manipulation
- Web applications use extensive string
manipulation - To construct html pages, to construct database
queries in SQL, etc. - The user input comes in string form and must be
validated and sanitized before it can be used - This requires the use of complex string
manipulation functions such as string-replace - String manipulation is error prone
10String Related Vulnerabilities
- String related web application vulnerabilities as
a percentage of all vulnerabilities (reported by
CVE)
- OWASP Top 10 in 2007
- Cross Site Scripting
- Injection Flaws
- OWASP Top 10 in 2010
- Injection Flaws
- Cross Site Scripting
11String Related Vulnerabilities
- String related web application vulnerabilities
occur when - a sensitive function is passed a malicious string
input from the user - This input contains an attack
- It is not properly sanitized before it reaches
the sensitive function - String analysis Discover these vulnerabilities
automatically
12XSS Vulnerability
- A PHP Example
- 1lt?php
- 2 www _GETwww
- 3 l_otherinfo URL
- 4 echo lttdgt . l_otherinfo . . www .
lt/tdgt - 5?gt
- The echo statement in line 4 is a sensitive
function - It contains a Cross Site Scripting (XSS)
vulnerability
ltscript ...
13Is It Vulnerable?
- A simple taint analysis can report this segment
vulnerable using taint propagation - 1lt?php
- 2 www _GETwww
- 3 l_otherinfo URL
- 4 echo lttdgt . l_otherinfo . .www.
lt/tdgt - 5?gt
- echo is tainted ? script is vulnerable
tainted
14How to Fix it?
- To fix the vulnerability we added a sanitization
routine at line s - Taint analysis will assume that www is untainted
and report that the segment is NOT vulnerable - 1lt?php
- 2 www _GETwww
- 3 l_otherinfo URL
- s www ereg_replace(A-Za-z0-9
.-_at_//,,www) - 4 echo lttdgt . l_otherinfo . .www.
lt/tdgt - 5?gt
tainted
untainted
15Is It Really Sanitized?
- 1lt?php
- 2 www _GETwww
- 3 l_otherinfo URL
- s www ereg_replace(A-Za-z0-9
.-_at_//,,www) - 4 echo lttdgt . l_otherinfo . .www.
lt/tdgt - 5?gt
ltscript ...
16Sanitization Routines are Erroneous
- The sanitization statement is not correct!
- ereg_replace(A-Za-z0-9 .-_at_//,,www)
- Removes all characters that are not in
A-Za-z0-9 .-_at_/ - .-_at_ denotes all characters between . and _at_
(including lt and gt) - .-_at_ should be .\-_at_
- This example is from a buggy sanitization routine
used in MyEasyMarket-4.1 (line 218 in file
trans.php)
17String Analysis
- String analysis determines all possible values
that a string expression can take during any
program execution - Using string analysis we can identify all
possible input values of the sensitive functions - Then we can check if inputs of sensitive
functions can contain attack strings - How can we characterize attack strings?
- Use regular expressions to specify the attack
patterns - Attack pattern for XSS SltscriptS
18Vulnerabilities can be Tricky
- Input lt!scrip!t ...gt does not match the attack
pattern - but it matches the vulnerability signature
- and it can cause an attack
- 1lt?php
- 2 www lt!scrip!t ...gt
- 3 l otherinfo URL
- s www ereg replace(A-Za-z0-9 .-_at_//,,
lt!scrip!t...gt) - 4 echo lttdgt . l otherinfo . . ltscript
...gt .lt/tdgt - 5?gt
19String Analysis
- If string analysis determines that the
intersection of the attack pattern and possible
inputs of the sensitive function is empty - then we can conclude that the program is secure
- If the intersection is not empty, then we can
again use string analysis to generate a
vulnerability signature - characterizes all malicious inputs
- Given SltscriptS as an attack pattern
- The vulnerability signature for _GETwww is
- SltasacaraiapatS
- where a? A-Za-z0-9 .-_at_/
20Automata-based String Analysis
- Finite State Automata can be used to characterize
sets of string values - We use automata based string analysis
- Associate each string expression in the program
with an automaton - The automaton accepts an over approximation of
all possible values that the string expression
can take during program execution - Using this automata representation we
symbolically execute the program, only paying
attention to string manipulation operations
21Input Validation Verification Stages
Application/ Scripts
(Tainted) Dependency Graphs
Parser/ Taint Analysis
Reachable Attack Strings
Vulnerability Analysis
Attack Patterns
Vulnerability Signature
Signature Generation
Sanitization Statements
Patch Synthesis
22Combining Forward Backward Analyses
- Convert PHP programs to dependency graphs
- Combine symbolic forward and backward symbolic
reachability analyses - Forward analysis
- Assume that the user input can be any string
- Propagate this information on the dependency
graph - When a sensitive function is reached, intersect
with attack pattern - Backward analysis
- If the intersection is not empty, propagate the
result backwards to identify which inputs can
cause an attack
Front End
Forward Analysis
Backward Analysis
PHP Program
Vulnerability Signatures
Attack patterns
23Dependency Graphs
- Given a PHP program,
- first construct the
- Dependency graph
- 1lt?php
- 2 www GETwww
- 3 l_otherinfo URL
- 4 www ereg_replace(
- A-Za-z0-9 .-_at_//,,www
- )
- 5 echo l_otherinfo .
- .www
- 6?gt
_GETwww, 2
A-Za-z0-9 .-_at_//, 4
, 4
www, 2
URL, 3
, 5
l_otherinfo, 3
preg_replace, 4
str_concat, 5
www, 4
str_concat, 5
echo, 5
Dependency Graph
24Forward Analysis
- Using the dependency graph we conduct
vulnerability analysis - Automata-based forward symbolic analysis that
identifies the possible values of each node - Each node in the dependency graph is associated
with a DFA - DFA accepts an over-approximation of the strings
values that the string expression represented by
that node can take at runtime - The DFAs for the input nodes accept S
- Intersecting the DFA for the sink nodes with the
DFA for the attack pattern identifies the
vulnerabilities - Uses post-image computations of string
operations - postConcat(M1, M2)
- returns M, where MM1.M2
- postReplace(M1, M2, M3)
- returns M, where Mreplace(M1, M2, M3)
25Forward Analysis
Forward S
Attack Pattern SltS
_GETwww, 2
, 4
A-Za-z0-9 .-_at_//, 4
www, 2
URL, 3
Forward e
Forward S
Forward A-Za-z0-9 .-_at_/
Forward URL
, 5
preg_replace, 4
l_otherinfo, 3
Forward
Forward A-Za-z0-9 .-_at_/
Forward URL
str_concat, 5
www, 4
Forward URL
Forward A-Za-z0-9 .-_at_/
str_concat, 5
Forward URL A-Za-z0-9 .-_at_/
echo, 5
n
L(SltS)
L(URL A-Za-z0-9 .-_at_/)
Forward URL A-Za-z0-9 .-_at_/
L(URL A-Za-z0-9 .--_at_/ltA-Za-z0-9 .-_at_/)
? Ø
26Resulting Automaton
U
R
L
A-Za-z0-9 .--_at_/
A-Za-z0-9 .-_at_/
Space
lt
URL A-Za-z0-9 .--_at_/ltA-Za-z0-9 .-_at_/
27Symbolic Automata Representation
- We use the MONA DFA Package for automata
manipulation - Klarlund and Møller, 2001
- Compact Representation
- Canonical form and
- Shared BDD nodes
- Efficient MBDD Manipulations
- Union, Intersection, and Emptiness Checking
- Projection and Minimization
- Cannot Handle Nondeterminism
- We used dummy bits to encode nondeterminism
28Symbolic Automata Representation
Symbolic DFA representation
Explicit DFA representation
29Widening
- String verification problem is undecidable
- The forward fixpoint computation is not
guaranteed to converge in the presence of loops
and recursion - We compute a sound approximation
- During fixpoint we compute an over approximation
of the least fixpoint that corresponds to the
reachable states - We use an automata based widening operation to
over-approximate the fixpoint - Widening operation over-approximates the union
operations and accelerates the convergence of the
fixpoint computation
30Widening
- Given a loop such as
- 1lt?php
- 2 var head
- 3 while (. . .)
- 4 var var . tail
- 5
- 6 echo var
- 7?gt
- Our forward analysis with widening would compute
that the value of the variable var in line 6 is
(head)(tail)
31 Backward Analysis
- A vulnerability signature is a characterization
of all malicious inputs that can be used to
generate attack strings - We identify vulnerability signatures using an
automata-based backward symbolic analysis
starting from the sink node - Pre-image computations on string operations
- preConcatPrefix(M, M2)
- returns M1 and where M M1.M2
- preConcatSuffix(M, M1)
- returns M2, where M M1.M2
- preReplace(M, M2, M3)
- returns M1, where Mreplace(M1, M2, M3)
32Backward Analysis
Forward S
Backward ltltS
_GETwww, 2
node 3
node 6
A-Za-z0-9 .-_at_//, 4
, 4
www, 2
URL, 3
Forward e
Forward A-Za-z0-9 .-_at_/
Forward S
Forward URL
Backward Do not care
Backward Do not care
Backward ltltS
Backward Do not care
preg_replace, 4
, 5
Vulnerability Signature ltltS
l_otherinfo, 3
Forward
Forward A-Za-z0-9 .-_at_/
Forward URL
Backward Do not care
Backward A-Za-z0-9 .--_at_/ltA-Za-z0-9
.-_at_/
Backward Do not care
node 10
str_concat, 5
www, 4
Forward A-Za-z0-9 .-_at_/
Forward URL
node 11
Backward A-Za-z0-9 .--_at_/ltA-Za-z0-9 .-_at_/
Backward Do not care
str_concat, 5
Forward URL A-Za-z0-9 .-_at_/
Backward URL A-Za-z0-9 .--_at_/ltA-Za-z0-
9 .-_at_/
node 12
echo, 5
Forward URL A-Za-z0-9 .-_at_/
Backward URL A-Za-z0-9 .--_at_/ltA-Za-z0-
9 .-_at_/
33Vulnerability Signature Automaton
S
lt
lt
Non-ASCII
ltltS
34Vulnerability Signatures
- The vulnerability signature is the result of the
input node, which includes all possible malicious
inputs - An input that does not match this signature
cannot exploit the vulnerability - After generating the vulnerability signature
- Can we generate a patch based on the
vulnerability signature? -
- The vulnerability signature automaton for
the running example
lt
S
lt
35Patches from Vulnerability Signatures
- Main idea
- Given a vulnerability signature automaton, find a
cut that separates initial and accepting states - Remove the characters in the cut from the user
input to sanitize - This means, that if we just delete lt from the
user input, then the vulnerability can be removed
lt
S
lt
min-cut is lt
36Patches from Vulnerability Signatures
- Ideally, we want to modify the input (as little
as possible) so that it does not match the
vulnerability signature - Given a DFA, an alphabet cut is
- a set of characters that after removing the
edges that are associated with the characters in
the set, the modified DFA does not accept any
non-empty string - Finding a minimal alphabet cut of a DFA is an
NP-hard problem (one can reduce the vertex cover
problem to this problem) - We use a min-cut algorithm instead
- The set of characters that are associated with
the edges of the min cut is an alphabet cut - but not necessarily the minimum alphabet cut
37Automatically Generated Patch
- Automatically generated patch will make sure that
no string that matches the attack pattern reaches
the sensitive function - lt?php
- if (preg match(/ ltlt./, GETwww))
- GETwww preg replace(lt,,
GETwww) - www _GETwww
- l_otherinfo URL
- www ereg_replace(A-Za-z0-9
.-_at_//,,www) - echo lttdgt . l_otherinfo . .www.
lt/tdgt - ?gt
38Experiments
- We evaluated our approach on five vulnerabilities
from three open source web applications - MyEasyMarket-4.1 A shopping cart program
- (2) BloggIT-1.0 A blog engine
- (3) proManager-0.72 A project management system
- We used the following XSS attack pattern
- SltscriptS
39Forward Analysis Results
- The dependency graphs of these benchmarks are
simplified based on the sinks - Unrelated parts are removed using slicing
Input Input Input Input Results Results Results
nodes edges sinks inputs Time(s) Mem (kb) states/bdds
21 20 1 1 0.08 2599 23/219
29 29 1 1 0.53 13633 48/495
25 25 1 2 0.12 1955 125/1200
23 22 1 1 0.12 4022 133/1222
25 25 1 1 0.12 3387 125/1200
40Backward Analysis Results
- We use the backward analysis to generate the
vulnerability signatures - Backward analysis starts from the vulnerable
sinks identified during forward analysis
Input Input Input Input Results Results Results
nodes edges sinks inputs Time(s) Mem (kb) states/bdds
21 20 1 1 0.46 2963 9/199
29 29 1 1 41.03 1859767 811/8389
25 25 1 2 2.35 5673 20/302, 20/302
23 22 1 1 2.33 32035 91/1127
25 25 1 1 5.02 14958 20/302
41Alphabet Cuts
- We generate alphabet cuts from the vulnerability
signatures using a min-cut algorithm - Problem When there are two user inputs the patch
will block everything and delete everything - Overlooks the relations among input variables
(e.g., the concatenation of two inputs contains lt
SCRIPT)
Input Input Input Input Results
nodes edges sinks inputs Alphabet Cut
21 20 1 1 lt
29 29 1 1 S,,
25 25 1 2 S , S
23 22 1 1 lt,,
25 25 1 1 lt,,
Vulnerability signature depends on two inputs
42Relational String Analysis
- Instead of using multiple single-track DFAs use
one multi-track DFA - Each track represents the values of one string
variable - Using multi-track DFAs
- Identifies the relations among string variables
- Generates relational vulnerability signatures for
multiple user inputs of a vulnerable application - Improves the precision of the path-sensitive
analysis - Proves properties that depend on relations among
string variables, e.g., file usr.txt
43Multi-track Automata
- Let X (the first track), Y (the second track), be
two string variables - ? is a padding symbol
- A multi-track automaton that encodes X Y.txt
(t,?)
(x,?)
(t,?)
(a,a), (b,b)
44Relational Vulnerability Signature
- We perform forward analysis using multi-track
automata to generate relational vulnerability
signatures - Each track represents one user input
- An auxiliary track represents the values of the
current node - We intersect the auxiliary track with the attack
pattern upon termination
45Relational Vulnerability Signature
- Consider a simple example having multiple user
inputs - lt?php
- 1 www _GETwww
- 2 url _GETurl
- 3 echo url. www
- ?gt
- Let the attack pattern be S lt S
46Relational Vulnerability Signature
- A multi-track automaton (url, www, aux)
- Identifies the fact that the concatenation of two
inputs contains lt
(a,?,a), (b,?,b),
(a,?,a), (b,?,b),
(lt,?,lt)
(?,a,a), (?,b,b),
(?,a,a), (?,b,b),
(?,lt,lt)
(?,lt,lt)
(?,a,a), (?,b,b),
(?,a,a), (?,b,b),
47Relational Vulnerability Signature
- Project away the auxiliary variable
- Find the min-cut
- This min-cut identifies the alphabet cuts lt for
the first track (url) and lt for the second
track (www)
(a,?), (b,?),
(a,?), (b,?),
(lt,?)
(?,a), (?,b),
(?,a), (?,b),
(?,lt)
(?,lt)
(?,a), (?,b),
(?,a), (?,b),
min-cut is lt,lt
48Patch for Multiple Inputs
- Patch If the inputs match the signature, delete
its alphabet cut - lt?php
- if (preg match(/ ltlt./, GETurl.
GETwww)) -
- GETurl preg replace(lt,,
GETurl) - GETwww preg replace(lt,,
GETwww) -
- 1 www GETwww
- 2 url GETurl
- 3 echo url. www
- ?gt
49Technical Issues
- To conduct relational string analysis, we need to
compute intersection of multi-track automata - Intersection is closed under aligned multi-track
automata - ?s are right justified in all tracks, e.g., ab??
instead of a?b? - However, there exist unaligned multi-track
automata that are not describable by aligned ones - We propose an alignment algorithm that constructs
aligned automata which over or under approximate
unaligned ones
50Other Technical Issues
- Modeling Word Equations
- Intractability of X cZ
- The number of states of the corresponding aligned
multi-track DFA is exponential to the length of
c. - Irregularity of X YZ
- X YZ is not describable by an aligned
multi-track automata - We propose a conservative analysis
- We construct multi-track automata that over or
under-approximate the word equations
51Composite Analysis
- What I have talked about so far focuses only on
string contents - It does not handle constraints on string lengths
- It cannot handle comparisons among integer
variables and string lengths - We extended our string analysis techniques to
analyze systems that have unbounded string and
integer variables - We proposed a composite static analysis approach
that combines string analysis and size analysis
52Size Analysis
- Size Analysis The goal of size analysis is to
provide properties about string lengths - It can be used to discover buffer overflow
vulnerabilities - Integer Analysis At each program point,
statically compute the possible states of the
values of all integer variables. - These infinite states are symbolically
over-approximated as linear arithmetic
constraints that can be represented as an
arithmetic automaton - Integer analysis can be used to perform size
analysis by representing lengths of string
variables as integer variables.
53An Example
- Consider the following segment
- 1 lt?php
- 2 www GETwww
- 3 l otherinfo URL
- 4 www ereg replace(A-Za-z0-9
./-_at_//,,www) - 5 if(strlen(www) lt limit)
- 6 echo lttdgt . l otherinfo . . www .
lt/tdgt - 7?gt
- If we perform size analysis solely, after line 4,
we do not know the length of www - If we perform string analysis solely, at line 5,
we cannot check/enforce the branch condition.
54Composite Analysis
- We need a composite analysis that combines string
analysis with size analysis. - Challenge How to transfer information between
string automata and arithmetic automata? - A string automaton is a single-track DFA that
accepts a regular language, whose length forms a
semi-linear set - For example 4, 6 ? 2 3k k 0
- The unary encoding of a semi-linear set is
uniquely identified by a unary automaton - The unary automaton can be constructed by
replacing the alphabet of a string automaton with
a unary alphabet
55Arithmetic Automata
- An arithmetic automaton is a multi-track DFA,
where each track represents the value of one
variable over a binary alphabet - If the language of an arithmetic automaton
satisfies a Presburger formula, the value of each
variable forms a semi-linear set - The semi-linear set is accepted by the binary
automaton that projects away all other tracks
from the arithmetic automaton
56Connecting the Dots
- We developed novel algorithms to convert unary
automata to binary automata and vice versa - Using these conversion algorithms we can conduct
a composite analysis that subsumes size analysis
and string analysis
String Automata
Unary Length Automata
Binary Length Automata
Arithmetic Automata
57Stranger A String Analysis Tool
Stranger is available at www.cs.ucsb.edu/vlab/st
ranger
- Uses Pixy Jovanovic et al., 2006 as a PHP front
end - Uses MONA Klarlund and Møller, 2001 automata
package for automata manipulation
Attack patterns
Symbolic String Analysis
Pixy Front End
String/Automata Operations
Automata Based String Manipulation Library
Parser
String Analyzer
Dependency Graphs
Stranger Automata
PHP program
CFG
DFAs
Dependency Analyzer
Vulnerability Signatures Patches
MONA Automata Package
58Case Study
- Schoolmate 1.5.4
- Number of PHP files 63
- Lines of code 8181
- Forward Analysis results
- After manual inspection we found the following
Time Memory Number of XSS sensitive sinks Number of XSS Vulnerabilities
22 minutes 281 MB 898 153
Actual Vulnerabilities False Positives
105 48
59Case Study False Positives
- Why false positives?
- Path insensitivity 39
- Path to vulnerable program point is not feasible
- Un-modeled built in PHP functions 6
- Unfound user written functions 3
- PHP programs have more than one execution entry
point - We can remove all these false positives by
extending our analysis to a path sensitive
analysis and modeling more PHP functions
60Case Study - Sanitization
- We patched all actual vulnerabilities by adding
sanitization routines - We ran stranger the second time
- Stranger proved that our patches are correct with
respect to the attack pattern we are using
61Related Work String Analysis
- String analysis based on context free grammars
Christensen et al., SAS03 Minamide, WWW05 - String analysis based on symbolic/concolic
execution Bjorner et al., TACAS09 - Bounded string analysis Kiezun et al.,
ISSTA09 - Automata based string analysis Xiang et al.,
COMPSAC07 Shannon et al., MUTATION07 - Application of string analysis to web
applications Wassermann and Su, PLDI07,
ICSE08 Halfond and Orso, ASE05, ICSE06
62Related Work
- Size Analysis
- Size analysis Hughes et al., POPL96 Chin et
al., ICSE05 Yu et al., FSE07 Yang et al.,
CAV08 - Composite analysis Bultan et al., TOSEM00 Xu
et al., ISSTA08 Gulwani et al., POPL08
Halbwachs et al., PLDI08 - Vulnerability Signature Generation
- Test input/Attack generation Wassermann et al.,
ISSTA08 Kiezun et al., ICSE09 - Vulnerability signature generation Brumley et
al., SP06 Brumley et al., CSF07 Costa et
al., SOSP07
63THE END