Title: Dynamic Analysis Applications
1Dynamic Analysis Applications
2Primitives
- Tracing
- Profiling
- Checkpointing and replay
- Slicing (dependence detection)
- Indexing
- Delta-debugging (input reduction)
3Four Sample Applications
- Malware unpacking
- Profiling for concurrency
- Taint analysis
- Input structure reverse engineering
4Malware Unpacking
- A malware often features an encrypted code body
and a decryption engine. Given an executable with
an embedded malicious code piece, acquire the
plain text of the malware code body. - Malware gt goodware.
- Daily report of Symantec contains a few thousands
of malware. - 70 of malwares are packed.
- Signature based malware detection is still the
most effective technique (0.01 false positive
rate). It requires unpacking.
5A Packed Malware Binary
- A binary is packed if some portion of its code is
not present until runtime
Original Binary
Packed Binary
- Payload program is mostly unchanged
Address Space
Address Space
Entry Point
- Timing checks of various granularities
- Control flow obfuscation
.loop lea eax, 0x4a0000 lea ebx,
0x401000 load ecx, ptr r1 xor ecx,
0xffffff store ptrecx, r2 ... jnz .x call
ptredi .x add eax, 4 add ebx, 4 cmp
eax, 0x4a1f88 jnz .loop jmp 0x401000
Entry Point
Anti-Debugger Code
Unpacking loop
Unpacking Loop
Packed code initially compressed or encrypted
JUMP
- Control transfer to unpacked code
Packed Binary Analysis with Dyninst
5 of 19
The slide is from Kevin Roundy
6Unpacking by Tracing
- The most prominent feature of a packed malware is
the control flow transfer to a dynamic generated
region.
1. for (i...) 2. Bi Ai XOR key 3.
4. goto B0
- Collect the memory access trace
- Upon execution of an instruction PC that writes
value V to address X - Create one trace entry ltPC, X, Vgt.
- HashmapXPC
- Upon execution of a control flow transfer
instruction to X - Test if HashmapX is defined. If so, the program
is executing a dynamically generated instruction. - The decryption loop is identified by
PCHashmapX
7Unpacking by Tracing (continued)
- After the identification of the decryption
instruction PC, search through the trace to
collect the sequence of values that are written
by PC, which is often the plain text body of the
malware.
8More on Malware
- Unpacking (decryption) occurs page by page.
- Unpack one page, execute that page, trap the
execution, unpack another page to the same buffer
space, and so on. - Is our tracing technique still working?
- Use multiple packers.
- The plain text can only be reached after multiple
levels of unpacking. - Anti-tracing techniques
- Detecting obvious slow-down of its own execution.
- Quoted from Symantec
- We know dynamic analysis is the future of AV
because of packing and obfuscation, but the
problem is to be able to run it and afford
running it.
9Profiling Parallelism
- A recent trend to parallelize a sequential
program is to spawn a method call as a separate
thread.
asynchronous foo()
foo()
foos body
foos body
foos continuation
foos continuation
10- Devise a profiler that identifies method calls
that are amenable to such parallelization. - A naïve solution collect dependence traces with
the form of ltPCuse, PCdefgt. If PCdef is inside a
dynamic method call C and PCuse is in the
continuation of C, then the method is not
amenable to asynchronous invocation.
- The problems
- It is unlikely that the value written in a method
call is not used later (observed by all the three
proposals I received). - Do we care if the control flow (time) distance
between the definition and the use is so long
that the conflicting dependence can be easily
respected although the call is spawned as a
thread? (observed by one proposal) - Nesting functions and repetitive functions.
11Dependence Filtering
C a method invocation Tdur the duration of the
method invocation Tdep the distance between
the def and the use involving in a dependence
12Nesting and Repetition Problem
void A ( ) while (...) s1 void
B ( ) A ( ) s2 void main ( )
A ( ) B ( )
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.
E 11 2 3 2 3 2 12 6 2 3 2
7
Rmain ?11 RA 12 RB RA ?2R2 R2 ?3 2 R2 e RB
?6RA 7
2 ? 2, two cases
2 ? 7, two cases
13The Solution
- Maintain execution indices during the run.
- Dependences are detected in the form of ltIDXuse,
IDXdefgt - Upon the detection of a dependence
- Traverse along the index path of the definition
point backwards (from the leaf, i.e., the
definition pc, to the root) until a common
ancestor of both the definition and the use point
is reached. - Along the traversal, the profile of all
intermediate nodes corresponding to a method call
is updated. - Given the profile, transform the program
accordingly. - Using Java futures (by one proposal).
14Taint Analysis
- Inputs from untrusted sources such as network
packets or even files are tainted, meaning they
are not trusted. Taint bits are propagated
through dependences during execution so that
variables are not trusted if the tainted input
affected their values. - Conditional/unconditional jumps to a tainted
location is considered as a security violation. - It can be also mutated to detect information leak.
15Implementing Taint Analysis Using DP Primitives
- Build dynamic dependence graph
- See if the entry points of untrusted values are
in the dynamic slices of output values.
- a buffer overflow exploit
void ( F) () char A2 ... read(B,
256) i2 AiBi ... (F) ()
16Reverse Engineering Input Syntactic Structure
- Focus on syntactic structure
- Inputs follow a certain grammar
- Derive the derivation tree (AST) for a given
input - Motivation
- Test generation
- Network protocol analysis
- Delta debugging
- Basic Idea
- Trace the use points of input values
- Build the index tree for the execution, annotate
the index tree at input use points with the
values used. - The annotated tree serves as the input syntactic
tree.
17Input Grammar and AST
18The Implementation Related To Parsing
19Execution Traces and the Index Tree
20(No Transcript)
21Open problems
- AST ! grammar
- Grammars are much more useful
- Reject malicious input from the beginning
- Facilitate protocol understanding
- Single Input only
- How to fuse
- How to mine the grammar from multiple ASTs?
- What about the semantic part?
- Keyword, length?
- They are indeed all constraints!