Title: Paradyn/Dyninst Binary Analysis Session
1Paradyn/DyninstBinary Analysis Session
- Stripped Binaries
- ?
- Obfuscated Binary Code
- ?
- Undetectable Transformation and Instrumentation
Paradyn / Condor Week Madison, Wisconsin April 29
May 2, 2008
2The World of Dyninst
instrumentation
analysis
debugging
The binary!
Code in known functions
3Modern Binary Challenges
Missing symbols
gaps
Could be anything in there
4Modern Binary Challenges
Packed code
New code appears at runtime
5Modern Binary Challenges
very difficult
Introspective or self-modifying code
Makes instrumenting difficult
6The Next 90 Minutes
7Learning to Analyze Stripped Binary Code
Nathan Rosenblum Paradyn Project Paradyn /
Condor Week Madison, Wisconsin April 29 May 2,
2008
8Code is Hard to Find
9Code is Hard to Find
but Dyninst knows how
ltlt push ebp mov esp, ebp gtgt
7a 01 00 fd a2 b3 74 68 69 73 20 65 78 61 6d 70
6c 65 20 69 55 85 e5 6f 67 75 73 2e 2e 2e 7a 01
00 fd a2 b3 74 68 69 73 20 65 78 61 6d 70 6c 65
7a 01 00 fd a2 b3 74 68 69 73 20 65 78 61 6d 70
6c 65 20 69 55 85 e5 6f 67 75 73 2e 2e 2e 7a 01
00 fd a2 b3 74 68 69 73 20 65 78 61 6d 70 6c 65
7a 01 00 fd a2 b3 74 68 69 73 20 65 78 61 6d 70
6c 65 20 69 push ebp mov esp, ebp push ebx .
. .
examine gaps
scan for patterns
recover code
10Digression Evaluating Code Parsers
Better confidence in results
precision
Find more code
recall
11Dyninst Finds Gap Code Well
GCC-compiled binaries
.97 precision .98 recall
precision
recall
12 or not so well
Intel CC-compiled binaries
precision
.67 precision .16 recall
recall
13Why is Gap Parsing Hard?
Code Segment
Gap contents may vary
String data
- Dialog Constants
- Import names
- Other strings
14Why is Gap Parsing Hard?
15Why is Gap Parsing Hard?
16no flexibility for additional insns
ltlt push ebp mov esp, ebp gtgt
Ignores preceding information
rigid, hand tuned, compiler-specific
17Learning to Recognize Functions
Goal Automatically model binary code
We need
- Features to represent functions
- Learning system to choose best features
- A way to use the system when parsing
18Idioms
ltlt push ebp mov esp,ebp gtgt ltlt push ebp
mov esp,ebp gtgt ltlt mov 0x8(ebp),eax gtgt PRE ltlt
ret nop gtgt
function starts after PRE idioms
19 A Problem of Scale
How do we choose the best idioms?
?
There are tens of thousands, we only want the
best few!
Needles in an um, idiom stack?
20Distributed Feature Selection
21Structural Complications
Candidate function
CALL
22Structural Complications
23Structural Complications
24Research
Call/ conflict features for pairs
Conditional Random Fields (CRFs)
Label each candidate FEP
Infers probability of joint labeling
Skipping lots of math!
idiom features for single nodes
Greedy Approximation
highest confidence
idiom score
call propagation
conflict elimination
25Experimentation
- GNU C Compiler
- Simple, regular function preamble
- Intel C Compiler
- Most variation in entry points highly optimized
- MS Visual Studio
- High variation in function entry point idioms
26Testing
Comparison of three binary analysis tools
- Original Dyninst
- Scans for common entry preamble
- Dyninst w/ Model
- Model replaces entry preamble heuristic
- IDA Pro Disassembler
- Scans for common entry preamble
- List of Library Fingerprints (Windows)
27More Testing
Classifier tuned to any point on curve
I
ICC binaries are the hardest
Visual Studio
Intel C Compiler
28One Final Issue
Which (compiler) model should we apply?
Reverse the Problem
29Productization
Analysis mini tool
Annotated with SymtabAPI
Looks like normal binary
Optional gap parsing enabled at runtime
30References
Rosenblum, Zhu, Miller, Hunt. Learning to
Analyze Binary Computer Code. Proceedings of the
23rd Conference on Artificial Intelligence (AAAI
08). July, 2008.
www.paradyn.org/html/publications-by-year.html