Title: Recovery of Variables and Heap Structure in x86 Executables
1Recovery of Variables and Heap Structure in x86
Executables
- Gogul Balakrishnan
- Thomas Reps
- University of Wisconsin
2Overview
- Introduction
- Challenges
- Background
- Recovering A-locs via Iteration
- An Abstraction for Heap-Allocated Storage
- Experiments
3Introduction
- The Need of Analyzing Executables
- What You See Is Not What You eXecute
- Many Obstacles in Analyzing Executables
- Data Objects are Not Easily Identifiable.
- Absence of Symbol Table Debugging Information
- Determining the Memory Addresses of Data Objects
- Difficult to Track the Flow of Data through
Memory - Challenging to get useful information about the
heap
e.g) memset(password, \0, len)
free(password)
4Challenges(1/3)
- Recovering Variable-like Entities
- The layout of Memory is known at Compile time or
Assembly time (IDAPro Approach) - To Recover y, the Set of Values that eax Holds at
5 Needs to be Determined.
void main() int x, y x 1 y
2 return
proc main 1 mov ebp, esp 2 sub esp, 8 3
mov ebp-8, 1 4 mov eax, ebp 5 mov eax-4,
2 6 add esp, 8 7 retn
5Challenges(2/3)
- Granularity of Recovered Variable-like Entities
- Affects the complexity and accuracy of subsequent
analyses - The Structure of Heap-Allocated Objects
- Only the Size of the Allocated Block is Known.
- Using Abstract-Refinement Algorithm
6Challenges(3/3)
- Resolving Virtual-Function Calls
- A Definite Link between the Object and the
Virtual Function Table is Never Established.
(Weak Update)
one-variable-per-malloc-site abstraction
7Background(1/6)
- Abstract Locations (A-locs)
- Memory Region
- A Set of Disjoint Memory Areas
- Represents a Group of Locations that have Similar
Runtime Properties - Abstract Locations
- Locations between two addresses/offsets in
Memory-Region - Address Offsets are Statically Determined
8Background(2/6)
- Abstract Locations (contd)
proc main 0 mov ebp,esp 1 sub esp,40 2
mov ecx,0 3 lea eax,ebp-40 L1 mov eax, 1 5
mov eax4,2 6 add eax, 8 7 inc ecx 8
cmp ecx, 5 9 jl L1 10 mov eax,ebp-36 11 add
esp,40 12 retn
9Background(3/6)
- Value-Set Analysis (VSA)
- Combined Numeric-Analysis Pointer-Analysis
- Over-Approximation of the values that each a-loc
holds at each program point - Value-Set
- The Set of Addresses and Numeric Values
- N-tuple of strided intervals of the form sl, u
- (Global Region, Procedure Region, )
- (10, 9, ?) versus (?, -8-40, -8)
N the number of memory-regions
e.g) 8-40, -8 -40, -32, -24, -16, -8
10Background(4/6)
- Value-Set Analysis (contd)
- The Value-Set of eax at L1
- (?, 8-40, -8)
- eax holds the offsets
-40, -32,
-24, -16, -8 - Starting Addresses of Field x of p
proc main 0 mov ebp,esp 1 sub esp,40 2
mov ecx,0 3 lea eax,ebp-40 L1 mov eax, 1 5
mov eax4,2 6 add eax, 8 7 inc ecx 8
cmp ecx, 5 9 jl L1 10 mov eax,ebp-36 11 add
esp,40 12 retn
Typedef struct int x, y Point int
main() int i Point p5 for(i0
ilt5 i) pi.x 1 pi.y
2 return p0.y
11Background(5/6)
- Aggregate Structure Identification (ASI)
- Can Distinguish between Accesses to Different
Parts of the Same Aggregate - Aggregate is broken up into smaller parts (atoms)
- Data-Access Constraint Language (DAC)
- Specifying Data-Access Pattern in the Program
12Background(6/6)
- Aggregate Structure Identification (contd)
- Data-Access Constraint Language (DAC)
- DataRef l u refers to bytes l through u in
DataRef - DataRef n n is the number of elements
- ASI DAG
e.g) P011 3 P03, P47, or P811
return_main
13Recovering A-locs via Iteration
- Problems of VSA
- Can only Represent a Contiguous Sequence of
Memory Locations - Cannot Detect Internal Substructure
- Basic Idea
- VSA is used to obtain memory-access patterns in
the executable - ASI is used as a heuristic to determine a set of
a-locs according to the memory-access patterns
obtained from the information recovered by VSA.
14Recovering A-locs via Iteration
- Generating Data-Access Constraints from Value
Input (r, sl, u, length) Output (ASI Ref,
Boolean)
AR_main-40-3307 AR_main-32-2507 AR_mai
n-24-1707 AR_main-16-907 AR_main-8-1
07
15Recovering A-locs via Iteration
- Generating Data-Access Constraints from Value
ltAlgorithm 2gt if (s1l1,u1 or s2l2,u2 is a
singleton then return SI2ASI(r, s1l1, u1 ?
s2l2, u2, length) end if if s1 (u2 l2
length) then baseSI ? s1l1, u1 indexSI
? s2l2, u2 else if s2 (u1 l1 length)
then baseSI ? s2l2, u2 indexSI ? s1l1,
u1 else return SI2ASI(r, s1l1, u1 ? s2l2,
u2, length) end if ltbaseRef, exactRefgt ?
SI2ASI(r, baseSI, stride(baseSI)) if exactRef is
false then return SI2ASI(r, s1l1, u1 ?
s2l2, u2, length) else return
concat(baseRef, SI2ASI(, indexSI, length)) endif
Determine base register
16Recovering A-locs via Iteration
- Interpreting Indirect Memory-References
- Lookup Algorithm
- NodeDesc ltname, lengthgt
- NodeDescList An Ordered List of NodeDesc
- Three Operations
name the name associated with the ASI tree
node length the length of above node
e.g) nd1, nd2, , ndn
17Recovering A-locs via Iteration
- Lookup Algorithm Examples
18An Abstraction for Heap-Allocated Storage
- Previous Abstraction
- Recency Abstraction
- Allowing VSA ASI to recover Info. About
virtual-function tables - Use Two Memory-Regions per allocation site s
- MRABs Most Recently Allocated Block
- NMRABs Non-Most Recently Allocated Block
- count How many concrete blocks the
memory-region represents (MRABs.count,
NMRABs.count) - SmallRange 0, 0, 0, 1, 1, 1, 0, 8, 1,
8, 2, 8 - size over-approximation of the size of block
(MRABs.size, NMRABs.size)
All of the nodes allocated at a given allocation
site s are folded together into a single summary
node ns.
19An Abstraction for Heap-Allocated Storage
- Operation
- AbsEnvs MRABs/NMRABs ? ltcount,size,alocEnv
gt - AlocEnv a-loc ? ValueSet
- Allocation site s transforms absEnv to absEnv
- absEnv(MRABs) lt0,1, size, a-loc.Value-Setgt
- absEnv(NMRABs).count absEnv(NMRABs).count
absEnv(MRABs).count - absEnv(NMRABs).size absEnv(NMRABs).size ?
absEnv(MRABs).size - absEnv(NMRABs).alocEnv absEnv(NMRABs).alocE
nv ? absEnv(MRABs).alocEnv
20An Abstraction for Heap-Allocated Storage
21Experiments
22Experiments
- Results of Virtual-Function Call Resolution
23Experiments
- Results of A-loc Identification
- Comparing the Results of Algorithm with Debugging
Information
The structure of 87 of the local variables is
correct
24Experiments
- Results of A-loc Identification
The structure of 72 of the objects in the heap
is correct
25Q A