Title: Intelligent Detection of Malicious Script Code
1Intelligent Detection of Malicious Script Code
- CS194, 2007-08
- Benson Luk
- Eyal Reuveni
- Kamron Farrokh
- Advisor Adnan Darwiche
- Sponsored by Symantec
2Outline for Project
- Phase I Setup
- Set up machine for testing environment
- Ensure that whitelist is clean
- Phase II Crawling
- Modify crawler to output only necessary data.
This means - Grab only necessary information from webcrawling
results - Listen into Internet Explorers Javascript
interpreter and output relevant behavior - Phase III Database
- Research and develop an effective structure for
storing data and link it to webcrawler - Phase IV Analysis
- Research trends for normalcy and investigate
possible heuristics
3Approach to Project
- First Quarter Infrastructure
- Second Quarter Data Gathering
- Third Quarter Data Analysis
- (Note some overlap between quarters)
4Infrastructure
- Internet Explorer 7, Windows XP SP2 Professional
- Main testing environment
- Norton Antivirus
- Protects against malicious files and scripts
- Can access logs to determine which sites
launched attacks - Integrated into automated site visiting
5Infrastructure
- CanaryCallback.dll
- Plugin into Internet Explorer
- Able to access most data received by low-level
Javascript interpreter - The function being called (DISPID)
- The class that the function belongs to (GUID)
- The list of types and values of parameters passed
into the function. Examples - VT_I4 4-byte integer
- VT_BSTR Byte string
- VT_DISPATCH Object
- Large part of first and second quarter was spent
programming, debugging, and maintaining the
functions that would handle the data - Functions to grab data type
- Functions to parse data values (some stored in
bitstreams) - Functions to output data to file
- If types did not have an obvious output format
(i.e. VT_DISPATCH), we had to create one that
would accurately represent as many components of
the data as possible
6Infrastructure
- Python
- Scripting language
- Designed to handle parsing with ease
- Script for infrastructure was used to perform
three tasks - Launch Internet Explorer (uses the cPAMIE
engine), load website, then close Internet
Explorer - Access and parse Nortons web attack logs for any
attacks launched by website - Sort script data from CanaryCallback DLL based on
DLL data and attack logs (Was there an attack?
Did any scripts run? Etc.) - Heretrix
- Open-source webcrawler with high customizability
- Can run specific crawls that target a set of
domains, and output minimal information - Uses HTTP requests does not render crawled
sites - The purpose is to gather as many URLs with
scripts as possible for a large sample base
7Infrastructure Crawler
Step 0 URL queue is seeded with domain list
Step 3 Append URLs to log data and URL queue iff
they satisfy our set of rules
Heretrix raw data
URL queue
Step 4 Get rid of excess data, leaving only URL
information for each site, and output to new file
Step 1 Grab URL from queue
Crawler
Python parser
Step 2 Grab source from URL
Heretrix parsed data
WWW
Repeat steps 1-4 until crawl limit is reached.
8Infrastructure Gatherer
Norton Antivirus Logs
Step 5 Python analyzes callback data and logs to
decide whether a site is clean, dirty, or has no
scripts
Norton Antivirus CanaryCallback data
Python controller
Step 4 IE7 informs PAMIE that it is finished
Python kills IE7
Step 1 Python script grabs site from crawl data
Step 3 IE7 Javascript interpreter outputs to
file containing all DLL data
Step 2 cPAMIE component loads IE and sends it to
specified site
Heretrix parsed data
Internet Explorer 7
Step 6 Python outputs sorted and formatted data
to relevant files for future analysis
Formatted output
Repeat steps 1-6 until URL list is exhausted.
9Data gathering
- Heretrix crawls
- First crawl 5 seeds, depth 5
- 5 million sites found
- Second crawl 10 seeds, depth 3
- 3 million sites found
- Third crawl 200 seeds, depth 1
- 18,500 sites found
- Fourth crawl 200 seeds, depth 2
- 3 million sites found
- First two crawls produced data that was biased
towards large, interlinked sites the last two
broad crawls were run to remedy this. - CanaryCallback gathering
- For first and second crawls, a chosen set of
1,000 or so sites were run through by gatherer
component. - For third crawl, all sites (18,500) were
processed by gatherer - For fourth crawl, several tasks were performed
- 20,000 sites were processed by gatherer
- In mid-May, the same 1000 sites were processed 28
times (about 4 times per day) from May 7 to May 13
10Data analysis setup
- CanaryCallback data analysis
- Main choice for parsing data was Python
scripting language - Too much data for MS Access or even MySQL
- Python scripts were developed to facilitate
analysis in manner similar to SQL - Scripts to aggregate data sets and frequencies
- Scripts to calculate various metrics of data
sets, such as - Smallest data point
- Largest data point
- Average data point
- Variance of data point
- Total data points
- Sum of data points
- Scripts to output to file in Excel spreadsheet
(CSV) for deeper analysis
11Individual data analysis
- Third quarter and last half of second quarter
were spent focusing on as wide a range of data as
possible - To accomplish this, our group split up and
pursued a different line of research individually - Individual presentations will follow
- Eyal Activity categorization
- Benson Integer argument trend analysis
- Kamron Byte string argument trend analysis
12Activity Categorization
13Activity Analysis
- There is an obvious connection between a
function and the site using it - Is it possible to quantify this relationship,
and establish whether certain functions are used
in a specific kind of site? - Characterize a site based on how active it is
i.e, how many function calls are made while the
site is loaded - Does there exist a pattern in the data that will
be able to distinguish an abnormal usage of any
function based on the characteristic of the site?
14Site Function Usage Statistics
- Total number of sites 14848
- Average function calls per site 5777
- Average function calls per function 1984
- Standard deviation of function calls per
function 25493 - Standard deviation of function calls per site
14181
- Minus outliers none
- Three Standard Deviations below 0
- Two Standard Deviations below 0
- One Standard Deviation below 12086
- One Standard Deviation above 1633
- Two Standard Deviations above 510
- Three Standard Deviations above 296
- Normal distribution outliers 323
- Median 1456
- First quartile 438
- Third quartile 4029
- Interquartile range 3591
- Minus outliers none
- Lower whisker starts at 0
- Upper whisker ends at 9365
- Box and whisker outliers 2048
15Correlation analysis
- Related each function to the site calling it
using the number of function calls on that site - Each tuple consisted of the number of times a
function was called at a particular site, and the
number of total function calls that were made at
that site - The correlation between the variables in the
tuple was made for each individual function - Many functions were not common, and so not
enough data was available to make a conclusion
about them - For the functions that had enough (over 100)
sites that called them, the correlation values
were between .004 and -.01, showing no
correlation between the function and the script
activity of the site calling it
16Function Usage Amount
- An interesting trend arose when analyzing the
correlation data - There are functions that are called
hundreds/thousands of times - Despite this, sites seem to call a specific
function only a couple times. - Example
- GUID 3050f3fd-98b5-11cf-bb82-00aa00bdec0b,
DISPID 1 - Called 346 times, only in 11 sites is it called
more than 3 times (3.2)
17(No Transcript)
18Categorization Approach
- Since no correlation was found, another approach
was taken - According to trends in the script activity data,
divide the sites into distinct categories - Examine the function behavior in each category,
as opposed to individual sites - Three categories were chosen, roughly along the
median and the end of the third quartile - This gave one category 50 of the data, while
the other two had 25 of the data - An attempt to avoid bias toward the extremely
script-heavy sites
19Categorization Heuristic
- A heuristic was developed to determine whether a
function would be more likely to appear in a
certain category - F ((avgl - avgsite)(L - avgfunc)(avgm -
avgsite)(M - avgfunc)(avgh - avgsite)(H -
avgfunc)) / 3 - avgl, avgm, and avgh are the average number of
function calls per category (542, 2882, and 22745
respectively) - avgsite is the overall average number of
function calls per site (5777) - avgfunc is the avg number of function calls per
function (1984). - L, M, and H are the specific number of times the
function was called in the low, medium, and high
category
20Statistical Variation Among Categories
- The heuristic separated out the functions into
three distinct sections - Along the higher values were mostly functions
that had few arguments supplied - In the middle, there were whole objects
represented (a GUID, and all of its related
function calls) - At the lowest negative values were functions
that were commonly called with arguments
21Argument Distributions
- A further analysis was done on whether there
exists a difference in the behavior of a function
in the separate categories - The distributions of BSTR (Byte String) lengths
and I4 (4-byte Integer) values were considered - Several functions were examined, but this
specific one (referred to as Second, as it had
the second highest heuristic value) is exemplary
of the trends noticed - The argument type frequency of Second
LOW 0 arguments 20713 I4 arguments 0 BSTR
arguments 2634 DISPATCH arguments 14 NULL
arguments 0 BOOL arguments 0
MID 0 arguments 170861 I4 arguments 0 BSTR
arguments 9888 DISPATCH arguments 1 NULL
arguments 0 BOOL arguments 0
HIGH 0 arguments 1215964 I4 arguments 0 BSTR
arguments 9447 DISPATCH arguments 19 NULL
arguments 0 BOOL arguments 0
22(No Transcript)
23(No Transcript)
24(No Transcript)
25Conclusions of Approach
- The trend seen is that there is no major
statistical difference in the argument value
distribution among the categories, but there are
distinct characteristic differences seen - Functions that appear more commonly in
less-active sites tend to have arguments supplied
to them - No general correlation exists between functions
and how active the site calling it is - There may exist correlation in some other
characteristic, however
26Integer analysis
27Functions through Three Sets
- Looked through 3 of the runs
- 5 seeds, depth 5 1,324 sites
- 10 seeds, depth 3 1,184 sites
- 200 seeds, depth 1 15,790 sites
- Picked three most common functions with integer
arguments of the first run to analyze - Goal Look for consistency throughout function
behavior in differing sets of sites
28Functions through Three Sets
- In all three data sets, the values of the
argument had a very large range, from 0 to the
millions or billions - Distributions did not stay consistent through
sets, all had differing commonly occurring values
29Functions through Three Sets
- Similar pattern in all 3 sets
- Low values were used
- Numbers near 0 most common, occurrences drop off
as values get larger
30Functions through Three Sets
- Values range from 0 to in the hundreds
- Second data set did not have enough data
- Similar common numbers in both sets 3, 300, and
728
31Patterns in DISPID Usage
- Looked at what DISPIDs were used, without regard
to the GUIDS of the calling classes - DISPIDs had a large range, from lows of less
than -2 billion, to highs of over 3 million - Out of 743,270 functions analyzed, The vast
majority had DISPIDs within 4 distinct ranges - 205 of the function did not fall within these
groups, and instead were one of 6 other numbers - Within each of the four ranges, occurrences at
specific numbers formed patterns
32DISPID Usage First Range
- The most common range for DISPIDs
3,000,000-3,001,286
- 490,201 functions, about 66
- 1,067 out of 1,286 different numbers used
- Numbers nearer to 3 million are most common,
higher numbers were used less
33DISPID Usage Second Range
- Second common range for DISPIDs 0-2,313
- 164,224 functions, about 22
- 39 numbers in this range were used
- 0 and 1,103 were the most common
- Numbers clumped around 5 groups 0-9, 127-154,
1002-1168, 1500-1504, and 2001-2015, with 2313
being an exception
34DISPID Usage Third Range
- Third range for DISPIDs -2,147,417,109 to
-2,147,411,105
- 50,541 functions, about 7
- 55 numbers in this range used
- Most occurrences were around numbers ending in
round thousands
35DISPID Usage Fourth Range
- Fourth range for DISPIDs 10,001-10,087
- 38,099 functions, about 5
- 75 numbers out of the range were used
- Uniquely used by 3050f55d-98b5-11cf-bb82-00aa00bd
ce0b - DISPIDs 10,001-10,007 are most common
36Patterns in DISPID Usage
- Looked at what DISPIDs were used, without regard
to the GUIDS of the calling classes - DISPIDs had a large range, from lows of less
than -2 billion, to highs of over 3 million - Out of 743,270 functions analyzed, The vast
majority had DISPIDs within 4 distinct ranges - Within each of the four ranges, occurrences at
specific numbers formed patterns
37Function with Multiple Integers
- Looked for patterns in the relations among the
integer arguments of functions taking multiple
arguments - Not very many functions in this category
- One took two arguments, first was always 0
- One took two arguments, always the same.
Arguments were all from (1,1) to (31,31) and
(1908,1908) to (1908) - All came from 2 signup sites on a particular
website - Two took two differing arguments, could not find
relation between arguments - Other functions did not have a large enough
sample size
38Functions with Multiple Integers
- Function itself had consistent patterns in the
values it took 95 of arguments were (1,1) or
(3,2) - No consistent relations between arguments
39Function Pairs
- Examined
- GUID 3050f55d-98b5-11cf-bb82-00aa00bdce0b
- DISPIDs 10001-10062
- Out of 38,099 occurrences, 3,595 were followed
by - GUID c59c6b12-f6c1-11cf-8835-00a0c911e8b2
- DISPID 0
- Second function had no independent occurrences
- Similar arguments
- First function took a variety of numbers and
types of arguments - Second function always took a DISPATCH argument,
followed by the same arguments as the first
function
40Conclusions of Approach
- Functions arguments through sets
- Seems to be consistent patterns in certain
functions - Range, values taken, values common, value
distribution - DISPID usage
- 4 ranges with very few exceptions
- Common subranges or distribution patterns within
each range - Multiple arguments
- Uncommon type of function
- No noticeable relations in arguments
- Function pairs
- Dependent functions have clear patterns
- Function position
- Argument types and values
- Only one example do more exist?
41Byte string analysis
42Byte String Analysis
- Buffer overflows are a common method of
exploiting a targeted system - One method create a very long string to break
boundary checking, then append shellcode at the
end to inject into the assembly code - We are interested in the length of BSTR objects
feeded into given functions - For any given API, what is considered a normal
string length?
43Class-based analysis
- Initial analyses were done on a class-by-class
basis - Samples were grouped together and analyzed
according to GUID - Byte strings are typically very small
- More than 70 of the commonly called Javascript
classes typically received byte strings of less
than length 20. (39 out of 55 functions from this
crawl) - Less than 10 of these ever receive a string
greater than 5000 characters in length (4 out of
55 functions from this crawl).
44Class-based analysis
- Analysis of individual classes shows same trend
toward smaller strings - However, analyzing based on classes groups byte
strings of all class functions together, which
results in inaccuracy and lost information
45Parameter-based analysis
- Second analysis split samples into individual
arguments of unique functions of each class - Given a sample set with values in the interval
(a, b) with average µ and standard deviation s,
we expect values to largely lie within the
interval (µ s, µ s) - We also expect (µ s, µ s) to be smaller than
(a, b) - The smaller (µ s, µ s) is in proportion to
(a, b), the more well-defined our sample set
becomes
46Parameter-based analysis
- Length of expected interval 2s
- Length of entire interval n b a 1
- 2s/n represents the ratio of the expected
interval to the entire interval - Since 2s lt n, 0 lt 2s/n lt 1
- When 2s/n 0, s 0 and all values in data set
are equal - When 2s/n 1, s n/2 and all values in data
equal either a or b - As 2s/n goes from 0 to 1, shape of graph begins
to shift
47(No Transcript)
48(No Transcript)
49(No Transcript)
50(No Transcript)
51(No Transcript)
52(No Transcript)
53(No Transcript)
54- When ratio is 0, amount of strings is typically
low - Otherwise, ratio increases as amount of strings
decreases - The function arguments with the smallest
non-zero ratio are the most well-defined
55(No Transcript)
56Analysis with pruning
- Only function arguments that see 9 or fewer
strings are removed however - Most zero-ratio functions are pruned (2607 to
731) - Many functions with ratio gt 0.5 are pruned (1540
to 883) - Functions with ratio lt 0.5 are affected
minimally (1442 to 1332)
57Analysis with pruning
- Only function arguments that see 99 or fewer
strings are removed however - Almost all zero-ratio functions are pruned (731
to 232) - Almost all functions with ratio gt 0.5 are pruned
(883 to 266) - Only some functions with ratio lt 0.5 are
affected (1332 to 979)
58Analysis with pruning
As a function is seen in the wild more
frequently, the byte string lengths it takes in
begin to fall into specific intervals. Functions
with substantial evidence are well-defined in the
lengths of byte strings they tend to receive!
59Comparing w/malicious data
- Symantec provided us with test samples used for
Canary testing - These samples trigger browser exploit but do not
inject actual shellcode - The worst thing they can do is crash the browser
- Malicious samples fell into one of three
categories - Bad BSTR
- Bad I4
- Bad DISPATCH (object)
- Example MSIE Popup Window Address Bar Spoofing
Weakness - Callback data
- Compare with data from May crawl
- 491 strings seen over the 20,416 websites visited
during that crawl - Smallest 70
- Largest 80
- Average 76.32
- Standard deviation 2.33
- Expected interval 73.99, 78.65
60Trend volatility
- How does web activity change over time?
- 28 crawls of 1000 sites over May 7 to May 13 were
performedto investigate this
- Each crawl differs by several hundred thousand
DLL calls - Amount of sites with actual scripts change
61Trend volatility
- These runs were done 5.5 hrs apart
- Change is very slight
- Zero-ratio functions increase
- High-ratio functions decrease
62Trend volatility
- These runs were done 1 day apart
- Change is also very slight
- Zero-ratio functions decrease
- Mid-ratio functions (R 0.5) increase
63Trend volatility
- These runs were done 6 days apart
- Change is a little more apparent
- Zero-ratio functions decrease
- Mid-ratio functions (R 0.5) increase
64Trend volatility
- State of Javascript activity on Web is constantly
changing - Changes are somewhat unpredictable (and entirely
dependent on decisions of webmaster) - These changes in the long run are not major
however, they still exist and need to be addressed
65Conclusions of Approach
- Substantial evidence in favor of existing trends
for byte string arguments - This approach can be adapted to anything that can
be quantified as a number - Changes in state of web will require any
heuristic developed to have at least a basic
learning capability - Plan to continue research over the summer