Intelligent Detection of Malicious Script Code

About This Presentation

Title:

Intelligent Detection of Malicious Script Code

Description:

Grab only necessary information from webcrawling results ... DISPIDs had a large range, from lows of less than -2 billion, to highs of over 3 ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 66

Provided by: kam9

Learn more at: http://web.cs.ucla.edu

Category:

more less

Transcript and Presenter's Notes

Title: Intelligent Detection of Malicious Script Code

1
Intelligent Detection of Malicious Script Code

CS194, 2007-08
Benson Luk
Eyal Reuveni
Kamron Farrokh
Advisor Adnan Darwiche
Sponsored by Symantec

2
Outline for Project

Phase I Setup
Set up machine for testing environment
Ensure that whitelist is clean
Phase II Crawling
Modify crawler to output only necessary data.
This means
Grab only necessary information from webcrawling
results
Listen into Internet Explorers Javascript
interpreter and output relevant behavior
Phase III Database
Research and develop an effective structure for
storing data and link it to webcrawler
Phase IV Analysis
Research trends for normalcy and investigate
possible heuristics

3
Approach to Project

First Quarter Infrastructure
Second Quarter Data Gathering
Third Quarter Data Analysis
(Note some overlap between quarters)

4
Infrastructure

Internet Explorer 7, Windows XP SP2 Professional
Main testing environment
Norton Antivirus
Protects against malicious files and scripts
Can access logs to determine which sites
launched attacks
Integrated into automated site visiting

5
Infrastructure

CanaryCallback.dll
Plugin into Internet Explorer
Able to access most data received by low-level
Javascript interpreter
The function being called (DISPID)
The class that the function belongs to (GUID)
The list of types and values of parameters passed
into the function. Examples
VT_I4 4-byte integer
VT_BSTR Byte string
VT_DISPATCH Object
Large part of first and second quarter was spent
programming, debugging, and maintaining the
functions that would handle the data
Functions to grab data type
Functions to parse data values (some stored in
bitstreams)
Functions to output data to file
If types did not have an obvious output format
(i.e. VT_DISPATCH), we had to create one that
would accurately represent as many components of
the data as possible

6
Infrastructure

Python
Scripting language
Designed to handle parsing with ease
Script for infrastructure was used to perform
three tasks
Launch Internet Explorer (uses the cPAMIE
engine), load website, then close Internet
Explorer
Access and parse Nortons web attack logs for any
attacks launched by website
Sort script data from CanaryCallback DLL based on
DLL data and attack logs (Was there an attack?
Did any scripts run? Etc.)
Heretrix
Open-source webcrawler with high customizability
Can run specific crawls that target a set of
domains, and output minimal information
Uses HTTP requests does not render crawled
sites
The purpose is to gather as many URLs with
scripts as possible for a large sample base

7
Infrastructure Crawler
Step 0 URL queue is seeded with domain list
Step 3 Append URLs to log data and URL queue iff
they satisfy our set of rules
Heretrix raw data
URL queue
Step 4 Get rid of excess data, leaving only URL
information for each site, and output to new file
Step 1 Grab URL from queue
Crawler
Python parser
Step 2 Grab source from URL
Heretrix parsed data
WWW
Repeat steps 1-4 until crawl limit is reached.
8
Infrastructure Gatherer
Norton Antivirus Logs
Step 5 Python analyzes callback data and logs to
decide whether a site is clean, dirty, or has no
scripts
Norton Antivirus CanaryCallback data
Python controller
Step 4 IE7 informs PAMIE that it is finished
Python kills IE7
Step 1 Python script grabs site from crawl data
Step 3 IE7 Javascript interpreter outputs to
file containing all DLL data
Step 2 cPAMIE component loads IE and sends it to
specified site
Heretrix parsed data
Internet Explorer 7
Step 6 Python outputs sorted and formatted data
to relevant files for future analysis
Formatted output
Repeat steps 1-6 until URL list is exhausted.
9
Data gathering

Heretrix crawls
First crawl 5 seeds, depth 5
5 million sites found
Second crawl 10 seeds, depth 3
3 million sites found
Third crawl 200 seeds, depth 1
18,500 sites found
Fourth crawl 200 seeds, depth 2
3 million sites found
First two crawls produced data that was biased
towards large, interlinked sites the last two
broad crawls were run to remedy this.
CanaryCallback gathering
For first and second crawls, a chosen set of
1,000 or so sites were run through by gatherer
component.
For third crawl, all sites (18,500) were
processed by gatherer
For fourth crawl, several tasks were performed
20,000 sites were processed by gatherer
In mid-May, the same 1000 sites were processed 28
times (about 4 times per day) from May 7 to May 13

10
Data analysis setup

CanaryCallback data analysis
Main choice for parsing data was Python
scripting language
Too much data for MS Access or even MySQL
Python scripts were developed to facilitate
analysis in manner similar to SQL
Scripts to aggregate data sets and frequencies
Scripts to calculate various metrics of data
sets, such as
Smallest data point
Largest data point
Average data point
Variance of data point
Total data points
Sum of data points
Scripts to output to file in Excel spreadsheet
(CSV) for deeper analysis

11
Individual data analysis

Third quarter and last half of second quarter
were spent focusing on as wide a range of data as
possible
To accomplish this, our group split up and
pursued a different line of research individually
Individual presentations will follow
Eyal Activity categorization
Benson Integer argument trend analysis
Kamron Byte string argument trend analysis

12
Activity Categorization
13
Activity Analysis

There is an obvious connection between a
function and the site using it
Is it possible to quantify this relationship,
and establish whether certain functions are used
in a specific kind of site?
Characterize a site based on how active it is
i.e, how many function calls are made while the
site is loaded
Does there exist a pattern in the data that will
be able to distinguish an abnormal usage of any
function based on the characteristic of the site?

14
Site Function Usage Statistics

Total number of sites 14848
Average function calls per site 5777
Average function calls per function 1984
Standard deviation of function calls per
function 25493
Standard deviation of function calls per site
14181

Minus outliers none
Three Standard Deviations below 0
Two Standard Deviations below 0
One Standard Deviation below 12086
One Standard Deviation above 1633
Two Standard Deviations above 510
Three Standard Deviations above 296
Normal distribution outliers 323

Median 1456
First quartile 438
Third quartile 4029
Interquartile range 3591
Minus outliers none
Lower whisker starts at 0
Upper whisker ends at 9365
Box and whisker outliers 2048

15
Correlation analysis

Related each function to the site calling it
using the number of function calls on that site
Each tuple consisted of the number of times a
function was called at a particular site, and the
number of total function calls that were made at
that site
The correlation between the variables in the
tuple was made for each individual function
Many functions were not common, and so not
enough data was available to make a conclusion
about them
For the functions that had enough (over 100)
sites that called them, the correlation values
were between .004 and -.01, showing no
correlation between the function and the script
activity of the site calling it

16
Function Usage Amount

An interesting trend arose when analyzing the
correlation data
There are functions that are called
hundreds/thousands of times
Despite this, sites seem to call a specific
function only a couple times.
Example
GUID 3050f3fd-98b5-11cf-bb82-00aa00bdec0b,
DISPID 1
Called 346 times, only in 11 sites is it called
more than 3 times (3.2)

17
(No Transcript)
18
Categorization Approach

Since no correlation was found, another approach
was taken
According to trends in the script activity data,
divide the sites into distinct categories
Examine the function behavior in each category,
as opposed to individual sites
Three categories were chosen, roughly along the
median and the end of the third quartile
This gave one category 50 of the data, while
the other two had 25 of the data
An attempt to avoid bias toward the extremely
script-heavy sites

19
Categorization Heuristic

A heuristic was developed to determine whether a
function would be more likely to appear in a
certain category
F ((avgl - avgsite)(L - avgfunc)(avgm -
avgsite)(M - avgfunc)(avgh - avgsite)(H -
avgfunc)) / 3
avgl, avgm, and avgh are the average number of
function calls per category (542, 2882, and 22745
respectively)
avgsite is the overall average number of
function calls per site (5777)
avgfunc is the avg number of function calls per
function (1984).
L, M, and H are the specific number of times the
function was called in the low, medium, and high
category

20
Statistical Variation Among Categories

The heuristic separated out the functions into
three distinct sections
Along the higher values were mostly functions
that had few arguments supplied
In the middle, there were whole objects
represented (a GUID, and all of its related
function calls)
At the lowest negative values were functions
that were commonly called with arguments

21
Argument Distributions

A further analysis was done on whether there
exists a difference in the behavior of a function
in the separate categories
The distributions of BSTR (Byte String) lengths
and I4 (4-byte Integer) values were considered
Several functions were examined, but this
specific one (referred to as Second, as it had
the second highest heuristic value) is exemplary
of the trends noticed
The argument type frequency of Second

LOW 0 arguments 20713 I4 arguments 0 BSTR
arguments 2634 DISPATCH arguments 14 NULL
arguments 0 BOOL arguments 0
MID 0 arguments 170861 I4 arguments 0 BSTR
arguments 9888 DISPATCH arguments 1 NULL
arguments 0 BOOL arguments 0
HIGH 0 arguments 1215964 I4 arguments 0 BSTR
arguments 9447 DISPATCH arguments 19 NULL
arguments 0 BOOL arguments 0
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
Conclusions of Approach

The trend seen is that there is no major
statistical difference in the argument value
distribution among the categories, but there are
distinct characteristic differences seen
Functions that appear more commonly in
less-active sites tend to have arguments supplied
to them
No general correlation exists between functions
and how active the site calling it is
There may exist correlation in some other
characteristic, however

26
Integer analysis
27
Functions through Three Sets

Looked through 3 of the runs
5 seeds, depth 5 1,324 sites
10 seeds, depth 3 1,184 sites
200 seeds, depth 1 15,790 sites
Picked three most common functions with integer
arguments of the first run to analyze
Goal Look for consistency throughout function
behavior in differing sets of sites

28
Functions through Three Sets

In all three data sets, the values of the
argument had a very large range, from 0 to the
millions or billions
Distributions did not stay consistent through
sets, all had differing commonly occurring values

29
Functions through Three Sets

Similar pattern in all 3 sets
Low values were used
Numbers near 0 most common, occurrences drop off
as values get larger

30
Functions through Three Sets

Values range from 0 to in the hundreds
Second data set did not have enough data
Similar common numbers in both sets 3, 300, and
728

31
Patterns in DISPID Usage

Looked at what DISPIDs were used, without regard
to the GUIDS of the calling classes
DISPIDs had a large range, from lows of less
than -2 billion, to highs of over 3 million
Out of 743,270 functions analyzed, The vast
majority had DISPIDs within 4 distinct ranges
205 of the function did not fall within these
groups, and instead were one of 6 other numbers
Within each of the four ranges, occurrences at
specific numbers formed patterns

32
DISPID Usage First Range

The most common range for DISPIDs
3,000,000-3,001,286

490,201 functions, about 66
1,067 out of 1,286 different numbers used
Numbers nearer to 3 million are most common,
higher numbers were used less

33
DISPID Usage Second Range

Second common range for DISPIDs 0-2,313

164,224 functions, about 22
39 numbers in this range were used
0 and 1,103 were the most common
Numbers clumped around 5 groups 0-9, 127-154,
1002-1168, 1500-1504, and 2001-2015, with 2313
being an exception

34
DISPID Usage Third Range

Third range for DISPIDs -2,147,417,109 to
-2,147,411,105

50,541 functions, about 7
55 numbers in this range used
Most occurrences were around numbers ending in
round thousands

35
DISPID Usage Fourth Range

Fourth range for DISPIDs 10,001-10,087

38,099 functions, about 5
75 numbers out of the range were used
Uniquely used by 3050f55d-98b5-11cf-bb82-00aa00bd
ce0b
DISPIDs 10,001-10,007 are most common

36
Patterns in DISPID Usage

Looked at what DISPIDs were used, without regard
to the GUIDS of the calling classes
DISPIDs had a large range, from lows of less
than -2 billion, to highs of over 3 million
Out of 743,270 functions analyzed, The vast
majority had DISPIDs within 4 distinct ranges
Within each of the four ranges, occurrences at
specific numbers formed patterns

37
Function with Multiple Integers

Looked for patterns in the relations among the
integer arguments of functions taking multiple
arguments
Not very many functions in this category
One took two arguments, first was always 0
One took two arguments, always the same.
Arguments were all from (1,1) to (31,31) and
(1908,1908) to (1908)
All came from 2 signup sites on a particular
website
Two took two differing arguments, could not find
relation between arguments
Other functions did not have a large enough
sample size

38
Functions with Multiple Integers

Function itself had consistent patterns in the
values it took 95 of arguments were (1,1) or
(3,2)
No consistent relations between arguments

39
Function Pairs

Examined
GUID 3050f55d-98b5-11cf-bb82-00aa00bdce0b
DISPIDs 10001-10062
Out of 38,099 occurrences, 3,595 were followed
by
GUID c59c6b12-f6c1-11cf-8835-00a0c911e8b2
DISPID 0
Second function had no independent occurrences
Similar arguments
First function took a variety of numbers and
types of arguments
Second function always took a DISPATCH argument,
followed by the same arguments as the first
function

40
Conclusions of Approach

Functions arguments through sets
Seems to be consistent patterns in certain
functions
Range, values taken, values common, value
distribution
DISPID usage
4 ranges with very few exceptions
Common subranges or distribution patterns within
each range
Multiple arguments
Uncommon type of function
No noticeable relations in arguments
Function pairs
Dependent functions have clear patterns
Function position
Argument types and values
Only one example do more exist?

41
Byte string analysis
42
Byte String Analysis

Buffer overflows are a common method of
exploiting a targeted system
One method create a very long string to break
boundary checking, then append shellcode at the
end to inject into the assembly code
We are interested in the length of BSTR objects
feeded into given functions
For any given API, what is considered a normal
string length?

43
Class-based analysis

Initial analyses were done on a class-by-class
basis
Samples were grouped together and analyzed
according to GUID
Byte strings are typically very small
More than 70 of the commonly called Javascript
classes typically received byte strings of less
than length 20. (39 out of 55 functions from this
crawl)
Less than 10 of these ever receive a string
greater than 5000 characters in length (4 out of
55 functions from this crawl).

44
Class-based analysis

Analysis of individual classes shows same trend
toward smaller strings
However, analyzing based on classes groups byte
strings of all class functions together, which
results in inaccuracy and lost information

45
Parameter-based analysis

Second analysis split samples into individual
arguments of unique functions of each class
Given a sample set with values in the interval
(a, b) with average µ and standard deviation s,
we expect values to largely lie within the
interval (µ s, µ s)
We also expect (µ s, µ s) to be smaller than
(a, b)
The smaller (µ s, µ s) is in proportion to
(a, b), the more well-defined our sample set
becomes

46
Parameter-based analysis

Length of expected interval 2s
Length of entire interval n b a 1
2s/n represents the ratio of the expected
interval to the entire interval
Since 2s lt n, 0 lt 2s/n lt 1
When 2s/n 0, s 0 and all values in data set
are equal
When 2s/n 1, s n/2 and all values in data
equal either a or b
As 2s/n goes from 0 to 1, shape of graph begins
to shift

47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
(No Transcript)
53
(No Transcript)
54

When ratio is 0, amount of strings is typically
low
Otherwise, ratio increases as amount of strings
decreases
The function arguments with the smallest
non-zero ratio are the most well-defined

55
(No Transcript)
56
Analysis with pruning

Only function arguments that see 9 or fewer
strings are removed however
Most zero-ratio functions are pruned (2607 to
731)
Many functions with ratio gt 0.5 are pruned (1540
to 883)
Functions with ratio lt 0.5 are affected
minimally (1442 to 1332)

57
Analysis with pruning

Only function arguments that see 99 or fewer
strings are removed however
Almost all zero-ratio functions are pruned (731
to 232)
Almost all functions with ratio gt 0.5 are pruned
(883 to 266)
Only some functions with ratio lt 0.5 are
affected (1332 to 979)

58
Analysis with pruning
As a function is seen in the wild more
frequently, the byte string lengths it takes in
begin to fall into specific intervals. Functions
with substantial evidence are well-defined in the
lengths of byte strings they tend to receive!
59
Comparing w/malicious data

Symantec provided us with test samples used for
Canary testing
These samples trigger browser exploit but do not
inject actual shellcode
The worst thing they can do is crash the browser
Malicious samples fell into one of three
categories
Bad BSTR
Bad I4
Bad DISPATCH (object)
Example MSIE Popup Window Address Bar Spoofing
Weakness
Callback data
Compare with data from May crawl
491 strings seen over the 20,416 websites visited
during that crawl
Smallest 70
Largest 80
Average 76.32
Standard deviation 2.33
Expected interval 73.99, 78.65

60
Trend volatility

How does web activity change over time?
28 crawls of 1000 sites over May 7 to May 13 were
performedto investigate this

Each crawl differs by several hundred thousand
DLL calls
Amount of sites with actual scripts change

61
Trend volatility

These runs were done 5.5 hrs apart
Change is very slight
Zero-ratio functions increase
High-ratio functions decrease

62
Trend volatility

These runs were done 1 day apart
Change is also very slight
Zero-ratio functions decrease
Mid-ratio functions (R 0.5) increase

63
Trend volatility

These runs were done 6 days apart
Change is a little more apparent
Zero-ratio functions decrease
Mid-ratio functions (R 0.5) increase

64
Trend volatility

State of Javascript activity on Web is constantly
changing
Changes are somewhat unpredictable (and entirely
dependent on decisions of webmaster)
These changes in the long run are not major
however, they still exist and need to be addressed

65
Conclusions of Approach

Substantial evidence in favor of existing trends
for byte string arguments
This approach can be adapted to anything that can
be quantified as a number
Changes in state of web will require any
heuristic developed to have at least a basic
learning capability
Plan to continue research over the summer

Write a Comment

User Comments (0)