Title: Detecting Phishing Web Pages with Visual Similarity Assessment Based on Earth Mover
1Detecting Phishing Web Pages with
VisualSimilarity Assessment Based on
EarthMovers Distance (EMD)
- Speaker
- Po-Jiu Wang
- Institute of Information Science Academia Sinica
- Author
- Anthony Y. Fu
- Department of Computer Science, City University
of Hong Kong - IEEE 2006
2Outline
- What is phishing
- Various phishing techniques
- Previous anti-phishing works
- Evaluating webpage distance with EMD
- What is EMD, and its advantage
- Color and its coordinate distance with EMD
- Conclusion and tentative work to do
3What is phishing
- Phishing is a criminal trick of stealing personal
information through requesting people to access a
fake webpage. - How to request people to?
- Phishing email, BBS, chatting room, etc.
- Spoofing free gift, identity confirmation etc.
4Various phishing techniques
- The most straightforward way for a phisher to
spoof people is to make the appearance of webpage
links and webpages similar to the real ones.
5Various phishing techniques (Link based phishing
obfuscation)
- The link based phishing obfuscation can be
carried out in four ways below - Adding suffix to domain name of URL.
- E.g., revise www.citybank.com to
www.citybank.com.us.ebanking - Using actual link different from visible link.
- E.g., the HTML line lta href"http//www.citibank
.com.us.ebanking"gt www.citibank.comlt/agt
6Various Phishing Techniques (Link based phishing
obfuscation 1)
- Using bug in real webpage to redirect to other
webpages. - E.g., the bug of eBay website
http//cgi.ebay.com/ws/eBayISAPI.dll?MfcISAPIComma
ndRedirectTo DomainDomainUrlPHISHINGLINK can
direct you to any specified PHISHINGLINK - And replacing similar characters in the real
link. - E.g., replace Is (uppercase i) with l
(lowercase of L) or 1 (Arabic number one),
such as WWW.CITIBANK.COM to WWW.C1TlBANK.COM.
7Various Phishing Techniques (webpage based
obfuscation)
- The webpage based obfuscation can be carried out
in three basic ways below - Using the downloaded webpage from real website to
make the phishing webpage appear and react
exactly the same with the real one
8Various Phishing Techniques(webpage based
obfuscation 1)
- Using script or add-in to web browser to cover
the address bar to spoof users to believe they
have entered the correct website - And using visual based content (E.g., image,
flash, video, etc.) rather than HTML to avoid
HTML based phishing detection.
9Previous Anti-Phishing Works
- Anti-Spamming
- Phishing email is spam. Phisher do email address
harvest, and broadcast to the potential victims. - Human aided
- Banks employ a group of people to monitor the
Phishing activities. E.g. HSBC
10Previous Anti-Phishing Works (1)
- Duplicate document detection approaches, which
focus on plain text documents and use pure text
features in similarity measure.
11Motivation
- Phishing Web pages always have high visual
similarity with the real Web pages. - An effective approach called image-based EMD is
proposed to calculate the visual similarity of
Web pages.
12Evaluating webpage distance with EMD
- EMD is Earth Movers Distance and it is based on
the well known transportation problem - Suppose we have m producers
- P(p1,wp1),(p2,wp2)(pm,wpm)
- N customers
- C(c1,wc1),(c2,wc2)(cn,wcn)
- Distance matrix Ddij is given
13Evaluating webpage distance with EMD
(transportation fee)
- The task is to find a flow matrix F fij which
contains factors indicating the amount of product
to be moved from one producer to one consumer.
14Evaluating webpage distance with EMD (total cost
of transportation fee)
- The total cost of transportation fee can be
represented as
ST
15Evaluating webpage distance with EMD (final
equation of EMD)
- The EMD can be represented as
16Advantage of EMD
- Represent problems involving multi-featured
signatures - Allow for partial matches in a very natural way
- Fit for cognitive distance evaluation
17Color and its coordinate distance with EMD
(Preprocess image data)
- Preprocess image data
- Compress them to 1010 pixes
- Experiment shows that the calculation time can be
heavily reduced through image size compression
without reducing the precision an recall - E.g.
18The calculation of the distance of pixel color
and coordinate
- Get the signature of webpage1 and webpage2 using
pixel color and coordinate - Calculate Ddij.
- dijDistance(Color(pixeli), Color(pixelj)
- , Coordinate(pixeli), Coordinate(pixelj))
-
- EMDColorAndCordinate
- EMDDist(Signature1,Signature2, D)
19The improved color space
- The color of each pixel in the resized images is
represented using the ARGB (alpha, red, green,
and blue) scheme with 4 bytes (32 bits).
- A degraded color space called Color Degrading
Factor (CDF) is needed.
Thus, the degraded color space is (28/CDF)4.
20The centroid of degraded color space
- The centroid of each degraded color is calculated
using
The coordinates of the ith pixel that has
degraded color dc
The centroid of degraded color dc
The total number of pixels that have degraded
color dc
21Computing visual similarity from EMD
- First, the normalized euclidian distance of the
degraded ARGB colors is calculated, and then the
normalized Euclidian distance of centroids is
calculated.
22The maximum color distance
- Suppose feature where
-
,feature ,where
, the maximum color
distance, the maximum color distance is
23The normalized color distance
- The normalized color distance NDcolor is defined
as
24The normalized centroid distance
- The maximum centroid distance MDcentroid
- where w and h are the width and height of the
resized images, respectively. The normalized
color distance NDcentroid is defined as
25Final equation of EMD
- The two distances are added up with weights p and
q,respectively, to form the feature distance,
where pq 1.
26Computing EMD-based visualsimilarity of two
images
is the amplifier of visual similarity
27An improved adjusted threshold for classification
- A special threshold for each given protected web
page is used to classify a web page to be a
phishing web page or a normal one.
denotes the
threshold of the ith protected Web page
28Two types of misclassifications
- False alarm
- The visual similarity is larger than or equal to
t but, in fact, the web page is not a phishing
Web page (false positive). - Missing
- The visual similarity is less than t but, in
fact, the web page is a phishing one (false
negative).
VSSi correlates to two accessory parameters, the
false alarm number and false negative
29The way to classify phishing page
- When a suspected web page comes, the visual
similarity vector which can be represented as -
- and the classification result using the
following equation
30Experiment configuration of phishing detection
performance
- 10,272 homepages are selected from the web.
- 9 phishing web pages which targeted at 8 real
protected web pages. - The 10,2729 web pages are mixed together to form
the Suspected Webpage Set. - Randomly selected 1,000 web pages from the 10,272
ones, combining with the 9 phishing webpages to
form the Training Webpage Set.
31Train a threshold vector
- We use the Train Webpage Set to train a threshold
vector
Protected Webpage Threshold(T)
real-Bank of Oklahoma - Online 0.8469
real-ebay1 0.9434
real-eBay2 0.9493
real-ICBC(Asia) 0.7385
real-Key Bank 0.9323
real-us bank 0.9573
real-Washington Mutual 0.8541
real-Wells Fargo Sign On 0.9255
32Classification precision, phishing recall, and
false alarm list( 0, 9281 Suspected Web
Pages)
33Classification precision, phishing recall, and
false alarm list( 0.005, 9281 Suspected Web
Pages)
Reduce false negative possibilities !!
34Phishing detection performance of image-based EMD
There are 65 false alarms
35Phishing detection performance of HTML/DOM-based
EMD
There are 849 false alarms
36Phishing detection performance of similarity
assessment-based EMD
There are 697 false alarms
37Experiment results
- The threshold vector to is used to classify an
suspected webpage. - In order to reduce false negative possibilities,
- there is a necessary sacrifice needed under
- Empirically set the parameters w h 100,
0.5,Ss 20, pq0.5, and CDF32 in our
experiments by tuning.
38The number of ground truth web page for each
protected web page
39The configuration of tuning the parameters
- Take as the
sample number for each protected web - If a web page in the Nsample collected web pages
is in the corresponding ground truth group, it is
counted as a correctly detected similar web page.
40Tuning the parameters (w and h)
- We have four configuration options (wh 10,
,100, and ) to tune w and h.
41Tuning the parameters (p and q)
- 11 configuration options (p q 0 1 01
09 02 08 . . . 09 0110) to are used
to tune p and q.
42Tuning the parameters (sample color number)
- Six configuration options (Ss 5, 10, 15, 20,
25, and 30) are used to tune Ss.
43Tuning the parameters (CDF)
- Eight configuration options (CDF 8, 16, 24, 32,
40,48, 56, and 64) to tune CDF.
44The built architecture anti-phising system
45Conclusions
- This approach works at the pixel level of Web
pages rather than at the text level. - Experiments show that our method can achieve
satisfying classification precision and phishing
recall. - The time efficiency of computation is also
acceptable for online phishing detection.
46Tentative works
- Continue with more phishing examples and even
larger scale datasets. - The method could not detect those which are not
visually similar. - Keep working on developing a client-side
application
47Thanks for your attention.