Detecting Phishing Web Pages with Visual Similarity Assessment Based on Earth Mover

1 / 47
About This Presentation
Title:

Detecting Phishing Web Pages with Visual Similarity Assessment Based on Earth Mover

Description:

real-Washington Mutual. 0.9573. real-us bank. 0.9323. real-Key Bank. 0.7385. real-ICBC(Asia) ... computation is also acceptable for online phishing detection. ... –

Number of Views:808
Avg rating:3.0/5.0
Slides: 48
Provided by: nrlIisSi
Category:

less

Transcript and Presenter's Notes

Title: Detecting Phishing Web Pages with Visual Similarity Assessment Based on Earth Mover


1
Detecting Phishing Web Pages with
VisualSimilarity Assessment Based on
EarthMovers Distance (EMD)
  • Speaker
  • Po-Jiu Wang
  • Institute of Information Science Academia Sinica
  • Author
  • Anthony Y. Fu
  • Department of Computer Science, City University
    of Hong Kong
  • IEEE 2006

2
Outline
  • What is phishing
  • Various phishing techniques
  • Previous anti-phishing works
  • Evaluating webpage distance with EMD
  • What is EMD, and its advantage
  • Color and its coordinate distance with EMD
  • Conclusion and tentative work to do

3
What is phishing
  • Phishing is a criminal trick of stealing personal
    information through requesting people to access a
    fake webpage.
  • How to request people to?
  • Phishing email, BBS, chatting room, etc.
  • Spoofing free gift, identity confirmation etc.

4
Various phishing techniques
  • The most straightforward way for a phisher to
    spoof people is to make the appearance of webpage
    links and webpages similar to the real ones.

5
Various phishing techniques (Link based phishing
obfuscation)
  • The link based phishing obfuscation can be
    carried out in four ways below
  • Adding suffix to domain name of URL.
  • E.g., revise www.citybank.com to
    www.citybank.com.us.ebanking
  • Using actual link different from visible link.
  • E.g., the HTML line lta href"http//www.citibank
    .com.us.ebanking"gt www.citibank.comlt/agt

6
Various Phishing Techniques (Link based phishing
obfuscation 1)
  • Using bug in real webpage to redirect to other
    webpages.
  • E.g., the bug of eBay website
    http//cgi.ebay.com/ws/eBayISAPI.dll?MfcISAPIComma
    ndRedirectTo DomainDomainUrlPHISHINGLINK can
    direct you to any specified PHISHINGLINK
  • And replacing similar characters in the real
    link.
  • E.g., replace Is (uppercase i) with l
    (lowercase of L) or 1 (Arabic number one),
    such as WWW.CITIBANK.COM to WWW.C1TlBANK.COM.

7
Various Phishing Techniques (webpage based
obfuscation)
  • The webpage based obfuscation can be carried out
    in three basic ways below
  • Using the downloaded webpage from real website to
    make the phishing webpage appear and react
    exactly the same with the real one

8
Various Phishing Techniques(webpage based
obfuscation 1)
  • Using script or add-in to web browser to cover
    the address bar to spoof users to believe they
    have entered the correct website
  • And using visual based content (E.g., image,
    flash, video, etc.) rather than HTML to avoid
    HTML based phishing detection.

9
Previous Anti-Phishing Works
  • Anti-Spamming
  • Phishing email is spam. Phisher do email address
    harvest, and broadcast to the potential victims.
  • Human aided
  • Banks employ a group of people to monitor the
    Phishing activities. E.g. HSBC

10
Previous Anti-Phishing Works (1)
  • Duplicate document detection approaches, which
    focus on plain text documents and use pure text
    features in similarity measure.

11
Motivation
  • Phishing Web pages always have high visual
    similarity with the real Web pages.
  • An effective approach called image-based EMD is
    proposed to calculate the visual similarity of
    Web pages.

12
Evaluating webpage distance with EMD
  • EMD is Earth Movers Distance and it is based on
    the well known transportation problem
  • Suppose we have m producers
  • P(p1,wp1),(p2,wp2)(pm,wpm)
  • N customers
  • C(c1,wc1),(c2,wc2)(cn,wcn)
  • Distance matrix Ddij is given

13
Evaluating webpage distance with EMD
(transportation fee)
  • The task is to find a flow matrix F fij which
    contains factors indicating the amount of product
    to be moved from one producer to one consumer.

14
Evaluating webpage distance with EMD (total cost
of transportation fee)
  • The total cost of transportation fee can be
    represented as

ST
15
Evaluating webpage distance with EMD (final
equation of EMD)
  • The EMD can be represented as

16
Advantage of EMD
  • Represent problems involving multi-featured
    signatures
  • Allow for partial matches in a very natural way
  • Fit for cognitive distance evaluation

17
Color and its coordinate distance with EMD
(Preprocess image data)
  • Preprocess image data
  • Compress them to 1010 pixes
  • Experiment shows that the calculation time can be
    heavily reduced through image size compression
    without reducing the precision an recall
  • E.g.

18
The calculation of the distance of pixel color
and coordinate
  • Get the signature of webpage1 and webpage2 using
    pixel color and coordinate
  • Calculate Ddij.
  • dijDistance(Color(pixeli), Color(pixelj)
  • , Coordinate(pixeli), Coordinate(pixelj))
  • EMDColorAndCordinate
  • EMDDist(Signature1,Signature2, D)

19
The improved color space
  • The color of each pixel in the resized images is
    represented using the ARGB (alpha, red, green,
    and blue) scheme with 4 bytes (32 bits).
  • A degraded color space called Color Degrading
    Factor (CDF) is needed.

Thus, the degraded color space is (28/CDF)4.
20
The centroid of degraded color space
  • The centroid of each degraded color is calculated
    using

The coordinates of the ith pixel that has
degraded color dc
The centroid of degraded color dc
The total number of pixels that have degraded
color dc
21
Computing visual similarity from EMD
  • First, the normalized euclidian distance of the
    degraded ARGB colors is calculated, and then the
    normalized Euclidian distance of centroids is
    calculated.

22
The maximum color distance
  • Suppose feature where

  • ,feature ,where
    , the maximum color
    distance, the maximum color distance is

23
The normalized color distance
  • The normalized color distance NDcolor is defined
    as

24
The normalized centroid distance
  • The maximum centroid distance MDcentroid
  • where w and h are the width and height of the
    resized images, respectively. The normalized
    color distance NDcentroid is defined as

25
Final equation of EMD
  • The two distances are added up with weights p and
    q,respectively, to form the feature distance,
    where pq 1.

26
Computing EMD-based visualsimilarity of two
images

is the amplifier of visual similarity
27
An improved adjusted threshold for classification

  • A special threshold for each given protected web
    page is used to classify a web page to be a
    phishing web page or a normal one.

denotes the
threshold of the ith protected Web page

28
Two types of misclassifications
  • False alarm
  • The visual similarity is larger than or equal to
    t but, in fact, the web page is not a phishing
    Web page (false positive).
  • Missing
  • The visual similarity is less than t but, in
    fact, the web page is a phishing one (false
    negative).

VSSi correlates to two accessory parameters, the
false alarm number and false negative
29
The way to classify phishing page
  • When a suspected web page comes, the visual
    similarity vector which can be represented as
  • and the classification result using the
    following equation

30
Experiment configuration of phishing detection
performance
  • 10,272 homepages are selected from the web.
  • 9 phishing web pages which targeted at 8 real
    protected web pages.
  • The 10,2729 web pages are mixed together to form
    the Suspected Webpage Set.
  • Randomly selected 1,000 web pages from the 10,272
    ones, combining with the 9 phishing webpages to
    form the Training Webpage Set.

31
Train a threshold vector
  • We use the Train Webpage Set to train a threshold
    vector

Protected Webpage Threshold(T)
real-Bank of Oklahoma - Online 0.8469
real-ebay1 0.9434
real-eBay2 0.9493
real-ICBC(Asia) 0.7385
real-Key Bank 0.9323
real-us bank 0.9573
real-Washington Mutual 0.8541
real-Wells Fargo Sign On 0.9255
32
Classification precision, phishing recall, and
false alarm list( 0, 9281 Suspected Web
Pages)
33
Classification precision, phishing recall, and
false alarm list( 0.005, 9281 Suspected Web
Pages)
Reduce false negative possibilities !!
34
Phishing detection performance of image-based EMD
There are 65 false alarms
35
Phishing detection performance of HTML/DOM-based
EMD
There are 849 false alarms
36
Phishing detection performance of similarity
assessment-based EMD
There are 697 false alarms
37
Experiment results
  • The threshold vector to is used to classify an
    suspected webpage.
  • In order to reduce false negative possibilities,
  • there is a necessary sacrifice needed under
  • Empirically set the parameters w h 100,
    0.5,Ss 20, pq0.5, and CDF32 in our
    experiments by tuning.

38
The number of ground truth web page for each
protected web page
39
The configuration of tuning the parameters
  • Take as the
    sample number for each protected web
  • If a web page in the Nsample collected web pages
    is in the corresponding ground truth group, it is
    counted as a correctly detected similar web page.

40
Tuning the parameters (w and h)
  • We have four configuration options (wh 10,
    ,100, and ) to tune w and h.

41
Tuning the parameters (p and q)
  • 11 configuration options (p q 0 1 01
    09 02 08 . . . 09 0110) to are used
    to tune p and q.

42
Tuning the parameters (sample color number)
  • Six configuration options (Ss 5, 10, 15, 20,
    25, and 30) are used to tune Ss.

43
Tuning the parameters (CDF)
  • Eight configuration options (CDF 8, 16, 24, 32,
    40,48, 56, and 64) to tune CDF.

44
The built architecture anti-phising system
45
Conclusions
  • This approach works at the pixel level of Web
    pages rather than at the text level.
  • Experiments show that our method can achieve
    satisfying classification precision and phishing
    recall.
  • The time efficiency of computation is also
    acceptable for online phishing detection.

46
Tentative works
  • Continue with more phishing examples and even
    larger scale datasets.
  • The method could not detect those which are not
    visually similar.
  • Keep working on developing a client-side
    application

47
Thanks for your attention.
Write a Comment
User Comments (0)
About PowerShow.com