Improving Spam Detection Based on Structural Similarity - PowerPoint PPT Presentation

About This Presentation
Title:

Improving Spam Detection Based on Structural Similarity

Description:

Title: CDA6938 Last modified by: Danzhou Liu Created Date: 3/10/2003 5:32:11 AM Document presentation format: On-screen Show Other titles: Garamond Georgia Times New ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 20
Provided by: ucf74
Learn more at: http://www.cs.ucf.edu
Category:

less

Transcript and Presenter's Notes

Title: Improving Spam Detection Based on Structural Similarity


1
Improving Spam DetectionBased on Structural
Similarity
  • Authors Luiz Henrique Gomes, Fernando Castro,
  • Rodrigo Almeida, Luis Bettencourt, Virgílio
    Almeida
  • Publish Steps to Reducing Unwanted Traffic on
    the Internet Workshop, 2005
  • Presenter Danzhou Liu

2
Introduction
  • Volume of spam traffic is increasing sharply
  • 83 of all incoming e-mails in 2005 vs. 24 in
    January 2003
  • Current detection techniques are not fully
    successful
  • Spammers evade detection by frequently changing
    e-mail characteristics traditionally used for
    detection/filtering
  • E-mail content and subjects
  • False positives legitimate e-mails are
    misclassified as spam.
  • high cost to end-users
  • Goals
  • Improve spam detection by reducing the number of
    false
  • positives

3
The Proposed Algorithm
  • Assumption contact lists change less frequently
    than other characteristics
  • Set of recipients targeted by a sender is likely
    to remain stable for longer periods than E-mail
    content and subjects
  • Exploit structural relationships between senders
    and recipients
  • Senders and recipients are clustered based on
    similarity of their contact lists
  • Historical information is used

4
Modeling SimilarityAmong Senders and Recipients
  • Vectorial representation of an e-mail sender

S0
r0
S1
r1
Similarly,
S2
r2
S3
5
Modeling SimilarityAmong Senders and Recipients
  • Vectorial representation of an e-mail sender
  • Vectorial representation of a sender cluster
  • Similarity between a sender and a sender cluster
  • Note Similar representations for
    recipients

6
The Proposed Algorithm
7
The Proposed Algorithm
Sender Clusters
Here
sim0.9
sim0.5
Ps( , , )0.8
Sender
For any incoming e-mail
From To
sim0.2
sim0.3
1. Classify incoming e-mail by auxiliary spam
detection method
2. Find similarity between sender and each sender
cluster
Auxiliary Detection Method
3. Add sender to cluster that is most similar to
it as long as similarity gt t
4. Calculate PS, the spam probability of its
senders cluster (auxiliary classification of
this e-mail and previous e-mails sent by cluster)
5. Repeat process with the recipients, calculate
PR (average spam probability of all its
recipients clusters)
Spam?
6. Finally determine whether the incoming e-mail
is spam based on PS and PR
8
The Proposed Algorithm
Here
Sender Clusters
Sender
For any incoming e-mail
From To
Recipients Clusters
Here
Auxiliary Detection Method
Recipients
Here
Spam?
9
Classification
  • Classify the e-mail as spam if the point (PS, PR)
    falls in the blue area
  • Classify the e-mail as legitimate if the point
    (PS, PR) falls in the green area
  • Implemented by computing Spam Rank

PR(m)
Spam
0.5
Legitimate
0.8
PS(m)
10
Spam Rank Computation
The Spam Rank vector is The Spam Rank (SR) is
the norm of the projection of over
diagonal If SR gt? classify e-mail as spam Else
If SR lt 1- ? classify it as legitimate Otherwise
, use classification reported by auxiliary
algorithm
PR(m)
Spam
1-?
?
0.5
SR
Legitimate
0.8
PS(m)
11
Experiments
  • Experimental Data Set
  • Auxiliary Spam Detection Method
  • Spam Assassin

12
Selecting the Similarity Threshold t
  • Number of sender/recipient clusters is roughly
    stable
  • for t 0.5, therefore use t 0.5 in
    experiments

13
Effectiveness of Spam Rank
(a) Bin Size 0.25
(a) Bin Size 0.10
  • Clusters with high PS / PR send/receive large
    number of spam
  • Clusters with low PS/PR send/receive large
    number of legitimate e-mails

14
E-mail Classification
  • Higher ? indicates smaller number of e-mails
    can be classified
  • Tradeoff between the total number of e-mails
    that are classified and the accordance with the
    previous classification provided by the original
    classifier algorithm

15
Accuracy of Spam Detection
 
  • t 0.5 , ? 0.85

Algorithm of Misclassification (false positives)
Original Classification 60.33
The proposed algorithm 39.67
  • 879 e-mails were manually analyzed for possible
    false positives

 
16
Strengths of the Paper
  • New spam detection algorithm that exploits
    structural similarities of senders and recipients
  • Clustering senders/recipients based on contact
    lists
  • Using historical information of each cluster can
    improve accuracy of existing detection algorithms
  • Reduce number of false positives caused by Spam
    Assassin

17
Weaknesses of the Paper
  • Run slow because of high dimensional vectors
    which leading to computational overhead
  • Assumption may not always hold
  • Cannot handle forged e-mail addresses
  • False negative is possibly high

18
Improvements of the Paper
  • Dimension reduction for clustering
  • Majority voting by combining several spam
    detection techniques to reduce both false
    positives and false negatives
  • Consider spam probability of a sender-recipient
    pair
  • Further evaluation with logs covering longer
    period (more than eight-day logs)

19
QA
Write a Comment
User Comments (0)
About PowerShow.com