Improving Spam Detection Based on Structural Similarity - PowerPoint PPT Presentation

About This Presentation

Title:

Improving Spam Detection Based on Structural Similarity

Description:

Title: CDA6938 Last modified by: Danzhou Liu Created Date: 3/10/2003 5:32:11 AM Document presentation format: On-screen Show Other titles: Garamond Georgia Times New ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 20

Provided by: ucf74

Learn more at: http://www.cs.ucf.edu

Category:

more less

Transcript and Presenter's Notes

Title: Improving Spam Detection Based on Structural Similarity

1
Improving Spam DetectionBased on Structural
Similarity

Authors Luiz Henrique Gomes, Fernando Castro,
Rodrigo Almeida, Luis Bettencourt, Virgílio
Almeida
Publish Steps to Reducing Unwanted Traffic on
the Internet Workshop, 2005
Presenter Danzhou Liu

2
Introduction

Volume of spam traffic is increasing sharply
83 of all incoming e-mails in 2005 vs. 24 in
January 2003
Current detection techniques are not fully
successful
Spammers evade detection by frequently changing
e-mail characteristics traditionally used for
detection/filtering
E-mail content and subjects
False positives legitimate e-mails are
misclassified as spam.
high cost to end-users
Goals
Improve spam detection by reducing the number of
false
positives

3
The Proposed Algorithm

Assumption contact lists change less frequently
than other characteristics
Set of recipients targeted by a sender is likely
to remain stable for longer periods than E-mail
content and subjects
Exploit structural relationships between senders
and recipients
Senders and recipients are clustered based on
similarity of their contact lists
Historical information is used

4
Modeling SimilarityAmong Senders and Recipients

Vectorial representation of an e-mail sender

S0
r0
S1
r1
Similarly,
S2
r2
S3
5
Modeling SimilarityAmong Senders and Recipients

Vectorial representation of an e-mail sender
Vectorial representation of a sender cluster
Similarity between a sender and a sender cluster
Note Similar representations for
recipients

6
The Proposed Algorithm
7
The Proposed Algorithm
Sender Clusters
Here
sim0.9
sim0.5
Ps( , , )0.8
Sender
For any incoming e-mail
From To
sim0.2
sim0.3
1. Classify incoming e-mail by auxiliary spam
detection method
2. Find similarity between sender and each sender
cluster
Auxiliary Detection Method
3. Add sender to cluster that is most similar to
it as long as similarity gt t
4. Calculate PS, the spam probability of its
senders cluster (auxiliary classification of
this e-mail and previous e-mails sent by cluster)
5. Repeat process with the recipients, calculate
PR (average spam probability of all its
recipients clusters)
Spam?
6. Finally determine whether the incoming e-mail
is spam based on PS and PR
8
The Proposed Algorithm
Here
Sender Clusters
Sender
For any incoming e-mail
From To
Recipients Clusters
Here
Auxiliary Detection Method
Recipients
Here
Spam?
9
Classification

Classify the e-mail as spam if the point (PS, PR)
falls in the blue area
Classify the e-mail as legitimate if the point
(PS, PR) falls in the green area
Implemented by computing Spam Rank

PR(m)
Spam
0.5
Legitimate
0.8
PS(m)
10
Spam Rank Computation
The Spam Rank vector is The Spam Rank (SR) is
the norm of the projection of over
diagonal If SR gt? classify e-mail as spam Else
If SR lt 1- ? classify it as legitimate Otherwise
, use classification reported by auxiliary
algorithm
PR(m)
Spam
1-?
?
0.5
SR
Legitimate
0.8
PS(m)
11
Experiments

Experimental Data Set
Auxiliary Spam Detection Method
Spam Assassin

12
Selecting the Similarity Threshold t

Number of sender/recipient clusters is roughly
stable
for t 0.5, therefore use t 0.5 in
experiments

13
Effectiveness of Spam Rank
(a) Bin Size 0.25
(a) Bin Size 0.10

Clusters with high PS / PR send/receive large
number of spam
Clusters with low PS/PR send/receive large
number of legitimate e-mails

14
E-mail Classification

Higher ? indicates smaller number of e-mails
can be classified
Tradeoff between the total number of e-mails
that are classified and the accordance with the
previous classification provided by the original
classifier algorithm

15
Accuracy of Spam Detection

t 0.5 , ? 0.85

Algorithm of Misclassification (false positives)
Original Classification 60.33
The proposed algorithm 39.67

879 e-mails were manually analyzed for possible
false positives

16
Strengths of the Paper

New spam detection algorithm that exploits
structural similarities of senders and recipients
Clustering senders/recipients based on contact
lists
Using historical information of each cluster can
improve accuracy of existing detection algorithms
Reduce number of false positives caused by Spam
Assassin

17
Weaknesses of the Paper

Run slow because of high dimensional vectors
which leading to computational overhead
Assumption may not always hold
Cannot handle forged e-mail addresses
False negative is possibly high

18
Improvements of the Paper

Dimension reduction for clustering
Majority voting by combining several spam
detection techniques to reduce both false
positives and false negatives
Consider spam probability of a sender-recipient
pair
Further evaluation with logs covering longer
period (more than eight-day logs)

19
QA

Write a Comment

User Comments (0)