Title: Improving Spam Detection Based on Structural Similarity
1Improving Spam DetectionBased on Structural
Similarity
- Authors Luiz Henrique Gomes, Fernando Castro,
- Rodrigo Almeida, Luis Bettencourt, VirgÃlio
Almeida - Publish Steps to Reducing Unwanted Traffic on
the Internet Workshop, 2005 - Presenter Danzhou Liu
2Introduction
- Volume of spam traffic is increasing sharply
- 83 of all incoming e-mails in 2005 vs. 24 in
January 2003 - Current detection techniques are not fully
successful - Spammers evade detection by frequently changing
e-mail characteristics traditionally used for
detection/filtering - E-mail content and subjects
- False positives legitimate e-mails are
misclassified as spam. - high cost to end-users
- Goals
- Improve spam detection by reducing the number of
false - positives
3The Proposed Algorithm
- Assumption contact lists change less frequently
than other characteristics - Set of recipients targeted by a sender is likely
to remain stable for longer periods than E-mail
content and subjects - Exploit structural relationships between senders
and recipients - Senders and recipients are clustered based on
similarity of their contact lists - Historical information is used
4Modeling SimilarityAmong Senders and Recipients
- Vectorial representation of an e-mail sender
S0
r0
S1
r1
Similarly,
S2
r2
S3
5Modeling SimilarityAmong Senders and Recipients
- Vectorial representation of an e-mail sender
- Vectorial representation of a sender cluster
- Similarity between a sender and a sender cluster
- Note Similar representations for
recipients
6The Proposed Algorithm
7The Proposed Algorithm
Sender Clusters
Here
sim0.9
sim0.5
Ps( , , )0.8
Sender
For any incoming e-mail
From To
sim0.2
sim0.3
1. Classify incoming e-mail by auxiliary spam
detection method
2. Find similarity between sender and each sender
cluster
Auxiliary Detection Method
3. Add sender to cluster that is most similar to
it as long as similarity gt t
4. Calculate PS, the spam probability of its
senders cluster (auxiliary classification of
this e-mail and previous e-mails sent by cluster)
5. Repeat process with the recipients, calculate
PR (average spam probability of all its
recipients clusters)
Spam?
6. Finally determine whether the incoming e-mail
is spam based on PS and PR
8The Proposed Algorithm
Here
Sender Clusters
Sender
For any incoming e-mail
From To
Recipients Clusters
Here
Auxiliary Detection Method
Recipients
Here
Spam?
9Classification
- Classify the e-mail as spam if the point (PS, PR)
falls in the blue area - Classify the e-mail as legitimate if the point
(PS, PR) falls in the green area - Implemented by computing Spam Rank
PR(m)
Spam
0.5
Legitimate
0.8
PS(m)
10Spam Rank Computation
The Spam Rank vector is The Spam Rank (SR) is
the norm of the projection of over
diagonal If SR gt? classify e-mail as spam Else
If SR lt 1- ? classify it as legitimate Otherwise
, use classification reported by auxiliary
algorithm
PR(m)
Spam
1-?
?
0.5
SR
Legitimate
0.8
PS(m)
11Experiments
- Experimental Data Set
- Auxiliary Spam Detection Method
- Spam Assassin
12Selecting the Similarity Threshold t
- Number of sender/recipient clusters is roughly
stable - for t 0.5, therefore use t 0.5 in
experiments
13Effectiveness of Spam Rank
(a) Bin Size 0.25
(a) Bin Size 0.10
- Clusters with high PS / PR send/receive large
number of spam - Clusters with low PS/PR send/receive large
number of legitimate e-mails
14E-mail Classification
- Higher ? indicates smaller number of e-mails
can be classified - Tradeoff between the total number of e-mails
that are classified and the accordance with the
previous classification provided by the original
classifier algorithm
15Accuracy of Spam Detection
Â
Algorithm of Misclassification (false positives)
Original Classification 60.33
The proposed algorithm 39.67
- 879 e-mails were manually analyzed for possible
false positives
Â
16Strengths of the Paper
- New spam detection algorithm that exploits
structural similarities of senders and recipients - Clustering senders/recipients based on contact
lists - Using historical information of each cluster can
improve accuracy of existing detection algorithms - Reduce number of false positives caused by Spam
Assassin
17Weaknesses of the Paper
- Run slow because of high dimensional vectors
which leading to computational overhead - Assumption may not always hold
- Cannot handle forged e-mail addresses
- False negative is possibly high
18Improvements of the Paper
- Dimension reduction for clustering
- Majority voting by combining several spam
detection techniques to reduce both false
positives and false negatives - Consider spam probability of a sender-recipient
pair - Further evaluation with logs covering longer
period (more than eight-day logs)
19QA