Improving%20Digest-Based%20Collaborative%20Spam%20Detection - PowerPoint PPT Presentation

About This Presentation

Title:

Improving%20Digest-Based%20Collaborative%20Spam%20Detection

Description:

no matching (miss-detection) case is observed. OD-paper conclusion: ... Miss-detection of good emails must be very low. approximating miss-detection probability ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 23

Provided by: pcsy4

Learn more at: https://projects.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Improving%20Digest-Based%20Collaborative%20Spam%20Detection

1
Improving Digest-BasedCollaborative Spam
Detection
Slavisa Sarafijanovic Sabrina Perez Jean-Yves Le
Boudec EPFL, Switzerland
MIT_Spam_Conference, Mar 27-28, 2008, MIT,
Cambridge.
2
Talk content

Digest-based filtering global picture
overview
Understanding HOW Digests WORK - Open
Digest Paper 1
(Very positive results/conclusions, cited
and referred a lot!)
Understanding it better - Our re-evaluation of
Open Digest Paper results
(Different conclusions!)
Our Alternative Digests - results IMPROVE a
lot, understanding WHY
Understanding the why gt further improvements
possible
(Negative selection)
Conclusions

1 "An Open Digest-based Technique for Spam
Detection, E. Damiani, S. De Capitani di
Vimercati, S. Paraboschi, P. Samarati, in Proc.
of the 2004 International Workshop on Security
in Parallel and Distributed Systems, San
Francisco, CA USA, September 15-17, 2004.
3
Two main collaborative spam detection approaches
1) White-listing using Social Networks
2) Bulky Content Detection using Digests
digests
relationships
User 1
User 1
User n
User 2
Recent digests
User 3
User n
User 2
Example PGP graph of certificates
Examples DCC, Vipuls Razor, Commtouch
Implementations (in both cases) centralized or
decentralized, open or proprietary
This talk (paper) digests approach for bulky
content detection
4
A Real Digest-Based System DCC(Distributed
Checksum Clearinghouse)

250 DCC Servers

n 10 000 Mail servers

(n3)
Replycounter

Query digest
n millions of Mail users

Strengths/drawbacks
- fast response
not precise (FP problems)
limited obfuscation resistance

Spammer (sends in bulk)
Reproducible evaluation of digests-efficiency
Open Digest Paper
5
Producing Digests Nilsimsa similarity hashingas
explained in OD-paper
Cheap
N5 characters sliding window
E-mail,
L characters long

1
8
2
trigrams
Cheapest vac...
Hash 303 -gt 28
Hash()
Hash()
Hash()

00001111
1
1
1
accumulator
... Best Regards, John
0
15
255
Digest
0
1
0
1
1
0
15
255

Digest is a binary string of 256 bits
Definition Nilsimsa Compare Value (NCV) between
two digests is equal to the
number of bits at corresponding positions that
are equal, minus 128.
Identical emails ? NCV128, unrelated emails ?
NCV close to 0.

More similar emails ? more similar digests ?
higher NCV
6
Open Digest paper experiments and results

Evaluation lt experiment
spam bulk detection lt detection of similarity
between two emails from the same spam bulk
ham miss-detection lt miss-detection of
similarity between unrelated emails

Bulk detection experiment
OD-paper result for adding random text
obfuscation
(repeated many times, to get statistic)
Spam Corpus
Select at random
Obfuscate (2 copies)
Compute digests
01011010
01101011
Evaluate similarity

OD-paper only evaluates (talks about) the
average NCV

Threshold54
OD-paper conclusion Average NCV gt Threshold gt
bulk detection resistant to strong obfuscation by
spammer
Matching indicator (0/1)
NCV value (integer)
7
Open Digest paper experiments and results
(cont.)
Ham miss-detection experiment
Ham and Spam Corpus

OD-paper result
n12500, n22500 emails
no matching (miss-detection) case is observed

For each pair of unrelated emails
Compute digests
10011010
01110011

OD-paper conclusion
Miss-detection of good emails must be very low
approximating miss-detection probability by use
of Binomial distribution supports the observed
result

Evaluate similarity
Threshold54
Matching indicators (0/1)
NCV values (integer)
8
Extending OD-paper experiments spam bulk
detection
Bulk detection experiment, identical as in
OD-paper
But we test higher obfuscation ratios
(repeated many times, to get statistic)
Spam Corpus
Select at random
Obfuscate (2 copies)
Compute digests
01011010
01101011
Evaluate similarity
Threshold54

OD-paper results is well recovered (blue dotted
line)

Matching indicator (0/1)
NCV value (integer)
OD-paper conclusion does not hold! Even only
slightly higher obfuscation ratio brings the
average NCV bellow the threshold
9
Understanding better what happens
Compare X to Database (generic experiment)
EITHER Ham Corpus1/2 (ham to filter) OR Spam
Corpus (Obfuscation 1)
X
n2
n1
Spam Corpus (Obfuscation 2)
Select at random
Compute digest
01011010
Database DB of spam and ham digests (represents p
revious digest queries)
compare to each from DB
Threshold54
Matching indicators (0/1)
NCV values (integer)
We look at more metrics
Probability of email-to-email matching
Max(NCV) average
NCV histogram
10
SPAM DB experiment results
Mean Max(NCV) value not informative
Effect of obfuscation changes gracefully Spammer
may gain by additional obfuscation.
11
SPAM DB, NCV histograms effect of obfuscation
Small obfuscation digests are still usefull for
bulk detection
12
SPAM DB, NCV histograms effect of obfuscation
Stronger obfuscation most of the digest are
rendered to not be useful !
13
HAM DB experiment results
Mean Max(NCV) value not informative
Miss-detection probability still too high for
practical use
14
HAM DB, NCV histograms effect of obfuscation
Spam obfuscation does not impact miss-detection
of good emails.
Shifted and wide histograms phenomena gt high
false positives explained
15
Alternative digests
Sampling strings fixed length, random positions
01101011
10111011
00101010
Email-to-email matching max NCV between over
pairs of digests (find how similar are the most
similar parts e.g. spammy phrases)
16
SPAM DB experiment results (alt. digests)
Spam bulk detection not any more vulnerable to
obfuscation...
17
SPAM DB (alt. digests) effect of obfuscation
and we can see why it is like that.
18
HAM DB experiment results (alt. digests)

miss-det. Prob still too high

19
HAM DB (alt. digests) effect of obfuscation
What can be done to decrease ham miss-detection?
20
Alternative digests open new possibilities
New email
digest(s)
database of good digests
Negative selection
digest that do not match
Compare to collaborative database of digests (DB)
This part is the same as without negative
selection
21
Effect of negative selection on miss-detection of
ham
22
Conclusions