Title: Improving%20Digest-Based%20Collaborative%20Spam%20Detection
1Improving Digest-BasedCollaborative Spam
Detection
Slavisa Sarafijanovic Sabrina Perez Jean-Yves Le
Boudec EPFL, Switzerland
MIT_Spam_Conference, Mar 27-28, 2008, MIT,
Cambridge.
2Talk content
- Digest-based filtering global picture
overview - Understanding HOW Digests WORK - Open
Digest Paper 1 - (Very positive results/conclusions, cited
and referred a lot!) - Understanding it better - Our re-evaluation of
Open Digest Paper results - (Different conclusions!)
- Our Alternative Digests - results IMPROVE a
lot, understanding WHY - Understanding the why gt further improvements
possible - (Negative selection)
- Conclusions
1 "An Open Digest-based Technique for Spam
Detection, E. Damiani, S. De Capitani di
Vimercati, S. Paraboschi, P. Samarati, in Proc.
of the 2004 International Workshop on Security
in Parallel and Distributed Systems, San
Francisco, CA USA, September 15-17, 2004.
3Two main collaborative spam detection approaches
1) White-listing using Social Networks
2) Bulky Content Detection using Digests
digests
relationships
User 1
User 1
User n
User 2
Recent digests
User 3
User n
User 2
Example PGP graph of certificates
Examples DCC, Vipuls Razor, Commtouch
Implementations (in both cases) centralized or
decentralized, open or proprietary
This talk (paper) digests approach for bulky
content detection
4A Real Digest-Based System DCC(Distributed
Checksum Clearinghouse)
250 DCC Servers
n 10 000 Mail servers
(n3)
Replycounter
Query digest
n millions of Mail users
- Strengths/drawbacks
- - fast response
- not precise (FP problems)
- limited obfuscation resistance
Spammer (sends in bulk)
Reproducible evaluation of digests-efficiency
Open Digest Paper
5Producing Digests Nilsimsa similarity hashingas
explained in OD-paper
Cheap
N5 characters sliding window
E-mail,
L characters long
1
8
2
trigrams
Cheapest vac...
Hash 303 -gt 28
Hash()
Hash()
Hash()
00001111
1
1
1
accumulator
... Best Regards, John
0
15
255
Digest
0
1
0
1
1
0
15
255
- Digest is a binary string of 256 bits
- Definition Nilsimsa Compare Value (NCV) between
two digests is equal to the - number of bits at corresponding positions that
are equal, minus 128. - Identical emails ? NCV128, unrelated emails ?
NCV close to 0.
More similar emails ? more similar digests ?
higher NCV
6Open Digest paper experiments and results
- Evaluation lt experiment
- spam bulk detection lt detection of similarity
between two emails from the same spam bulk - ham miss-detection lt miss-detection of
similarity between unrelated emails
Bulk detection experiment
OD-paper result for adding random text
obfuscation
(repeated many times, to get statistic)
Spam Corpus
Select at random
Obfuscate (2 copies)
Compute digests
01011010
01101011
Evaluate similarity
- OD-paper only evaluates (talks about) the
average NCV
Threshold54
OD-paper conclusion Average NCV gt Threshold gt
bulk detection resistant to strong obfuscation by
spammer
Matching indicator (0/1)
NCV value (integer)
7Open Digest paper experiments and results
(cont.)
Ham miss-detection experiment
Ham and Spam Corpus
- OD-paper result
- n12500, n22500 emails
- no matching (miss-detection) case is observed
For each pair of unrelated emails
Compute digests
10011010
01110011
- OD-paper conclusion
- Miss-detection of good emails must be very low
- approximating miss-detection probability by use
of Binomial distribution supports the observed
result
Evaluate similarity
Threshold54
Matching indicators (0/1)
NCV values (integer)
8Extending OD-paper experiments spam bulk
detection
Bulk detection experiment, identical as in
OD-paper
But we test higher obfuscation ratios
(repeated many times, to get statistic)
Spam Corpus
Select at random
Obfuscate (2 copies)
Compute digests
01011010
01101011
Evaluate similarity
Threshold54
- OD-paper results is well recovered (blue dotted
line)
Matching indicator (0/1)
NCV value (integer)
OD-paper conclusion does not hold! Even only
slightly higher obfuscation ratio brings the
average NCV bellow the threshold
9Understanding better what happens
Compare X to Database (generic experiment)
EITHER Ham Corpus1/2 (ham to filter) OR Spam
Corpus (Obfuscation 1)
X
n2
n1
Spam Corpus (Obfuscation 2)
Select at random
Compute digest
01011010
Database DB of spam and ham digests (represents p
revious digest queries)
compare to each from DB
Threshold54
Matching indicators (0/1)
NCV values (integer)
We look at more metrics
Probability of email-to-email matching
Max(NCV) average
NCV histogram
10SPAM DB experiment results
Mean Max(NCV) value not informative
Effect of obfuscation changes gracefully Spammer
may gain by additional obfuscation.
11SPAM DB, NCV histograms effect of obfuscation
Small obfuscation digests are still usefull for
bulk detection
12SPAM DB, NCV histograms effect of obfuscation
Stronger obfuscation most of the digest are
rendered to not be useful !
13HAM DB experiment results
Mean Max(NCV) value not informative
Miss-detection probability still too high for
practical use
14HAM DB, NCV histograms effect of obfuscation
Spam obfuscation does not impact miss-detection
of good emails.
Shifted and wide histograms phenomena gt high
false positives explained
15Alternative digests
Sampling strings fixed length, random positions
01101011
10111011
00101010
Email-to-email matching max NCV between over
pairs of digests (find how similar are the most
similar parts e.g. spammy phrases)
16SPAM DB experiment results (alt. digests)
Spam bulk detection not any more vulnerable to
obfuscation...
17SPAM DB (alt. digests) effect of obfuscation
and we can see why it is like that.
18HAM DB experiment results (alt. digests)
- miss-det. Prob still too high
19HAM DB (alt. digests) effect of obfuscation
What can be done to decrease ham miss-detection?
20Alternative digests open new possibilities
New email
digest(s)
database of good digests
Negative selection
digest that do not match
Compare to collaborative database of digests (DB)
This part is the same as without negative
selection
21Effect of negative selection on miss-detection of
ham
22Conclusions
- Use of proper metrics is crucial for proper
conclusions from experiments.
- Alternative digests provide much better results,
and by use of - NCV histograms we understand why.
- Use of proper metrics crucial for understanding
what happens - and for understanding how to fix the problems.