Improving%20Digest-Based%20Collaborative%20Spam%20Detection - PowerPoint PPT Presentation

About This Presentation
Title:

Improving%20Digest-Based%20Collaborative%20Spam%20Detection

Description:

no matching (miss-detection) case is observed. OD-paper conclusion: ... Miss-detection of good emails must be very low. approximating miss-detection probability ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 23
Provided by: pcsy4
Category:

less

Transcript and Presenter's Notes

Title: Improving%20Digest-Based%20Collaborative%20Spam%20Detection


1
Improving Digest-BasedCollaborative Spam
Detection
Slavisa Sarafijanovic Sabrina Perez Jean-Yves Le
Boudec EPFL, Switzerland
MIT_Spam_Conference, Mar 27-28, 2008, MIT,
Cambridge.
2
Talk content
  • Digest-based filtering global picture
    overview
  • Understanding HOW Digests WORK - Open
    Digest Paper 1
  • (Very positive results/conclusions, cited
    and referred a lot!)
  • Understanding it better - Our re-evaluation of
    Open Digest Paper results
  • (Different conclusions!)
  • Our Alternative Digests - results IMPROVE a
    lot, understanding WHY
  • Understanding the why gt further improvements
    possible
  • (Negative selection)
  • Conclusions

1 "An Open Digest-based Technique for Spam
Detection, E. Damiani, S. De Capitani di
Vimercati, S. Paraboschi, P. Samarati, in Proc.
of the 2004 International Workshop on Security
in Parallel and Distributed Systems, San
Francisco, CA USA, September 15-17, 2004.
3
Two main collaborative spam detection approaches
1) White-listing using Social Networks
2) Bulky Content Detection using Digests
digests
relationships
User 1
User 1
User n
User 2
Recent digests
User 3
User n
User 2
Example PGP graph of certificates
Examples DCC, Vipuls Razor, Commtouch
Implementations (in both cases) centralized or
decentralized, open or proprietary
This talk (paper) digests approach for bulky
content detection
4
A Real Digest-Based System DCC(Distributed
Checksum Clearinghouse)

250 DCC Servers

n 10 000 Mail servers

(n3)
Replycounter

Query digest
n millions of Mail users
  • Strengths/drawbacks
  • - fast response
  • not precise (FP problems)
  • limited obfuscation resistance

Spammer (sends in bulk)
Reproducible evaluation of digests-efficiency
Open Digest Paper
5
Producing Digests Nilsimsa similarity hashingas
explained in OD-paper
Cheap
N5 characters sliding window
E-mail,
L characters long

1
8
2
trigrams
Cheapest vac...
Hash 303 -gt 28
Hash()
Hash()
Hash()

00001111
1
1
1
accumulator
... Best Regards, John
0
15
255
Digest
0
1
0
1
1
0
15
255
  • Digest is a binary string of 256 bits
  • Definition Nilsimsa Compare Value (NCV) between
    two digests is equal to the
  • number of bits at corresponding positions that
    are equal, minus 128.
  • Identical emails ? NCV128, unrelated emails ?
    NCV close to 0.

More similar emails ? more similar digests ?
higher NCV
6
Open Digest paper experiments and results
  • Evaluation lt experiment
  • spam bulk detection lt detection of similarity
    between two emails from the same spam bulk
  • ham miss-detection lt miss-detection of
    similarity between unrelated emails

Bulk detection experiment
OD-paper result for adding random text
obfuscation
(repeated many times, to get statistic)
Spam Corpus
Select at random
Obfuscate (2 copies)
Compute digests
01011010
01101011
Evaluate similarity
  • OD-paper only evaluates (talks about) the
    average NCV

Threshold54
OD-paper conclusion Average NCV gt Threshold gt
bulk detection resistant to strong obfuscation by
spammer
Matching indicator (0/1)
NCV value (integer)
7
Open Digest paper experiments and results
(cont.)
Ham miss-detection experiment
Ham and Spam Corpus
  • OD-paper result
  • n12500, n22500 emails
  • no matching (miss-detection) case is observed

For each pair of unrelated emails
Compute digests
10011010
01110011
  • OD-paper conclusion
  • Miss-detection of good emails must be very low
  • approximating miss-detection probability by use
    of Binomial distribution supports the observed
    result

Evaluate similarity
Threshold54
Matching indicators (0/1)
NCV values (integer)
8
Extending OD-paper experiments spam bulk
detection
Bulk detection experiment, identical as in
OD-paper
But we test higher obfuscation ratios
(repeated many times, to get statistic)
Spam Corpus
Select at random
Obfuscate (2 copies)
Compute digests
01011010
01101011
Evaluate similarity
Threshold54
  • OD-paper results is well recovered (blue dotted
    line)

Matching indicator (0/1)
NCV value (integer)
OD-paper conclusion does not hold! Even only
slightly higher obfuscation ratio brings the
average NCV bellow the threshold
9
Understanding better what happens
Compare X to Database (generic experiment)
EITHER Ham Corpus1/2 (ham to filter) OR Spam
Corpus (Obfuscation 1)
X
n2
n1
Spam Corpus (Obfuscation 2)
Select at random
Compute digest
01011010
Database DB of spam and ham digests (represents p
revious digest queries)
compare to each from DB
Threshold54
Matching indicators (0/1)
NCV values (integer)
We look at more metrics
Probability of email-to-email matching
Max(NCV) average
NCV histogram
10
SPAM DB experiment results
Mean Max(NCV) value not informative
Effect of obfuscation changes gracefully Spammer
may gain by additional obfuscation.
11
SPAM DB, NCV histograms effect of obfuscation
Small obfuscation digests are still usefull for
bulk detection
12
SPAM DB, NCV histograms effect of obfuscation
Stronger obfuscation most of the digest are
rendered to not be useful !
13
HAM DB experiment results
Mean Max(NCV) value not informative
Miss-detection probability still too high for
practical use
14
HAM DB, NCV histograms effect of obfuscation
Spam obfuscation does not impact miss-detection
of good emails.
Shifted and wide histograms phenomena gt high
false positives explained
15
Alternative digests
Sampling strings fixed length, random positions
01101011
10111011
00101010
Email-to-email matching max NCV between over
pairs of digests (find how similar are the most
similar parts e.g. spammy phrases)
16
SPAM DB experiment results (alt. digests)
Spam bulk detection not any more vulnerable to
obfuscation...
17
SPAM DB (alt. digests) effect of obfuscation
and we can see why it is like that.
18
HAM DB experiment results (alt. digests)
  • miss-det. Prob still too high

19
HAM DB (alt. digests) effect of obfuscation
What can be done to decrease ham miss-detection?
20
Alternative digests open new possibilities
New email
digest(s)
database of good digests
Negative selection
digest that do not match
Compare to collaborative database of digests (DB)
This part is the same as without negative
selection
21
Effect of negative selection on miss-detection of
ham
22
Conclusions
  • Use of proper metrics is crucial for proper
    conclusions from experiments.
  • Alternative digests provide much better results,
    and by use of
  • NCV histograms we understand why.
  • Use of proper metrics crucial for understanding
    what happens
  • and for understanding how to fix the problems.
Write a Comment
User Comments (0)
About PowerShow.com