Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam

1 / 10
About This Presentation
Title:

Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam

Description:

Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam. Steve Hookway ... Black and blue a competition. Identify SPAM pages and discount them in ranking ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam


1
Thwarting the Nigritude UltramarineLearning to
Identify Link Spam
  • Steve Hookway
  • 11/17/05

2
Motivation
  • Black and blue a competition
  • Identify SPAM pages and discount them in ranking
  • Which techniques work best and will they last?

3
SPAM vs Ham
  • Spam
  • Link Farms
  • Link Exchange Services
  • Guestbooks
  • Ham
  • Dmoz

4
BadRank
  • Google may make use of Bad Rank
  • Interleave crawling and page rank updating
  • When updating page rank, BR and blacklist are
    considered

5
Representation
  • Each page represented by 89 features plus tfidf
    vector
  • Three block approach
  • Content based
  • Term frequency, inverse document frequency
  • Features based on each page and aggregated
  • Features based collectively
  • Labeled samples created
  • Ham Dmoz
  • SPAM Manually identified

6
Experimental Results
  • tfidf is the most discriminative feature
  • Using the combined representation is always
    better than using only the link based features

7
(No Transcript)
8
Robustness
  • Adversary obfuscates an increasing number of
    attributes
  • Purely text based classifier is immediately
    useless
  • Combined classifier deteriorates slower

9
Open Problems
  • Collective Classification
  • Dealing with a large dataset
  • Game Theory
  • Google Bombing
  • Deciding validity of references
  • Click Spam
  • Stateless protocol provides no info on client

10
Conclusion
  • Classify instances of SPAM
  • Modify page rank
  • Purely text-based classifier is easy to break
  • Need to consider a variety of features
Write a Comment
User Comments (0)