Social media spam - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Social media spam

Description:

Information seized or intercepted for criminal investigative purposes may not be ... In criminal investigations, all those who have been the subject of interception ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 26
Provided by: moira4
Category:
Tags: media | social | spam

less

Transcript and Presenter's Notes

Title: Social media spam


1
Social media spam
  • Moira Burke

2
Varieties of social media spam
  • Email spam/phishing
  • Forum and newsgroup spam messages

3
Varieties of social media spam
  • Email spam/phishing
  • Forum and newsgroup spam messages
  • Comment spam, including
  • trackbacks
  • wall/shoutbox posts

4
Varieties of social media spam
  • Email spam/phishing
  • Forum and newsgroup spam messages
  • Comment spam
  • Spam profiles on SNS

5
Varieties of social media spam
  • Email spam/phishing
  • Forum and newsgroup spam messages
  • Comment spam
  • Spam profiles on SNS
  • Wiki spam, including
  • Spam links
  • Promotional pages
  • User-to-user canvassing
  • Spings
  • YouTube spam videos

6
Goals of social media spam
  • Increase PageRank (Spamdexing)
  • So dont need to fool human readers, only
    crawlers
  • Blog search engines rank by recency, not
    relevancy
  • Attract target demographic
  • Especially at Facebook and MySpace
  • Phishing (e.g. MySpace profiles raising money
    for charity)

7
Approaches to spam control
  • Human-moderated
  • Human moderation of comments
  • Distributed human moderation (e.g. Wikipedia)
  • Posting by people in your friend network (e.g.
    Facebook)

8
Approaches to spam control
  • Automatic
  • Captchas
  • Content-based filters (keywords)
  • Network-based filters (network shape or poster
    reputation)
  • White or blacklists
  • Preventing HTML in the comments
  • Throttling comment rate
  • Link markup (relnofollow)

9
The relnofollow attribute
  • http//en.wikipedia.org/wiki/Spam_in_blogsrel.3D.
    22nofollow.22
  • Endorsed by major blog platforms in Jan 2005
  • Slashdot links left by recently created
    accounts
  • Wikipedia references/external links section
  • Problems
  • Can lead site owners to gorge on PageRank
  • Reduces value of legitimate comments
  • Doesnt stop spam comments, just spamdexing

10
Blocking blog spam with language model
disagreement (WWW2005)
  • Gilad Mishne David Carmel Ronny Lempel

11
Overview
  • Compare language in
  • Blog post
  • Comment text
  • Page being linked to

12
Language models
  • Calculate Kullback-Leibler divergence between
    post and comment or linked page language
    models
  • Maximum likelihood models smoothed with
    distribution of words on the Internet

13
Spam classification
  • Assume KL-divergence scores drawn from one of two
    distributions spam and legitimate
  • Set threshold as vertical separator between
    distributions
  • Moving left (lower) reduces false negatives
    (unidentified spam)
  • Moving right (greater) reduces false positives

14
Pros and cons of method
  • Pros
  • No training
  • No hard-coded rule sets that need updating
  • Doesnt require full web connectivity (unlike
    network analysis)
  • Can be deployed retrospectively
  • Hard for spammer to choose comment language
    similar to both the blog and the spam site
  • Cons
  • Spammers can just copy blog text (but detectable
    by search engine when they do it on multiple
    sites)
  • Doesnt work well on short posts unless language
    model includes other out-link page text
    (introduces model drift)

15
Experiment
  • 50 random blog posts w/ 1024 comments
  • Human coded spam (68 spam, 32 clean)
  • Varied threshold multiplier from 0.75 to 1.25
  • Best performance
  • Threshold multiplier 1.1
  • 83 accuracy (8.5 false positives, 8.5 false
    negatives)
  • Misclassified comments were usually short
  • Expanding language model of posts to include
    linked-to pages reduced overall performance by
    2-5, but helped with shorter posts

16
Is Britney Spears Spam? (CEAS2007)
  • Aaron Zinman Judith Donath

17
Spam in social networking systems (SNS)
  • Ambiguous profiles - should they be friended?
  • Hard to vet hundreds of friend requests
  • Many requests are content-less (name only)
  • Users have different preferences (some want
    Britney PR)
  • Instead of automatically filtering
  • Help user make informed choice
  • Make network and profile features more salient

18
Goals
  • Long-term
  • Build a people-oriented reasoning AI engine
    that matches users mental models of who to
    friend
  • That guy central to the punk rock scene
  • Someone who shares/passes similar media as I do
  • Near-term
  • Present lower-level feature bundles
  • Someone who sends more movie clips than receives
  • Someone with little public information

19
Why other approaches dont work in SNS
  • Only trusting friends of friends Trust changes
    over time and for different purposes, and cant
    be confidently evaluated several hops away.
  • Using network clustering components Works for
    classic spam, but not for borderline cases (like
    undesirable friend posting political spam).
    Doesnt match mental model of users.

20
Method
  • Harvested 800 MySpace profiles plus top friends
  • Hand-scored on 5pt scales for
  • sociability (s)
  • of personal comments
  • customized graphics
  • other normal social activity
  • promotionality (p)
  • amount of material meant to influence others
  • Half had p1 half had p1.

21
Four user prototypes
22
Network and profile features
  • Profile features (relatively cheap to fake)
  • Which sections are filled out
  • Does it have a picture
  • Does it have school info
  • of "I" words
  • Comments
  • "thanks"
  • Network features (relatively hard to fake)
  • / independent images in comments
  • / independent links
  • Avg. of posters use the same links/images as
    us
  • comments
  • Didn't include comment timestamps

23
Machine learning
  • Tried four algoritms
  • Naïve Bayes
  • Neural networks
  • Linear regression
  • KNN
  • 40 features
  • Network/comment-based
  • Profile-based
  • Mixed
  • Used PCA to reduce feature space, but didnt
    help

24
Results
  • Poor performance (30-50 accuracy) classifying s
    and p on 1 to 5 scale
  • Better (90) using binary threshold at 4
  • Best model used network and profile features,
    though only marginally better than profile-only
  • Authors suggest that profile features more easily
    faked in spam arms race network features will be
    more important in the future.

25
Is spam dashboard a good idea?
  • Until we have reliable agents using a fine-tuned
    subjective cognitive model of the user, a better
    approach is to expose the end-user to a
    digestible form of the raw features and let them
    decide how to proceed.
  • Next Content- and link-analysis methods for spam
    detection . . .
Write a Comment
User Comments (0)
About PowerShow.com