Web Spam - PowerPoint PPT Presentation

1 / 80
About This Presentation
Title:

Web Spam

Description:

From 1995 to 2000, tens of thousands of business ... Remember not only airfare to say the right plane. tickets thing in the right place, but far cheap travel ... – PowerPoint PPT presentation

Number of Views:780
Avg rating:3.0/5.0
Slides: 81
Provided by: pb8
Category:
Tags: airfare | spam | web

less

Transcript and Presenter's Notes

Title: Web Spam


1
Web Spam
  • PengBo
  • Dec 8, 2008

2
Todays Outline
  • Web Spam
  • The Spammers Toolbox
  • Link Farm
  • Combating Web Spam
  • TrustRank

3
Web Spam
4
Story of the 2BigFeet
  • From 1995 to 2000, tens of thousands of business
    owners took out small-business to open
    storefronts on the Web, Neil Moncrief is one of
    them.
  • 40,000 a month, 2003, 95 from search engine
    referrals.
  • November 14,2003, the phone stopped ringing and
    the orders stopped coming in.
  • OH! Google Dance
  • But

5
Economic Considerations
  • ????????Web?gateway.
  • ??Search Engine (SE) ???????????.
  • ????????????????????.
  • ????????????????.
  • e.g., e-commerce sites.
  • advertising-driven sites

6
Ways to Increase SE Referrals
  • ??keyword-based ??
  • ?????????
  • ??genuinely better content??
  • or Game the system??
  • Search Engine Optimization is a thriving
    business??
  • Some SEOs are ethical
  • Some are not

7
What Is Web Spam?
  • Spamming any deliberate action solely in order
    to boost a web pages position in search engine
    results, incommensurate with pages real value
  • Spam web pages that are the result of spamming
  • Approximately 10-15 of web pages are spam

8
Why Web Spam Is Bad
We appreciate your taking the time to help us
improve our service for your fellow users around
the world. By helping us eliminate spam, you're
saving millions of people time, effort and
energy.
9
Detecting Web Spam
  • Spam detection a classification problem
  • ????,??????/?????spam?
  • But what are the salient features?
  • ????spamming???????
  • Finding the right features is alchemy, not
    science
  • Spammers????????? its an arms race!

10
The Spammers Toolbox
11
Techniques Taxonomy
12
Techniques / Boosting / Term
13
Term Spamming-What?
14
Weaving
Remember not only to say the right thing in the
right place, but far more difficult still, to
leave unsaid the wrong thing at the tempting
moment. Benjamin FranklinUS author, diplomat,
inventor, physicist, politician, printer (1706
- 1790)
Remember not only airfare to say the right
plane tickets thing in the right place, but far
cheap travel more difficult still, to leave hotel
rooms unsaid the wrong thing at vacation the
tempting moment. Benjamin FranklinUS author,
diplomat, inventor, physicist, politician,
printer (1706 - 1790)
Remember not only airfare to say the right
plane tickets thing in the right place, but far
cheap travel more difficult still, to leave hotel
rooms unsaid the wrong thing at vacation the
tempting moment. Benjamin FranklinUS author,
diplomat, inventor, physicist, politician,
printer (1706 - 1790)
15
Techniques / Boosting / Term
  • repetition repetition repetitionrepetition
    repetition repetition
  • dumortierite dumose dumous dump dumpage dumper
    dumpily dumpiness dumping dumpish dumpishly
  • work in weaving three-women teams is an ancient
    textile art on looms
  • please refrain from using the phrase stitching
    wounds located on the lower limbs

16
Techniques / Boosting / Link
Nice story. Read about my inoonlinever.comLas Vegas casino trip.
17
Google Bomb
  • "Google bombing" project organized by George
    Johnston back at the end of October 2003

18
Techniques / Hiding
19
Techniques / Hiding
  • Content hiding
  • Cloaking
  • Identify web crawlers
  • Serve a different version of the page

GET /db_pages/members.html HTTP/1.0 Host
infolab.stanford.edu User-Agent
AVSearch-3.0(AltaVista/AVC)
20
Link Farm
21
Now, focus on link farm
  • Link spamming Inflating the rank of a page by
    creating boosting links to it
  • From unaffiliated sites (e.g. blogs, guest books,
    web forums, etc.)
  • From partner sites Link exchanges
  • From own sites Link farms

22
Link Exchanges
  • reciprocal link??????objects?????????links,?ensur
    e mutual traffic.
  • Three way linking
  • ???Search Engine????"natural" links.
  • siteA - siteB - siteC - siteA
  • Automated Linking
  • automatic link exchange services

Quick and lazy schemes are increasingly worth
less and less, so avoid themat all costs.
Heres a good rule of thumb for SEO, if its
automated, super-easy, or super-fast, avoid
it. Why Link Exchanges are a Bad Idea by Scott
Allen
23
PageRank Algorithm I
  • Basic idea BP98 given the web graph G (VE),
    we define a regular Markov chain M on this web
    graph. The PageRank vector is the stationary
    distribution of M.
  • Step 1 suppose A is the adjacency matrix.
  • Example
  • Step 2 normalize matrix A to get P
  • Example We call a page
    without
  • outgoing links dangling pages.

24
PageRank Algorithm II
  • Step 3 P can be created by replacing rows of 0T
    in P with vector
  • Example
  • Step 4 transition Matrix of Markov Chain a
  • Example

25
PageRank Algorithm III
  • M is regular.
  • Theorem 1 (KS60) For a finite state regular
    Markov chain P
  • There exists a unique stationary distribution
  • PageRank is the stationary distribution of the
    Markov chain M i.e. and
  • In the previous example, the PageRank vector
  • is (0292 0292 0416)

26
PageRank
  • PageRank in one equation
  • PR(p) ? M (1- ?) Vp
  • M is the adjacency matrix of the Web Graph.
  • ? is the damping factor. (usually .85)
  • in case of fairness Vp1/N (N of pages
    in the Web).
  • V is the personalization vector.
  • What happens if a page p has no outgoing links ?
  • ? of its PR is lost -- all the PR will be lost
    eventually.
  • solution normalize rows of M.
    (i.e. insert links to every other page)

27
Aggregate Page Rank
  • Total page rank is affected by
  • Number of pages
  • Incoming Links
  • Outgoing Links
  • Dangling Nodes
  • Topologies that
  • Use as many pages as possible
  • minimize outgoing links
  • minimize dangling nodes

incoming links
WEB-SITE
outgoing links
28
Chain topology (more is better)
PR (Web Site) 0.34
I
a
O
0.18
0.34
0.47
PR (Web Site) 0.210.29 0.50
I
a
O
b
0.11
0.21
0.37
0.29
I
a
b
c
d
e
f
O
0.03
0.07
0.09
0.12
0.14
0.16
0.17
0.18
PR (Web Site) 0.77
29
Ring topology
I
a
O
0.18
0.34
0.47
0.18
I
a
O
0.11
0.03
b
f
0.15
0.11
PR (Web Site) 0.86
c
e
0.12
0.14
d
0.13
30
Clique topology
I
a
O
0.18
0.34
0.47
0.18
I
a
O
0.04
0.03
b
f
0.15
0.15
PR (Web Site) 0.93
c
e
0.15
0.15
d
0.15
31
Increasing Page Rank of a single target page
  • Complicated structures do not help
  • chain, ring, clique waste page rank among every
    node in the website
  • Then

32
Star topology
I
a
O
0.18
0.34
0.47
0.09
b
0.09
0.09
c
f
PR (a) 0.43
I
a
O
0.09
0.03
d
e
0.09
0.09
33
Link Farm Model
34
How to Do It?
  • ?Spammer????,?????
  • Own pages
  • ???spammer???
  • ?????domain names
  • Accessible pages
  • ??,web log comments pages
  • Spammer?????????? links
  • Inaccessible pages

35
Link Spam Farms
  • Spammers goal
  • ?????? t ?PageRank????
  • Techniques
  • ????????accessible pages?t?links
  • ?? link farm ?????page rank multiplier effect

36
Organization
One of the most common and effective
organizations for a link farm
37
A Simple Farm Model
  • Suppose rank contributed by accessible pages ?
  • Let page rank of target page y, Nnumber of all
    pages
  • Rank of each farm page (1-c)/N, cdamping
    factor
  • y ? kc(1-c)/N (1-c)/N ? (1-c) (ck1)
    /N??
  • No multiplier effect for acquired page rank??
  • By making k large, we can make y as large as we
    want

38
An Optimal Model
  • Model Suppose rank contributed by accessible
    pages ?
  • Let page rank of target page y, Nnumber of all
    pages
  • Rank of each farm page cy/k (1-c)/N
  • y ? ckcy/k (1-c)/N (1-c)/N
  • ? c2y c(1-c)k/N (1-c)/N
  • y ?/(1-c2) (ck1)/(NcN), For c 0.85,
    1/(1-c2) 3.6

39
Comparison
40
A Theorem for Link Farm
  • The PageRankscore of the target is maximal if and
    only if
  • ??l boosting pages (farm pages) ???????target
  • boosting pages??????
  • target ????????boosting pages
  • All hijacked links?? target

Really Optimal?
41
Optimal Farm
Lesson 1 Short loop(s) increase target
PageRank
42
Two Farms
  • Alliances interconnected farms
  • Single spammer, several target pages/farms
  • Multiple spammers

What happens if you and I team up?
43
Two Farms
  • We can do this
  • but it wouldnt helptarget scores balance out
  • p0q0 c(mk)2/N(1c), So, p0 q0d(mk)/2,
    where d c/N(1c)

44
Two Farms
  • However, we can do this
  • Remove the links to boosting pages
  • and both targets scores increase
  • For km , we have 1.85x increase
  • p0q0 ?/(1-c) cN2/N. So, 1c1.85,
    1.853.66.7

Lesson 2 Target pages should only link to other
targets
Lesson 3 In an alliance of two, both
participants win
45
Larger Alliances
  • Extremems
  • Ring core
  • Completely connected core

46
Larger Alliances
  • Target scores for ring/complete cores
  • 10 farms of sizes 1000, 2000, , 10000

Lesson 4 Larger alliances need to be stable to
keep all participants happy
Problem farm 10 loses in a ring
47
Features Identifying Link Spam
  • ???low-ranked pages????
  • ????????????
  • ??????affiliated pages??
  • Same web site same domain
  • Same IP address
  • Same owner (according to WHOIS record)
  • linking pages?machine-generated
  • ??

48
Combating Web Spam
49
Combating Web Spam
  • Statistical Detection
  • Comment Spam Detection
  • Detecting Cloaking and Redirection
  • Secrecy
  • Content Based Detection
  • Graph Based Detection

50
Statistical Detection
  • Fetterly et al. (2004) provide a list of
    attributes that are often present in spam pages,
    for example,
  • Large numbers of hostnames resolving to a single
    IP address
  • Large sets of pages with little variance in
    content
  • Disproportionately high ratio of incoming to
    outgoing links to a page
  • If detection by statistics gains popularity, we
    believe some of these predictors will become
    obsolete as spammers adapt.

51
Comment Spam Detection
  • Mishne et al. (2005) propose identifying comment
    spam by comparing the language models in the
    commented page and the comment itself.
  • A language model models the probability of words
    occurring in a given text
  • Their results look promising, though their test
    collection is small.
  • They note that most of the incorrect
    classifications occur with very short comments.
    As a potential solution to this, they suggest
    appending the linked page to the comment before
    language model generation.

52
Detecting Cloaking and Redirection
  • Wu and Davison (2005) compare the pages returned
    by four separate crawls (two reporting to the
    server as a common web browser, two reporting as
    a search crawler) using a thresholded difference
    calculation that takes into account unique term
    and link differences between the pages.
  • Actually, it is very difficult to separate
    malicious and acceptable cloaking.
  • You can do more!

53
Secrecy
  • Search engines do not disclose the full details
    of their ranking algorithms, in order to protect
    their business and to prevent easy exploitation.
  • Security by secrecy is a useful way to slow the
    progress of attack, but it is not a solution.

54
Content Based Detection
  • Features Identifying Synthetic Content
  • Average word length
  • The mean word length for English prose is about 5
    characters
  • Word frequency distribution
  • Certain words (the, a, ) appear more often
    than others
  • N-gram frequency distribution
  • Some words are more likely to occur next to each
    other than others
  • Grammatical well-formedness
  • Ntoulas et al.(2006) introduce a number of
    heuristic methods for detecting content based
    spam and combine these methods to create a highly
    accurate classifier. Their classifier can
    correctly identify 86.2 of all spam pages.

55
Heuristic Methods
  • Number of words in the page
  • Number of words in the page title
  • Average length of words
  • Amount of anchor text
  • Fraction of visible content
  • Compressibility
  • Fraction of page drawn from globally popular
    words
  • Fraction of globally popular words
  • Independent n-gram likelihoods
  • Conditional n-gram likelihoods

56
Average Length of Words
  • Average word length 46, the prevalence of spam
    is 1020
  • 50 of the pages with an average length of 8 are
    spam
  • Every sample page with an average word length of
    ten is spam

57
Fraction of Visible Content
  • ??Many spam pages contain more visible content

58
Graph Based Detection
  • Grongyi et al. (2004)suggest a reputation
    propagation method to identify trustworthy pages.
  • Wu and Davison (2005)suggest a technique to
    detect and penalize link farms.
  • Metaxas and DeStefano (2005)liken web spam to
    social propaganda, and use techniques from social
    science. Their algorithm requires a human to find
    an untrustworthy seed page.

59
Webmaster Guidelines
  • Quality guidelines - basic principles
  • Don't participate in link schemes designed to
    increase your site's ranking or PageRank.
  • In particular, avoid links to web spammers or
    "bad neighborhoods" on the web, as your own
    ranking may be affected adversely by those links.

60
TrustRank Idea
Approximate isolation of good pagesgood pages
seldom point to spam
61
TrustRank Idea
  • ?good pages? spam pages?????
  • ?very good pages????
  • ?known good pages?????
  • ???????????ranking

62
Step 1 Seed Select
  • Basic principle approximate isolation??
  • It is rare for a good page to point to a bad
    (spam) page??
  • ?Web?????seed pages????
  • ?????Expensive and indispensable task
  • ?????????,??????
  • High outdegree
  • High inverse PageRank
  • High PageRank

63
Step 2 Trust Propagation
  • ?????good pages? trusted pages,???trust???1
  • ??????trust????
  • ???????0,1??
  • ??trust threshold ????spam

64
Simple Score Propagation
65
Score Splitting
66
Trust Attenuation
67
Now TrustRank Algorithm
  • Function Trust Rank
  • Input T transition matrix, N number of
    pages, L limit of oracle invocations,
  • - decay factor for biased pageRank -
    number of biased PageRank iterations
  • Output - TrustRank scores
  • Begin
  • S SelectSeed()
  • Rank(1,,N,s)
  • d 0
  • for 1 1 to L do
  • if
  • for i 1 to do
  • return
  • End

Select good seeds
Normalize static score distribution vector
Compute TrustRank scores
68
Example
S 0.08,0.13,0.08,0.10,0.09,0.06,0.02
Order 2,4,5,1,3,6,7
Assume L 3 selected seed set 2,4,5
d 0,1/2,0,1/2,0,0,0
T0,0.18,0.12,0.15,0.13,0.05,0.05
69
The last story
70
SEO contest
  • SEO contest is a prized activity that challenges
    search engine optimization practitioners to rank
    themselves among the major search engines.
  • nigritude ultramarine competition by
    SearchGuild is widely acclaimed as the mother of
    all SEO contests in the English world.
  • Dates May 7, 2004 July 7, 2004
  • Keyword Nigritude ultramarine
  • Prize iPod, Flat Panel LCD Screen various
    bonus prizes

71
Optimization Tricks
72
The winner
  • Anil Dash an early and influential blogger who
    began his weblog in 1999.
  • Dash stated that his goal in entering the contest
    was to "prove that real content trumps all the
    shady optimization tricks that someone can figure
    out"
  • Another competitor took this idea further and
    wrote the Nigritude Ultramarine FAQ,
  • placed sixth overall, won the "Judge's Choice"
    award
  • remains a valuable source of information about
    the competition.

73
Summary
  • Web Spam
  • Its an arm race
  • Spammer Toolbox
  • Link Farm
  • Detection Web Spam
  • Trust Rank

74
Resources
  • Web spam detection dataset WEBSPAM-UK2006
  • AIRWeb Adversarial Information Retrieval on the
    Web
  • search engine spam and optimization,
  • crawling the web without detection,
  • link-bombing (a.k.a. Google-bombing),
  • comment spam, referrer spam,
  • blog spam (splogs),
  • malicious tagging,
  • reverse engineering of ranking algorithms,
  • advertisement blocking, and
  • web content filtering.

75
References
  • Zoltán Gyöngyi, Pavel Berkhin, Hector
    Garcia-Molina, Jan Pedersen. Link Spam Detection
    Based on Mass Estimation.VLDB 2006, September
    12-15, 2006, Seoul, Korea.
  • A. Ntoulas, M. Najork, and M. a. Manasse.
    Detecting spamweb pages through content analysis.
    In Proceedings of the World Wide Web conference,
    Edinburgh, Scotland, 2006.
  • Zoltán Gyöngyi, Hector Garcia-Molina. Link Spam
    Alliances.31st International Conference on Very
    Large Data Bases (VLDB), Trondheim, Norway, 2005
  • Zoltán Gyöngyi, Hector Garcia-Molina, Jan
    Pedersen. Combating Web Spam with TrustRank.30th
    International Conference on Very Large Data Bases
    (VLDB), Toronto, Canada, 2004
  • D. Fetterly, M. Manasse, and M. Najork. Spam,
    damn spam, and statistics. In Seventh
    WebDBWorkshop, 2004.??

76
References
  • A. A. Benczur, K. Csalogany, T. Sarlos, and M.
    Uher. Spamrank fully automatic link spam
    detection. In Proceedings of the First
    International Workshop on Adversarial Information
    Retrieval on the Web, Chiba, Japan, May 2005
  • Mishne, G., D. Carmel, et al. (2005). Blocking
    BlogSpam with Language Model Disagreement.
    Proceedings of the 1st International Workshop on
    Adversarial Information Retrieval on the Web
    (AIRWeb).??
  • Wu, B. and B. Davison (2005). Cloaking and
    Redirection A Preliminary Study. Proceedings of
    the 1st International Workshop on Adversarial
    Information Retrieval on the Web (AIRWeb). ??
  • Wu, B. and B. Davison (2005). Identifying link
    farm spampages. In Proc. of WWW, Chiba, Japan,
    May 2005.??
  • P.T. Metaxasand J.DeStefano(2005). Web Spam,
    Propaganda and Trust. Proceedings of the 1st
    International Workshop on Adversarial Information
    Retrieval on the Web (AIRWeb)

77
Readings
  • 1 K. Georgia, E. Frans Adjie, Z. Gyöngyi , H.
    Paul, and G.-M. Hector, "Combating spam in
    tagging systems," in Proceedings of the 3rd
    international workshop on Adversarial information
    retrieval on the web. Banff, Alberta, Canada
    ACM, 2007.

78
Thank You!
  • QA

79
Whats the meaning of spam?
  • www.dict.cn

80
street spam
Write a Comment
User Comments (0)
About PowerShow.com