Title: Web Spam
1Web Spam
2Todays Outline
- Web Spam
- The Spammers Toolbox
- Link Farm
- Combating Web Spam
- TrustRank
3Web Spam
4Story of the 2BigFeet
- From 1995 to 2000, tens of thousands of business
owners took out small-business to open
storefronts on the Web, Neil Moncrief is one of
them. - 40,000 a month, 2003, 95 from search engine
referrals. - November 14,2003, the phone stopped ringing and
the orders stopped coming in. - OH! Google Dance
- But
5Economic Considerations
- ????????Web?gateway.
- ??Search Engine (SE) ???????????.
- ????????????????????.
- ????????????????.
- e.g., e-commerce sites.
- advertising-driven sites
6Ways to Increase SE Referrals
- ??keyword-based ??
- ?????????
- ??genuinely better content??
- or Game the system??
- Search Engine Optimization is a thriving
business?? - Some SEOs are ethical
- Some are not
7What Is Web Spam?
- Spamming any deliberate action solely in order
to boost a web pages position in search engine
results, incommensurate with pages real value - Spam web pages that are the result of spamming
- Approximately 10-15 of web pages are spam
8Why Web Spam Is Bad
We appreciate your taking the time to help us
improve our service for your fellow users around
the world. By helping us eliminate spam, you're
saving millions of people time, effort and
energy.
9Detecting Web Spam
- Spam detection a classification problem
- ????,??????/?????spam?
- But what are the salient features?
- ????spamming???????
- Finding the right features is alchemy, not
science - Spammers????????? its an arms race!
10The Spammers Toolbox
11Techniques Taxonomy
12Techniques / Boosting / Term
13Term Spamming-What?
14Weaving
Remember not only to say the right thing in the
right place, but far more difficult still, to
leave unsaid the wrong thing at the tempting
moment. Benjamin FranklinUS author, diplomat,
inventor, physicist, politician, printer (1706
- 1790)
Remember not only airfare to say the right
plane tickets thing in the right place, but far
cheap travel more difficult still, to leave hotel
rooms unsaid the wrong thing at vacation the
tempting moment. Benjamin FranklinUS author,
diplomat, inventor, physicist, politician,
printer (1706 - 1790)
Remember not only airfare to say the right
plane tickets thing in the right place, but far
cheap travel more difficult still, to leave hotel
rooms unsaid the wrong thing at vacation the
tempting moment. Benjamin FranklinUS author,
diplomat, inventor, physicist, politician,
printer (1706 - 1790)
15Techniques / Boosting / Term
- repetition repetition repetitionrepetition
repetition repetition - dumortierite dumose dumous dump dumpage dumper
dumpily dumpiness dumping dumpish dumpishly - work in weaving three-women teams is an ancient
textile art on looms - please refrain from using the phrase stitching
wounds located on the lower limbs
16Techniques / Boosting / Link
Nice story. Read about my inoonlinever.comLas Vegas casino trip.
17Google Bomb
- "Google bombing" project organized by George
Johnston back at the end of October 2003
18Techniques / Hiding
19Techniques / Hiding
- Content hiding
- Cloaking
- Identify web crawlers
- Serve a different version of the page
GET /db_pages/members.html HTTP/1.0 Host
infolab.stanford.edu User-Agent
AVSearch-3.0(AltaVista/AVC)
20Link Farm
21Now, focus on link farm
- Link spamming Inflating the rank of a page by
creating boosting links to it - From unaffiliated sites (e.g. blogs, guest books,
web forums, etc.) - From partner sites Link exchanges
- From own sites Link farms
22Link Exchanges
- reciprocal link??????objects?????????links,?ensur
e mutual traffic. - Three way linking
- ???Search Engine????"natural" links.
- siteA - siteB - siteC - siteA
- Automated Linking
- automatic link exchange services
Quick and lazy schemes are increasingly worth
less and less, so avoid themat all costs.
Heres a good rule of thumb for SEO, if its
automated, super-easy, or super-fast, avoid
it. Why Link Exchanges are a Bad Idea by Scott
Allen
23PageRank Algorithm I
- Basic idea BP98 given the web graph G (VE),
we define a regular Markov chain M on this web
graph. The PageRank vector is the stationary
distribution of M. - Step 1 suppose A is the adjacency matrix.
- Example
- Step 2 normalize matrix A to get P
- Example We call a page
without - outgoing links dangling pages.
24PageRank Algorithm II
- Step 3 P can be created by replacing rows of 0T
in P with vector - Example
- Step 4 transition Matrix of Markov Chain a
- Example
25PageRank Algorithm III
- M is regular.
- Theorem 1 (KS60) For a finite state regular
Markov chain P - There exists a unique stationary distribution
- PageRank is the stationary distribution of the
Markov chain M i.e. and - In the previous example, the PageRank vector
- is (0292 0292 0416)
26PageRank
- PageRank in one equation
- PR(p) ? M (1- ?) Vp
- M is the adjacency matrix of the Web Graph.
- ? is the damping factor. (usually .85)
- in case of fairness Vp1/N (N of pages
in the Web). - V is the personalization vector.
- What happens if a page p has no outgoing links ?
- ? of its PR is lost -- all the PR will be lost
eventually. - solution normalize rows of M.
(i.e. insert links to every other page)
27Aggregate Page Rank
- Total page rank is affected by
- Number of pages
- Incoming Links
- Outgoing Links
- Dangling Nodes
- Topologies that
- Use as many pages as possible
- minimize outgoing links
- minimize dangling nodes
incoming links
WEB-SITE
outgoing links
28Chain topology (more is better)
PR (Web Site) 0.34
I
a
O
0.18
0.34
0.47
PR (Web Site) 0.210.29 0.50
I
a
O
b
0.11
0.21
0.37
0.29
I
a
b
c
d
e
f
O
0.03
0.07
0.09
0.12
0.14
0.16
0.17
0.18
PR (Web Site) 0.77
29Ring topology
I
a
O
0.18
0.34
0.47
0.18
I
a
O
0.11
0.03
b
f
0.15
0.11
PR (Web Site) 0.86
c
e
0.12
0.14
d
0.13
30Clique topology
I
a
O
0.18
0.34
0.47
0.18
I
a
O
0.04
0.03
b
f
0.15
0.15
PR (Web Site) 0.93
c
e
0.15
0.15
d
0.15
31Increasing Page Rank of a single target page
- Complicated structures do not help
- chain, ring, clique waste page rank among every
node in the website - Then
32Star topology
I
a
O
0.18
0.34
0.47
0.09
b
0.09
0.09
c
f
PR (a) 0.43
I
a
O
0.09
0.03
d
e
0.09
0.09
33Link Farm Model
34How to Do It?
- ?Spammer????,?????
- Own pages
- ???spammer???
- ?????domain names
- Accessible pages
- ??,web log comments pages
- Spammer?????????? links
- Inaccessible pages
35Link Spam Farms
- Spammers goal
- ?????? t ?PageRank????
- Techniques
- ????????accessible pages?t?links
- ?? link farm ?????page rank multiplier effect
36Organization
One of the most common and effective
organizations for a link farm
37A Simple Farm Model
- Suppose rank contributed by accessible pages ?
- Let page rank of target page y, Nnumber of all
pages - Rank of each farm page (1-c)/N, cdamping
factor - y ? kc(1-c)/N (1-c)/N ? (1-c) (ck1)
/N?? - No multiplier effect for acquired page rank??
- By making k large, we can make y as large as we
want
38An Optimal Model
- Model Suppose rank contributed by accessible
pages ? - Let page rank of target page y, Nnumber of all
pages - Rank of each farm page cy/k (1-c)/N
- y ? ckcy/k (1-c)/N (1-c)/N
- ? c2y c(1-c)k/N (1-c)/N
- y ?/(1-c2) (ck1)/(NcN), For c 0.85,
1/(1-c2) 3.6
39Comparison
40A Theorem for Link Farm
- The PageRankscore of the target is maximal if and
only if - ??l boosting pages (farm pages) ???????target
- boosting pages??????
- target ????????boosting pages
- All hijacked links?? target
Really Optimal?
41Optimal Farm
Lesson 1 Short loop(s) increase target
PageRank
42Two Farms
- Alliances interconnected farms
- Single spammer, several target pages/farms
- Multiple spammers
What happens if you and I team up?
43Two Farms
- We can do this
- but it wouldnt helptarget scores balance out
- p0q0 c(mk)2/N(1c), So, p0 q0d(mk)/2,
where d c/N(1c)
44Two Farms
- However, we can do this
- Remove the links to boosting pages
- and both targets scores increase
- For km , we have 1.85x increase
- p0q0 ?/(1-c) cN2/N. So, 1c1.85,
1.853.66.7
Lesson 2 Target pages should only link to other
targets
Lesson 3 In an alliance of two, both
participants win
45Larger Alliances
- Completely connected core
46Larger Alliances
- Target scores for ring/complete cores
- 10 farms of sizes 1000, 2000, , 10000
Lesson 4 Larger alliances need to be stable to
keep all participants happy
Problem farm 10 loses in a ring
47Features Identifying Link Spam
- ???low-ranked pages????
- ????????????
- ??????affiliated pages??
- Same web site same domain
- Same IP address
- Same owner (according to WHOIS record)
- linking pages?machine-generated
- ??
48Combating Web Spam
49Combating Web Spam
- Statistical Detection
- Comment Spam Detection
- Detecting Cloaking and Redirection
- Secrecy
- Content Based Detection
- Graph Based Detection
50Statistical Detection
- Fetterly et al. (2004) provide a list of
attributes that are often present in spam pages,
for example, - Large numbers of hostnames resolving to a single
IP address - Large sets of pages with little variance in
content - Disproportionately high ratio of incoming to
outgoing links to a page - If detection by statistics gains popularity, we
believe some of these predictors will become
obsolete as spammers adapt.
51Comment Spam Detection
- Mishne et al. (2005) propose identifying comment
spam by comparing the language models in the
commented page and the comment itself. - A language model models the probability of words
occurring in a given text - Their results look promising, though their test
collection is small. - They note that most of the incorrect
classifications occur with very short comments.
As a potential solution to this, they suggest
appending the linked page to the comment before
language model generation.
52Detecting Cloaking and Redirection
- Wu and Davison (2005) compare the pages returned
by four separate crawls (two reporting to the
server as a common web browser, two reporting as
a search crawler) using a thresholded difference
calculation that takes into account unique term
and link differences between the pages. - Actually, it is very difficult to separate
malicious and acceptable cloaking. - You can do more!
53Secrecy
- Search engines do not disclose the full details
of their ranking algorithms, in order to protect
their business and to prevent easy exploitation. - Security by secrecy is a useful way to slow the
progress of attack, but it is not a solution.
54Content Based Detection
- Features Identifying Synthetic Content
- Average word length
- The mean word length for English prose is about 5
characters - Word frequency distribution
- Certain words (the, a, ) appear more often
than others - N-gram frequency distribution
- Some words are more likely to occur next to each
other than others - Grammatical well-formedness
- Ntoulas et al.(2006) introduce a number of
heuristic methods for detecting content based
spam and combine these methods to create a highly
accurate classifier. Their classifier can
correctly identify 86.2 of all spam pages.
55Heuristic Methods
- Number of words in the page
- Number of words in the page title
- Average length of words
- Amount of anchor text
- Fraction of visible content
- Compressibility
- Fraction of page drawn from globally popular
words - Fraction of globally popular words
- Independent n-gram likelihoods
- Conditional n-gram likelihoods
56Average Length of Words
- Average word length 46, the prevalence of spam
is 1020 - 50 of the pages with an average length of 8 are
spam - Every sample page with an average word length of
ten is spam
57Fraction of Visible Content
- ??Many spam pages contain more visible content
58Graph Based Detection
- Grongyi et al. (2004)suggest a reputation
propagation method to identify trustworthy pages.
- Wu and Davison (2005)suggest a technique to
detect and penalize link farms. - Metaxas and DeStefano (2005)liken web spam to
social propaganda, and use techniques from social
science. Their algorithm requires a human to find
an untrustworthy seed page.
59Webmaster Guidelines
- Quality guidelines - basic principles
- Don't participate in link schemes designed to
increase your site's ranking or PageRank. - In particular, avoid links to web spammers or
"bad neighborhoods" on the web, as your own
ranking may be affected adversely by those links.
60TrustRank Idea
Approximate isolation of good pagesgood pages
seldom point to spam
61TrustRank Idea
- ?good pages? spam pages?????
- ?very good pages????
- ?known good pages?????
- ???????????ranking
62Step 1 Seed Select
- Basic principle approximate isolation??
- It is rare for a good page to point to a bad
(spam) page?? - ?Web?????seed pages????
- ?????Expensive and indispensable task
- ?????????,??????
- High outdegree
- High inverse PageRank
- High PageRank
63Step 2 Trust Propagation
- ?????good pages? trusted pages,???trust???1
- ??????trust????
- ???????0,1??
- ??trust threshold ????spam
64Simple Score Propagation
65Score Splitting
66Trust Attenuation
67Now TrustRank Algorithm
- Function Trust Rank
- Input T transition matrix, N number of
pages, L limit of oracle invocations, - - decay factor for biased pageRank -
number of biased PageRank iterations - Output - TrustRank scores
- Begin
- S SelectSeed()
- Rank(1,,N,s)
- d 0
- for 1 1 to L do
- if
-
- for i 1 to do
- return
- End
Select good seeds
Normalize static score distribution vector
Compute TrustRank scores
68Example
S 0.08,0.13,0.08,0.10,0.09,0.06,0.02
Order 2,4,5,1,3,6,7
Assume L 3 selected seed set 2,4,5
d 0,1/2,0,1/2,0,0,0
T0,0.18,0.12,0.15,0.13,0.05,0.05
69The last story
70SEO contest
- SEO contest is a prized activity that challenges
search engine optimization practitioners to rank
themselves among the major search engines. - nigritude ultramarine competition by
SearchGuild is widely acclaimed as the mother of
all SEO contests in the English world. - Dates May 7, 2004 July 7, 2004
- Keyword Nigritude ultramarine
- Prize iPod, Flat Panel LCD Screen various
bonus prizes
71Optimization Tricks
72The winner
- Anil Dash an early and influential blogger who
began his weblog in 1999. - Dash stated that his goal in entering the contest
was to "prove that real content trumps all the
shady optimization tricks that someone can figure
out" - Another competitor took this idea further and
wrote the Nigritude Ultramarine FAQ, - placed sixth overall, won the "Judge's Choice"
award - remains a valuable source of information about
the competition.
73Summary
- Web Spam
- Its an arm race
- Spammer Toolbox
- Link Farm
- Detection Web Spam
- Trust Rank
74Resources
- Web spam detection dataset WEBSPAM-UK2006
- AIRWeb Adversarial Information Retrieval on the
Web
- search engine spam and optimization,
- crawling the web without detection,
- link-bombing (a.k.a. Google-bombing),
- comment spam, referrer spam,
- blog spam (splogs),
- malicious tagging,
- reverse engineering of ranking algorithms,
- advertisement blocking, and
- web content filtering.
75References
- Zoltán Gyöngyi, Pavel Berkhin, Hector
Garcia-Molina, Jan Pedersen. Link Spam Detection
Based on Mass Estimation.VLDB 2006, September
12-15, 2006, Seoul, Korea. - A. Ntoulas, M. Najork, and M. a. Manasse.
Detecting spamweb pages through content analysis.
In Proceedings of the World Wide Web conference,
Edinburgh, Scotland, 2006. - Zoltán Gyöngyi, Hector Garcia-Molina. Link Spam
Alliances.31st International Conference on Very
Large Data Bases (VLDB), Trondheim, Norway, 2005 - Zoltán Gyöngyi, Hector Garcia-Molina, Jan
Pedersen. Combating Web Spam with TrustRank.30th
International Conference on Very Large Data Bases
(VLDB), Toronto, Canada, 2004 - D. Fetterly, M. Manasse, and M. Najork. Spam,
damn spam, and statistics. In Seventh
WebDBWorkshop, 2004.??
76References
- A. A. Benczur, K. Csalogany, T. Sarlos, and M.
Uher. Spamrank fully automatic link spam
detection. In Proceedings of the First
International Workshop on Adversarial Information
Retrieval on the Web, Chiba, Japan, May 2005 - Mishne, G., D. Carmel, et al. (2005). Blocking
BlogSpam with Language Model Disagreement.
Proceedings of the 1st International Workshop on
Adversarial Information Retrieval on the Web
(AIRWeb).?? - Wu, B. and B. Davison (2005). Cloaking and
Redirection A Preliminary Study. Proceedings of
the 1st International Workshop on Adversarial
Information Retrieval on the Web (AIRWeb). ?? - Wu, B. and B. Davison (2005). Identifying link
farm spampages. In Proc. of WWW, Chiba, Japan,
May 2005.?? - P.T. Metaxasand J.DeStefano(2005). Web Spam,
Propaganda and Trust. Proceedings of the 1st
International Workshop on Adversarial Information
Retrieval on the Web (AIRWeb)
77Readings
- 1 K. Georgia, E. Frans Adjie, Z. Gyöngyi , H.
Paul, and G.-M. Hector, "Combating spam in
tagging systems," in Proceedings of the 3rd
international workshop on Adversarial information
retrieval on the web. Banff, Alberta, Canada
ACM, 2007.
78Thank You!
79Whats the meaning of spam?
80street spam