Title: Combating Web Spam with TrustRank
1Combating Web Spam with TrustRank
- Authors Zoltan Gyongyi, Hector Garcia-Molina,
Jan Pedersen - Presented by Jing Huang
2What is Web Spam?
- Definition in this paper the term web spam
refers to hyperlinked pages on the WorldWideWeb
that are created with the intention of misleading
search engines. - Something behind
- than they deserve
- unjustifiably favorable ranking wrt the
pages true value - unethical web page positioning
- It is a problem, not only for search engines
- Primarily for users
- As well as for content providers
- It is first a social problem, then a technical
one - Reference P. Takis Metaxas, Web Spam, Propaganda
and Trust
3How to Spam?
- Add keywords so as to confuse page relevance
- i.e
- SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA
SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN
ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE
MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY
IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA
SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE
PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY
BROOK SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ
CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON
GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE
MACPHERSON KATE MOSS CAROL ALT TYRA BANKS
FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN
MULDER VALERIA MAZZA SHALOM HARLOW AMBER - creation of a large number of bogus web pages,
all pointing to a single target page - Reference P. Takis Metaxas, Web Spam, Propaganda
and Trust
4One of the solving methods Combating Web Spam
with TrustRank
- Content
- Preliminaries
- Accessing Trust
- Computing Trust
- Selecting Seeds
- Experiments
- Results
5Preliminaries
- Web Model
- transition matrix T
5
Indegree 1 Outdegree2
Unreferenced page
Nonreferenced page
6Overview of PageRank
- PageRank is based on a mutual reinforcement
between pages the importance of a certain page
influences and is being influenced by the
importance of some other pages. - The PageRank score r(p) of a page p is defined
as - Biased PageRank version
Decay factor
Static score distribution vector of arbitrary
7Assessing Trust
- Oracle Function
- O(p) 0 if p is bad
- 1 if p is good
- This binary function used as the notion for human
checking a page - Example
8Trust Functions
- Definition a trust function T that yields a
range of values between 0(bad) and 1(good) - Ideal Trust Property
- Ordered Trust Property
- Threshold Trust Property
If a page P receives a score above the threshold
then P is good
9Computing Trust
Web graph
10Computing Trust Example
Set L 3
11Trust Propagation
- Assumption good pages point to other good pages
only, assign a score of 1 to all pages that are
reachable from a page in S in M or fewer steps. - M- Step Trust Function
- Example
12Trust Attenuation
- Trust dampening
- Trust splitting
13TrustRank Algorithm
- Function Trust Rank
- Input T transition matrix, N number of
pages, L limit of oracle invocations, - - decay factor for biased pageRank -
number of biased PageRank iterations - Output - TrustRank scores
- Begin
- S SelectSeed()
- Rank(1,,N,s)
- d 0
- for 1 1 to L do
- if
-
- for i 1 to do
- return
- end
Select good seeds
Normalize static score distribution vector
Compute TrustRank scores
14S 0.08,0.13,0.08,0.10,0.09,0.06,0.02
Order 2,4,5,1,3,6,7
Assume L 3 selected seed set 2,4,5
d 0,1/2,0,1/2,0,0,0
T0,0.18,0.12,0.15,0.13,0.05,0.05
15Selecting Seeds
- High PageRank
- Inverse PageRank
- Coverage is important
- Build the seed set from those pages which point
to many pages that in turn point to many pages
and so on. - Only different between PageRank is this method
based on outlinks - Example
L 2
S 1,2
S 0.05,0.05,0.04,0.02,0.02,0.02,0.02
16Evaluation Metrics(1)
- Pairwise orderedness( related to ordered trust
property)
P is a set of ordered pairs of page(p,q), p!q
from the sample X
17Evaluation Metrics(2)
- Precision
- the fraction of good among all pages in X that
have a trust score above - Recall
- the ratio between the number of good pages with a
trust score above and the total number of
good pages in X
18Experiments
- Data set Grouped 31,003,946 sites using a
proprietary algorithm from Alta Vista. - Seed set inversed PageRank
- Manually evaluated the top 1250 seeds
- Get 178 sites as good seeds
- Evaluation Sample
- use 748 of the sample sites to evaluate
TrustRank - Reputable - 563 Web organization 37
Advertisement-13 Spam -135
19Results
- TrustRank
- 178 good seeds
- PageRank
- Use the PageRank of site as the value of T(a)
- Ignorant Trust
- All sites assigned an ignorant trust score of ½
except for the 1250 seeds
20Results
Good Sites in PageRank buckets
Good Sites in TrustRank buckets
Bad Sites in PageRank buckets
Bad Sites in PageRank buckets
21Results
22Results
23Something after TrustRank
- Big sites monopolize the search results
- i.e. wiki
- Reputable not equal to relevance
- How to stop of other types of spam
- Image
- Based on the content of the page that image
appears - Video
- music
24(No Transcript)
25(No Transcript)