Combating Web Spam with TrustRank - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Combating Web Spam with TrustRank

Description:

Definition in this paper : the term web spam refers to hyperlinked pages on the ... VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM ... – PowerPoint PPT presentation

Number of Views:529
Avg rating:3.0/5.0
Slides: 26
Provided by: Jing57
Category:

less

Transcript and Presenter's Notes

Title: Combating Web Spam with TrustRank


1
Combating Web Spam with TrustRank
  • Authors Zoltan Gyongyi, Hector Garcia-Molina,
    Jan Pedersen
  • Presented by Jing Huang

2
What is Web Spam?
  • Definition in this paper the term web spam
    refers to hyperlinked pages on the WorldWideWeb
    that are created with the intention of misleading
    search engines.
  • Something behind
  • than they deserve
  • unjustifiably favorable ranking wrt the
    pages true value
  • unethical web page positioning
  • It is a problem, not only for search engines
  • Primarily for users
  • As well as for content providers
  • It is first a social problem, then a technical
    one
  • Reference P. Takis Metaxas, Web Spam, Propaganda
    and Trust

3
How to Spam?
  • Add keywords so as to confuse page relevance
  • i.e
  • SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA
    SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN
    ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE
    MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY
    IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA
    SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE
    PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY
    BROOK SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ
    CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON
    GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE
    MACPHERSON KATE MOSS CAROL ALT TYRA BANKS
    FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN
    MULDER VALERIA MAZZA SHALOM HARLOW AMBER
  • creation of a large number of bogus web pages,
    all pointing to a single target page
  • Reference P. Takis Metaxas, Web Spam, Propaganda
    and Trust

4
One of the solving methods Combating Web Spam
with TrustRank
  • Content
  • Preliminaries
  • Accessing Trust
  • Computing Trust
  • Selecting Seeds
  • Experiments
  • Results

5
Preliminaries
  • Web Model
  • transition matrix T

5
Indegree 1 Outdegree2
Unreferenced page
Nonreferenced page
6
Overview of PageRank
  • PageRank is based on a mutual reinforcement
    between pages the importance of a certain page
    influences and is being influenced by the
    importance of some other pages.
  • The PageRank score r(p) of a page p is defined
    as
  • Biased PageRank version

Decay factor
Static score distribution vector of arbitrary
7
Assessing Trust
  • Oracle Function
  • O(p) 0 if p is bad
  • 1 if p is good
  • This binary function used as the notion for human
    checking a page
  • Example

8
Trust Functions
  • Definition a trust function T that yields a
    range of values between 0(bad) and 1(good)
  • Ideal Trust Property
  • Ordered Trust Property
  • Threshold Trust Property

If a page P receives a score above the threshold
then P is good
9
Computing Trust
Web graph
  • Ignorant Trust Function

10
Computing Trust Example
Set L 3
11
Trust Propagation
  • Assumption good pages point to other good pages
    only, assign a score of 1 to all pages that are
    reachable from a page in S in M or fewer steps.
  • M- Step Trust Function
  • Example

12
Trust Attenuation
  • Trust dampening
  • Trust splitting

13
TrustRank Algorithm
  • Function Trust Rank
  • Input T transition matrix, N number of
    pages, L limit of oracle invocations,
  • - decay factor for biased pageRank -
    number of biased PageRank iterations
  • Output - TrustRank scores
  • Begin
  • S SelectSeed()
  • Rank(1,,N,s)
  • d 0
  • for 1 1 to L do
  • if
  • for i 1 to do
  • return
  • end

Select good seeds
Normalize static score distribution vector
Compute TrustRank scores
14
S 0.08,0.13,0.08,0.10,0.09,0.06,0.02
Order 2,4,5,1,3,6,7
Assume L 3 selected seed set 2,4,5
d 0,1/2,0,1/2,0,0,0
T0,0.18,0.12,0.15,0.13,0.05,0.05
15
Selecting Seeds
  • High PageRank
  • Inverse PageRank
  • Coverage is important
  • Build the seed set from those pages which point
    to many pages that in turn point to many pages
    and so on.
  • Only different between PageRank is this method
    based on outlinks
  • Example

L 2
S 1,2
S 0.05,0.05,0.04,0.02,0.02,0.02,0.02
16
Evaluation Metrics(1)
  • Pairwise orderedness( related to ordered trust
    property)

P is a set of ordered pairs of page(p,q), p!q
from the sample X
17
Evaluation Metrics(2)
  • Precision
  • the fraction of good among all pages in X that
    have a trust score above
  • Recall
  • the ratio between the number of good pages with a
    trust score above and the total number of
    good pages in X

18
Experiments
  • Data set Grouped 31,003,946 sites using a
    proprietary algorithm from Alta Vista.
  • Seed set inversed PageRank
  • Manually evaluated the top 1250 seeds
  • Get 178 sites as good seeds
  • Evaluation Sample
  • use 748 of the sample sites to evaluate
    TrustRank
  • Reputable - 563 Web organization 37
    Advertisement-13 Spam -135

19
Results
  • TrustRank
  • 178 good seeds
  • PageRank
  • Use the PageRank of site as the value of T(a)
  • Ignorant Trust
  • All sites assigned an ignorant trust score of ½
    except for the 1250 seeds

20
Results
  • PageRank VS TrustRank

Good Sites in PageRank buckets
Good Sites in TrustRank buckets
Bad Sites in PageRank buckets
Bad Sites in PageRank buckets
21
Results
  • Pairwise Orderedness

22
Results
  • Precision and Recall

23
Something after TrustRank
  • Big sites monopolize the search results
  • i.e. wiki
  • Reputable not equal to relevance
  • How to stop of other types of spam
  • Image
  • Based on the content of the page that image
    appears
  • Video
  • music

24
(No Transcript)
25
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com