Forschungszentrum L3S - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Forschungszentrum L3S

Description:

Another solution would be to downgrade these links instead (not tested) Thresholds ... Each page is downgraded according to its susceptivity of being part of such a ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 23
Provided by: Nej7
Category:

less

Transcript and Presenter's Notes

Title: Forschungszentrum L3S


1
Site Level Noise Removal for Search
Engines Written By Andre Carvalho, Paul
Alexandru Chirita, Edleno de Moura Pavel Calado,
Wolfgang Nejdl
2
Outline
  • Motivation
  • Noisy Web Structures at the Site Level
  • Mutual Site Reinforcement Relations
  • Abnormal Site Support
  • Site Level Link Alliances
  • Experiments
  • Conclusions

3
Outline
  • Motivation
  • Noisy Web Structures at the Site Level
  • Mutual Site Reinforcement Relations
  • Abnormal Site Support
  • Site Level Link Alliances
  • Experiments
  • Conclusions

4
Motivation
  • Web Spam has become an industry
  • All previous approaches to link spam detection
    use various page level algorithms
  • Spammers are now creating more complex PageRank
    boosting graphs, across several sites
  • There is more than just Spam
  • Noisy links are hyperlinks created for any
    purpose, but that of expressing a vote for the
    target page
  • Each branch of a company has its own site, these
    sites being interconnected by a navigational
    structure
  • Replicated sites pointing towards the newer
    replica

5
Outline
  • Motivation
  • Noisy Web Structures at the Site Level
  • Mutual Site Reinforcement Relations
  • Abnormal Site Support
  • Site Level Link Alliances
  • Experiments
  • Conclusions

6
Mutual Site Reinforcement
  • Characterized by many link exchanges between two
    or more sites (see figure)
  • Several counting approaches are possible
  • Only the bidirectional link exchanges (3)
  • Link density, i.e., all links between sites, as
    within a unidirectional counting (9)
  • Minimum no. of inter-site links S ? S and S ? S
    (4, work in progress)

7
Mutual Site Reinforcement and Ranking
  • Detection algorithm
  • For each page p from S, take all q from S other
    than S, out-neighbors of p
  • If p is also out-neighbor of q, then increase
    BMSR (bidirectional link)
  • Increase UMSR always (unidirectional link)
  • Inclusion in PageRank is then straightforward
  • If the amount of link exchanges of either type
    exceeds a give threshold, then remove all links
    between S and S
  • Another solution would be to downgrade these
    links instead (not tested)
  • Thresholds
  • 250 for UMSR (tried quite several values between
    10 and 300)
  • 2 for BMSR (tried up to 4) shows that incorrect
    votes are usually pair-wise

8
Outline
  • Motivation
  • Noisy Web Structures at the Site Level
  • Mutual Site Reinforcement Relations
  • Abnormal Site Support
  • Site Level Link Alliances
  • Experiments
  • Conclusions

9
Abnormal Site Support
  • Based on the idea that
  • For a site S, there should not be another site
    S, whose number of links towards S is above a
    certain percentage of the total number of links S
    receives overall
  • Best threshold was 2 (tried value ranging from
    0.5 up to 20)
  • Algorithm
  • For all pairs of sites (S, S), calculate
  • Support Links (S ? S) / In-links (S)
  • If Support gt e then remove all links S ? S

10
Outline
  • Motivation
  • Noisy Web Structures at the Site Level
  • Mutual Site Reinforcement Relations
  • Abnormal Site Support
  • Site Level Link Alliances
  • Experiments
  • Conclusions

11
Site Level Link Alliances
  • Most complex and costly structures
  • If a page p has in-links from pages
  • i1, i2, . . ., in, and
  • these latter pages are highly connected,
  • Then they are suspect of being part of a
    structure which could deceive popularity
    ranking algorithms
  • It is a Site Level approach only in the sense
    that the in-linking pages usually belong to
    different sites
  • Solution
  • Each page is downgraded according to its
    susceptivity of being part of such a malicious
    structure

12
Site Level Link Alliances and Susceptivity
  • First step involves computing the Maliciousness
    Susceptivity of each page p
  • Let Tot be the number of out-links of all pages q
    linking to p
  • and TotIn be the number of out-links of all
    pages q linking to p, such that they point to
    some other page linking to p
  • Susceptivity is then the ratio between TotIn and
    Tot
  • For p this is 22 / 26

13
Site Level Link Alliances and Ranking
  • Simply removing all links to pages with a high
    susceptivity would be too harsh on them (they
    would be practically classified as spam)
  • Use a downgrading method instead

14
Outline
  • Motivation
  • Noisy Web Structures at the Site Level
  • Mutual Site Reinforcement Relations
  • Abnormal Site Support
  • Site Level Link Alliances
  • Experiments
  • Conclusions

15
Experimental Setup Dataset
  • TodoBR Search Engine http//www.todobr.com.br/
  • About 12 M pages, query log with about 11 M
    entries
  • Extracted queries from the log, as follows
  • Bookmark queries, seeking for a specific page
    (100)
  • Topic queries, seeking for information on a given
    topic (60)
  • Each set of queries contained two equal subsets
  • Popular queries, i.e., most frequent Show how
    many of the top sites have been downgraded by our
    algorithms
  • Randomly selected queries Show the impact on
    pages with a low PageRank

16
Experimental Setup Evaluation
  • 14 undergraduate and graduate CS students
  • Bookmark queries evaluated using MRR
  • and MEANPOS, i.e., the average position of the
    first relevant answer
  • (more sensible to low-rank relevant results,
    e.g., places 15 and 40)
  • Topic queries used P_at_5, P_at_10, and MAP also,
    results were categorized into non-relevant (0),
    relevant (1), and highly relevant (2)
  • Web sites simply defined as the host name part of
    the URL
  • Fine-tuned the thresholds using the MRR
    experiments

17
Overall Results, Bookmark Queries
18
Overall Results, Topic Queries
19
Practical Issues
  • Amount of links removed (where applicable)
  • SLLA SLAbS could thus be preferred to increase
    computation speed, as BMSR does not remove too
    many links
  • Let V the no. of nodes, E the no. of edges, M the
    average out-degree, and P the average in-degree.
    Then, the complexities are
  • O(E) for UMSR and O(VMlog(M)) for BMSR
  • O(VP) for SLAbS
  • O(VM2log(M)) for SLLA
  • All algorithms are trivially parallelizable

20
Outline
  • Motivation
  • Noisy Web Structures at the Site Level
  • Mutual Site Reinforcement Relations
  • Abnormal Site Support
  • Site Level Link Alliances
  • Experiments
  • Conclusions

21
Conclusions and Further Work
  • First approach to detect link noise at the site
    level
  • Three algorithms
  • Mutual Site Reinforcement Relations
  • Site Level Abnormal Support
  • Site Link Alliances
  • Improvement of
  • 26.98 in MRR for popular bookmark queries,
  • 20.92 for randomly selected bookmark queries,
  • and up to 59.16 in Mean Average Precision for
    topic queries
  • Identified up to 16.7 of the links from our
    collection as noisy
  • To do
  • Try more complex site definitions, as well as new
    algorithms
  • Automatic parameter tuning (hard)

22
Thank You!
Write a Comment
User Comments (0)
About PowerShow.com