Title: Forschungszentrum L3S
1 Site Level Noise Removal for Search
Engines Written By Andre Carvalho, Paul
Alexandru Chirita, Edleno de Moura Pavel Calado,
Wolfgang Nejdl
2Outline
- Motivation
- Noisy Web Structures at the Site Level
- Mutual Site Reinforcement Relations
- Abnormal Site Support
- Site Level Link Alliances
- Experiments
- Conclusions
3Outline
- Motivation
- Noisy Web Structures at the Site Level
- Mutual Site Reinforcement Relations
- Abnormal Site Support
- Site Level Link Alliances
- Experiments
- Conclusions
4Motivation
- Web Spam has become an industry
- All previous approaches to link spam detection
use various page level algorithms - Spammers are now creating more complex PageRank
boosting graphs, across several sites - There is more than just Spam
- Noisy links are hyperlinks created for any
purpose, but that of expressing a vote for the
target page - Each branch of a company has its own site, these
sites being interconnected by a navigational
structure - Replicated sites pointing towards the newer
replica
5Outline
- Motivation
- Noisy Web Structures at the Site Level
- Mutual Site Reinforcement Relations
- Abnormal Site Support
- Site Level Link Alliances
- Experiments
- Conclusions
6Mutual Site Reinforcement
- Characterized by many link exchanges between two
or more sites (see figure) - Several counting approaches are possible
- Only the bidirectional link exchanges (3)
- Link density, i.e., all links between sites, as
within a unidirectional counting (9) - Minimum no. of inter-site links S ? S and S ? S
(4, work in progress)
7Mutual Site Reinforcement and Ranking
- Detection algorithm
- For each page p from S, take all q from S other
than S, out-neighbors of p - If p is also out-neighbor of q, then increase
BMSR (bidirectional link) - Increase UMSR always (unidirectional link)
- Inclusion in PageRank is then straightforward
- If the amount of link exchanges of either type
exceeds a give threshold, then remove all links
between S and S - Another solution would be to downgrade these
links instead (not tested) - Thresholds
- 250 for UMSR (tried quite several values between
10 and 300) - 2 for BMSR (tried up to 4) shows that incorrect
votes are usually pair-wise
8Outline
- Motivation
- Noisy Web Structures at the Site Level
- Mutual Site Reinforcement Relations
- Abnormal Site Support
- Site Level Link Alliances
- Experiments
- Conclusions
9Abnormal Site Support
- Based on the idea that
- For a site S, there should not be another site
S, whose number of links towards S is above a
certain percentage of the total number of links S
receives overall - Best threshold was 2 (tried value ranging from
0.5 up to 20) - Algorithm
- For all pairs of sites (S, S), calculate
- Support Links (S ? S) / In-links (S)
- If Support gt e then remove all links S ? S
10Outline
- Motivation
- Noisy Web Structures at the Site Level
- Mutual Site Reinforcement Relations
- Abnormal Site Support
- Site Level Link Alliances
- Experiments
- Conclusions
11Site Level Link Alliances
- Most complex and costly structures
- If a page p has in-links from pages
- i1, i2, . . ., in, and
- these latter pages are highly connected,
- Then they are suspect of being part of a
structure which could deceive popularity
ranking algorithms - It is a Site Level approach only in the sense
that the in-linking pages usually belong to
different sites - Solution
- Each page is downgraded according to its
susceptivity of being part of such a malicious
structure
12Site Level Link Alliances and Susceptivity
- First step involves computing the Maliciousness
Susceptivity of each page p - Let Tot be the number of out-links of all pages q
linking to p - and TotIn be the number of out-links of all
pages q linking to p, such that they point to
some other page linking to p - Susceptivity is then the ratio between TotIn and
Tot - For p this is 22 / 26
13Site Level Link Alliances and Ranking
- Simply removing all links to pages with a high
susceptivity would be too harsh on them (they
would be practically classified as spam) - Use a downgrading method instead
14Outline
- Motivation
- Noisy Web Structures at the Site Level
- Mutual Site Reinforcement Relations
- Abnormal Site Support
- Site Level Link Alliances
- Experiments
- Conclusions
15Experimental Setup Dataset
- TodoBR Search Engine http//www.todobr.com.br/
- About 12 M pages, query log with about 11 M
entries - Extracted queries from the log, as follows
- Bookmark queries, seeking for a specific page
(100) - Topic queries, seeking for information on a given
topic (60) - Each set of queries contained two equal subsets
- Popular queries, i.e., most frequent Show how
many of the top sites have been downgraded by our
algorithms - Randomly selected queries Show the impact on
pages with a low PageRank
16Experimental Setup Evaluation
- 14 undergraduate and graduate CS students
- Bookmark queries evaluated using MRR
- and MEANPOS, i.e., the average position of the
first relevant answer - (more sensible to low-rank relevant results,
e.g., places 15 and 40) - Topic queries used P_at_5, P_at_10, and MAP also,
results were categorized into non-relevant (0),
relevant (1), and highly relevant (2) - Web sites simply defined as the host name part of
the URL - Fine-tuned the thresholds using the MRR
experiments
17Overall Results, Bookmark Queries
18Overall Results, Topic Queries
19Practical Issues
- Amount of links removed (where applicable)
- SLLA SLAbS could thus be preferred to increase
computation speed, as BMSR does not remove too
many links - Let V the no. of nodes, E the no. of edges, M the
average out-degree, and P the average in-degree.
Then, the complexities are - O(E) for UMSR and O(VMlog(M)) for BMSR
- O(VP) for SLAbS
- O(VM2log(M)) for SLLA
- All algorithms are trivially parallelizable
20Outline
- Motivation
- Noisy Web Structures at the Site Level
- Mutual Site Reinforcement Relations
- Abnormal Site Support
- Site Level Link Alliances
- Experiments
- Conclusions
21Conclusions and Further Work
- First approach to detect link noise at the site
level - Three algorithms
- Mutual Site Reinforcement Relations
- Site Level Abnormal Support
- Site Link Alliances
- Improvement of
- 26.98 in MRR for popular bookmark queries,
- 20.92 for randomly selected bookmark queries,
- and up to 59.16 in Mean Average Precision for
topic queries - Identified up to 16.7 of the links from our
collection as noisy - To do
- Try more complex site definitions, as well as new
algorithms - Automatic parameter tuning (hard)
22 Thank You!