Title: Link Analysis, Web Structure Mining and Web communities
1Link Analysis, Web Structure Mining and Web
communities
An overview of methods and results
2Contents
- PageRank ranking all pages
- HITS ranking topic-relevant pages
- Community Identification Algorithm identifying
web communities
31. PageRank
- Assumptions
- A page with many links to it is more likely to be
useful than one with few links to it - The links from a page that itself is the target
of many links are likely to be particularly
important
4Example
X
Y
X seems to be the most important page since 2
important pages link to it
5Simple voting model round 1
1
1
1
1
6Simple voting model round 2
1.5
1
0
1.5
7Simple voting model round 3
2
0
0
2
8Revised voting model round 1
1
1
1
1
- Allocate 1 vote to each node after each voting
round - Remove votes from leaf nodes
9Revised voting model round 2
1.5
2
1
1.5
10Revised voting model round 3
2
2
1
2
The middle node only has one link to it, but this
does not share its votes with other nodes
11Revised voting model cycling problem
1
1
1
12PageRank
- Use a proportion of vote, redistribute the rest
- If proportion is lt 1 then no cycling will occur
- Voting can also be performed by a matrix
- Find votes from principle left eigenvector of
matrix
13PageRank round 1, giving votes
.4
1
1
.8
1
.4
1
- 4 votes in system allocate 80 of vote,
redistribute 20 of each, plus the lost votes
from leaf nodes 2.4 votes. Redistribute 2.4/4
0.6 extra votes to each node
14PageRank round 2, receiving votes
0.60.4
0.60.8
1
1.4
0.6
1
0.60.4
15PageRank round 3, giving votes
1.4x0.8/20.56
1
1.4
0.6x0.80.48
0.6
1.4x0.8/20.56
1
Lost votes 0.6x0.2 1.4x0.2 1 1 2.4.
Redistribute 2.4/4 0.6 votes to each node
16PageRank round 3, receiving votes
0.60.56
0.60.48
1.16
1.08
0.6
1.16
17PageRank Summary
- The pages that get the highest PageRank are those
that are linked to by many pages or by important
pages - Spammers try to exploit this by creating dummy
sites to link to their main sites
182. Kleinbergs HITS
- Also uses link structures, but also uses page
content to identify pages that are useful for a
coherent topic on the web - An Authority is a page that is linked to by many
other pages from the same topic - A Hub is a page that links to many pages from the
same topic
19Hubs and authorities
A
H
H
20The HITS algorithm
- Another iterative algorithm
- Each page has a hub value and an authority value
- Unlike PageRank, it is topic-specific, and needs
to be recomputed for each user query
21HITS 1 Finding the Base Set (simplified version)
- Use a text-based query to obtain the top t
matching pages - Add all pages linked to or linking to the
matching pages - Remove all links between pages within the same
site
22HITS 2 Computing Hubs and Authorities
(simplified)
- (Initialising) Assign each page an equal
authority weight x and a hub weight y - (Iterating 1) Add the hub weight of each page to
the authority value of each page linked to - (Iterating 1) Add the authority weight of each
page to the hub value of each page linked from - Normalise and repeat until stable
230.45, 0.45
0.45, 0.45
Hub 0.45, Authority 0.45
0.45, 0.45
240.45, 0.9
1.35, 0.9
Hub 0.9, Authority 0.45
0.45, 0.9
25HITS 3 Computing Ranks
- Use Hub and Authority values but return a mixture
of the top hub values and top authority values - This should avoid pages that would rank highly
for general reasons but are not authoritative for
the topic
263. The Community Identification Algorithm
- The Community Identification Algorithm operates
on the link structure of the Web alone - It identifies communities collections of pages
that tend to link amongst each other but do not
tend to link to pages outside of the community - This is useful for topic identification
27The CIA algorithm
- Is based upon the maximal flow algorithm
- Start with one or more relevant pages the seed
set - Partitions the web into two groups
- Pages within the community of the seed set
- Pages outside of the seed set
- Works by creating an artificial network
- Is very complicated!
28CIA Example what is the community around the
given node?
Initial community
29How much water can flow into the well for any
value of k?
Water tank
k
k
k
1
1
1
1
Well
Initial community
30How much water can flow into the well for k1?
Water tank
1
1/1
2
1
1
1/1
1
1/1
Well
Cut through full pipes
31How much water can flow into the well for k2?
Water tank
0/2
2/2
3
1/2
0/1
1/1
0/1
1/1
Well
Cut through full pipes
32How much water can flow into the well for k4?
Water tank
1/4
3/4
4
1/4
1/1
1/1
1/1
1/1
Well
Cut through full pipes
334. Link Algorithms - Overview
- The success of HITS and PageRank indicates the
importance of links as a new information source - More needs to be known about patterns of linking
- But there is still little hard evidence that link
approaches work well for Web IR academic papers
report unscientific experiments or inconclusive
results
34References
- Brin, S., Page, L. (1998). The anatomy of a
large scale hypertextual Web search engine.
Computer Networks and ISDN Systems, 30(1-7),
107-117. - Kleinberg, J. (1999). Authoritative sources in a
hyperlinked environment. Journal of the ACM,
46(5), 604-632. - Flake, G. W., Lawrence, S., Giles, C. L.,
Coetzee, F. M. (2002). Self-organization and
identification of Web communities. IEEE Computer,
35, 66-71.