Title: Hyperlink Analysis
1Hyperlink Analysis
2Overview of This Talk
- Introduction to Hyperlink Analysis
- Classification of Hyperlink Analysis
- Two sub-topics
- Measures and Metrics
- Interesting Web Structures
3Definition of Hyperlink Analysis
- Hyperlink Analysis can be defined as an area of
Web Information Retrieval using the hyperlink
structure of the Web.
4Motivation
- Hyperlinks serve two main purposes.
- Pure Navigation.
- Point to pages with authority on the same topic
of the page containing the link. - This can be used to retrieve useful information
from the web.
- a set of ideas or statements supporting a
topic
5What Information Can Be Retrieved ?
- Quality of Web Page.
- - The authority of a page on a topic.
- - Ranking of web Pages.
- Interesting Web Structures.
- Graph patterns like Co-citation, Social choice,
Complete bipartite graphs etc. - Web Page Classification.
- - Classifying web pages according to various
topics.
6What Information Can Be Retrieved? (Cont)
- Which pages to crawl.
- - Deciding which web pages to add to the
collection of web pages. - Finding Related Pages.
- - Given one relevant page, find all related
pages. - Detection of duplicated pages.
- - Detection of neared-mirror sites to eliminate
duplication.
7Classification of Hyperlink Analysis Research
Measures and Metrics
Interesting Web Structures
Hyperlink Analysis
Web Page Classification
Web Search
(Still needs to be refined. Suggestions Welcome)
8Measures/metrics
- Standards for measuring properties of a page or a
web structure. - Quality of a page.
- Distance between pages.
- Web Page Reputation.
9PageRank Citation Ranking1
- Aim
- Ranking Metric for Hypertext Documents
- Approach
- Page has a high rank if the sum of the ranks of
its backlinks is high
10Authoritative Sources in Hyperlink Environment3
- Aim
- Determining relative authority of pages
- Approach
- Good authority page is one pointed to by many
good hubs - Good hub page is one that points to many good
authorities - Results
- Efficient when query topic is sufficiently
broad - Benefits
- Locating dense bipartite communities
11Does Authority Mean Quality ?4
- Aim.
- Are any metrics we compute for Web documents good
predictors of document quality ? - Approach.
- Do experts agree in their quality judgments?
- Are different link-based metrics different?
- Indegree, PageRank and Authority.
- Can we predict human quality judgments ?
- Compute correlations between each pair of
metrics and also compare it with expert judgment.
12Does Authority Mean Quality ?4
- Results.
- Experts agree on the nature of a quality within a
topic. - No significant difference between link based
metrics. - In-degree performed as well as PR and Authority.
13Web Page Reputations 5
- Aim.
- Input URL, Output Ranked set of topics for
which the page has a reputation. - Approach.
- A page an acquire a high reputation on a topic
because the page is pointed to by many pages on
that topic, or because the page is pointed to by
some high reputation pages on that topic. - A page is deemed authority on the topic if it is
pointed to by good hubs on the topic, and a good
hub is one that points to good authorities.
14One-level Influence Propagation
- Reputation of the page p on a topic is the
probability that the random surfer looking for
topic t will visit page p - At each step
- with probability dgt0 jump to a random page, or
- with probability (1-d) follow a random link from
the current page
if term t appears in page p
otherwise
15Two Level Influence Propagation
- with probability dgt0 jump to random page that
contains term t - with probability (1-d) follow random link
forward/backward from the current page,
alternating directions - Authority Reputation of a page p on a topic t is
the probability that a random surfer looking for
a topic t makes a forward visit to the page p - Hub Reputation of a page p on a topic t is the
probability that a random surfer looking for a
topic t makes a backward visit to the page p
16Two Level Influence Propagation
A(p,t) probability of a forward visit to page p
when searching for term t Authority rank of
page p on term t
if term t appears in page p
otherwise
H(p,t) probability of a backward visit to page
p when searching for term t Hub rank of page p
on term t
if term t appears in page p
otherwise
17Factors Affecting Page Reputation
- How well a topic is represented.
- How well pages on a topic are connected.
18Link Analysis and Stability6
- Aim.
- When to expect stable rankings under small
perturbations to hyperlink patterns. - Approach.
- Eigengap directly affects the stability of
eigenvectors in HITS algorithm. - Coupled Markov Chain Theory(?).
- So long as perturbed web pages did not have high
overall PageRank scores, then the perturbed
PageRank Scores will not be far from the
original. - Result.
- HITS Unstable PageRank Stable.
19Stable Algorithms 7
- Aim
- Stable Link Analysis Methods
- Approach
- Randomized HITS
- Merging Hubs and Authorities notion with reset
mechanism from PageRank - Subspace HITS
- Combining multiple eigenvectors from HITS to
yield aggregate authority scores Subspace HITS - Results
- Both approaches more stable than HITS, latter a
little worse than PageRank
20Average Clicks 8
- Aim.
- A new definition of distance between two pages.
- Approach.
- Based on probability to click a link through
random surfing. - Benefit.
- A good justification of practical search for
fetching neighboring pages. - Result.
- Distance by average clicks seems to fit well
intuitively.
21Interesting Web Structure
- Analyzing interesting graph patterns or Web
Structures. - Helpful in identification of Web Communities.
22Interesting Web Structures 11
Mutual Reinforcement
Social Choice
Co-Citation
Transitive Endorsement
23Interesting Web Structures 11
Directed Complete Bipartite graph
NK-clan with N2, K10
NK- Clan is a set of K-nodes in which there is a
path length N or less(ignoring edge directions)
between every pair of nodes
24Interesting Web Structures 11
25Interesting Web Structures
26Friends and Neighbors 9
- Aim.
- Techniques to mine information in order to
predict relationship between individuals. - Approach.
- Similarity measured by analyzing text, in-links,
out-links and mailing list. - Result.
- In-links were good predictors.
27References
- 1 S. Brin and L. Page(1998) The PageRank
Citation Ranking Bringing Order to the Web. In
Technical Report available at http//www-db.stanfo
rd.edu/backrub/pageranksub.ps, January 1998. - 2 T. Haveliwala,(1999) Efficient Computation of
PageRank In Technical Report , Stanford
University,CA - 3 J.M. Klienberg (1998), Authoritative Sources
in Hyperlinked Environment
28References
- 4 B. Amento1, L. Terveen, and Will Hill(2000) ,
Does "Authority" Mean Quality? Predicting Expert
Quality Ratings of Web Documents (ACM 2000) - 5 D. Rafiei, A.O. Mendelzon (2000), What is
this Page Known for? Computing Web Page
Reputations ,Proceedings of Ninth International
WWW Conference
29References(contd)
- 6 A. Y. Ng, A. X. Zheng, and M. I.
Jordan(2001),Link Analysis, Eigenvectors and
Stability, IJCAI-01. - 7 A. Y. Ng, A. X. Zheng, and M. I.
Jordan(2001), Stable algorithms for link
analysis. Proc. 24th International Conference on
Research and Development in Information Retrieval
(SIGIR), 2001. - 8 Y. Matsuo, Y.Ohsawa and M. Ishizuka(2001),
Average-clicks A new measure of distance on the
WWW, WI-2001, 2001.
30References(contd)
- 9 L. A. Adamic and E. Adar(2000), Friends and
Neighbors on the Web,Xerox Palo Alto Research
Center Palo Alto, CA 94304. - 10 A. Borodin, G.O. Roberts, J.S. Rosenthal, P.
Tsaparas (2000), Finding Authorities and Hubs
From Link Structures on the World Wide Web,WWW10
Proceedings.
31References (contd)
- 11 Kemal Efe, Vijay Raghavan, C. Henry Chu,
Adrienne L. Broadwater, Levent Bolelli, Seyda
Ertekin (2000), The Shape of the Web and Its
Implications for Searching the Web ,
International Conference on Advances in
Infrastructure for Electronic Business, Science,
and Education on the Internet- Proceedings at
http//www.ssgrr.it/en/ssgrr2000/proceedings.htm,
Rome. Italy, Jul.-Aug. 2000 - 12 Monika Henzinger, Link Analysis in Web
Information Retrieval, ICDE Bulletin Sept 2000,
Vol 23. No.3
32PageRank Approach
- PageRank of a page p.
- d is the damping factor (or probability that a
page is chosen uniformly at random from all pages
). - n is the number of nodes in Graph G.
- outdegree(q) is the number of edges leaving a
page q. - Back.
33HITS Approach
- Let z denote the vector(1,1,1,1,.1).
- Initially set x ? z y ? z,
- For i 1,2,3.
- Apply the I Operation.
- Apply the O operation.
- Normalize x and y.
- The sequence of (x, y) pairs produced converges
to a limit (x, y). - Return (x, y ) as the authority and hub
weights. - Back.
34Friends and Neighbors
- Items that are unique to few users are weighted
more than commonly occurring items - 2 people mention item, Weight 1/log(2) 1.4
- 5 people mention item, Weight 1/log(5) 0.62
-
- Back