Title: Hyperlink Analysis
1Hyperlink Analysis
2Overview of This Talk
- Introduction to Hyperlink Analysis
- Classification of Hyperlink Analysis
- Two sub-topics
- Measures and Metrics
- Interesting Web Structures
3Definition of Hyperlink Analysis
- Hyperlink Analysis can be defined as an area of
Web Information Retrieval using the hyperlink
structure of the Web.
- Hyperlinks serve two main purposes.
- Pure Navigation.
- Point to pages with authority on the same topic
of the page containing the link. - This can be used to retrieve useful information
from the web.
- a set of ideas or statements supporting a
5What Information Can Be Retrieved ?
- Quality of Web Page.
- - The authority of a page on a topic.
- - Ranking of web Pages.
- Interesting Web Structures.
- Graph patterns like Co-citation, Social choice,
Complete bipartite graphs etc. - Web Page Classification.
- - Classifying web pages according to various
6What Information Can Be Retrieved? (Cont)
- Which pages to crawl.
- - Deciding which web pages to add to the
collection of web pages. - Finding Related Pages.
- - Given one relevant page, find all related
pages. - Detection of duplicated pages.
- - Detection of neared-mirror sites to eliminate
7Classification of Hyperlink Analysis Research
Measures and Metrics
Interesting Web Structures
Hyperlink Analysis
Web Page Classification
Web Search
(Still needs to be refined. Suggestions Welcome)
- Standards for measuring properties of a page or a
web structure. - Quality of a page.
- Distance between pages.
- Web Page Reputation.
9PageRank Citation Ranking1
- Aim
- Ranking Metric for Hypertext Documents
- Approach
- Page has a high rank if the sum of the ranks of
its backlinks is high
10Authoritative Sources in Hyperlink Environment3
- Aim
- Determining relative authority of pages
- Approach
- Good authority page is one pointed to by many
good hubs - Good hub page is one that points to many good
authorities - Results
- Efficient when query topic is sufficiently
broad - Benefits
- Locating dense bipartite communities
11Does Authority Mean Quality ?4
- Aim.
- Are any metrics we compute for Web documents good
predictors of document quality ? - Approach.
- Do experts agree in their quality judgments?
- Are different link-based metrics different?
- Indegree, PageRank and Authority.
- Can we predict human quality judgments ?
- Compute correlations between each pair of
metrics and also compare it with expert judgment.
12Does Authority Mean Quality ?4
- Results.
- Experts agree on the nature of a quality within a
topic. - No significant difference between link based
metrics. - In-degree performed as well as PR and Authority.
13Web Page Reputations 5
- Aim.
- Input URL, Output Ranked set of topics for
which the page has a reputation. - Approach.
- A page an acquire a high reputation on a topic
because the page is pointed to by many pages on
that topic, or because the page is pointed to by
some high reputation pages on that topic. - A page is deemed authority on the topic if it is
pointed to by good hubs on the topic, and a good
hub is one that points to good authorities.
14One-level Influence Propagation
- Reputation of the page p on a topic is the
probability that the random surfer looking for
topic t will visit page p - At each step
- with probability dgt0 jump to a random page, or
- with probability (1-d) follow a random link from
the current page
if term t appears in page p
15Two Level Influence Propagation
- with probability dgt0 jump to random page that
contains term t - with probability (1-d) follow random link
forward/backward from the current page,
alternating directions - Authority Reputation of a page p on a topic t is
the probability that a random surfer looking for
a topic t makes a forward visit to the page p - Hub Reputation of a page p on a topic t is the
probability that a random surfer looking for a
topic t makes a backward visit to the page p
16Two Level Influence Propagation
A(p,t) probability of a forward visit to page p
when searching for term t Authority rank of
page p on term t
if term t appears in page p
H(p,t) probability of a backward visit to page
p when searching for term t Hub rank of page p
on term t
if term t appears in page p
17Factors Affecting Page Reputation
- How well a topic is represented.
- How well pages on a topic are connected.
18Link Analysis and Stability6
- Aim.
- When to expect stable rankings under small
perturbations to hyperlink patterns. - Approach.
- Eigengap directly affects the stability of
eigenvectors in HITS algorithm. - Coupled Markov Chain Theory(?).
- So long as perturbed web pages did not have high
overall PageRank scores, then the perturbed
PageRank Scores will not be far from the
original. - Result.
- HITS Unstable PageRank Stable.
19Stable Algorithms 7
- Aim
- Stable Link Analysis Methods
- Approach
- Randomized HITS
- Merging Hubs and Authorities notion with reset
mechanism from PageRank - Subspace HITS
- Combining multiple eigenvectors from HITS to
yield aggregate authority scores Subspace HITS - Results
- Both approaches more stable than HITS, latter a
little worse than PageRank
20Average Clicks 8
- Aim.
- A new definition of distance between two pages.
- Approach.
- Based on probability to click a link through
random surfing. - Benefit.
- A good justification of practical search for
fetching neighboring pages. - Result.
- Distance by average clicks seems to fit well
21Interesting Web Structure
- Analyzing interesting graph patterns or Web
Structures. - Helpful in identification of Web Communities.
22Interesting Web Structures 11
Mutual Reinforcement
Social Choice
Transitive Endorsement
23Interesting Web Structures 11
Directed Complete Bipartite graph
NK-clan with N2, K10
NK- Clan is a set of K-nodes in which there is a
path length N or less(ignoring edge directions)
between every pair of nodes
24Interesting Web Structures 11
25Interesting Web Structures
26Friends and Neighbors 9
- Aim.
- Techniques to mine information in order to
predict relationship between individuals. - Approach.
- Similarity measured by analyzing text, in-links,
out-links and mailing list. - Result.
- In-links were good predictors.
- 1 S. Brin and L. Page(1998) The PageRank
Citation Ranking Bringing Order to the Web. In
Technical Report available at http//www-db.stanfo
rd.edu/backrub/pageranksub.ps, January 1998. - 2 T. Haveliwala,(1999) Efficient Computation of
PageRank In Technical Report , Stanford
University,CA - 3 J.M. Klienberg (1998), Authoritative Sources
in Hyperlinked Environment
- 4 B. Amento1, L. Terveen, and Will Hill(2000) ,
Does "Authority" Mean Quality? Predicting Expert
Quality Ratings of Web Documents (ACM 2000) - 5 D. Rafiei, A.O. Mendelzon (2000), What is
this Page Known for? Computing Web Page
Reputations ,Proceedings of Ninth International
WWW Conference
- 6 A. Y. Ng, A. X. Zheng, and M. I.
Jordan(2001),Link Analysis, Eigenvectors and
Stability, IJCAI-01. - 7 A. Y. Ng, A. X. Zheng, and M. I.
Jordan(2001), Stable algorithms for link
analysis. Proc. 24th International Conference on
Research and Development in Information Retrieval
(SIGIR), 2001. - 8 Y. Matsuo, Y.Ohsawa and M. Ishizuka(2001),
Average-clicks A new measure of distance on the
WWW, WI-2001, 2001.
- 9 L. A. Adamic and E. Adar(2000), Friends and
Neighbors on the Web,Xerox Palo Alto Research
Center Palo Alto, CA 94304. - 10 A. Borodin, G.O. Roberts, J.S. Rosenthal, P.
Tsaparas (2000), Finding Authorities and Hubs
From Link Structures on the World Wide Web,WWW10
31References (contd)
- 11 Kemal Efe, Vijay Raghavan, C. Henry Chu,
Adrienne L. Broadwater, Levent Bolelli, Seyda
Ertekin (2000), The Shape of the Web and Its
Implications for Searching the Web ,
International Conference on Advances in
Infrastructure for Electronic Business, Science,
and Education on the Internet- Proceedings at
Rome. Italy, Jul.-Aug. 2000 - 12 Monika Henzinger, Link Analysis in Web
Information Retrieval, ICDE Bulletin Sept 2000,
Vol 23. No.3
32PageRank Approach
- PageRank of a page p.
- d is the damping factor (or probability that a
page is chosen uniformly at random from all pages
). - n is the number of nodes in Graph G.
- outdegree(q) is the number of edges leaving a
page q. - Back.
33HITS Approach
- Let z denote the vector(1,1,1,1,.1).
- Initially set x ? z y ? z,
- For i 1,2,3.
- Apply the I Operation.
- Apply the O operation.
- Normalize x and y.
- The sequence of (x, y) pairs produced converges
to a limit (x, y). - Return (x, y ) as the authority and hub
weights. - Back.
34Friends and Neighbors
- Items that are unique to few users are weighted
more than commonly occurring items - 2 people mention item, Weight 1/log(2) 1.4
- 5 people mention item, Weight 1/log(5) 0.62
- Back