Hyperlink Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Hyperlink Analysis

Description:

Hub Reputation of a page p on a topic t is the probability that a random surfer ... S. Rosenthal, P. Tsaparas (2000), Finding Authorities and Hubs From Link ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 35
Provided by: prasanna4
Category:

less

Transcript and Presenter's Notes

Title: Hyperlink Analysis


1
Hyperlink Analysis
  • A Survey
  • (In Progress)

2
Overview of This Talk
  • Introduction to Hyperlink Analysis
  • Classification of Hyperlink Analysis
  • Two sub-topics
  • Measures and Metrics
  • Interesting Web Structures

3
Definition of Hyperlink Analysis
  • Hyperlink Analysis can be defined as an area of
    Web Information Retrieval using the hyperlink
    structure of the Web.

4
Motivation
  • Hyperlinks serve two main purposes.
  • Pure Navigation.
  • Point to pages with authority on the same topic
    of the page containing the link.
  • This can be used to retrieve useful information
    from the web.

- a set of ideas or statements supporting a
topic
5
What Information Can Be Retrieved ?
  • Quality of Web Page.
  • - The authority of a page on a topic.
  • - Ranking of web Pages.
  • Interesting Web Structures.
  • Graph patterns like Co-citation, Social choice,
    Complete bipartite graphs etc.
  • Web Page Classification.
  • - Classifying web pages according to various
    topics.

6
What Information Can Be Retrieved? (Cont)
  • Which pages to crawl.
  • - Deciding which web pages to add to the
    collection of web pages.
  • Finding Related Pages.
  • - Given one relevant page, find all related
    pages.
  • Detection of duplicated pages.
  • - Detection of neared-mirror sites to eliminate
    duplication.

7
Classification of Hyperlink Analysis Research
Measures and Metrics
Interesting Web Structures
Hyperlink Analysis
Web Page Classification
Web Search
(Still needs to be refined. Suggestions Welcome)
8
Measures/metrics
  • Standards for measuring properties of a page or a
    web structure.
  • Quality of a page.
  • Distance between pages.
  • Web Page Reputation.

9
PageRank Citation Ranking1
  • Aim
  • Ranking Metric for Hypertext Documents
  • Approach
  • Page has a high rank if the sum of the ranks of
    its backlinks is high

10
Authoritative Sources in Hyperlink Environment3
  • Aim
  • Determining relative authority of pages
  • Approach
  • Good authority page is one pointed to by many
    good hubs
  • Good hub page is one that points to many good
    authorities
  • Results
  • Efficient when query topic is sufficiently
    broad
  • Benefits
  • Locating dense bipartite communities

11
Does Authority Mean Quality ?4
  • Aim.
  • Are any metrics we compute for Web documents good
    predictors of document quality ?
  • Approach.
  • Do experts agree in their quality judgments?
  • Are different link-based metrics different?
  • Indegree, PageRank and Authority.
  • Can we predict human quality judgments ?
  • Compute correlations between each pair of
    metrics and also compare it with expert judgment.

12
Does Authority Mean Quality ?4
  • Results.
  • Experts agree on the nature of a quality within a
    topic.
  • No significant difference between link based
    metrics.
  • In-degree performed as well as PR and Authority.

13
Web Page Reputations 5
  • Aim.
  • Input URL, Output Ranked set of topics for
    which the page has a reputation.
  • Approach.
  • A page an acquire a high reputation on a topic
    because the page is pointed to by many pages on
    that topic, or because the page is pointed to by
    some high reputation pages on that topic.
  • A page is deemed authority on the topic if it is
    pointed to by good hubs on the topic, and a good
    hub is one that points to good authorities.

14
One-level Influence Propagation
  • Reputation of the page p on a topic is the
    probability that the random surfer looking for
    topic t will visit page p
  • At each step
  • with probability dgt0 jump to a random page, or
  • with probability (1-d) follow a random link from
    the current page

if term t appears in page p
otherwise
15
Two Level Influence Propagation
  • with probability dgt0 jump to random page that
    contains term t
  • with probability (1-d) follow random link
    forward/backward from the current page,
    alternating directions
  • Authority Reputation of a page p on a topic t is
    the probability that a random surfer looking for
    a topic t makes a forward visit to the page p
  • Hub Reputation of a page p on a topic t is the
    probability that a random surfer looking for a
    topic t makes a backward visit to the page p

16
Two Level Influence Propagation
A(p,t) probability of a forward visit to page p
when searching for term t Authority rank of
page p on term t
if term t appears in page p
otherwise
H(p,t) probability of a backward visit to page
p when searching for term t Hub rank of page p
on term t
if term t appears in page p
otherwise
17
Factors Affecting Page Reputation
  • How well a topic is represented.
  • How well pages on a topic are connected.

18
Link Analysis and Stability6
  • Aim.
  • When to expect stable rankings under small
    perturbations to hyperlink patterns.
  • Approach.
  • Eigengap directly affects the stability of
    eigenvectors in HITS algorithm.
  • Coupled Markov Chain Theory(?).
  • So long as perturbed web pages did not have high
    overall PageRank scores, then the perturbed
    PageRank Scores will not be far from the
    original.
  • Result.
  • HITS Unstable PageRank Stable.

19
Stable Algorithms 7
  • Aim
  • Stable Link Analysis Methods
  • Approach
  • Randomized HITS
  • Merging Hubs and Authorities notion with reset
    mechanism from PageRank
  • Subspace HITS
  • Combining multiple eigenvectors from HITS to
    yield aggregate authority scores Subspace HITS
  • Results
  • Both approaches more stable than HITS, latter a
    little worse than PageRank

20
Average Clicks 8
  • Aim.
  • A new definition of distance between two pages.
  • Approach.
  • Based on probability to click a link through
    random surfing.
  • Benefit.
  • A good justification of practical search for
    fetching neighboring pages.
  • Result.
  • Distance by average clicks seems to fit well
    intuitively.

21
Interesting Web Structure
  • Analyzing interesting graph patterns or Web
    Structures.
  • Helpful in identification of Web Communities.

22
Interesting Web Structures 11
Mutual Reinforcement
Social Choice
Co-Citation
Transitive Endorsement
23
Interesting Web Structures 11
Directed Complete Bipartite graph
NK-clan with N2, K10
NK- Clan is a set of K-nodes in which there is a
path length N or less(ignoring edge directions)
between every pair of nodes
24
Interesting Web Structures 11
25
Interesting Web Structures
  • Web Communities

26
Friends and Neighbors 9
  • Aim.
  • Techniques to mine information in order to
    predict relationship between individuals.
  • Approach.
  • Similarity measured by analyzing text, in-links,
    out-links and mailing list.
  • Result.
  • In-links were good predictors.

27
References
  • 1 S. Brin and L. Page(1998) The PageRank
    Citation Ranking Bringing Order to the Web. In
    Technical Report available at http//www-db.stanfo
    rd.edu/backrub/pageranksub.ps, January 1998.
  • 2 T. Haveliwala,(1999) Efficient Computation of
    PageRank In Technical Report , Stanford
    University,CA
  • 3 J.M. Klienberg (1998), Authoritative Sources
    in Hyperlinked Environment

28
References
  • 4 B. Amento1, L. Terveen, and Will Hill(2000) ,
    Does "Authority" Mean Quality? Predicting Expert
    Quality Ratings of Web Documents (ACM 2000) 
  • 5 D. Rafiei, A.O. Mendelzon (2000), What is
    this Page Known for? Computing Web Page
    Reputations ,Proceedings of Ninth International
    WWW Conference

29
References(contd)
  • 6 A. Y. Ng, A. X. Zheng, and M. I.
    Jordan(2001),Link Analysis, Eigenvectors and
    Stability, IJCAI-01.
  • 7 A. Y. Ng, A. X. Zheng, and M. I.
    Jordan(2001), Stable algorithms for link
    analysis. Proc. 24th International Conference on
    Research and Development in Information Retrieval
    (SIGIR), 2001.
  • 8 Y. Matsuo, Y.Ohsawa and M. Ishizuka(2001),
    Average-clicks A new measure of distance on the
    WWW, WI-2001, 2001.

30
References(contd)
  • 9 L. A. Adamic and E. Adar(2000), Friends and
    Neighbors on the Web,Xerox Palo Alto Research
    Center Palo Alto, CA 94304.
  • 10 A. Borodin, G.O. Roberts, J.S. Rosenthal, P.
    Tsaparas (2000), Finding Authorities and Hubs
    From Link Structures on the World Wide Web,WWW10
    Proceedings.

31
References (contd)
  • 11 Kemal Efe, Vijay Raghavan, C. Henry Chu,
    Adrienne L. Broadwater, Levent Bolelli, Seyda
    Ertekin (2000), The Shape of the Web and Its
    Implications for Searching the Web ,
    International Conference on Advances in
    Infrastructure for Electronic Business, Science,
    and Education on the Internet- Proceedings at
    http//www.ssgrr.it/en/ssgrr2000/proceedings.htm,
    Rome. Italy, Jul.-Aug. 2000
  • 12 Monika Henzinger, Link Analysis in Web
    Information Retrieval, ICDE Bulletin Sept 2000,
    Vol 23. No.3

32
PageRank Approach
  • PageRank of a page p.
  • d is the damping factor (or probability that a
    page is chosen uniformly at random from all pages
    ).
  • n is the number of nodes in Graph G.
  • outdegree(q) is the number of edges leaving a
    page q.
  • Back.

33
HITS Approach
  • Let z denote the vector(1,1,1,1,.1).
  • Initially set x ? z y ? z,
  • For i 1,2,3.
  • Apply the I Operation.
  • Apply the O operation.
  • Normalize x and y.
  • The sequence of (x, y) pairs produced converges
    to a limit (x, y).
  • Return (x, y ) as the authority and hub
    weights.
  • Back.

34
Friends and Neighbors
  • Predicting Friendship
  • Items that are unique to few users are weighted
    more than commonly occurring items
  • 2 people mention item, Weight 1/log(2) 1.4
  • 5 people mention item, Weight 1/log(5) 0.62
  • Back
Write a Comment
User Comments (0)
About PowerShow.com