Authoritative Sources in a Hyperlinked Environment - PowerPoint PPT Presentation

About This Presentation
Title:

Authoritative Sources in a Hyperlinked Environment

Description:

... an algorithm for identifying authoritative pages and hub pages. ... p, we associate a non-negative authority weight x(p) and a non-negative hub weight y(p) ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 28
Provided by: nob85
Learn more at: https://www2.cs.uh.edu
Category:

less

Transcript and Presenter's Notes

Title: Authoritative Sources in a Hyperlinked Environment


1
Authoritative Sources in a Hyperlinked Environment
  • Jon M. Kleinberg
  • Presentation by Julian Zinn

2
Searching the Web
  • Goal find pages relevant to a query.
  • The basic text-based search algorithms retrieve
    pages that contain the query keywords.
  • Improved searching algorithms can examine the
    link structure of the web to learn about the
    contents of web pages.
  • This paper introduces an algorithm for
    identifying authoritative pages and hub pages.

3
Overview
  • Issues in Searching
  • Algorithm Overview
  • Iterative Algorithm
  • Wrap-up

4
Types of Queries
  • Specific queries information about the topic is
    scarce.
  • Broad-topic queries information about the topic
    is overabundant. We want to return the most
    authoritative pages.
  • Similar-page queries find pages that are like
    a given page.
  • This paper examines broad-topic queries.

5
Complications with Text-based Search
  • An authoritative page for a query may not contain
    the query terms.
  • Example www.uh.edu contains neither University
    nor Houston, and has UH only six times.
  • Text may be in the form of images or flash
    animations.
  • A page might not be self-descriptive.
  • Example Honda does not describe itself as an
    automobile manufacturer and Google does not
    describe itself as a search engine.

6
Examining Link Structure
  • The creator of a page p, by including a link to a
    page q, confers authority in some way to page q.
  • How can we exploit this latent human judgment
    information?
  • Pitfall Many links, such as navigational links
    and advertisement links do not confer authority.

7
Exploiting Link Structure 1
  • An authoritative page must be popular.
  • So, of all pages that contain the query terms,
    return those with the highest in-degree.
  • Pitfall Still misses authoritative pages that do
    not contain the query terms.
  • Pitfall Universally popular pages (like
    www.yahoo.com) will be considered highly
    authoritative for any query terms they contain.

8
Exploiting Link Structure 2
  • Authoritative sources often do not link to other
    authoritative sources.
  • Examples Toyota does not link to Honda, and
    Google does not link to Teoma.
  • Other pages, which we call hub pages, link to
    multiple authoritative sources.
  • Example Auto enthusiast websites linking to
    multiple manufacturers websites.
  • The authoritative pages for a query share many
    hub pages.

9
Overview
  • Issues in Searching
  • Algorithm Overview
  • Iterative Algorithm
  • Wrap-up

10
Algorithm Overview
  • For a query ?, start with a text-based search to
    generate an initial root set R?.
  • Enlarge the root set to a base set S?.
  • Identify authoritative pages and hub pages in S?.
  • Return the most authoritative pages in S?.

11
Desiderata for S?
  • S? should be
  • Relatively small.
  • Rich in relevant pages.
  • Contain most (or many) of the strongest
    authorities.
  • R? will satisfy 1 and 2, but not 3.
  • Even the set of all pages that contain the query
    terms may not satisfy 3.

12
Enlarging R? to S?
  • Pages in R? may not be authoritative, but most
    authoritative pages are probably pointed to by at
    least one member of R?.
  • Pages in R? may not point to each other.
  • Let S? R? all pages pointed to by pages of R?
    some pages that point to pages of R?.
  • Use a heuristic to avoid navigation links.
  • Kleinbergs experiments had R? ? 200 and S? ?
    1000 to 5000.

13
Identifying Hubs and Authorities
  • Our set S? still has the problem of
    non-authoritative pages of high in-degree.
  • The authoritative pages are the popular pages
    that have a large overlap in the sets of pages
    that point to them.
  • The hub pages are the pages that point to many of
    the authoritative pages.

14
Hubs and Authorities Picture
Unrelated page of large in-degree
authorities
hubs
15
Mutually Reinforcing Relationship
  • Good hubs point to many good authorities.
  • Good authorities are pointed to by many good
    hubs.
  • There must be an iterative algorithm.

16
Overview
  • Issues in Searching
  • Algorithm Overview
  • Iterative Algorithm
  • Wrap-up

17
Iterative Algorithm 1
  • For each page p, we associate a non-negative
    authority weight x(p) and a non-negative hub
    weight y(p).
  • Values are normalized
  • Larger values indicate better pages.

18
Iterative Algorithm 2
  • If p points to many pages with large x-values,
    then p receives a large y-value
  • If p is pointed to by many pages with large
    y-values, then p receives a large x-value

19
Iterative Algorithm 3
  • We iterate and renormalize until values converge.
  • Therefore, we need to prove convergence.
  • The algorithm is a discrete-time evolution and
    can be written as multiplications of matrices and
    vectors
  • A result of linear algebra guarantees convergence
    of X and Y to the principle eigenvectors of MTM
    and MMT.

20
Example Mini Web
     


A
M
H

-
i
i
1
X
T

H
M
A

-
i
i
1
Y
Z
21
Example
 

Iteration 0 1 2 3
X
Y
Z
22
Overview
  • Issues in Searching
  • Algorithm Overview
  • Iterative Algorithm
  • Example
  • Wrap-up

23
Notes to Consider
  • In general, we dont need to iterate to
    convergence.
  • Paper contains a list of good results for various
    queries.
  • After initial text-based search, the text was
    ignored in favor of the link structure.

24
Related Areas
  • Similar-page queries.
  • Connections with
  • Social networks
  • Bibliometrics (citations)
  • Stand-alone hypertext environments
  • Clustering of link structures
  • Multiple sets of hubs and authorities
  • Diffusion and Generalization

25
Conclusion
  • Influential paper many citations.
  • Published at the same time as the Google
    page-rank algorithm.
  • HITS Hyperlink Induced Topic Search
  • Clever (IBM)
  • Basis of Teoma search engine algorithm.

26
References
  • Kleinberg, Jon. Authoritative Sources in a
    Hyperlinked Environment. Journal of the ACM, Vol.
    46, No. 5, September 1999, pp. 604-632.
  • The mini-web example comes from
  • http//www.cs.fiu.edu/vagelis/presentations/Rando
    mWalks.ppt

27
The End
Write a Comment
User Comments (0)
About PowerShow.com