Link ranking II: Hubs, Authorities, and more - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Link ranking II: Hubs, Authorities, and more

Description:

PageRank and related 'random walks' give query-independent ranking of entire Web ... Combining such ranks into an overall query-specific ranking is tricky - the ' ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 14
Provided by: raymie
Category:

less

Transcript and Presenter's Notes

Title: Link ranking II: Hubs, Authorities, and more


1
Link ranking II Hubs, Authorities, and more
2
Last time PageRank etc
  • PageRank and related random walks give
    query-independent ranking of entire Web
  • These random walks are ergodic Markov
    processes, the ranks are stationary probabilities
    computed from an eigenvector computation
  • Combining such ranks into an overall
    query-specific ranking is tricky - the secret
    sauce of search engines

3
Today Hubs/Authorities etc
  • Kleinbergs Hubs/Authorities and related
    schemes give query-specific rankings of small
    subgraphs of the Web
  • Used for
  • Ranking broad topic queries,
    aka topic distillation
  • Solves the problem of abundance
  • Finding similar pages
  • Solves problem of scarcity (assuming you have an
    initial archetype)

4
Hubs Authorities
  • Good Hub points to good Authorities
  • Good Authority is pointed to by good Hubs
  • Mutual reinforcement
  • Draw picture of hubs pointing to search engines.
    Point out that search engine is not typically
    on the home page of search engines!

5
Basic idea topic distillation
  • Compute root set by issuing a search to a search
    engine (200 pages)
  • Compute base set by adding root set to pages
    pointing to and from it (again, typically can be
    done through search engine) (1-5K pages -- small
    by Web stds)
  • Draw picture again.
  • Run iterative algorithm to identify hubs and
    authorities

6
Basic idea similar pages
  • Root set is a (or a few) page(s) of interest
  • Base set is this plus pages pointing to it plus
    the pages they point to plus pages pointing to
    them plus
  • Run iterative algorithm

7
Computing HA scores
  • Let h(x), a(x) be hub/auth score of page x
  • Initialize h(x) a(x) 1 for all x
  • Iterate until convergence
  • h(x) sum a(y) for all y pointed to by x
  • a(x) sum h(y) for all y pointing to x
  • Normalize h a s.t. L2-norm is 1
  • In practice, 5 iterations to convergence

8
More eigenvectors
  • Let W be the adjacency matrix of base set
  • W_ij 1 if edge i-gtj, 0 otherwise
  • Rewrite iteration in matrix form (a, h now
    vectors)
  • h Wa, a Wh (W is transpose of W)
  • gt h WWh, a WWa
  • h is principle eigenvector of WW
    a is principle eigenvector of WW

9
Bibliometrics
  • A WW is co-citation matrix Small73, A_ij is
    number of papers which jointly cite i and j
  • H WW is bibliographic coupling matrix Kessler
    63, H_ij is number of papers jointly cited by i
    and j

10
Problem topic drift
  • Many pages have Best if viewed with lta
    href..microsoft..gtInternet Explorerlt/agt
  • As result, IE download page becomes an authority
    for everything!

11
Adding weights
  • Iterate until convergence
  • h(x) sum w(x,y)a(y) for all y pointed to by x
  • a(x) sum w(x,y)h(y) for all y pointing to x
  • Normalize h a s.t. L2-norm is 1
  • Where w(x,y) is a weight on the edge
  • Could be number of query terms in anchor text
  • Could be similarity of xy pages
  • Could be relevance of y to topic (really is
    w(y), ie, independent of x)

12
SALSA
  • Interesting paper making the following
    suggestions
  • Use in/outdegree of base set to rank!
  • Immune from TKC effect draw java/earthweb
  • Why does it work? Because all the hard work is
    in (a) work done by search engine in finding the
    base set and (b) the link pruning

13
More issues
  • Ignoring non-informative links
  • Intra-host links (considered navigational)
  • Download IE here, generated by tex2html
  • Deal with lots of pages on one host pointing to
    same page on another (or the dual)
  • Hub/Auths has spawned large literature!
Write a Comment
User Comments (0)
About PowerShow.com