Romil Jain - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Romil Jain

Description:

Key is to find those pages that the user desires. Takes a set of relevant ... http://www.pandia.com/sew/383-web-size.html . Search Engines Worldwide, Jan 2003. ... – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 28
Provided by: cseY
Category:
Tags: jain | romil | sew

less

Transcript and Presenter's Notes

Title: Romil Jain


1
Introduction to Google PageRank Algorithm
- Romil Jain romilj_at_cse.yorku.ca
2
World Wide Web
  • WWW is HUGE. Approximate estimations 1
  • 50 million active web sites
  • 25 billion web pages
  • 1 billion users
  • There are a large number of search engines too
    2
  • At least 3,105 search engines

3
Anatomy of a Search Engine
User Query
WWW
4
Ranking Module
  • Key is to find those pages that the user desires
  • Takes a set of relevant web pages and ranks them
  • Rank is generally a function of
  • Content Score
  • Popularity Score (The focus of this talk)
  • E.g. What are some good Indian restaurants in
    Toronto?

5
Ranking Web Pages by Popularity
  • PageRank algorithm, given by Sergey Brin and
    Larry Page in 1998 1
  • Exploits the linked structure of the web for
    computing popularity

6
Ranking by Popularity (contd)
r(Pj)
?
r(Pi)
Pj
Pj ? Bi
  • But r(Pj) are unknown !
  • So use and iterative procedure
  • r0(Pj) 1/n, where n is web pages

7
Example
1
2
3
5
6
4
8
Matrix Notation
r0(Pj) 1/n
1
2
3
5
6
4
?(k1)T ?(k)T H, ? (k)T PageRank vector after
kth iteration ? (0)T 1/n eT
9
Nice (?) Properties of H
?(k1)T ?(k)T H
  • Sparse n? n matrix
  • Less storage space (25 billion web pages!)
  • Each iteration requires ? (nnz(H)) computations.
    H has about 10n nonzero. So ? (n) computations.
  • Note that a dense matrix would require ? (n2)
    computation
  • The dangling nodes create 0 rows in H. All other
    rows have sum 1. Thus H is substochastic
    matrix

10
Issues with Iterative Process
?(k1)T ?(k)T H
  • Will it converge or continue indefinitely?
  • What properties of H will ensure convergence?
  • Does convergence depend on ?(0)T ?
  • How long will it take to converge i.e. what k is
    the fixed point?
  • Does a converged ?T give useful page ranks?

All these questions can be answered using theory
of Markov Chains Stochastic Matrices
11
Stochastic Matrix
Markov Chain for a Random Surfer
  • A stochastic matrix S is
  • n? n matrix with each row-sum 1
  • for each sij , 0 ? sij?1

Transition Probability Matrix
12
Power of Stochastic Matrix
If we start from C, what is the probability that
we will reach B in 2 steps? P(CB2) P(CA)P(AB)
P(CB)P(BB) P(CC)P(CB)
13
Power Convergence
In 3, 4, 5, 6, 7 steps?
14
State Vector Transition
If xT is a stochastic probability distribution
vector of a given state, then x (k1)T x (k)T S
Similar to ?(k1)T ?(k)T H, except that H is
not stochastic!
15
State Vector Convergence
x (n1)T x (n)T S
If we start with x(0)T, then lim x(n)T x (0)T
lim Sn x (0)TS x n?? n??
16
H is not stochastic!
?(k1)T ?(k)T H
17
Adjustment 1 to H
A random surfer can randomly jump to any page
after he encounters a dangling node
S H a(1/n eT)
a is called the dangling node vector. ai 1 if
page i is dangling otherwise 0.
18
Adjustment 2 to H
?(k1)T ?(k)T S
0 ? sij ? 1 not true for S!
A random surfer can randomly teleport to any
page irrespective of the current page.
19
Finally we have G!
G ?S (1 - ?) E , 0 ? ? ? 1
?(k1)T ?(k)T G
  • G is stochastic
  • 0 ? gij ? 1 true for G

Therefore the above equation converges for any
?(0)T
But now G is no longer sparse ?. In fact it is
completely dense!
20
Fortunately
?(k1)T ?(k)T G
G ?S (1 - ?) E
?S (1 - ?) 1/n eeT
?(H 1/n aeT) (1 - ?) 1/n eeT
?H (?a (1 - ?) e) 1/n eT
Therefore
?(k1)T ?(k)T G
? ?(k)T H (? ?(k)T a (1 - ?) ?(k)T e )
1/n eT
? ?(k)T H (? ?(k)T a (1 - ?)) 1/n eT
(?)
Now vector multiplications are done on extremely
sparse H
21
Importance of ?
?(k1)T ?(k)T G
  • G ?S (1 - ?) E , 0 ? ? ? 1
  • ?(k1)T ?(k)T G
  • What ? must be chosen?
  • It can be shown that rate of convergence is the
    rate at which
  • ?k ? 0
  • ? ? 0, ?T converges immediately, but completely
    unrealistic!
  • ? ? 1, ?T may never converge, again unrealistic !
  • We want ? to be as close as possible to 1

22
? 0.85 Saves the Day
?(k1)T ?(k)T G
  • G ?S (1 - ?) E, 0 ? ? ? 1
  • Brin Page initially chose ? 0.85, and this is
    still the value
  • used by Google
  • Takes about 50 iterations (3 days) to converge
    sufficiently
  • Accuracy is ?50 .8550 ? .000296, which is
    sufficient for
  • Googles needs

23
Importance of Teleportation Matrix E
?(k1)T ?(k)T G
G ?S (1 - ?) E Initially we had E 1/n
eeT This means that a random surfer can teleport
to any web page with equal probability 1/n
Instead of 1/n eeT use evT , where vT is the
personalization or teleportation vector.
vT is used to counter-act link farms (like
SearchKing.com)
24
Issue Sensitivity of PageRank
?(k1)T ?(k)T G
It can be shown that
1
d ?(k)T
?
d ?
1 - ?
as ? ? 1, 1/(1- ?) ?? So, PageRank is quite
sensitive to small changes in the web. Google
computes PageRank from scratch every month!
Can we compute ?i1 from ?i without computing
?i1 from scratch?
25
Issue PageRank is Query Independent!
?(k1)T ?(k)T G
  • PageRank is pre-computed.
  • It means that to be better linked is more
    important than to contain the search terms
  • This is significant because a badly linked page,
    might be popular within the community of pages
    with the same topic

A rosy idea Is it feasible to compute PageRank
after the relevant documents have been retrieved?
26
Issue PageRank is Dead!
?(k1)T ?(k)T G
Not for now, but is susceptible to a lot of
damage
  • PageRank is based upon an ideal democratic
    structure of the web
  • But hackers, spammers and SEOs know too much
    about Google to skew the rankings
  • Typical examples are Link Farms and Google
    Bombs.
  • Bloggers created a bomb where if you typed
    miserable failure then Google would take you to
    www.whitehouse.gov!

How can we detect and fight Rank Skewing?
27
References
  • The size of the World Wide Web, May 2007.
  • http//www.pandia.com/sew/383-web-size.html .
  • Search Engines Worldwide, Jan 2003.
    http//home.inter.net/takakuwa/search/search.html
    .
  • Langville and Meyer. Googles PageRank and
    Beyond. Princeton University Press, 2006.
  • Brin and Page. The Anatomy of a Large-scale
    Hypertextual Web Search Engine. Computer Networks
    and ISDN Systems, 1998.
Write a Comment
User Comments (0)
About PowerShow.com