Title: Romil Jain
1Introduction to Google PageRank Algorithm
- Romil Jain romilj_at_cse.yorku.ca
2World Wide Web
- WWW is HUGE. Approximate estimations 1
- 50 million active web sites
- 25 billion web pages
- 1 billion users
- There are a large number of search engines too
2 - At least 3,105 search engines
3Anatomy of a Search Engine
User Query
WWW
4Ranking Module
- Key is to find those pages that the user desires
- Takes a set of relevant web pages and ranks them
- Rank is generally a function of
- Content Score
- Popularity Score (The focus of this talk)
- E.g. What are some good Indian restaurants in
Toronto?
5Ranking Web Pages by Popularity
- PageRank algorithm, given by Sergey Brin and
Larry Page in 1998 1 - Exploits the linked structure of the web for
computing popularity
6Ranking by Popularity (contd)
r(Pj)
?
r(Pi)
Pj
Pj ? Bi
- But r(Pj) are unknown !
- So use and iterative procedure
- r0(Pj) 1/n, where n is web pages
7Example
1
2
3
5
6
4
8Matrix Notation
r0(Pj) 1/n
1
2
3
5
6
4
?(k1)T ?(k)T H, ? (k)T PageRank vector after
kth iteration ? (0)T 1/n eT
9Nice (?) Properties of H
?(k1)T ?(k)T H
- Less storage space (25 billion web pages!)
- Each iteration requires ? (nnz(H)) computations.
H has about 10n nonzero. So ? (n) computations. - Note that a dense matrix would require ? (n2)
computation
- The dangling nodes create 0 rows in H. All other
rows have sum 1. Thus H is substochastic
matrix
10Issues with Iterative Process
?(k1)T ?(k)T H
- Will it converge or continue indefinitely?
- What properties of H will ensure convergence?
- Does convergence depend on ?(0)T ?
- How long will it take to converge i.e. what k is
the fixed point?
- Does a converged ?T give useful page ranks?
All these questions can be answered using theory
of Markov Chains Stochastic Matrices
11Stochastic Matrix
Markov Chain for a Random Surfer
- A stochastic matrix S is
- n? n matrix with each row-sum 1
- for each sij , 0 ? sij?1
Transition Probability Matrix
12Power of Stochastic Matrix
If we start from C, what is the probability that
we will reach B in 2 steps? P(CB2) P(CA)P(AB)
P(CB)P(BB) P(CC)P(CB)
13Power Convergence
In 3, 4, 5, 6, 7 steps?
14State Vector Transition
If xT is a stochastic probability distribution
vector of a given state, then x (k1)T x (k)T S
Similar to ?(k1)T ?(k)T H, except that H is
not stochastic!
15State Vector Convergence
x (n1)T x (n)T S
If we start with x(0)T, then lim x(n)T x (0)T
lim Sn x (0)TS x n?? n??
16H is not stochastic!
?(k1)T ?(k)T H
17Adjustment 1 to H
A random surfer can randomly jump to any page
after he encounters a dangling node
S H a(1/n eT)
a is called the dangling node vector. ai 1 if
page i is dangling otherwise 0.
18Adjustment 2 to H
?(k1)T ?(k)T S
0 ? sij ? 1 not true for S!
A random surfer can randomly teleport to any
page irrespective of the current page.
19Finally we have G!
G ?S (1 - ?) E , 0 ? ? ? 1
?(k1)T ?(k)T G
- G is stochastic
- 0 ? gij ? 1 true for G
Therefore the above equation converges for any
?(0)T
But now G is no longer sparse ?. In fact it is
completely dense!
20Fortunately
?(k1)T ?(k)T G
G ?S (1 - ?) E
?S (1 - ?) 1/n eeT
?(H 1/n aeT) (1 - ?) 1/n eeT
?H (?a (1 - ?) e) 1/n eT
Therefore
?(k1)T ?(k)T G
? ?(k)T H (? ?(k)T a (1 - ?) ?(k)T e )
1/n eT
? ?(k)T H (? ?(k)T a (1 - ?)) 1/n eT
(?)
Now vector multiplications are done on extremely
sparse H
21 Importance of ?
?(k1)T ?(k)T G
- G ?S (1 - ?) E , 0 ? ? ? 1
- ?(k1)T ?(k)T G
- What ? must be chosen?
- It can be shown that rate of convergence is the
rate at which - ?k ? 0
- ? ? 0, ?T converges immediately, but completely
unrealistic! - ? ? 1, ?T may never converge, again unrealistic !
- We want ? to be as close as possible to 1
22? 0.85 Saves the Day
?(k1)T ?(k)T G
- G ?S (1 - ?) E, 0 ? ? ? 1
- Brin Page initially chose ? 0.85, and this is
still the value - used by Google
- Takes about 50 iterations (3 days) to converge
sufficiently - Accuracy is ?50 .8550 ? .000296, which is
sufficient for - Googles needs
23Importance of Teleportation Matrix E
?(k1)T ?(k)T G
G ?S (1 - ?) E Initially we had E 1/n
eeT This means that a random surfer can teleport
to any web page with equal probability 1/n
Instead of 1/n eeT use evT , where vT is the
personalization or teleportation vector.
vT is used to counter-act link farms (like
SearchKing.com)
24Issue Sensitivity of PageRank
?(k1)T ?(k)T G
It can be shown that
1
d ?(k)T
?
d ?
1 - ?
as ? ? 1, 1/(1- ?) ?? So, PageRank is quite
sensitive to small changes in the web. Google
computes PageRank from scratch every month!
Can we compute ?i1 from ?i without computing
?i1 from scratch?
25Issue PageRank is Query Independent!
?(k1)T ?(k)T G
- PageRank is pre-computed.
- It means that to be better linked is more
important than to contain the search terms - This is significant because a badly linked page,
might be popular within the community of pages
with the same topic
A rosy idea Is it feasible to compute PageRank
after the relevant documents have been retrieved?
26Issue PageRank is Dead!
?(k1)T ?(k)T G
Not for now, but is susceptible to a lot of
damage
- PageRank is based upon an ideal democratic
structure of the web - But hackers, spammers and SEOs know too much
about Google to skew the rankings - Typical examples are Link Farms and Google
Bombs. - Bloggers created a bomb where if you typed
miserable failure then Google would take you to
www.whitehouse.gov!
How can we detect and fight Rank Skewing?
27References
- The size of the World Wide Web, May 2007.
- http//www.pandia.com/sew/383-web-size.html .
- Search Engines Worldwide, Jan 2003.
http//home.inter.net/takakuwa/search/search.html
. - Langville and Meyer. Googles PageRank and
Beyond. Princeton University Press, 2006. - Brin and Page. The Anatomy of a Large-scale
Hypertextual Web Search Engine. Computer Networks
and ISDN Systems, 1998.