Romil Jain - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Romil Jain

Description:

Key is to find those pages that the user desires. Takes a set of relevant ... http://www.pandia.com/sew/383-web-size.html . Search Engines Worldwide, Jan 2003. ... – PowerPoint PPT presentation

Number of Views:78

Avg rating:3.0/5.0

Slides: 28

Provided by: cseY

Category:

more less

Transcript and Presenter's Notes

Title: Romil Jain

1
Introduction to Google PageRank Algorithm
- Romil Jain romilj_at_cse.yorku.ca
2
World Wide Web

WWW is HUGE. Approximate estimations 1
50 million active web sites
25 billion web pages
1 billion users

There are a large number of search engines too
2
At least 3,105 search engines

3
Anatomy of a Search Engine
User Query
WWW
4
Ranking Module

Key is to find those pages that the user desires
Takes a set of relevant web pages and ranks them
Rank is generally a function of
Content Score
Popularity Score (The focus of this talk)
E.g. What are some good Indian restaurants in
Toronto?

5
Ranking Web Pages by Popularity

PageRank algorithm, given by Sergey Brin and
Larry Page in 1998 1
Exploits the linked structure of the web for
computing popularity

6
Ranking by Popularity (contd)
r(Pj)
?
r(Pi)
Pj
Pj ? Bi

But r(Pj) are unknown !
So use and iterative procedure

r0(Pj) 1/n, where n is web pages

7
Example
1
2
3
5
6
4
8
Matrix Notation
r0(Pj) 1/n
1
2
3
5
6
4
?(k1)T ?(k)T H, ? (k)T PageRank vector after
kth iteration ? (0)T 1/n eT
9
Nice (?) Properties of H
?(k1)T ?(k)T H

Sparse n? n matrix

Less storage space (25 billion web pages!)

Each iteration requires ? (nnz(H)) computations.
H has about 10n nonzero. So ? (n) computations.
Note that a dense matrix would require ? (n2)
computation

The dangling nodes create 0 rows in H. All other
rows have sum 1. Thus H is substochastic
matrix

10
Issues with Iterative Process
?(k1)T ?(k)T H

Will it converge or continue indefinitely?

What properties of H will ensure convergence?

Does convergence depend on ?(0)T ?

How long will it take to converge i.e. what k is
the fixed point?

Does a converged ?T give useful page ranks?

All these questions can be answered using theory
of Markov Chains Stochastic Matrices
11
Stochastic Matrix
Markov Chain for a Random Surfer

A stochastic matrix S is
n? n matrix with each row-sum 1
for each sij , 0 ? sij?1

Transition Probability Matrix
12
Power of Stochastic Matrix
If we start from C, what is the probability that
we will reach B in 2 steps? P(CB2) P(CA)P(AB)
P(CB)P(BB) P(CC)P(CB)
13
Power Convergence
In 3, 4, 5, 6, 7 steps?
14
State Vector Transition
If xT is a stochastic probability distribution
vector of a given state, then x (k1)T x (k)T S
Similar to ?(k1)T ?(k)T H, except that H is
not stochastic!
15
State Vector Convergence
x (n1)T x (n)T S
If we start with x(0)T, then lim x(n)T x (0)T
lim Sn x (0)TS x n?? n??
16
H is not stochastic!
?(k1)T ?(k)T H
17
Adjustment 1 to H
A random surfer can randomly jump to any page
after he encounters a dangling node
S H a(1/n eT)
a is called the dangling node vector. ai 1 if
page i is dangling otherwise 0.
18
Adjustment 2 to H
?(k1)T ?(k)T S
0 ? sij ? 1 not true for S!
A random surfer can randomly teleport to any
page irrespective of the current page.
19
Finally we have G!
G ?S (1 - ?) E , 0 ? ? ? 1
?(k1)T ?(k)T G

G is stochastic
0 ? gij ? 1 true for G

Therefore the above equation converges for any
?(0)T
But now G is no longer sparse ?. In fact it is
completely dense!
20
Fortunately
?(k1)T ?(k)T G
G ?S (1 - ?) E
?S (1 - ?) 1/n eeT
?(H 1/n aeT) (1 - ?) 1/n eeT
?H (?a (1 - ?) e) 1/n eT
Therefore
?(k1)T ?(k)T G
? ?(k)T H (? ?(k)T a (1 - ?) ?(k)T e )
1/n eT
? ?(k)T H (? ?(k)T a (1 - ?)) 1/n eT
(?)
Now vector multiplications are done on extremely
sparse H
21
Importance of ?
?(k1)T ?(k)T G

G ?S (1 - ?) E , 0 ? ? ? 1
?(k1)T ?(k)T G
What ? must be chosen?
It can be shown that rate of convergence is the
rate at which
?k ? 0
? ? 0, ?T converges immediately, but completely
unrealistic!
? ? 1, ?T may never converge, again unrealistic !
We want ? to be as close as possible to 1

22
? 0.85 Saves the Day
?(k1)T ?(k)T G

G ?S (1 - ?) E, 0 ? ? ? 1
Brin Page initially chose ? 0.85, and this is
still the value
used by Google
Takes about 50 iterations (3 days) to converge
sufficiently
Accuracy is ?50 .8550 ? .000296, which is
sufficient for
Googles needs

23
Importance of Teleportation Matrix E
?(k1)T ?(k)T G
G ?S (1 - ?) E Initially we had E 1/n
eeT This means that a random surfer can teleport
to any web page with equal probability 1/n
Instead of 1/n eeT use evT , where vT is the
personalization or teleportation vector.
vT is used to counter-act link farms (like
SearchKing.com)
24
Issue Sensitivity of PageRank
?(k1)T ?(k)T G
It can be shown that
1
d ?(k)T
?
d ?
1 - ?
as ? ? 1, 1/(1- ?) ?? So, PageRank is quite
sensitive to small changes in the web. Google
computes PageRank from scratch every month!
Can we compute ?i1 from ?i without computing
?i1 from scratch?
25
Issue PageRank is Query Independent!
?(k1)T ?(k)T G

PageRank is pre-computed.
It means that to be better linked is more
important than to contain the search terms
This is significant because a badly linked page,
might be popular within the community of pages
with the same topic

A rosy idea Is it feasible to compute PageRank
after the relevant documents have been retrieved?
26
Issue PageRank is Dead!
?(k1)T ?(k)T G
Not for now, but is susceptible to a lot of
damage

PageRank is based upon an ideal democratic
structure of the web
But hackers, spammers and SEOs know too much
about Google to skew the rankings
Typical examples are Link Farms and Google
Bombs.
Bloggers created a bomb where if you typed
miserable failure then Google would take you to
www.whitehouse.gov!

How can we detect and fight Rank Skewing?
27
References

The size of the World Wide Web, May 2007.
http//www.pandia.com/sew/383-web-size.html .
Search Engines Worldwide, Jan 2003.
http//home.inter.net/takakuwa/search/search.html
.
Langville and Meyer. Googles PageRank and
Beyond. Princeton University Press, 2006.
Brin and Page. The Anatomy of a Large-scale
Hypertextual Web Search Engine. Computer Networks
and ISDN Systems, 1998.