Title: Algorithms for Large Data Sets
1Algorithms for Large Data Sets
Lecture 4 April 9, 2006
http//www.ee.technion.ac.il/courses/049011
2Crash Course in AlgebraandMarkov Chains
3Ranking Algorithms
4PageRank, Attempt 1
- Additional Conditions
- r is non-negative r 0
- r is normalized r1 1
- B normalized adjacency matrix
- Then
- r is a non-negative normalized left eigenvector
of B with eigenvalue 1
5PageRank, Attempt 1
- Solution exists only if B has eigenvalue 1
- Problem B may not have 1 as an eigenvalue
- Because some of its rows are 0.
- Example
6PageRank, Attempt 2
- ? normalization constant
-
- r is a non-negative normalized left eigenvector
of B with eigenvalue 1/?
7PageRank, Attempt 2
- Any nonzero eigenvalue ? of B may give a solution
- l 1/?
- r any non-negative normalized left eigenvector
of B with eigenvalue ? - Which solution to pick?
- Pick a principal eigenvector (i.e.,
corresponding to maximal ?) - How to find a solution?
- Power iterations
8PageRank, Attempt 2
- Problem 1 Maximal eigenvalue may have
multiplicity gt 1 - Several possible solutions
- Happens, for example, when graph is disconnected
- Problem 2 Rank accumulates at sinks.
- Only sinks or nodes, from which a sink cannot be
reached, can have nonzero rank mass.
9PageRank, Final Definition
- e rank source vector
- Standard setting e(p) ?/n for all p (? lt 1)
- 1 the all 1s vector
- Then
- r is a non-negative normalized left eigenvector
of (B 1eT) with eigenvalue 1/?
10PageRank, Final Definition
- Any nonzero eigenvalue of (B 1eT) may give a
solution - Pick r to be a principal left eigenvector of (B
1eT) - Will show
- Principal eigenvalue has multiplicity 1, for any
graph - There exists a non-negative left eigenvector
- Hence, PageRank always exists and is uniquely
defined - Due to rank source vector, rank no longer
accumulates at sinks
11An Alternative View of PageRankThe Random
Surfer Model
- When visiting a page p, a random surfer
- With probability 1 - d, selects a random outlink
p ? q and goes to visit q. (focused browsing) - With probability d, jumps to a random web page q.
(loss of interest) - If p has no outlinks, assume it has a self loop.
- P probability transition matrix
12PageRank Random Surfer Model
Suppose
Then
- Therefore, r is a principal left eigenvector of
(B 1eT) if and only if it is a principal left
eigenvector of P.
13PageRank Markov Chains
- PageRank vector is normalized principal left
eigenvector of (B 1eT). - Hence, PageRank vector is also a principal left
eigenvector of P - Conclusion PageRank is the unique stationary
distribution of the random surfer Markov Chain. - PageRank(p) r(p) probability of random surfer
visiting page p at the limit. - Note Random jump guarantees Markov Chain is
ergodic.
14PageRank Computation
In practice about 50 iterations suffices
15HITS Hubs and Authorities Kleinberg, 1997
- HITS Hyperlink Induced Topic Search
- Main principle every page p is associated with
two scores - Authority score how authoritative a page is
about the querys topic - Ex query IR authorities scientific IR
papers - Ex query automobile manufacturers
authorities Mazda, Toyota, and GM web sites - Hub score how good the page is as a resource
list about the querys topic - Ex query IR hubs surveys and books about IR
- Ex query automobile manufacturers hubs KBB,
car link lists
16Mutual Reinforcement
- HITS principles
- p is a good authority, if it is linked by many
good hubs. - p is a good hub, if it points to many good
authorities.
17HITS Algebraic Form
- a authority vector
- h hub vector
- A adjacency matrix
- a is principal eigenvector of ATA
- h is principal eigenvector of AAT
- Need to deal with same issues as in PageRank
18Co-Citation and Bibilographic Coupling
- ATA co-citation matrix
- ATAp,q of pages that link both to p and to q.
- Thus authority scores propagate through
co-citation. - AAT bibliographic coupling matrix
- AATp,q of pages that both p and q link to.
- Thus hub scores propagate through bibliographic
coupling.
p
q
p
q
19HITS Computation
20Principal Eigenvector Computation
- E n n matrix
- ?1 gt ?2 gt ?3 gt ?n eigenvalues of E
- v1,,vn corresponding eigenvectors
- Eigenvectors are linearly independent
- Input
- The matrix E
- The principal eigenvalue ?1
- A unit vector u, which is not orthogonal to v1
- Goal compute v1
21The Power Method
22Why Does It Work?
- Theorem As t ? ?, w/?1t ? c v1
(c is a constant) - Convergence rate Proportional to (?2/?1)t
- The larger the spectral gap ?2 - ?1, the faster
the convergence.
23End of Lecture 4