Title: Algorithms (wait, Math?) Everywhere
1Algorithms (wait, Math?) Everywhere
- Gerald Kruse, PhD.
- John 54 and Irene 58 Dale Professor of MA, CS
and I T - Interim Assistant Provost 2013-14
- Juniata College
- Huntingdon, PA
- kruse_at_juniata.edu
- http//faculty.juniata.edu/kruse
2Some Context / Confessions
- Prepare to be underwhelmed. I cant return the
hour or so you spend here. - I am impressed by the elegance of the algorithms
I will present today, and I will probably try too
hard to explain the underlying math (but its so
cool). - We like and depend on many automated processes,
we just have issues implementing or interacting
with them. - But, when we understand an algorithm, we can
manipulate it. (my CS 315 students Google
Bombed Juniata in a good way). - Are we really surprised to learn that a Google
search isnt free?
3What movie should we pick?1,000,000 to the
first algorithm that was 10 better than
Netflixs original algorithm
4The first 8 improvement was easy
5The first 8 improvement was easy
Just A Guy In A Garage Psychiatrist father and
hacker daughter team
6The first 8 improvement was easy
Team from Bell Labs ended up winning
7Heres an interesting billboard, from a few years
ago in Silicon Valley
8First 70 digits ofe2.71828182845904523536028747
1352662497757247093699959574966967627724077
9What happened for those who found the answer?
10What happened for those who found the answer?
- The answer is 7427466391
- Those who typed in the URL, http//7427466391.com
, ended up getting another puzzle. Solving that
lead them to a page with a job application for
11What happened for those who found the answer?
- The answer is 7427466391
- Those who typed in the URL, http//7427466391.com
, ended up getting another puzzle. Solving that
lead them to a page with a job application for - Google!
12(1) Just what does it take to solve that
problem?
First Question
13(1) Just what does it take to solve that
problem?Calculations (most probably on a
computer), knowledge of number theory, a general
aptitude and interest in problem solving.
First Question
14(2) Why does Google want to hire people who know
how to find that number, and what does it have to
do with a search engine?
Second Question
15(2) Why does Google want to hire people who know
how to find that number, and what does it have to
do with a search engine? Hmmm Google wants you
to choose it for your web searches.
Second Question
16(2) Why does Google want to hire people who know
how to find that number, and what does it have to
do with a search engine? Hmmm Google wants you
to choose it for your web searches.Maybe their
algorithms are mathematically based?
Second Question
17 18- Results in an early paper from Page, Brin et. al.
while in graduate school
19Search EnginesWeve all used them, but what is
under the hood?
- Crawl the web and locate all public pages
- Index the crawled data so it can be searched
- Rank the pages for more effective searching (
the math part of this talk ) - Each word which is searched on is linked with a
list of pages (just URLs) which contain it. The
pages with the highest rank are returned first. - - cant get a snapshot of the web at a
particular instance
20NoteGoogles PageRank uses the link structure
(crowd sourcing) of the World Wide Web to
determine a pages rank, it doesnt grade content
of a page.
21PageRank is NOT a simple citation index
Which is the more popular page below, A or B?
A
B
22PageRank is NOT a simple citation index
Which is the more popular page below, A or
B?What if the links to A were from unpopular
pages, and the one link to B was from
www.yahoo.com ? (High School)
A
B
- NOTE
- Rankings based on citation index would be very
easy to manipulate
23PageRank is NOT a simple citation index
Which is the more popular page below, A or
B?What if the links to A were from unpopular
pages, and the one link to B was from
www.yahoo.com ? (High School)
A
B
- NOTE
- Rankings based on citation index would be very
easy to manipulate - PageRank has evolved to be a minor part of
Googles search results.
24Intuitively PageRank is analogous to popularity
- The web as a graph each page is a vertex, each
hyperlink a directed edge.
Page A
Page B
Which of these three would have the highest page
rank?
Page C
25Intuitively PageRank is analogous to popularity
- The web as a graph each page is a vertex, each
hyperlink a directed edge. - A page is popular if a few very popular pages
point (via hyperlinks) to it.
Page A
Page B
Which of these three would have the highest page
rank?
Page C
26Intuitively PageRank is analogous to popularity
- The web as a graph each page is a vertex, each
hyperlink a directed edge. - A page is popular if a few very popular pages
point (via hyperlinks) to it. - A page could be popular if many not-necessarily
popular pages point (via hyperlinks) to it.
Page A
Page B
Which of these three would have the highest page
rank?
Page C
27So what is the mathematical definition of
PageRank?
- In particular, a pages rank is equal to the sum
of the ranks of all the pages pointing to it. - note the scaling of each page rank
28Writing out the equation for each web-page in our
example gives
Page A
Page B
Page C
29Even though this is a circular definition we can
calculate the ranks.
30Even though this is a circular definition we can
calculate the ranks.Re-write the system of
equations as a Matrix-Vector product.
31Even though this is a circular definition we can
calculate the ranks.Re-write the system of
equations as a Matrix-Vector product.
The PageRank vector is simply an eigenvector of
the coefficient matrix, with
32Wait whats an eigenvector?
33PageRank 0.4
PageRank 0.2
Page A
Page B
Page C
PageRank 0.4
Note we choose the eigenvector with
34Implementation Details
- Billions of web-pages would make a huge matrix
- The matrix (in theory) is column-stochastic,
which allows for iterative calculation - Previous PageRank is used as an initial guess
- Random-Surfer term handles computational
difficulties associated with a disconnected
graph
35Wait what else gets searched?
36(No Transcript)
37(No Transcript)
38(No Transcript)
39(No Transcript)
40- Attempts to Manipulate Search Results
- Via a Google Bomb
41- Liberals vs. Conservatives!
- In 2007, Google addressed Google Bombs, too many
people thought the results were intentional and
not merely a function of the structure of the web
42 43- At Juniata, CS 315 is my Analysis and
Algorithms course
44- Try a search in Google on PigeonRank.
- What types of sites would Google NOT give good
results on? - PageRank has been deprecated. Google is
continuosly trying new ranking algorithms.
45- A rules approach filter out all messages with
things like, Dear Friend or Click. - The first 80 is captured easily, with few
false-positives. - But the last few (remember Netflix) will be
difficult to catch, the rules will offer many
more false-positives, and the SPAMMers can
adapt. - A statistical approach, called a Bayesian filter,
is much more effective. - It learns from a given set of SPAM and non-SPAM
emails, automatically counting the frequency of
words. - Some words are incriminating, like Madam,
others almost guarantee the email is non-SPAM,
like describe, or example.
461 S. Brin, L. Page, et. al., The PageRank
Citation Ranking Bringing Order to the Web,
http//dbpubs.stanford.edu/pub/1999-66 , Stanford
Digital Libraries Project (January 29,
1998). 2 K. Bryan and T. Leise, The
25,000,000,000 Eigenvector The Linear Algebra
behind Google, SIAM Review, 48 (2006), pp.
569-581. 3 G. Strang, Linear Algebra and Its
Applications, Brooks-Cole, Boston, MA, 2005. 4
D. Poole, Linear Algebra A Modern Introduction,
Brooks-Cole, Boston, MA, 2005.
47Any Questions?
- Slides available at http//faculty.juniata.edu/kru
se
48The following slides give some of the more
in-depth mathematics behind Google
49A Graphical Interpretation of a 2-Dimensional
Eigenvectorhttp//cnx.org/content/m10736/latest/
If we have some 2-D vector x, and some 2 x 2
matrix A, generally their product, Ax b, will
result in a new vector, b, which is pointing in
a different direction and having a different
length than x.
50A Graphical Interpretation of a 2-Dimensional
Eigenvectorhttp//cnx.org/content/m10736/latest/
If we have some 2-D vector x, and some 2 x 2
matrix A, generally their product, Ax b, will
result in a new vector, b, which is pointing in
a different direction and having a different
length than x. But, if the vector (v in the
image at the left) is an eigenvector of A, then
Av will give a vector which is same direction
as v, but just scaled a different length, by
?. Note that ? is called an eigenvalue of A.
51Note that the coefficient matrix is
column-stochastic
Every column-stochastic matrix has 1 as an
eigenvalue. As long as there are no dangling
nodes and the graph is connected.
52Dangling Nodes have no outgoing links
In this example, Page C is a dangling node. Note
that its associated column in the coefficient
matrix is all 0. Matrices like these are called
column-substochastic.
Page A
Page C
Page B
In Page, Brin, et. al. 1, they suggest dangling
nodes most likely would occur from pages which
havent been crawled yet, and so they simply
remove them from the system until all the
PageRanks are calculated.It is interesting to
note that a column-substochastic does have a
positive eigenvalue and corresponding
eigenvector with non-negative entries, which is
called the Perron eigenvector, as detailed in
Bryan and Leise 2.
53A disconnected graph could lead to non-unique
rankings
Notice the block diagonal structure of the
coefficient matrix. Note Re-ordering via
permutation doesnt change the ranking, as in 2.
Page C
Page A
Page E
Page D
Page B
In this example, the eigenspace assiciated with
eigenvalue is two-dimensional. Which
eigenvector should be used for ranking?
54Add a random-surfer term to the simple PageRank
formula.
Let S be an n x n matrix with all entries 1/n. S
is column-stochastic, and we consider the matrix
M , which is a weighted average of A and S.
- This models the behavior of a real web-surfer,
who might jump to another page by directly typing
in a URL or by choosing a bookmark, rather than
clicking on a hyperlink. Originally, m0.15 in
Google, according to 2. - can also be written as
Important Note We will use this formulation
with A when computing x , and s is a
column vector with all entries 1/n, where
if
55M for our previous disconnected graph, with
m0.15
Page C
Page A
Page E
Page D
Page B
The eigenspace associated with is
one-dimensional, and the normalized eigenvector is
So the addition of the random surfer term permits
comparison between pages in different subwebs.
56Iterative Calculation
By many estimates, the web currently contains at
least 8 billion pages. How does Google compute
an eigenvector for something this large?One
possibility is the power method.In 2, it is
shown that every positive (all entries are gt 0)
column-stochastic matrix M has a unique vector q
with positivecomponents such that Mq q, with
, and it can becomputed as
, for any initial guess
withpositive components and .
57Iterative Calculation continued
Rather than calculating the powers of M directly,
we could use the iteration,
.Since M is positive, would be an
calculation. As we mentioned
previously, Google uses the equivalent expression
in the computationThese products can be
calculated without explicitly creating the huge
coefficient matrix, since A contains mostly 0s.
The iteration is guaranteed to converge, and it
will converge quicker with a better first guess,
so the previous PageRank vector is used as the
initial vector.
58This gives a regular matrix
- In matrix notation we have
- Since we can rewrite as
- The new coefficient matrix is regular, so we can
calculate the eigenvector iteratively. - This iterative process is a series of
matrix-vector products, beginning with an
initial vector (typically the previous PageRank
vector). These products can be calculated
without explicitly creating the huge coefficient
matrix.