Title: Web Intelligence Search and Ranking
1Web Intelligence Search and Ranking
2Today
The anatomy of search engines (read it yourself)
The key design goal(s) for search engines Why
google is good PageRank and anchor
text Coursework 3 (last slide)
3The Anatomy of a Search Engine
A classic paper from the founders of
Google Available at the site.
- Challenges faced by a www search system
- Design Goals
- Googles key ideas (for improved
search/relevance) - System design of google.
4The one-slide guide to search engines
- A search engines back end is simply an index of
the pages on the web, in precisely the same way
that an index of a book is an index of the pages
in the book! - In a book, to find the pages that discuss
apples, you look up apples in the index, and
get page numbers. In the www, you look up
apples and get URLs. - To give more appropriate URLs, web indexes are
more sophisticated, but its still the same idea. - So if you want to start a search engine company,
you need - to build and maintain an index to the web (how?)
- to provide an interface (of course)
- to have a routine that takes a search query and,
making good use of your index, finds the most
relevant urls.
5Growth of the WWW
From 93 to 96 from http//www.mit.edu/people/mkg
ray/growth/
In 2005, From http//www.cs.uiowa.edu/asignori/w
eb-size/
The indexable web in Jan 2005 11.5 billion
pages Coverage Google76.16, Msn
Beta61.90, Ask/Teoma57.62, Yahoo!69.32
Based on 438,141 in 75 different languages
6How can we make search better?
7How can we make search better?
Reduce (ideally eliminate) results that are
irrelevant to our requirements
8How can we make search better?
Reduce (ideally eliminate) results that are
irrelevant to our requirements this started
to be a problem in 1997, when the size of the www
at the time (c. 100,000,000 pages) was such
that relevant responses to queries were
often swamped by irrelevant ones.
9How can we make search better?
Reduce (ideally eliminate) results that are
irrelevant to our requirements this started
to be a problem in 1997, when the size of the www
at the time (c. 100,000,000 pages) was such
that relevant responses to queries were
often swamped by irrelevant ones. Number of
potential pages to list against a query is always
growing, but peoples ability to filter
them is static you dont want to look at
more than a few 10s of documents. So,
search engines need to continually get better at
ranking.
10Two key metrics
Suppose you enter a query into a search
engine Suppose that there are precisely 20
documents on the www that are fully relevant
to your query Suppose that the search engine
returns a list of 10 documents 2 of these are
relevant, the rest are not.
11Two key metrics
Suppose you enter a query into a search
engine Suppose that there are precisely 20
documents on the www that are fully relevant
to your query Suppose that the search engine
returns a list of 10 documents 2 of these are
relevant, the rest are not. Precision the
percentage of retrieved documents that are
relevant to the query. Recall the percentage of
the truly relevant documents that have been
retrieved by the search engine. What are
Precision and Recall in this case?
12Precision and Recall Speculations
13Precision and Recall Speculations
- Is it possible, and/or desirable, to design a
search engine so that it achieves 100 Recall?
14Precision and Recall Speculations
- Is it possible, and/or desirable, to design a
search engine so that it achieves 100 Precision?
15Precision and Recall Speculations
- Which is better
- 10 precision, 0.001 recall
- 0.001 precision, 10 recall
16Search engine design goals .
High precision is the number one priority for a
good search engine. Fast response is very
important for a good search engine. Scalability
is very important. As the www continues to grow,
precision and response should not significantly
degrade.
17Comparing Search Engines
The main area in which search engines differ is
in their Precision. This is the area in which
most research effort is expended. The key
research question always has been, and shall
remain How can we automatically estimate the
relevance and usefulness of a web document given
a particular search query?
18Basic notes on queries
Suppose your search query is flower shops
in Edinburgh If a document contains all of
these words, maybe several times, does that mean
it is relevant?
19Basic notes on queries
Suppose your search query is flower shops
in Edinburgh If a document contains all of
these words, maybe several times, does that mean
it is relevant? If a document contains this
precise phrase, does that mean it is relevant?
20Basic notes on queries
Suppose your search query is flower shops
in Edinburgh If a document contains all of
these words, maybe several times, does that mean
it is relevant? If a document contains this
precise phrase, does that mean it is
relevant? If the answer to these questions is
no, or not necessarily, then how can we work out
what documents are relevant?
21Googles key points w.r.t. relevance
PageRank a way of calculating a documents
overall importance important documents
containing the keywords are more likely to be
relevant than unimportant documents that also
contain the keywords. PageRank is based very
much on the link graph of the web.
22Googles key points w.r.t. relevance
PageRank a way of calculating a documents
overall importance important documents
containing the keywords are more likely to be
relevant than unimportant documents that also
contain the keywords. PageRank is based very
much on the link graph of the web. Anchor Text
as well as attaching extra weight to query words
if they appear in Headings and similar, google
associates the text in a hyperlink (the anchor
text) with the document it is pointing to. So,
for example, if a page contains edinburgh flower
shop, that doesnt mean it is relevant to the
query. But if the text of a link to that page
contains edinburgh flower shop, then it is far
more likely to be relevant.
23Googles Page Ranking method
Problem given maybe 100s of pages that contain
the words in a search query, in what order should
these be displayed to the user?
PageRank was the method that made google
different (and better) than other search engines
of the time. It makes use of the directed
network defined by links between pages.
24The PageRank Calculation
To find the PageRank for page A PR(A)
Suppose that pages T1, T2, , Tn all point to A
so A has n inlinks Let C(X) be the number of
outlinks from a page X. Let d be a damping
factor (set perhaps at 0.85) PR(A) (1 d)
d( PR(T1)/C(T1) PR(T2)/C(T2) PR(Tn)/C(Tn)
)
25PageRank Notes
- The PR of A is worked out in terms of the PRs of
all of the pages that link to it. Although simple
to define, it is not simple to calculate it
requires some slightly advanced mathematics. - Technically, the PR of a page corresponds to an
entry in the principal eigenvector of the link
matrix. - Note that PR gets a positive contribution from
- inlinks from highly ranked pages that have low
numbers of outlinks. - having many inlinks.
-
- PR(A)
- (1 d) d( PR(T1)/C(T1) PR(T2)/C(T2)
PR(Tn)/C(Tn) )
26PageRank Very Simple Example
Set d 0.8 -- when C(page) 0 set C(page) 1
A
E
D
C
F
B
PR(A) 0.2 PR(B) 0.2 0.8(PR(A)/4)
0.24 PR(C) 0.2 0.8(PR(A)/4) 0.24 PR(D)
0.2 0.8(PR(A)/4 PR(B)/2 PR(C)/2) 0.432
PR(E) 0.2 0.8(PR(A)/4 PR(D)/1)
0.5856 PR(F) 0.2 0.8(PR(B)/2 PR(C)/2)
0.392
27PageRank Notes II
The random surfer model Imagine randomly
surfing the web, clicking links at random, and
never clicking Back. Occasionally, you click a
random page button and start again.
PageRank corresponds to the probability that a
random surfer will find the page. d is
probability for pages not linked to from
anywhere. Important pages are likely to have
very many inlinks, or maybe just a few
inlinks, but each of those from important
pages. There is much more detail about
PageRank, but we have seen the basic ideas.
28The google paper is essential reading, rather
than lecture material but if theres time, here
are some slides anyway
29Inside Google
30High level anatomy
Several web-crawlers on different servers
continually crawl the www searching for new pages
and links, to help google maintain its view of
the wwws link structure. Pages are sent to the
storeserver, which compresses and stores them in
the repository Every unique www page is given a
unique docID The Links database is just a
collection of pairs of docIDs. PageRanks
are caclulated using this.
31The indexer
A key component that takes a www page from the
repository, and then stores information about it
in a way that supports fast search. The
indexer parses the document (recognises tags,
headings, links, text, etc) converts the
document into a set of hits where each hit is
characterised by type (plain (ordinary text) or
fancy (within a tag)), the wordID, position in
the document, and some other info. The hits are
distributed among the barrels.
32The Barrels contain the main index that supports
the front end of the search process. When you
type a query into google, info in the barrels is
used to identify what documents actually contain
the query words. Most barrels contains a
straightforward index, like this
wordID, docID, hit, hit, hit , docID, hit, hit,
hit , wordID, docID, hit, hit, hit , docID,
hit, hit, hit ,
i.e. when you look up a particular word in a
barrel, the entry for that word is a list of
docIDs for that word (these are the documents on
the www that contain the word), and a list of
hits for that word in each one
33Barrell Contents example
E.g. these may be partial entries for two
successive words in a barrel
potato doc12852 3 101, 178, 2009 doc12990 1
809 quake doc07828 1 16 doc10023 4 3,
11, 12, 678
When you search on quake in google, this barrel
tells google all of the documents on the www
(that it knows about so far) in which the work
quake occurs at the beginning of the entry, we
see that quake occurs in document doc07828, just
once, as the 16th word in that document. It also
occurs in doc10023, 4 times, in positions 3, 11,
12 and 678 and so on.
34Anchor Text, Types of Barrels
Suppose see here for potato info is a hyperlink
in doc1 that points to doc2. Google considers
the hyperlink text to be contained in doc2 for
search/retrieval purposes.
Short barrels contain a reduced version of the
index, in which the hitlists refer only to Anchor
text and other special text such as headings.
The Full barrels contain the entire index
(including what is in the short barrels)
The Lexicon contains each word (14,000,00 words)
recognised by google, and a pointer into the
short and long barrels that contain the doclists
for that word.
35Google Query Evaluation
1. The query is parsed (removing common words
etc.) 2. Words are converted into wordIDs 3. For
each word in the query, find the beginning of its
doclist in the short barrels 4. Scan the doclists
until a document D is found that matches all the
search wordIDs 5. Compute Rank of D for the query
(this is PageRank combined with other things) 6.
If we are in the short barrels but have reached
the end of any one of the doclists, then move
into the short barrels, positioning ourselves at
the beginning of each doclist, and go to 4. 7. If
we are not at the end of any doclist, go to step
4. Otherwise, sort by rank all the documents that
have matched all wordIDs, and show them to the
user.
36Coursework 3
Search Engine Optimization (SEO) is the task of
designing a web page in such a way that it is
more likely to be highly ranked (i.e. appear high
on the list) in the set of pages returned after
suitable queries to a search engine. E.g. if you
have an online presence via which you sell
PacMan-style games software, you would wish to do
SEO, so that when a user searches google with
query pacman games, your page is high on the
list. You will find lots of information on the
WWW about SEO. Write a lt250 word statement
(the word limit is strict), which explains how
to do effective SEO without spending any money
(e.g. without paying for google adwords). In
addition, list up to 5 URLs that were key to your
research. Marking Content 70 / Clarity 30
As usual, email PDF to dwcorne_at_gmail.com