Title: Conducting a Web Search: Problems
1Conducting a Web Search Problems Algorithms
2Motivation
- Web as a distributed knowledge base
- Utilization of this knowledge base may be
improved through efficient search utilities
3Basic Functionality of a Web Search Engine (I)
- main functionality components
- crawler module, page repository, query engine
- algorithmically interesting (behind the scenes)
components - crawler control, indexer/collection analysis and
ranking modules
4Basic Functionality of a Web Search Engine (II)
- Crawler module (many running in parallel)
- takes an initial set of URLs
- retrieves and caches the pages
- extracts URLs from the retrieved pages
- Page repository
- the cached collection is stored and used to
create indexes as new pages are retrieved and
processed - Query engine
- processes a query
5Basic Functionality of a Web Search Engine (III)
- Indexer module
- builds a content index (inverted index)
- for each term, a sorted list of locations where
it appears - where location is a tuple
- document URL
- offset within the document
- weight of a given occurence (e.g. occurrences in
headings and titles may be assigned higher
weights) - builds a link index
- in-links and out-links for each URL stored in an
adjacency matrix
6Basic Functionality of a Web Search Engine (IV)
- Collection analysis
- creates specialized utility indexes
- e.g. indexes based on global page ranking
algorithms - maintains a lexicon of terms and term-level
statistics - e.g. total number of documents in which a term
occurs, etc) - Ranking module
- assigns rank to each document relative to a
specific query, using utility indexes created by
the collection analysis module
7Basic Functionality of a Web Search Engine (V)
- Crawler control module
- prioritizes retrieved URLs to determine the order
in which they should be visited by the crawler - uses utility indexes created by the collection
analysis module - may use historical data on queries received by
the query engine - may use heuristics-based URL analysis (e.g.
prefer URLs with fewer slashes) - may use anchor text analysis and location
- may use global ranking based on link structure
analysis, etc.
8Some of the Issues
- Refresh strategy - metrics determining the
refresh strategy - Freshness of the collection
- of the pages that are up to date
- Age of the collection
- average age of cached pages age of a local copy
of a single page is e.g. the time elapsed since
it was last current - choosing to visit only the more frequently
updated pages will lead to the age of collection
growing indefinitely - Scalability of all techniques
9Remainder of the Talk - Outline
- Structure of the Web Graph
- the actual structure of the web graph should
influence crawling and caching strategies - design of link-based ranking algorithms
- Relevance Ranking Algorithms
- used by query engine and by crawler control
modules - link-based algorithms, such as PageRank, HITS
- content-based algorithms, such as TFIDF, Latent
Semantic Indexing
10Modeling the Web Graph (I)
- Web as a directed graph
- each page is a vertex, each hypertext link is an
edge - may or may not consider each link weighted
based on - where the anchor text occurs
- how many times a given link occurs on a page,
etc. - Bow-tie link structure
- 28 constitute strongly connected core
- 44 fall onto the sides of the bow-tie that can
be reached from the core, but not vice versa - 22 that can reach the core, but can not be
reached from it
11Modeling the Web Graph (II)
- Web as a probabilistic graph
- web graph is constantly modified, both nodes and
edges are added and removed - Traditional (Erdös-Rényi) random graph model
G(n,p) - n nodes, p is the probability that a given edge
exists - number of in-links follows some standard
distribution, e.g. binomial - Prob(in-degree of a node k)
- used to model sparse random graphs
12Modeling the Web Graph (III)
- Web graph properties (empirically)
- evolving nature of web graph
- disjoint bipartite cliques
- two subsets, with i nodes and j nodes each node
in first subset connected to each node in the
second (total of ij edges) - distribution of in- and out-degrees follow the
power law - Prob(in-degree of a node k)
- experimentally, beta 2.1
13Evolving Graph Models (Kumar et al.)
- Graph models with stochastic copying
- new vertices and new edges added to the graph at
discrete intervals of time - allow dependencies between edges
- some vertices choose outgoing edges at random,
independently - others replicate existing linkage patterns by
copying edges from a randomly chosen vertex - two evolving graph models
- linear growth model (a constant number of nodes
added at each interval) - exponential growth (current number of vertices
multiplied by a constant factor)
14Evolving Graph Models (II)
- Defining Gt (Vt, Et) - state of the graph at time
t - fv(Vt, t) - returns of vertices added to graph
at time t1 - fe(Gt, t) - returns the set of edges added to
graph at time t1 - Vt1 Vt fv(Vt, t)
- Et1 Et U fe(Gt, t)
- Edge selection
- new edges may lead from new vertices to old
vertices, or be added between old vertices - origin and destination for each edge are chosen
randomly - destination selection method (replicated or
random destination) is chosen randomly
15Evolving Graph Models (III)
- Evaluation
- Very rough assumptions
- constant number of edges assumed for each added
vertex - no deletion, etc.
- creating a new site generates a lot of links
between new nodes in that site since no edges
are added between the new nodes, their
assumptions appear to collapse each a new site
into a single vertex - Claim these models show the desired properties,
- the power law distribution for in- and
out-degrees - the presence of directed bi-partite cliques
16Link-Based Relevance Ranking PageRank (I)
- Goal Assign global rank to each page on the Web
- Basic idea
- Each pages rank is a sum, over all pages that
point to it (referrers), of rank of each
referrer, divided by the out-degree of that
referrer
17Link-Based Relevance Ranking PageRank (II)
- Description
- For the total of n web pages, the goal is to
obtain a rank vector - r ltr1, r2, ..., rn gt where ri
- Consider matrix An x n, with the elements
- ai,j 1/outdegree(i) if page i points to page j
- ai,j 0 otherwise
- ai,j is the rank contribution of i to j
- By our definition of rank vector, we must have
then - r AT r
- r is the eigenvector of matrix AT corresponding
to the eigenvalue 1
18Link-Based Relevance Ranking PageRank (III)
- Mathematical apparatus
- If the graph is strongly connected (every node
reachable from every node), eigenvector r for the
eigenvalue 1 of such adjacency matrix is
uniquely defined - The principal eigenvector of a matrix
(corresponding to the eigenvalue 1) can be
computed using power iteration method - initialize a vector s with random values
- apply a given linear transformation to it, until
it converges to the principal eigenvector - r AT s
- r r / r , where vector is vector
length (normalization) - r - s lt epsilon (stop condition)
- essentially, r AT (AT ... (AT s) - with
normalization
19Link-Based Relevance Ranking PageRank (IV)
- Mathematical apparatus (contd)
- PageRank vector r, as defined above, is
proportional to the stationary probability
distribution of the random walk on a graph - traverse the graph choosing at random which link
to follow at each vertex - The power iteration method is guaranteed to
converge only if the graph is aperiodic (i.e. no
two cycles such that the length of one is
proportional to the length of the other) - The speed of convergence of the power iteration
depends on the eigenvalue gap (difference between
the two largest eigenvalues)
20Link-Based Relevance Ranking PageRank (V)
- Practical application of the algorithm
- The actual algorithm is merely applying the power
iteration method to the matrix AT to obtain r - In practice, there are problems with the
assumptions needed for this algorithm to work - the Web graph is NOT guaranteed to be aperiodic,
so stationary distribution might not be reached - the Web is NOT strongly connected the are pages
with no outward links at all - slight modifications to the formula for ri take
care of that - We dont really need the actual rank of each
page, we just need to sort the pages correctly
21Link-Based Relevance Ranking HITS
(Hypertext-Induced Topic Search)
- Goal Rank all pages with respect to a given
query (obtain both hub and authority score for
each page) - Motivation
- Two types of web pages hubs (pages with large
collections of links web directories, link
lists, etc.) and authorities (pages well referred
to by other pages) - Each pages gets two scores, a hub score and an
authority score - A good authority has a high in-degree, and is
pointed to by many good hubs - A good hub is has a high out-degree and points to
many good authorities - Consider the sites of Toyota and Honda though
they will not point to each - other, good hubs would point to both
22Link-Based Relevance Ranking HITS (II)
- Basic idea
- An authority score of a page is obtained by
summing up the hub scores of pages that point to
it - A hub score of a page is obtained by summing up
the authority scores of pages it points to - At query time, a small subgraph of the Web graph
is identified, and a link analysis is run on it
23Link-Based Relevance Ranking HITS (III)
- Description
- Selecting a limited subset of pages
- The query string determines the initial root set
of pages - up to t pages containing the same terms as query
string - Root set is expanded
- by all pages linked from the root set
- by d pages pointing to the root set
- this is to prevent an over-popular page in the
root set - to which everybody points - to force
you to add a large portion of the Web graph to
your set
24Link-Based Relevance Ranking HITS (IV)
- Description (contd)
- Link analysis
- here we wish to obtain two rank vectors a and h
- a lta1, a2, ..., an gt and h lth1, h2, ..., hn gt
- we obtain them using the following iteration
method - initialize both vectors to random values
- for each page i, set the authority score ai equal
to the sum of hub scores of all pages within the
subgraph that refer to it (referrers(i)) - for each page i, set the hub score hi equal to
the sum up authority scores of all pages within
the subgraph that it points to (referred(i)) - normalize resulting vectors a and h to have
length of 1 (divide each ai by a and each hi
by h)
25Link-Based Relevance Ranking HITS (V)
- Link analysis (contd)
- consider the adjacency matrix A for our focused
subgraph - h A a
- a AT h
- Normalize
- hi hi/h ai ai/a
- hnorm c1 A AT hnorm
- anorm c2 AT A anorm
- thus, vectors h and a are the principal
eigenvectors of matrices AAT and ATA,
respectively and we are essentially using the
power iteration method (which, as we know, will
converge)
26Content-Based Relevance Ranking TFIDF (I)
- Goal Rank all pages with respect to a given
query (compute similarity between the query and
each document) - Background This is a traditional IR technique
used on collections of documents since 1960s,
originally proposed by Richard Salton - Basic idea
- Using vector-space model to represent documents
- Compute the similarity score using a distance
metric
27Content-Based Relevance Ranking TFIDF (II)
- Description
- Each document is represented as a vector ltw1, w2
, ..., wkgt - k is the total number of terms (lemmas) in a
document collection - wi is the weight of the ith term depends on the
number of occurrences of this term in this
document - A query is thought of as just another document
- There are different schemes for computing term
weights - the choice of a particular scheme is usually
empirically motivated - TF IDF is the most common one
28Content-Based Relevance Ranking TFIDF (III)
- Description (contd)
- TFIDF weighting scheme
- wi term frequencyi inverse document
frequencyi - term frequencyi of times the term i occurs in
a document - inverse document frequencyi log (N / ni )where
- N is the total of documents in collection
- ni is the number of documents in which term i
occurs - since N is usually large, N / ni is squashed
with log - 1 lt N / ni lt N
- 0 lt log (N / ni ) lt log N
- lowest weight of 1 is assigned to terms that
occur in all documents
29Content-Based Relevance Ranking TFIDF (IV)
- Description (contd)
- Distance metric
- cosine of the angle between the two vectors
- obtained using scalar product
- this similarity score is indepedent the size of
each document
30Content-Based Relevance Ranking TFIDF (V)
- Practical application of the algorithm
- Since the query is frequently very short (2.3
words) raw term frequency is not usesful - Augmented TFIDF is used for weighting query
terms - wi 0.5 (0.5 tfi/max tf) idfi
- max tf is the frequency of the most frequent
term - for terms not found in the query, the weight
would be - 0.5 idfi
- for most terms found in the query, the weight
would be - 1 idfi
31Content-Based Relevance Ranking Latent Semantic
Indexing
- Goal Rank all documents with respect to a given
query - Background This is also a technique developed
for traditional IR with static document
collection (introduced in 1990) - Basic idea
- Construct a term x document matrix, using a
vector representation similar to the one used in
TFIDF - Matrix Am x n (m terms, n documents), of rank r
- Am x n is typically sparse
- Using SVD (Singular Value Decomposition), obtain
a rank-k approximation to A - Matrix Ak (m terms, n documents)
- similar to the least squares method of fitting a
line to a set of points
32Content-Based Relevance Ranking LSI (II)
- Mathematical apparatus
- rank(A)
- number of linearly independent columns in Am x n
Rn -gt Rm - linear transform of rank r maps the basis vectors
of the pre-image into r linearly independent
basis vectors of the image - Singular Value Decomposition
- Matrix A can be represented as
- A U S VT where
- columns of U and V are left and right
eigenvectors of A AT - U and V orthogonal (VVTI) , and S is a diagonal
matrix - S diag(s1, ..., sn), where si are nonnegative
square roots of the r eigenvalues of A AT , - and s1 gt s2 gt ... gt sr gt sr1 ... sn 0
33Content-Based Relevance Ranking LSI (III)
- Mathematical apparatus (contd)
- Um x k ,, Sk x k , Vk x n
- A - AkF2
34Content-Based Relevance Ranking LSI (IV)
- Practical application of the algorithm
- The SVD computation on the term x document matrix
is performed in advance, not at query processing
time - Each document is represented as a column in the
Ak matrix - Scalar product-based metric for distances between
document vectors is used - Query vector lta1q, a2q, ..., amqgt
- pseudo-document, added to Ak postfactum
- Vq AqT U S-1
- values from AkTAk give scalar product of
document vectors - AkTAk V S2 V
35References
- Arasu, Cho, Garcia-Molina, Paepcke, Raghavan
(2001). Searching the Web. - Kleinberg, J. (1999). Authoritative Sources in a
Hyperlinked Environment. Journal of ACM. - Kumar, Raghavan, Rajagopalan, Sivakumar (2000).
Stochastic Models for the Web Graph. IEEE. - Berry, Dumais O'Brien (1994). Using Linear
Algebra for Intelligent Information Retrieval. - Deerwester, Dumais, Furnas, Landauer, Harshman
(1990). Indexing by Latent Semantic Analysis.
Journal of American Society for Information
Sciences. - Manning Schutze (1999). Foundations of
Statistical Natural Language Processing.
36(No Transcript)
37Content-Based Relevance Ranking LSI (III)
- Mathematical apparatus (contd)
- A - Ak
- Vq AqT U S-1
- A U S V T
- AT V ST U T V S U T,
- AT (U T) -1 V S
- AT (U T)-1 S-1 V