Conducting a Web Search: Problems - PowerPoint PPT Presentation

About This Presentation

Title:

Conducting a Web Search: Problems

Description:

Web as a distributed knowledge base. Utilization of this knowledge base may be improved through ... may use heuristics-based URL analysis (e.g. prefer ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 38

Provided by: scie254

Learn more at: https://www.cs.brandeis.edu

Category:

more less

Transcript and Presenter's Notes

Title: Conducting a Web Search: Problems

1
Conducting a Web Search Problems Algorithms

Anna Rumshisky

2
Motivation

Web as a distributed knowledge base
Utilization of this knowledge base may be
improved through efficient search utilities

3
Basic Functionality of a Web Search Engine (I)

main functionality components
crawler module, page repository, query engine
algorithmically interesting (behind the scenes)
components
crawler control, indexer/collection analysis and
ranking modules

4
Basic Functionality of a Web Search Engine (II)

Crawler module (many running in parallel)
takes an initial set of URLs
retrieves and caches the pages
extracts URLs from the retrieved pages
Page repository
the cached collection is stored and used to
create indexes as new pages are retrieved and
processed
Query engine
processes a query

5
Basic Functionality of a Web Search Engine (III)

Indexer module
builds a content index (inverted index)
for each term, a sorted list of locations where
it appears
where location is a tuple
document URL
offset within the document
weight of a given occurence (e.g. occurrences in
headings and titles may be assigned higher
weights)
builds a link index
in-links and out-links for each URL stored in an
adjacency matrix

6
Basic Functionality of a Web Search Engine (IV)

Collection analysis
creates specialized utility indexes
e.g. indexes based on global page ranking
algorithms
maintains a lexicon of terms and term-level
statistics
e.g. total number of documents in which a term
occurs, etc)
Ranking module
assigns rank to each document relative to a
specific query, using utility indexes created by
the collection analysis module

7
Basic Functionality of a Web Search Engine (V)

Crawler control module
prioritizes retrieved URLs to determine the order
in which they should be visited by the crawler
uses utility indexes created by the collection
analysis module
may use historical data on queries received by
the query engine
may use heuristics-based URL analysis (e.g.
prefer URLs with fewer slashes)
may use anchor text analysis and location
may use global ranking based on link structure
analysis, etc.

8
Some of the Issues

Refresh strategy - metrics determining the
refresh strategy
Freshness of the collection
of the pages that are up to date
Age of the collection
average age of cached pages age of a local copy
of a single page is e.g. the time elapsed since
it was last current
choosing to visit only the more frequently
updated pages will lead to the age of collection
growing indefinitely
Scalability of all techniques

9
Remainder of the Talk - Outline

Structure of the Web Graph
the actual structure of the web graph should
influence crawling and caching strategies
design of link-based ranking algorithms
Relevance Ranking Algorithms
used by query engine and by crawler control
modules
link-based algorithms, such as PageRank, HITS
content-based algorithms, such as TFIDF, Latent
Semantic Indexing

10
Modeling the Web Graph (I)

Web as a directed graph
each page is a vertex, each hypertext link is an
edge
may or may not consider each link weighted
based on
where the anchor text occurs
how many times a given link occurs on a page,
etc.
Bow-tie link structure
28 constitute strongly connected core
44 fall onto the sides of the bow-tie that can
be reached from the core, but not vice versa
22 that can reach the core, but can not be
reached from it

11
Modeling the Web Graph (II)

Web as a probabilistic graph
web graph is constantly modified, both nodes and
edges are added and removed
Traditional (Erdös-Rényi) random graph model
G(n,p)
n nodes, p is the probability that a given edge
exists
number of in-links follows some standard
distribution, e.g. binomial
Prob(in-degree of a node k)
used to model sparse random graphs

12
Modeling the Web Graph (III)

Web graph properties (empirically)
evolving nature of web graph
disjoint bipartite cliques
two subsets, with i nodes and j nodes each node
in first subset connected to each node in the
second (total of ij edges)
distribution of in- and out-degrees follow the
power law
Prob(in-degree of a node k)
experimentally, beta 2.1

13
Evolving Graph Models (Kumar et al.)

Graph models with stochastic copying
new vertices and new edges added to the graph at
discrete intervals of time
allow dependencies between edges
some vertices choose outgoing edges at random,
independently
others replicate existing linkage patterns by
copying edges from a randomly chosen vertex
two evolving graph models
linear growth model (a constant number of nodes
added at each interval)
exponential growth (current number of vertices
multiplied by a constant factor)

14
Evolving Graph Models (II)

Defining Gt (Vt, Et) - state of the graph at time
t
fv(Vt, t) - returns of vertices added to graph
at time t1
fe(Gt, t) - returns the set of edges added to
graph at time t1
Vt1 Vt fv(Vt, t)
Et1 Et U fe(Gt, t)
Edge selection
new edges may lead from new vertices to old
vertices, or be added between old vertices
origin and destination for each edge are chosen
randomly
destination selection method (replicated or
random destination) is chosen randomly

15
Evolving Graph Models (III)

Evaluation
Very rough assumptions
constant number of edges assumed for each added
vertex
no deletion, etc.
creating a new site generates a lot of links
between new nodes in that site since no edges
are added between the new nodes, their
assumptions appear to collapse each a new site
into a single vertex
Claim these models show the desired properties,
the power law distribution for in- and
out-degrees
the presence of directed bi-partite cliques

16
Link-Based Relevance Ranking PageRank (I)

Goal Assign global rank to each page on the Web
Basic idea
Each pages rank is a sum, over all pages that
point to it (referrers), of rank of each
referrer, divided by the out-degree of that
referrer

17
Link-Based Relevance Ranking PageRank (II)

Description
For the total of n web pages, the goal is to
obtain a rank vector
r ltr1, r2, ..., rn gt where ri
Consider matrix An x n, with the elements
ai,j 1/outdegree(i) if page i points to page j
ai,j 0 otherwise
ai,j is the rank contribution of i to j
By our definition of rank vector, we must have
then
r AT r
r is the eigenvector of matrix AT corresponding
to the eigenvalue 1

18
Link-Based Relevance Ranking PageRank (III)

Mathematical apparatus
If the graph is strongly connected (every node
reachable from every node), eigenvector r for the
eigenvalue 1 of such adjacency matrix is
uniquely defined
The principal eigenvector of a matrix
(corresponding to the eigenvalue 1) can be
computed using power iteration method
initialize a vector s with random values
apply a given linear transformation to it, until
it converges to the principal eigenvector
r AT s
r r / r , where vector is vector
length (normalization)
r - s lt epsilon (stop condition)
essentially, r AT (AT ... (AT s) - with
normalization

19
Link-Based Relevance Ranking PageRank (IV)

Mathematical apparatus (contd)
PageRank vector r, as defined above, is
proportional to the stationary probability
distribution of the random walk on a graph
traverse the graph choosing at random which link
to follow at each vertex
The power iteration method is guaranteed to
converge only if the graph is aperiodic (i.e. no
two cycles such that the length of one is
proportional to the length of the other)
The speed of convergence of the power iteration
depends on the eigenvalue gap (difference between
the two largest eigenvalues)

20
Link-Based Relevance Ranking PageRank (V)

Practical application of the algorithm
The actual algorithm is merely applying the power
iteration method to the matrix AT to obtain r
In practice, there are problems with the
assumptions needed for this algorithm to work
the Web graph is NOT guaranteed to be aperiodic,
so stationary distribution might not be reached
the Web is NOT strongly connected the are pages
with no outward links at all
slight modifications to the formula for ri take
care of that
We dont really need the actual rank of each
page, we just need to sort the pages correctly

21
Link-Based Relevance Ranking HITS
(Hypertext-Induced Topic Search)

Goal Rank all pages with respect to a given
query (obtain both hub and authority score for
each page)
Motivation
Two types of web pages hubs (pages with large
collections of links web directories, link
lists, etc.) and authorities (pages well referred
to by other pages)
Each pages gets two scores, a hub score and an
authority score
A good authority has a high in-degree, and is
pointed to by many good hubs
A good hub is has a high out-degree and points to
many good authorities
Consider the sites of Toyota and Honda though
they will not point to each
other, good hubs would point to both

22
Link-Based Relevance Ranking HITS (II)

Basic idea
An authority score of a page is obtained by
summing up the hub scores of pages that point to
it
A hub score of a page is obtained by summing up
the authority scores of pages it points to
At query time, a small subgraph of the Web graph
is identified, and a link analysis is run on it

23
Link-Based Relevance Ranking HITS (III)

Description
Selecting a limited subset of pages
The query string determines the initial root set
of pages
up to t pages containing the same terms as query
string
Root set is expanded
by all pages linked from the root set
by d pages pointing to the root set
this is to prevent an over-popular page in the
root set - to which everybody points - to force
you to add a large portion of the Web graph to
your set

24
Link-Based Relevance Ranking HITS (IV)

Description (contd)
Link analysis
here we wish to obtain two rank vectors a and h
a lta1, a2, ..., an gt and h lth1, h2, ..., hn gt
we obtain them using the following iteration
method
initialize both vectors to random values
for each page i, set the authority score ai equal
to the sum of hub scores of all pages within the
subgraph that refer to it (referrers(i))
for each page i, set the hub score hi equal to
the sum up authority scores of all pages within
the subgraph that it points to (referred(i))
normalize resulting vectors a and h to have
length of 1 (divide each ai by a and each hi
by h)

25
Link-Based Relevance Ranking HITS (V)

Link analysis (contd)
consider the adjacency matrix A for our focused
subgraph
h A a
a AT h
Normalize
hi hi/h ai ai/a
hnorm c1 A AT hnorm
anorm c2 AT A anorm
thus, vectors h and a are the principal
eigenvectors of matrices AAT and ATA,
respectively and we are essentially using the
power iteration method (which, as we know, will
converge)

26
Content-Based Relevance Ranking TFIDF (I)

Goal Rank all pages with respect to a given
query (compute similarity between the query and
each document)
Background This is a traditional IR technique
used on collections of documents since 1960s,
originally proposed by Richard Salton
Basic idea
Using vector-space model to represent documents
Compute the similarity score using a distance
metric

27
Content-Based Relevance Ranking TFIDF (II)

Description
Each document is represented as a vector ltw1, w2
, ..., wkgt
k is the total number of terms (lemmas) in a
document collection
wi is the weight of the ith term depends on the
number of occurrences of this term in this
document
A query is thought of as just another document
There are different schemes for computing term
weights
the choice of a particular scheme is usually
empirically motivated
TF IDF is the most common one

28
Content-Based Relevance Ranking TFIDF (III)

Description (contd)
TFIDF weighting scheme
wi term frequencyi inverse document
frequencyi
term frequencyi of times the term i occurs in
a document
inverse document frequencyi log (N / ni )where
N is the total of documents in collection
ni is the number of documents in which term i
occurs
since N is usually large, N / ni is squashed
with log
1 lt N / ni lt N
0 lt log (N / ni ) lt log N
lowest weight of 1 is assigned to terms that
occur in all documents

29
Content-Based Relevance Ranking TFIDF (IV)

Description (contd)
Distance metric
cosine of the angle between the two vectors
obtained using scalar product
this similarity score is indepedent the size of
each document

30
Content-Based Relevance Ranking TFIDF (V)

Practical application of the algorithm
Since the query is frequently very short (2.3
words) raw term frequency is not usesful
Augmented TFIDF is used for weighting query
terms
wi 0.5 (0.5 tfi/max tf) idfi
max tf is the frequency of the most frequent
term
for terms not found in the query, the weight
would be
0.5 idfi
for most terms found in the query, the weight
would be
1 idfi

31
Content-Based Relevance Ranking Latent Semantic
Indexing

Goal Rank all documents with respect to a given
query
Background This is also a technique developed
for traditional IR with static document
collection (introduced in 1990)
Basic idea
Construct a term x document matrix, using a
vector representation similar to the one used in
TFIDF
Matrix Am x n (m terms, n documents), of rank r
Am x n is typically sparse
Using SVD (Singular Value Decomposition), obtain
a rank-k approximation to A
Matrix Ak (m terms, n documents)
similar to the least squares method of fitting a
line to a set of points

32
Content-Based Relevance Ranking LSI (II)

Mathematical apparatus
rank(A)
number of linearly independent columns in Am x n
Rn -gt Rm
linear transform of rank r maps the basis vectors
of the pre-image into r linearly independent
basis vectors of the image
Singular Value Decomposition
Matrix A can be represented as
A U S VT where
columns of U and V are left and right
eigenvectors of A AT
U and V orthogonal (VVTI) , and S is a diagonal
matrix
S diag(s1, ..., sn), where si are nonnegative
square roots of the r eigenvalues of A AT ,
and s1 gt s2 gt ... gt sr gt sr1 ... sn 0

33
Content-Based Relevance Ranking LSI (III)

Mathematical apparatus (contd)
Um x k ,, Sk x k , Vk x n
A - AkF2

34
Content-Based Relevance Ranking LSI (IV)

Practical application of the algorithm
The SVD computation on the term x document matrix
is performed in advance, not at query processing
time
Each document is represented as a column in the
Ak matrix
Scalar product-based metric for distances between
document vectors is used
Query vector lta1q, a2q, ..., amqgt
pseudo-document, added to Ak postfactum
Vq AqT U S-1
values from AkTAk give scalar product of
document vectors
AkTAk V S2 V

35
References

Arasu, Cho, Garcia-Molina, Paepcke, Raghavan
(2001). Searching the Web.
Kleinberg, J. (1999). Authoritative Sources in a
Hyperlinked Environment. Journal of ACM.
Kumar, Raghavan, Rajagopalan, Sivakumar (2000).
Stochastic Models for the Web Graph. IEEE.
Berry, Dumais O'Brien (1994). Using Linear
Algebra for Intelligent Information Retrieval.
Deerwester, Dumais, Furnas, Landauer, Harshman
(1990). Indexing by Latent Semantic Analysis.
Journal of American Society for Information
Sciences.
Manning Schutze (1999). Foundations of
Statistical Natural Language Processing.