Conducting a Web Search: Problems - PowerPoint PPT Presentation

About This Presentation
Title:

Conducting a Web Search: Problems

Description:

Web as a distributed knowledge base. Utilization of this knowledge base may be improved through ... may use heuristics-based URL analysis (e.g. prefer ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 38
Provided by: scie254
Category:

less

Transcript and Presenter's Notes

Title: Conducting a Web Search: Problems


1
Conducting a Web Search Problems Algorithms
  • Anna Rumshisky

2
Motivation
  • Web as a distributed knowledge base
  • Utilization of this knowledge base may be
    improved through efficient search utilities

3
Basic Functionality of a Web Search Engine (I)
  • main functionality components
  • crawler module, page repository, query engine
  • algorithmically interesting (behind the scenes)
    components
  • crawler control, indexer/collection analysis and
    ranking modules

4
Basic Functionality of a Web Search Engine (II)
  • Crawler module (many running in parallel)
  • takes an initial set of URLs
  • retrieves and caches the pages
  • extracts URLs from the retrieved pages
  • Page repository
  • the cached collection is stored and used to
    create indexes as new pages are retrieved and
    processed
  • Query engine
  • processes a query

5
Basic Functionality of a Web Search Engine (III)
  • Indexer module
  • builds a content index (inverted index)
  • for each term, a sorted list of locations where
    it appears
  • where location is a tuple
  • document URL
  • offset within the document
  • weight of a given occurence (e.g. occurrences in
    headings and titles may be assigned higher
    weights)
  • builds a link index
  • in-links and out-links for each URL stored in an
    adjacency matrix

6
Basic Functionality of a Web Search Engine (IV)
  • Collection analysis
  • creates specialized utility indexes
  • e.g. indexes based on global page ranking
    algorithms
  • maintains a lexicon of terms and term-level
    statistics
  • e.g. total number of documents in which a term
    occurs, etc)
  • Ranking module
  • assigns rank to each document relative to a
    specific query, using utility indexes created by
    the collection analysis module

7
Basic Functionality of a Web Search Engine (V)
  • Crawler control module
  • prioritizes retrieved URLs to determine the order
    in which they should be visited by the crawler
  • uses utility indexes created by the collection
    analysis module
  • may use historical data on queries received by
    the query engine
  • may use heuristics-based URL analysis (e.g.
    prefer URLs with fewer slashes)
  • may use anchor text analysis and location
  • may use global ranking based on link structure
    analysis, etc.

8
Some of the Issues
  • Refresh strategy - metrics determining the
    refresh strategy
  • Freshness of the collection
  • of the pages that are up to date
  • Age of the collection
  • average age of cached pages age of a local copy
    of a single page is e.g. the time elapsed since
    it was last current
  • choosing to visit only the more frequently
    updated pages will lead to the age of collection
    growing indefinitely
  • Scalability of all techniques

9
Remainder of the Talk - Outline
  • Structure of the Web Graph
  • the actual structure of the web graph should
    influence crawling and caching strategies
  • design of link-based ranking algorithms
  • Relevance Ranking Algorithms
  • used by query engine and by crawler control
    modules
  • link-based algorithms, such as PageRank, HITS
  • content-based algorithms, such as TFIDF, Latent
    Semantic Indexing

10
Modeling the Web Graph (I)
  • Web as a directed graph
  • each page is a vertex, each hypertext link is an
    edge
  • may or may not consider each link weighted
    based on
  • where the anchor text occurs
  • how many times a given link occurs on a page,
    etc.
  • Bow-tie link structure
  • 28 constitute strongly connected core
  • 44 fall onto the sides of the bow-tie that can
    be reached from the core, but not vice versa
  • 22 that can reach the core, but can not be
    reached from it

11
Modeling the Web Graph (II)
  • Web as a probabilistic graph
  • web graph is constantly modified, both nodes and
    edges are added and removed
  • Traditional (Erdös-Rényi) random graph model
    G(n,p)
  • n nodes, p is the probability that a given edge
    exists
  • number of in-links follows some standard
    distribution, e.g. binomial
  • Prob(in-degree of a node k)
  • used to model sparse random graphs

12
Modeling the Web Graph (III)
  • Web graph properties (empirically)
  • evolving nature of web graph
  • disjoint bipartite cliques
  • two subsets, with i nodes and j nodes each node
    in first subset connected to each node in the
    second (total of ij edges)
  • distribution of in- and out-degrees follow the
    power law
  • Prob(in-degree of a node k)
  • experimentally, beta 2.1

13
Evolving Graph Models (Kumar et al.)
  • Graph models with stochastic copying
  • new vertices and new edges added to the graph at
    discrete intervals of time
  • allow dependencies between edges
  • some vertices choose outgoing edges at random,
    independently
  • others replicate existing linkage patterns by
    copying edges from a randomly chosen vertex
  • two evolving graph models
  • linear growth model (a constant number of nodes
    added at each interval)
  • exponential growth (current number of vertices
    multiplied by a constant factor)

14
Evolving Graph Models (II)
  • Defining Gt (Vt, Et) - state of the graph at time
    t
  • fv(Vt, t) - returns of vertices added to graph
    at time t1
  • fe(Gt, t) - returns the set of edges added to
    graph at time t1
  • Vt1 Vt fv(Vt, t)
  • Et1 Et U fe(Gt, t)
  • Edge selection
  • new edges may lead from new vertices to old
    vertices, or be added between old vertices
  • origin and destination for each edge are chosen
    randomly
  • destination selection method (replicated or
    random destination) is chosen randomly

15
Evolving Graph Models (III)
  • Evaluation
  • Very rough assumptions
  • constant number of edges assumed for each added
    vertex
  • no deletion, etc.
  • creating a new site generates a lot of links
    between new nodes in that site since no edges
    are added between the new nodes, their
    assumptions appear to collapse each a new site
    into a single vertex
  • Claim these models show the desired properties,
  • the power law distribution for in- and
    out-degrees
  • the presence of directed bi-partite cliques

16
Link-Based Relevance Ranking PageRank (I)
  • Goal Assign global rank to each page on the Web
  • Basic idea
  • Each pages rank is a sum, over all pages that
    point to it (referrers), of rank of each
    referrer, divided by the out-degree of that
    referrer

17
Link-Based Relevance Ranking PageRank (II)
  • Description
  • For the total of n web pages, the goal is to
    obtain a rank vector
  • r ltr1, r2, ..., rn gt where ri
  • Consider matrix An x n, with the elements
  • ai,j 1/outdegree(i) if page i points to page j
  • ai,j 0 otherwise
  • ai,j is the rank contribution of i to j
  • By our definition of rank vector, we must have
    then
  • r AT r
  • r is the eigenvector of matrix AT corresponding
    to the eigenvalue 1

18
Link-Based Relevance Ranking PageRank (III)
  • Mathematical apparatus
  • If the graph is strongly connected (every node
    reachable from every node), eigenvector r for the
    eigenvalue 1 of such adjacency matrix is
    uniquely defined
  • The principal eigenvector of a matrix
    (corresponding to the eigenvalue 1) can be
    computed using power iteration method
  • initialize a vector s with random values
  • apply a given linear transformation to it, until
    it converges to the principal eigenvector
  • r AT s
  • r r / r , where vector is vector
    length (normalization)
  • r - s lt epsilon (stop condition)
  • essentially, r AT (AT ... (AT s) - with
    normalization

19
Link-Based Relevance Ranking PageRank (IV)
  • Mathematical apparatus (contd)
  • PageRank vector r, as defined above, is
    proportional to the stationary probability
    distribution of the random walk on a graph
  • traverse the graph choosing at random which link
    to follow at each vertex
  • The power iteration method is guaranteed to
    converge only if the graph is aperiodic (i.e. no
    two cycles such that the length of one is
    proportional to the length of the other)
  • The speed of convergence of the power iteration
    depends on the eigenvalue gap (difference between
    the two largest eigenvalues)

20
Link-Based Relevance Ranking PageRank (V)
  • Practical application of the algorithm
  • The actual algorithm is merely applying the power
    iteration method to the matrix AT to obtain r
  • In practice, there are problems with the
    assumptions needed for this algorithm to work
  • the Web graph is NOT guaranteed to be aperiodic,
    so stationary distribution might not be reached
  • the Web is NOT strongly connected the are pages
    with no outward links at all
  • slight modifications to the formula for ri take
    care of that
  • We dont really need the actual rank of each
    page, we just need to sort the pages correctly

21
Link-Based Relevance Ranking HITS
(Hypertext-Induced Topic Search)
  • Goal Rank all pages with respect to a given
    query (obtain both hub and authority score for
    each page)
  • Motivation
  • Two types of web pages hubs (pages with large
    collections of links web directories, link
    lists, etc.) and authorities (pages well referred
    to by other pages)
  • Each pages gets two scores, a hub score and an
    authority score
  • A good authority has a high in-degree, and is
    pointed to by many good hubs
  • A good hub is has a high out-degree and points to
    many good authorities
  • Consider the sites of Toyota and Honda though
    they will not point to each
  • other, good hubs would point to both

22
Link-Based Relevance Ranking HITS (II)
  • Basic idea
  • An authority score of a page is obtained by
    summing up the hub scores of pages that point to
    it
  • A hub score of a page is obtained by summing up
    the authority scores of pages it points to
  • At query time, a small subgraph of the Web graph
    is identified, and a link analysis is run on it

23
Link-Based Relevance Ranking HITS (III)
  • Description
  • Selecting a limited subset of pages
  • The query string determines the initial root set
    of pages
  • up to t pages containing the same terms as query
    string
  • Root set is expanded
  • by all pages linked from the root set
  • by d pages pointing to the root set
  • this is to prevent an over-popular page in the
    root set - to which everybody points - to force
    you to add a large portion of the Web graph to
    your set

24
Link-Based Relevance Ranking HITS (IV)
  • Description (contd)
  • Link analysis
  • here we wish to obtain two rank vectors a and h
  • a lta1, a2, ..., an gt and h lth1, h2, ..., hn gt
  • we obtain them using the following iteration
    method
  • initialize both vectors to random values
  • for each page i, set the authority score ai equal
    to the sum of hub scores of all pages within the
    subgraph that refer to it (referrers(i))
  • for each page i, set the hub score hi equal to
    the sum up authority scores of all pages within
    the subgraph that it points to (referred(i))
  • normalize resulting vectors a and h to have
    length of 1 (divide each ai by a and each hi
    by h)

25
Link-Based Relevance Ranking HITS (V)
  • Link analysis (contd)
  • consider the adjacency matrix A for our focused
    subgraph
  • h A a
  • a AT h
  • Normalize
  • hi hi/h ai ai/a
  • hnorm c1 A AT hnorm
  • anorm c2 AT A anorm
  • thus, vectors h and a are the principal
    eigenvectors of matrices AAT and ATA,
    respectively and we are essentially using the
    power iteration method (which, as we know, will
    converge)

26
Content-Based Relevance Ranking TFIDF (I)
  • Goal Rank all pages with respect to a given
    query (compute similarity between the query and
    each document)
  • Background This is a traditional IR technique
    used on collections of documents since 1960s,
    originally proposed by Richard Salton
  • Basic idea
  • Using vector-space model to represent documents
  • Compute the similarity score using a distance
    metric

27
Content-Based Relevance Ranking TFIDF (II)
  • Description
  • Each document is represented as a vector ltw1, w2
    , ..., wkgt
  • k is the total number of terms (lemmas) in a
    document collection
  • wi is the weight of the ith term depends on the
    number of occurrences of this term in this
    document
  • A query is thought of as just another document
  • There are different schemes for computing term
    weights
  • the choice of a particular scheme is usually
    empirically motivated
  • TF IDF is the most common one

28
Content-Based Relevance Ranking TFIDF (III)
  • Description (contd)
  • TFIDF weighting scheme
  • wi term frequencyi inverse document
    frequencyi
  • term frequencyi of times the term i occurs in
    a document
  • inverse document frequencyi log (N / ni )where
  • N is the total of documents in collection
  • ni is the number of documents in which term i
    occurs
  • since N is usually large, N / ni is squashed
    with log
  • 1 lt N / ni lt N
  • 0 lt log (N / ni ) lt log N
  • lowest weight of 1 is assigned to terms that
    occur in all documents

29
Content-Based Relevance Ranking TFIDF (IV)
  • Description (contd)
  • Distance metric
  • cosine of the angle between the two vectors
  • obtained using scalar product
  • this similarity score is indepedent the size of
    each document

30
Content-Based Relevance Ranking TFIDF (V)
  • Practical application of the algorithm
  • Since the query is frequently very short (2.3
    words) raw term frequency is not usesful
  • Augmented TFIDF is used for weighting query
    terms
  • wi 0.5 (0.5 tfi/max tf) idfi
  • max tf is the frequency of the most frequent
    term
  • for terms not found in the query, the weight
    would be
  • 0.5 idfi
  • for most terms found in the query, the weight
    would be
  • 1 idfi

31
Content-Based Relevance Ranking Latent Semantic
Indexing
  • Goal Rank all documents with respect to a given
    query
  • Background This is also a technique developed
    for traditional IR with static document
    collection (introduced in 1990)
  • Basic idea
  • Construct a term x document matrix, using a
    vector representation similar to the one used in
    TFIDF
  • Matrix Am x n (m terms, n documents), of rank r
  • Am x n is typically sparse
  • Using SVD (Singular Value Decomposition), obtain
    a rank-k approximation to A
  • Matrix Ak (m terms, n documents)
  • similar to the least squares method of fitting a
    line to a set of points

32
Content-Based Relevance Ranking LSI (II)
  • Mathematical apparatus
  • rank(A)
  • number of linearly independent columns in Am x n
    Rn -gt Rm
  • linear transform of rank r maps the basis vectors
    of the pre-image into r linearly independent
    basis vectors of the image
  • Singular Value Decomposition
  • Matrix A can be represented as
  • A U S VT where
  • columns of U and V are left and right
    eigenvectors of A AT
  • U and V orthogonal (VVTI) , and S is a diagonal
    matrix
  • S diag(s1, ..., sn), where si are nonnegative
    square roots of the r eigenvalues of A AT ,
  • and s1 gt s2 gt ... gt sr gt sr1 ... sn 0

33
Content-Based Relevance Ranking LSI (III)
  • Mathematical apparatus (contd)
  • Um x k ,, Sk x k , Vk x n
  • A - AkF2

34
Content-Based Relevance Ranking LSI (IV)
  • Practical application of the algorithm
  • The SVD computation on the term x document matrix
    is performed in advance, not at query processing
    time
  • Each document is represented as a column in the
    Ak matrix
  • Scalar product-based metric for distances between
    document vectors is used
  • Query vector lta1q, a2q, ..., amqgt
  • pseudo-document, added to Ak postfactum
  • Vq AqT U S-1
  • values from AkTAk give scalar product of
    document vectors
  • AkTAk V S2 V

35
References
  • Arasu, Cho, Garcia-Molina, Paepcke, Raghavan
    (2001). Searching the Web.
  • Kleinberg, J. (1999). Authoritative Sources in a
    Hyperlinked Environment. Journal of ACM.
  • Kumar, Raghavan, Rajagopalan, Sivakumar (2000).
    Stochastic Models for the Web Graph. IEEE.
  • Berry, Dumais O'Brien (1994). Using Linear
    Algebra for Intelligent Information Retrieval.
  • Deerwester, Dumais, Furnas, Landauer, Harshman
    (1990). Indexing by Latent Semantic Analysis.
    Journal of American Society for Information
    Sciences.
  • Manning Schutze (1999). Foundations of
    Statistical Natural Language Processing.

36
(No Transcript)
37
Content-Based Relevance Ranking LSI (III)
  • Mathematical apparatus (contd)
  • A - Ak
  • Vq AqT U S-1
  • A U S V T
  • AT V ST U T V S U T,
  • AT (U T) -1 V S
  • AT (U T)-1 S-1 V
Write a Comment
User Comments (0)
About PowerShow.com