The Web as a Graph: Measurements, Models, and Methods - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

The Web as a Graph: Measurements, Models, and Methods

Description:

Pages with a relatively high xp will be viewed as authoritative pages ... Also note that all pages receive the same initial weight; there is no a priori estimation ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 25
Provided by: robertdab
Category:

less

Transcript and Presenter's Notes

Title: The Web as a Graph: Measurements, Models, and Methods


1
The Web as a Graph Measurements, Models, and
Methods
  • Presented by Rob Abernethy
  • and Saravanan Palanisamy

2
Outline
  • Impact on Web Search Engines
  • HITS Algorithm
  • Trawling Algorithm
  • Measurements
  • Models

3
Google Relationship
  • Recall the PageRank system
  • The number of citations to the page in question
    determines that pages rank
  • Uses a link-based graph to determine a pages
    relevance to a topic
  • Google paper left out lots of details relating to
    this graph (structure, creation, etc.)

4
Definitions
  • Authoritative Page
  • A page that is deemed important to a specific
    topic by the web community based on the large
    number of citations (links) to it found in other
    pages
  • Hub Page
  • A page that contains citations to all the
    important pages focused on a particular topic
  • In-degree
  • The number of links pointing to a page
  • Out-degree
  • The number of links coming out of a page

5
HITS Algorithm
  • Input is the World Wide Web
  • Output is a sub-graph (hopefully) having these
    properties
  • Relatively small
  • Rich in relevant pages
  • Contains most of the authoritative and hub pages
  • Two main steps
  • Sampling Step
  • Weight-propagation Step

6
Sampling Step
  • Create a root set via a regular search engine
  • Root set consists of about 200 pages
  • Root set is assumed to contain some hub and
    authoritative pages
  • Expand root set to base set by tracking links
  • Base set consists of about 1000-3000 pages
  • Base set should contain a large number of hub and
    authoritative pages

7
Sampling Set (cont.)
  • Using the base set, induce a sub-graph
  • Delete links in the sub-graph between pages in
    the same web site
  • These links are assumed to be for navigational
    purposes and would possibly bias a pages
    in-degree or out-degree

8
Weight-propagation Step
  • Associate non-negative hub and authority weights
    to all pages in the sub-graph
  • Authority-weight xp
  • Hub-weight yp
  • Pages with a relatively high xp will be viewed as
    authoritative pages
  • Pages with a relatively high yp will be viewed as
    hub pages

9
Weight-propagation (cont.)
  • Note that the relative weights of the pages are
    used there is no universal score for
    determining a authoritative or hub page
  • Also note that all pages receive the same initial
    weight there is no a priori estimation

10
Weight-propagation (cont.)
  • Authoritative update rule
  • Hub update rule

11
HITS Algorithm
  • The sub-graph induced by the base set along with
    the weights computed in weight-propagation step
    can be used to deliver a list of authoritative
    and hub pages
  • After initial query to obtain root set, HITS is
    completely based on link-structure, completely
    ignoring a web pages textual content

12
HITS Modifications
  • Weight the value of a link based on
  • Content
  • Domain
  • Date Modified

13
Trawling the Web for cyber-communities
  • Seeks to enumerate all topics (unlike HITS)
  • A bipartite core Ci,j is a graph on Ij nodes
    that contains at least one Ki,j as subgraph.

14
Trawling (contd.)
  • Identify a large fraction of cyber-communities by
    enumerating all the bipartite cores in the web
    trawling.
  • Problems of naïve search algorithm
  • Size of search space is too large
  • Requires random access to edges in graph, which
    requires most of the graph in memory

15
Elimination-generation paradigm
  • The algorithm performs a number of sequential
    passes over the Web graph.
  • Passes over the data are interleaved with sort
    operations which change the order in which the
    data is scanned.
  • Elimination operations and generation operations
    are interleaved during each pass.
  • After elimination/generation there are fewer
    neighbors and new opportunities during next pass

16
Elimination - Generation
  • Elimination filters
  • Nodes with in-degree 3 or smaller cant
    participate on the right side of a C4,4.
  • Nodes with out-degree 3 or smaller cant
    participate on the left side of a C4,4.
  • Generation filter - A node of in-degree 4 can
    belong to a C4,4 iff the 4 nodes that point to it
    have a neighborhood intersection of size at least
    4.

17
Why elimination generation is fast?
  • The in/out-degree of every node drops during each
    phase
  • In each generation test, we either eliminate a
    node from further consideration, or we output the
    subgraph containing u. Hence the algorithm is
    linear in the number of nodes.
  • Elimination phases eliminate most nodes in the
    web graph

18
Measurements
  • In-degree distribution
  • Out-degree distribution

19
Degree distribution
  • Probability that a node has in/out degree i is
    proportional to 1/i? where ? ? 2
  • This is a Zipfian distribution which does not
    arise in a model such as Gn,p (poisson or
    binomial distribution).
  • Number of bipartite cores

20
Connectivity of local subgraphs
  • u and v are biconnected if there is no third node
    w so that w lies on all u-v paths.
  • The undirected graph Gu has a giant biconnected
    component, with small pieces connected around it
  • u and v are strongly connected if each can reach
    the other by a directed path
  • For many of the graphs the largest strongly
    connected component has size less than 20

21
Alternating connectivity
  • Sequence P of edges is an alternating-path from u
    to v if
  • P is a path in Gu with endpoints u and v
  • Orientations of the edges of P strictly alternate
    in G
  • Symmetric but not transitive
  • u -gt a lt- b -gt v -gt c lt- d -gt w
  • nearly equivalent if u is related to three
    nodes a,b,c, then at least one pair among a,b,c
    is related claw-free relation

22
Model
  • Reasons
  • Allows to model the properties of the web graph
    degrees, distribution of Ci,js
  • Predict the behavior of algorithms on web
  • Structural properties of todays web and predict
    future
  • Intuition
  • Some page creators link to other sites without
    regard to topics
  • Most page creators link to pages with same topics

23
A class of random graph models
  • Characterized by four stochastic processes node
    and edge-creation and node and edge-deletion
  • Node-creation At each step, create a node with
    probability
  • Node-deletion At each step, delete a node with
    probability and also its incident edges

24
Edge creation and deletion
  • At each step we sample a probability distribution
    to determine a node v to add edges out of, and a
    number of edges k that will be added.
  • With probability ? we add k edges from v to nodes
    chosen independently and randomly.
  • With probability 1- ? we copy k edges from a
    randomly chosen node to v.
  • Edge deletion is similar
  • Under this model the probability that a node has
    in-degree i converges to i-1/(1- ?).
Write a Comment
User Comments (0)
About PowerShow.com