The Web as a Graph: Measurements, Models, and Methods - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

The Web as a Graph: Measurements, Models, and Methods

Description:

Pages with a relatively high xp will be viewed as authoritative pages ... Also note that all pages receive the same initial weight; there is no a priori estimation ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 25

Provided by: robertdab

Category:

more less

Transcript and Presenter's Notes

Title: The Web as a Graph: Measurements, Models, and Methods

1
The Web as a Graph Measurements, Models, and
Methods

Presented by Rob Abernethy
and Saravanan Palanisamy

2
Outline

Impact on Web Search Engines
HITS Algorithm
Trawling Algorithm
Measurements
Models

3
Google Relationship

Recall the PageRank system
The number of citations to the page in question
determines that pages rank
Uses a link-based graph to determine a pages
relevance to a topic
Google paper left out lots of details relating to
this graph (structure, creation, etc.)

4
Definitions

Authoritative Page
A page that is deemed important to a specific
topic by the web community based on the large
number of citations (links) to it found in other
pages
Hub Page
A page that contains citations to all the
important pages focused on a particular topic
In-degree
The number of links pointing to a page
Out-degree
The number of links coming out of a page

5
HITS Algorithm

Input is the World Wide Web
Output is a sub-graph (hopefully) having these
properties
Relatively small
Rich in relevant pages
Contains most of the authoritative and hub pages
Two main steps
Sampling Step
Weight-propagation Step

6
Sampling Step

Create a root set via a regular search engine
Root set consists of about 200 pages
Root set is assumed to contain some hub and
authoritative pages
Expand root set to base set by tracking links
Base set consists of about 1000-3000 pages
Base set should contain a large number of hub and
authoritative pages

7
Sampling Set (cont.)

Using the base set, induce a sub-graph
Delete links in the sub-graph between pages in
the same web site
These links are assumed to be for navigational
purposes and would possibly bias a pages
in-degree or out-degree

8
Weight-propagation Step

Associate non-negative hub and authority weights
to all pages in the sub-graph
Authority-weight xp
Hub-weight yp
Pages with a relatively high xp will be viewed as
authoritative pages
Pages with a relatively high yp will be viewed as
hub pages

9
Weight-propagation (cont.)

Note that the relative weights of the pages are
used there is no universal score for
determining a authoritative or hub page
Also note that all pages receive the same initial
weight there is no a priori estimation

10
Weight-propagation (cont.)

Authoritative update rule
Hub update rule

11
HITS Algorithm

The sub-graph induced by the base set along with
the weights computed in weight-propagation step
can be used to deliver a list of authoritative
and hub pages
After initial query to obtain root set, HITS is
completely based on link-structure, completely
ignoring a web pages textual content

12
HITS Modifications

Weight the value of a link based on
Content
Domain
Date Modified

13
Trawling the Web for cyber-communities

Seeks to enumerate all topics (unlike HITS)
A bipartite core Ci,j is a graph on Ij nodes
that contains at least one Ki,j as subgraph.

14
Trawling (contd.)

Identify a large fraction of cyber-communities by
enumerating all the bipartite cores in the web
trawling.
Problems of naïve search algorithm
Size of search space is too large
Requires random access to edges in graph, which
requires most of the graph in memory

15
Elimination-generation paradigm

The algorithm performs a number of sequential
passes over the Web graph.
Passes over the data are interleaved with sort
operations which change the order in which the
data is scanned.
Elimination operations and generation operations
are interleaved during each pass.
After elimination/generation there are fewer
neighbors and new opportunities during next pass

16
Elimination - Generation

Elimination filters
Nodes with in-degree 3 or smaller cant
participate on the right side of a C4,4.
Nodes with out-degree 3 or smaller cant
participate on the left side of a C4,4.
Generation filter - A node of in-degree 4 can
belong to a C4,4 iff the 4 nodes that point to it
have a neighborhood intersection of size at least
4.

17
Why elimination generation is fast?

The in/out-degree of every node drops during each
phase
In each generation test, we either eliminate a
node from further consideration, or we output the
subgraph containing u. Hence the algorithm is
linear in the number of nodes.
Elimination phases eliminate most nodes in the
web graph

18
Measurements

In-degree distribution
Out-degree distribution

19
Degree distribution

Probability that a node has in/out degree i is
proportional to 1/i? where ? ? 2
This is a Zipfian distribution which does not
arise in a model such as Gn,p (poisson or
binomial distribution).
Number of bipartite cores

20
Connectivity of local subgraphs

u and v are biconnected if there is no third node
w so that w lies on all u-v paths.
The undirected graph Gu has a giant biconnected
component, with small pieces connected around it
u and v are strongly connected if each can reach
the other by a directed path
For many of the graphs the largest strongly
connected component has size less than 20

21
Alternating connectivity

Sequence P of edges is an alternating-path from u
to v if
P is a path in Gu with endpoints u and v
Orientations of the edges of P strictly alternate
in G
Symmetric but not transitive
u -gt a lt- b -gt v -gt c lt- d -gt w
nearly equivalent if u is related to three
nodes a,b,c, then at least one pair among a,b,c
is related claw-free relation

22
Model

Reasons
Allows to model the properties of the web graph
degrees, distribution of Ci,js
Predict the behavior of algorithms on web
Structural properties of todays web and predict
future
Intuition
Some page creators link to other sites without
regard to topics
Most page creators link to pages with same topics

23
A class of random graph models

Characterized by four stochastic processes node
and edge-creation and node and edge-deletion
Node-creation At each step, create a node with
probability
Node-deletion At each step, delete a node with
probability and also its incident edges

24
Edge creation and deletion

At each step we sample a probability distribution
to determine a node v to add edges out of, and a
number of edges k that will be added.
With probability ? we add k edges from v to nodes
chosen independently and randomly.
With probability 1- ? we copy k edges from a
randomly chosen node to v.
Edge deletion is similar
Under this model the probability that a node has
in-degree i converges to i-1/(1- ?).