The Web as a Graph, Models and Algorithms - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

The Web as a Graph, Models and Algorithms

Description:

Examples: VLSI Circuits, Communication Networks, Logical flow of a computer ... An abstraction for data and relationships. Examples: The Web, Social Networks, ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 23

Provided by: IBMU393

Category:

more less

Transcript and Presenter's Notes

Title: The Web as a Graph, Models and Algorithms

1
The Web as a Graph, Models and Algorithms

Sridhar Rajagopalan
IBM Almaden Research Center

2
Graphs in Computer Science How do they help?

What do they model?
An abstraction in Core CS.
Examples VLSI Circuits, Communication Networks,
Logical flow of a computer program, Data
structures.
An abstraction for data and relationships.
Examples The Web, Social Networks, Flows and
Flow Networks, Biological Data, Taxonomies,
Citations, Explicit relations within a DB system.
What aspects are studied?
Algorithms, Data Structures and Complexity
Theory.
Characterization and Modeling of Graphs.
Implementations of Graph Algorithms in Specific
contexts.
What is new?
Some graphs are getting very very large.
Several Web crawls have over 2 Billion pages
(nodes) and 10 times as many edges.
Some social networks (Telephone call graphs, for
instance) are huge.
Sensors and small devices will output enormous
amounts of data.

3
Why and how does scale change the game?

It is not about an instance, it is about THE
instance.
Systems and Software is designed from scratch to
solve one (or a few) instances of the problem.
Properties of the instance are important.
Generality and genericity of the code base are
not critical.
An answer to a single instance can be worth a
lot.
A new kind of algorithm and system design which
is very engineering orientated.
Non-viability of the random access model.
Hardware and software co-design..
Time to market considerations encourage careful
approximations.
Statistical approaches are valuable.
Do not need the complete answer, even a partial
high quality result can be very valuable.

4
What does a system and programming model for
processing large data sets have to do?

System and programming model for processing large
graphs.
Exceptions will occur.
Components (both hard and soft) will fail.
Data structures will exceed available memory.
Aware of statistical issues.
Approximate or incomplete results are usually
good enough.
What happens when you string (even well
understood) approximate techniques together.

5
Commodity Computing Platforms

Disk based storage is plentiful and cheap.
Current price less than 1.5 dollars/gigabyte.
Memory hierarchies are complicated and pipelines
are deep.
Large gap between random and sequential access,
even within main memory.
Random access in main is slower than sequential
access to disk.
CPU is underutilized, especially in data
intensive applications.
Where have all the cycles gone?
Parallelism is great in theory, but hard in
practice.
Naïve data partitioning is the only practical
option to keep system complexity under control.

6
The Streaming Model Munro-Paterson

Data is seen as a sequence of elements.
Proposed as a model for large data sets, as well
as data from sensors.
Issues
Memory usage.
Passes required over the data.
Many variations.

Sorting is hard. As are almost all interesting
combinatorial/graph theoretic problems.
Exact computation of statistical functions are
hard. Approximation is possible.
Relationships to communication complexity and
information complexity.

7
Sorting

Sorting large a large data set on commodity
hardware is a solved problem.
Google sorts more than 100 B terms in its index.
SPSort.
Penny Sort, Minute Sort.
But Sorting well requires a great deal of care
and customization to the hardware platform.
What is the cost of indexing the Web? 2B
documents, each with 500 words 1 Trillion
records. Cost of index build per Penny Sort is
under 50 bucks.
Networking speeds make sharing large quantities
of streaming data possible.

8
Model 1 Stream Sort

Basic multi-pass data stream model with access to
a Sort box.
Quite powerful.
Can do entire index build (including PageRank
like calculations).
Spanning Tree, Connected Components, MinCut,
STCONN, Bipartiteness.
Exact computations of order statistics and
frequency moments.
Suffix Tree/Suffix Array build.
Red/Blue segment intersection.
So strictly stronger than just streaming.

9
Questions About the Web Graph

Size how big is the graph?How many links on a
page (outdegree)? How many links to a page
(indegree)?
Connectedness can one browse from any web page
to any other? How many clicks?
Sampling can we pick a random page on the web?
Browsing how different is browsing from a
random walk?

Applications can we exploit the structure of the
web graph for searching and mining?
Models what does the web graph reveal about
social processes which result in its creation and
dynamics?

10
A Picture of the Web Graph
Broder et.al., 2000
11
Power Laws A Curious Statistic About the Web

Indegree outdegree distributions of the web graph
are distributed by the power law.
Component size distributions are distributed by
the power law.

Broder et.al., 2000
12
Power Laws

Inverse polynomial tail.
Word frequency in text. Yule (later Mandelbrot)
statistical study of the literary
vocabulary.Yule, 1944.
Citation analysis Lotka, 1926.
Zipf human behavior and the principle of least
effort. Zipf, 1947.
Pareto Cours deconomie politique. Pareto,1897.
Network graph. Faloutsos-Faloutsos-Faloutsos,
1999.
Oligonucleotide sequences Martindale-Konopka,
1996.
Access statistics for web pages. (From server
logs) Glassman, 1997.
User behavior (instrument browsers and proxies)
Lukose-Huberman, 1998, Crovella and
others,1997-99.
Many other instances.

13
The Web Is a Rich Source of Graphs

Co-citation graph C.
Nodes web pages. Undirected (weighted) edges
co-citation (weight).
Bibliographic coupling graph B.
Nodes web pages. Undirected (weighted) edges
bibliographic overlap.
Content similarity graph S.
Nodes web pages. Undirected or directed
(weighted) edges similarity of text on the
pages.
Namespace tree T.
Nodes URLs. Directed (labeled) edges parent
to child (labeled by extension).
Host graph H.
Nodes websites/webhosts. Directed (weighted)
edges (number) links from one host to the
other.
Data mining graph D.
Bipartite. Two nodes per web page, one per
partition. Edges from LHS to RHS corresponding
to each edge in the web graph.
Nodes on LHS are adjacent vertex sets. On RHS
are items.

14
Applications of Graph Methods

Searching.
Page rank. (Google).
Hits.
Clever and variants.
Data mining.
Communities.
Focused crawling.
Mirrors.

Estimation methods.
Sampling.
Random walks.
Browsing and foraging.
Back links.
Find similar.

15
Spectral Analysis Matrices

Adjacency matrix A(G).
Markov chain M(G).

16
Search Eigenvectors of M.

Page rank comes from the web graph. Brin and
Page, 1998
Hub rank comes from bibliographic coupling.
Kleinberg, 1998
Authority rank comes from co-citations.
Kleinberg, 1998
Latent semantic indexing builds on textual
similarity. Dumais et.al.

17
Co-citation and Web Communities.

Social networks Milgram 6DOS Routing.
Bibliographic coupling thesis frequently
co-cited web pages are related. Pages with large
bibliographic overlap are related.
CS problem enumerate all frequently co-cited
groups of web pages. (Complete bipartite
subgraphs). We call these cores.

18
The Cores Are Interesting.
Explicit communities.
Implicit communities

Hotels in costa rica
Clipart
Japanese elementary schools
Turkish student associations
Oil spills off the coast of japan
Australian fire brigades
Aviation/aircraft vendors
Guitar manufacturers

(1) Implicit communities are defined by
cores. (2) There are an order of magnitude more
implicit communities. (3) Very reliable. Over
97 (sampled) make sense.
19
More Applications.

Find similar look at neighborhood in co-citation
graph (C).
Bidirectional browsing edges are reversed and a
list of pages which point to the currently
visible one is made available.
Mirror detection find identical sub-trees in the
namespace tree.

20
Random Graphs

Erdos and Renyis model.
Graph with n vertices.
Each of n(n-1) arcs appear independently with
probability p.
Graphical evolution Palmer study properties of
the resulting random graph as p is increased from
0 to 1.

1
0
t
21
Facts About the Erdos-Renyi Model

A random graph with average degree 4 has a giant
connected component containing almost all (90)
of the vertices.
Indegrees and outdegrees are concentrated around
the mean. And have exponentially declining tails.
Most vertices in the graph are close to most
others (small world).

22
New Models

Ad Hoc. Aiello, Chung and Lu, 2000 Pick a
random graph which satisfies degree distribution
constraints.
Copying models. Barabasi-Albert, KRRT, 2000-01.
Local optimization models. Papadimitriou et.al.
, 2001.
See Mitzenmacher 2000 survey.

Write a Comment

User Comments (0)