Title: The Web as a Graph, Models and Algorithms
1The Web as a Graph, Models and Algorithms
- Sridhar Rajagopalan
- IBM Almaden Research Center
-
2Graphs in Computer Science How do they help?
- What do they model?
- An abstraction in Core CS.
- Examples VLSI Circuits, Communication Networks,
Logical flow of a computer program, Data
structures. - An abstraction for data and relationships.
- Examples The Web, Social Networks, Flows and
Flow Networks, Biological Data, Taxonomies,
Citations, Explicit relations within a DB system. - What aspects are studied?
- Algorithms, Data Structures and Complexity
Theory. - Characterization and Modeling of Graphs.
- Implementations of Graph Algorithms in Specific
contexts. - What is new?
- Some graphs are getting very very large.
- Several Web crawls have over 2 Billion pages
(nodes) and 10 times as many edges. - Some social networks (Telephone call graphs, for
instance) are huge. - Sensors and small devices will output enormous
amounts of data.
3Why and how does scale change the game?
- It is not about an instance, it is about THE
instance. - Systems and Software is designed from scratch to
solve one (or a few) instances of the problem. - Properties of the instance are important.
Generality and genericity of the code base are
not critical. - An answer to a single instance can be worth a
lot. - A new kind of algorithm and system design which
is very engineering orientated. - Non-viability of the random access model.
- Hardware and software co-design..
- Time to market considerations encourage careful
approximations. - Statistical approaches are valuable.
- Do not need the complete answer, even a partial
high quality result can be very valuable.
4What does a system and programming model for
processing large data sets have to do?
- System and programming model for processing large
graphs. - Exceptions will occur.
- Components (both hard and soft) will fail.
- Data structures will exceed available memory.
- Aware of statistical issues.
- Approximate or incomplete results are usually
good enough. - What happens when you string (even well
understood) approximate techniques together.
5 Commodity Computing Platforms
- Disk based storage is plentiful and cheap.
- Current price less than 1.5 dollars/gigabyte.
- Memory hierarchies are complicated and pipelines
are deep. - Large gap between random and sequential access,
even within main memory. - Random access in main is slower than sequential
access to disk. - CPU is underutilized, especially in data
intensive applications. - Where have all the cycles gone?
- Parallelism is great in theory, but hard in
practice. - Naïve data partitioning is the only practical
option to keep system complexity under control.
6The Streaming Model Munro-Paterson
- Data is seen as a sequence of elements.
- Proposed as a model for large data sets, as well
as data from sensors. - Issues
- Memory usage.
- Passes required over the data.
- Many variations.
- Sorting is hard. As are almost all interesting
combinatorial/graph theoretic problems. - Exact computation of statistical functions are
hard. Approximation is possible. - Relationships to communication complexity and
information complexity.
7Sorting
- Sorting large a large data set on commodity
hardware is a solved problem. - Google sorts more than 100 B terms in its index.
- SPSort.
- Penny Sort, Minute Sort.
- But Sorting well requires a great deal of care
and customization to the hardware platform. - What is the cost of indexing the Web? 2B
documents, each with 500 words 1 Trillion
records. Cost of index build per Penny Sort is
under 50 bucks. - Networking speeds make sharing large quantities
of streaming data possible.
8Model 1 Stream Sort
- Basic multi-pass data stream model with access to
a Sort box. - Quite powerful.
- Can do entire index build (including PageRank
like calculations). - Spanning Tree, Connected Components, MinCut,
STCONN, Bipartiteness. - Exact computations of order statistics and
frequency moments. - Suffix Tree/Suffix Array build.
- Red/Blue segment intersection.
- So strictly stronger than just streaming.
9Questions About the Web Graph
- Size how big is the graph?How many links on a
page (outdegree)? How many links to a page
(indegree)? - Connectedness can one browse from any web page
to any other? How many clicks? - Sampling can we pick a random page on the web?
- Browsing how different is browsing from a
random walk?
- Applications can we exploit the structure of the
web graph for searching and mining? - Models what does the web graph reveal about
social processes which result in its creation and
dynamics?
10A Picture of the Web Graph
Broder et.al., 2000
11Power Laws A Curious Statistic About the Web
- Indegree outdegree distributions of the web graph
are distributed by the power law. - Component size distributions are distributed by
the power law.
Broder et.al., 2000
12Power Laws
- Inverse polynomial tail.
- Word frequency in text. Yule (later Mandelbrot)
statistical study of the literary
vocabulary.Yule, 1944. - Citation analysis Lotka, 1926.
- Zipf human behavior and the principle of least
effort. Zipf, 1947. - Pareto Cours deconomie politique. Pareto,1897.
- Network graph. Faloutsos-Faloutsos-Faloutsos,
1999. - Oligonucleotide sequences Martindale-Konopka,
1996. - Access statistics for web pages. (From server
logs) Glassman, 1997. - User behavior (instrument browsers and proxies)
Lukose-Huberman, 1998, Crovella and
others,1997-99. - Many other instances.
13The Web Is a Rich Source of Graphs
- Co-citation graph C.
- Nodes web pages. Undirected (weighted) edges
co-citation (weight). - Bibliographic coupling graph B.
- Nodes web pages. Undirected (weighted) edges
bibliographic overlap. - Content similarity graph S.
- Nodes web pages. Undirected or directed
(weighted) edges similarity of text on the
pages. - Namespace tree T.
- Nodes URLs. Directed (labeled) edges parent
to child (labeled by extension). - Host graph H.
- Nodes websites/webhosts. Directed (weighted)
edges (number) links from one host to the
other. - Data mining graph D.
- Bipartite. Two nodes per web page, one per
partition. Edges from LHS to RHS corresponding
to each edge in the web graph. - Nodes on LHS are adjacent vertex sets. On RHS
are items.
14Applications of Graph Methods
- Searching.
- Page rank. (Google).
- Hits.
- Clever and variants.
- Data mining.
- Communities.
- Focused crawling.
- Mirrors.
- Estimation methods.
- Sampling.
- Random walks.
- Browsing and foraging.
- Back links.
- Find similar.
15Spectral Analysis Matrices
- Adjacency matrix A(G).
- Markov chain M(G).
16Search Eigenvectors of M.
- Page rank comes from the web graph. Brin and
Page, 1998 - Hub rank comes from bibliographic coupling.
Kleinberg, 1998 - Authority rank comes from co-citations.
Kleinberg, 1998 - Latent semantic indexing builds on textual
similarity. Dumais et.al.
17Co-citation and Web Communities.
- Social networks Milgram 6DOS Routing.
- Bibliographic coupling thesis frequently
co-cited web pages are related. Pages with large
bibliographic overlap are related. - CS problem enumerate all frequently co-cited
groups of web pages. (Complete bipartite
subgraphs). We call these cores.
18The Cores Are Interesting.
Explicit communities.
Implicit communities
- Hotels in costa rica
- Clipart
- Japanese elementary schools
- Turkish student associations
- Oil spills off the coast of japan
- Australian fire brigades
- Aviation/aircraft vendors
- Guitar manufacturers
(1) Implicit communities are defined by
cores. (2) There are an order of magnitude more
implicit communities. (3) Very reliable. Over
97 (sampled) make sense.
19More Applications.
- Find similar look at neighborhood in co-citation
graph (C). - Bidirectional browsing edges are reversed and a
list of pages which point to the currently
visible one is made available. - Mirror detection find identical sub-trees in the
namespace tree.
20Random Graphs
- Erdos and Renyis model.
- Graph with n vertices.
- Each of n(n-1) arcs appear independently with
probability p. - Graphical evolution Palmer study properties of
the resulting random graph as p is increased from
0 to 1.
1
0
t
21Facts About the Erdos-Renyi Model
- A random graph with average degree 4 has a giant
connected component containing almost all (90)
of the vertices. - Indegrees and outdegrees are concentrated around
the mean. And have exponentially declining tails. - Most vertices in the graph are close to most
others (small world).
22New Models
- Ad Hoc. Aiello, Chung and Lu, 2000 Pick a
random graph which satisfies degree distribution
constraints. - Copying models. Barabasi-Albert, KRRT, 2000-01.
- Local optimization models. Papadimitriou et.al.
, 2001. - See Mitzenmacher 2000 survey.