The Web as a Graph, Models and Algorithms - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

The Web as a Graph, Models and Algorithms

Description:

Examples: VLSI Circuits, Communication Networks, Logical flow of a computer ... An abstraction for data and relationships. Examples: The Web, Social Networks, ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 23
Provided by: IBMU393
Category:

less

Transcript and Presenter's Notes

Title: The Web as a Graph, Models and Algorithms


1
The Web as a Graph, Models and Algorithms
  • Sridhar Rajagopalan
  • IBM Almaden Research Center

2
Graphs in Computer Science How do they help?
  • What do they model?
  • An abstraction in Core CS.
  • Examples VLSI Circuits, Communication Networks,
    Logical flow of a computer program, Data
    structures.
  • An abstraction for data and relationships.
  • Examples The Web, Social Networks, Flows and
    Flow Networks, Biological Data, Taxonomies,
    Citations, Explicit relations within a DB system.
  • What aspects are studied?
  • Algorithms, Data Structures and Complexity
    Theory.
  • Characterization and Modeling of Graphs.
  • Implementations of Graph Algorithms in Specific
    contexts.
  • What is new?
  • Some graphs are getting very very large.
  • Several Web crawls have over 2 Billion pages
    (nodes) and 10 times as many edges.
  • Some social networks (Telephone call graphs, for
    instance) are huge.
  • Sensors and small devices will output enormous
    amounts of data.

3
Why and how does scale change the game?
  • It is not about an instance, it is about THE
    instance.
  • Systems and Software is designed from scratch to
    solve one (or a few) instances of the problem.
  • Properties of the instance are important.
    Generality and genericity of the code base are
    not critical.
  • An answer to a single instance can be worth a
    lot.
  • A new kind of algorithm and system design which
    is very engineering orientated.
  • Non-viability of the random access model.
  • Hardware and software co-design..
  • Time to market considerations encourage careful
    approximations.
  • Statistical approaches are valuable.
  • Do not need the complete answer, even a partial
    high quality result can be very valuable.

4
What does a system and programming model for
processing large data sets have to do?
  • System and programming model for processing large
    graphs.
  • Exceptions will occur.
  • Components (both hard and soft) will fail.
  • Data structures will exceed available memory.
  • Aware of statistical issues.
  • Approximate or incomplete results are usually
    good enough.
  • What happens when you string (even well
    understood) approximate techniques together.

5
Commodity Computing Platforms
  • Disk based storage is plentiful and cheap.
  • Current price less than 1.5 dollars/gigabyte.
  • Memory hierarchies are complicated and pipelines
    are deep.
  • Large gap between random and sequential access,
    even within main memory.
  • Random access in main is slower than sequential
    access to disk.
  • CPU is underutilized, especially in data
    intensive applications.
  • Where have all the cycles gone?
  • Parallelism is great in theory, but hard in
    practice.
  • Naïve data partitioning is the only practical
    option to keep system complexity under control.

6
The Streaming Model Munro-Paterson
  • Data is seen as a sequence of elements.
  • Proposed as a model for large data sets, as well
    as data from sensors.
  • Issues
  • Memory usage.
  • Passes required over the data.
  • Many variations.
  • Sorting is hard. As are almost all interesting
    combinatorial/graph theoretic problems.
  • Exact computation of statistical functions are
    hard. Approximation is possible.
  • Relationships to communication complexity and
    information complexity.

7
Sorting
  • Sorting large a large data set on commodity
    hardware is a solved problem.
  • Google sorts more than 100 B terms in its index.
  • SPSort.
  • Penny Sort, Minute Sort.
  • But Sorting well requires a great deal of care
    and customization to the hardware platform.
  • What is the cost of indexing the Web? 2B
    documents, each with 500 words 1 Trillion
    records. Cost of index build per Penny Sort is
    under 50 bucks.
  • Networking speeds make sharing large quantities
    of streaming data possible.

8
Model 1 Stream Sort
  • Basic multi-pass data stream model with access to
    a Sort box.
  • Quite powerful.
  • Can do entire index build (including PageRank
    like calculations).
  • Spanning Tree, Connected Components, MinCut,
    STCONN, Bipartiteness.
  • Exact computations of order statistics and
    frequency moments.
  • Suffix Tree/Suffix Array build.
  • Red/Blue segment intersection.
  • So strictly stronger than just streaming.

9
Questions About the Web Graph
  • Size how big is the graph?How many links on a
    page (outdegree)? How many links to a page
    (indegree)?
  • Connectedness can one browse from any web page
    to any other? How many clicks?
  • Sampling can we pick a random page on the web?
  • Browsing how different is browsing from a
    random walk?
  • Applications can we exploit the structure of the
    web graph for searching and mining?
  • Models what does the web graph reveal about
    social processes which result in its creation and
    dynamics?

10
A Picture of the Web Graph
Broder et.al., 2000
11
Power Laws A Curious Statistic About the Web
  • Indegree outdegree distributions of the web graph
    are distributed by the power law.
  • Component size distributions are distributed by
    the power law.

Broder et.al., 2000
12
Power Laws
  • Inverse polynomial tail.
  • Word frequency in text. Yule (later Mandelbrot)
    statistical study of the literary
    vocabulary.Yule, 1944.
  • Citation analysis Lotka, 1926.
  • Zipf human behavior and the principle of least
    effort. Zipf, 1947.
  • Pareto Cours deconomie politique. Pareto,1897.
  • Network graph. Faloutsos-Faloutsos-Faloutsos,
    1999.
  • Oligonucleotide sequences Martindale-Konopka,
    1996.
  • Access statistics for web pages. (From server
    logs) Glassman, 1997.
  • User behavior (instrument browsers and proxies)
    Lukose-Huberman, 1998, Crovella and
    others,1997-99.
  • Many other instances.

13
The Web Is a Rich Source of Graphs
  • Co-citation graph C.
  • Nodes web pages. Undirected (weighted) edges
    co-citation (weight).
  • Bibliographic coupling graph B.
  • Nodes web pages. Undirected (weighted) edges
    bibliographic overlap.
  • Content similarity graph S.
  • Nodes web pages. Undirected or directed
    (weighted) edges similarity of text on the
    pages.
  • Namespace tree T.
  • Nodes URLs. Directed (labeled) edges parent
    to child (labeled by extension).
  • Host graph H.
  • Nodes websites/webhosts. Directed (weighted)
    edges (number) links from one host to the
    other.
  • Data mining graph D.
  • Bipartite. Two nodes per web page, one per
    partition. Edges from LHS to RHS corresponding
    to each edge in the web graph.
  • Nodes on LHS are adjacent vertex sets. On RHS
    are items.

14
Applications of Graph Methods
  • Searching.
  • Page rank. (Google).
  • Hits.
  • Clever and variants.
  • Data mining.
  • Communities.
  • Focused crawling.
  • Mirrors.
  • Estimation methods.
  • Sampling.
  • Random walks.
  • Browsing and foraging.
  • Back links.
  • Find similar.

15
Spectral Analysis Matrices
  • Adjacency matrix A(G).
  • Markov chain M(G).

16
Search Eigenvectors of M.
  • Page rank comes from the web graph. Brin and
    Page, 1998
  • Hub rank comes from bibliographic coupling.
    Kleinberg, 1998
  • Authority rank comes from co-citations.
    Kleinberg, 1998
  • Latent semantic indexing builds on textual
    similarity. Dumais et.al.

17
Co-citation and Web Communities.
  • Social networks Milgram 6DOS Routing.
  • Bibliographic coupling thesis frequently
    co-cited web pages are related. Pages with large
    bibliographic overlap are related.
  • CS problem enumerate all frequently co-cited
    groups of web pages. (Complete bipartite
    subgraphs). We call these cores.

18
The Cores Are Interesting.
Explicit communities.
Implicit communities
  • Hotels in costa rica
  • Clipart
  • Japanese elementary schools
  • Turkish student associations
  • Oil spills off the coast of japan
  • Australian fire brigades
  • Aviation/aircraft vendors
  • Guitar manufacturers

(1) Implicit communities are defined by
cores. (2) There are an order of magnitude more
implicit communities. (3) Very reliable. Over
97 (sampled) make sense.
19
More Applications.
  • Find similar look at neighborhood in co-citation
    graph (C).
  • Bidirectional browsing edges are reversed and a
    list of pages which point to the currently
    visible one is made available.
  • Mirror detection find identical sub-trees in the
    namespace tree.

20
Random Graphs
  • Erdos and Renyis model.
  • Graph with n vertices.
  • Each of n(n-1) arcs appear independently with
    probability p.
  • Graphical evolution Palmer study properties of
    the resulting random graph as p is increased from
    0 to 1.

1
0
t
21
Facts About the Erdos-Renyi Model
  • A random graph with average degree 4 has a giant
    connected component containing almost all (90)
    of the vertices.
  • Indegrees and outdegrees are concentrated around
    the mean. And have exponentially declining tails.
  • Most vertices in the graph are close to most
    others (small world).

22
New Models
  • Ad Hoc. Aiello, Chung and Lu, 2000 Pick a
    random graph which satisfies degree distribution
    constraints.
  • Copying models. Barabasi-Albert, KRRT, 2000-01.
  • Local optimization models. Papadimitriou et.al.
    , 2001.
  • See Mitzenmacher 2000 survey.
Write a Comment
User Comments (0)
About PowerShow.com