Title: Graph Structure in the Web
1Graph Structure in the Web
- Alta Vista
- IBM Almaden
- Compaq SRC
- 9th WWW Conference htttp//www9.org
Vamsi Vutukuru, Anuj Khare
2Presentation Structure
- Introduction
- Motivation
- Prior Work
- Experiments
- Infrastructure
- Algorithms run
- Results
- Interpretation
- Future Work
3Motivation
Why study the Web graph ?
- Crawl strategies
- Behavior of Web Algorithms
- HITS, PageRank
- Evolution of Web
- Webrings, Bipartite cores
4Prior Work
- Observations of power law distributions
- Kumar et. al. 40 million pages (1997 crawl)
- Barabasi et. al. 325K nodes nd.edu domain
(99) - diameter of the web 19
- Graph theoretic methods
- Klienberg, PageRank - Search
- Mendelzon and Wood .Web Mining (95)
5Terminology
- Directed Graph
- Out degree, in-degree
- Strongly Connected Component (SCC)
- Weakly Connected Component (WCC)
- BFS
- Diameter
- max Shortest_Path(u,v) for all u,v ? V
- average Shortest_Path(u,v) for all u,v ? V
6Infrastructure
- Data
- 2 AltaVista crawls - May 99, October 99
- 203 million web-pages/nodes
- 1.5 billion links/edges
- Connectivity Server 2 (CS2)
- 465 MHz Compaq AlphaServer 4100
- 12GB RAM
- BFS reaching 100mi nodes 4 minutes
7Algorithms Run
- BFS
- WCC algorithm
- SCC algorithm
8Results
Power Law for in-degree Probability that a node
has i in-degree is propotional to 1/
ix
exponent 2.1
exponent 2.72
9(No Transcript)
10Undirected Connected Components
Giant WCC 186mi nodes (91) Is this because of
junctions ? Remove nodes with in-degree
5 WCC 59mi nodes
exponent 2.5
Connectivity is resilient Hubs and authorities
are embedded in a graph that is well connected
even without them
11Strongly Connected Components
SCC 56mi pages (28)
Where have the other pages gone ?
exponent 2.5
12 Bowtie
13Experiments
- Random - start BFS
- 570 randomly chosen start nodes
- Forward BFS
- Backward BFS
- Start nodes
- BFS dies out with 90 nodes
- BFS explodes to cover 100mi
- Both Forward and Backward BFS explode
14(No Transcript)
15Interpretation
WCC - 186mi nodes SCC - 56mi nodes DISC
TOTAL WCC
SCC IN Forward BFS explodes SCC OUT
Backward BFS explodes
16IN, OUT and TENDRILS
- Every BFS start node in SCC reaches
- 99,807,161 through in-link expansion
- Hence SCC IN 100mi
- 99,630,178 through out-link expansion
- Hence SCC OUT 100mi
TENDRILS WCC ( SCC IN OUT )
17IN and OUT
128 nodes in IN. 134 nodes in OUT
OUT tends to encounter larger neighborhoods.
18SCC
136 nodes in SCC
BFS depth
Directed diameter is 28
19More Observations
Given random start and finish pages how likely
are we to get from start page to finish page
? 24 Maximum finite shortest path length
475430905
Average Connected distance
20Future Work
- Further analysis of SCC, IN, OUT, TENDRILS
- Is the structure stable ?
- Mathematical models for evolving graphs
- Applicability to phone-call graph,
purchase/transaction graphs etc - Explore other notions of connectivity
- co-citation
- bibiliographic coupling
21(No Transcript)