Web structure - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Web structure

Description:

A new search engine starting at a single Good page, will find SCC OUT only, ... Allow user submission of URLs. Summary. Simple mathematical models help ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 19
Provided by: Miketh8
Category:

less

Transcript and Presenter's Notes

Title: Web structure


1
Web structure
  • Dynamics and Complexity
  • A simple web growth model
  • More complex models

2
Introduction
  • The web grows and changes every day
  • Search engines and other programs need to respond
    to this growth
  • Understanding web growth helps us to design
    efficient programs

3
1. A simple web growth model
4
Link frequencies
  • The number of links between web pages displays a
    striking pattern

Links from pages in Australian University Web
sites A similar graph can be drawn for the whole
web
Why do many web Sites contain few links
Why do few web sites contain many links?
5
Link frequencies
  • A few web pages contain many links
  • Many web pages contain few links
  • The 80/20 rule
  • 20 of pages have 80 of the links
  • The relationship between link counts and page
    frequencies and is a power law

6
The Power Law Model
  • The number of web pages N having L links is
    proportional to La, where a is a constant, often
    2.5
  • e.g. N 1000x L-2.5
  • For these values
  • 1 link N 1000x1-2.51000 pages have 1 link
  • 2 links N 1000x2-2.5177 pages have 2 links
  • 3 links N 1000x3-2.564 pages have 3 links
  • 4 links N 1000x4-2.531 pages have 4 links

7
Example
  • N 10,000x L-2.0
  • How many pages have
  • 1 link
  • 2 links
  • 10 links
  • You can do these without a calculator!

8
Rich get richer
  • Why do web page links follow a power law?
  • Pages that already have many links are more
    likely to have extra links added to them
  • Pages that already have many links to them are
    more likely to have additional links created to
    them
  • Mathematicians have shown that these rich get
    richer phenomena always lead to a power law
  • Web pages, in general, tend to obey the rich get
    richer law
  • Partly because links are used in search engine
    ranking

9
More complex models
  • More complex web growth models

10
The Pennock, Flake, Lawrence, Glover, Giles
(PFLGG) Model
  • Notice that the hooked shape on the top left of
    the power law graph
  • This is a deviation from a power law
  • The PFLGG model accounts for this
  • part power law, part random attachment
  • New links are created partly at random, partly
    because of rich get richer
  • Different types of page have different
    combinations of rich get richer and power law

11
Two extreme examples
  • Company home pages
  • not much like a power law
  • Many links are given at random

Number of pages
Links in each page
  • Random pages
  • a pure power law
  • few links are given at random

Number of pages
Links in each page
12
Clustering models
13
Component sizes
Sizes of components in a May 1999 AltaVista crawl
(Broder, Kumar, Maghoul et al., 2000)
14
Component definitions 1
  • SCC Strongly Connected Component
  • The largest group of pages which can reach all
    other SCC members by following links
  • IN Pages that can reach an SCC page by
    following links
  • OUT Pages that can be reached by the SCC by
    following links

15
Component definitions 2
  • DISCONNECTED Pages that are not connected to the
    SCC in any way
  • TENDRILS The rest

16
Why is this important?
  • A new search engine starting at a single Good
    page, will find SCC OUT only, and miss the rest
    of the web
  • Need to find other methods to get DISCONNECTED,
    TENDRILS, IN
  • Remember old pages
  • Allow user submission of URLs

17
Summary
  • Simple mathematical models help explain web
    linking
  • More complex models provide a better fit
  • Topological models allow the web to be visualised
  • All help the design of web crawling and mining
    algorithms

18
  • Barabasi, A. (2002). Linked The new science of
    networks. Perseus Publishing, Cambridge, MA.
  • Broder, A., Kumar, R., Maghoul, F., Raghavan, P.,
    Rajagopalan, S., Stata, R., et al. (2000). Graph
    Structure in the Web. Journal of Computer
    Networks, 33(1-6), 309-320. Google it
  • Pennock, D., Flake, G. W., Lawrence, S., Glover,
    E. J., Giles, C. L. (2002). Winners don't take
    all Characterizing the competition for links on
    the web. Proceedings of the National Academy of
    Sciences, 99, 5207-5211. Google it
Write a Comment
User Comments (0)
About PowerShow.com