Extracting knowledge from the World Wide Web - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Extracting knowledge from the World Wide Web

Description:

Discuss methods for extracting knowledge from the web by randomly ... Pref. Attch. - Results. Models the probability for k inlinks to be proportinal to k-3. ... – PowerPoint PPT presentation

Number of Views:152
Avg rating:3.0/5.0
Slides: 26
Provided by: zgrdeni
Category:

less

Transcript and Presenter's Notes

Title: Extracting knowledge from the World Wide Web


1
Extracting knowledge from the World Wide Web
  • by
  • Monika Rauch Henzinger Steve Lawrence

Presentation by Özgür Deniz Aydin
2
Overview
  • Sampling
  • Analyzing and Modeling Growth
  • Communities on Web
  • Summary

3
Abstract
  • Discuss methods for extracting knowledge from the
    web by randomly sampling and analyzing hosts and
    pages, and by analyzing the link structure of the
    web and how links accumulate over time.
  • Information collected on the dist. of web pages
    over domains, the dist. of interest in different
    areas, communities related to different topics,
    the nature of competition in different categories
    of sites, and the degree of communication between
    different communities or countries.

4
Sampling the Web
  • Size too big. Need to sample.
  • Important to preserve distribution
    characteristics.
  • Simples method random selection
  • Fractions from various Internet domains
  • Fractions in various languages
  • Fractions indexed by certain search engines.

5
Sampling Random Walk
  • Take successive steps in random directions.
  • Main idea page is visited with probability
    proportional to its PageRank.
  • Method
  • Choice of initial page is uniformly random
  • When at page p,
  • With prob. d, follow an outlink from p,
  • With prob. 1-d, follow to a random page.
  • Time spent at p defines the PageRank

6
Sampling Random Walk
  • Problems
  • We assume that we can select a page at random,
    the very problem we are trying to build a model
    for.
  • Many pages have outinks to other pages in the
    same domain, very likely to get stuck.

7
Modified Random Walk
  • Method
  • Choice of initial page is uniformly random
  • When at page p,
  • With prob. d, follow an outlink from p,chosen
    uniformly at random.
  • With prob. 1-d, follow to a random host from
    visited hosts so far, and jump to a ramdomly
    selected page out of the visited pages in that
    host.

8
Random Walk - Results
9
IP Address Sampling
  • Idea with IP4, there are 4.3 billion possible
    IP addresses. (2564)
  • Possible to traverse all IP addresses. Would not
    be possible with IP6.
  • Possible Problems
  • Sites temporarily unavailable check more than
    once.
  • Will also find pages not linked directly.

10
IP Address Sampling
  • Possible Problems
  • Firewalls and authentication requirements.
  • Default page (no page) responses
  • Coming soon pages
  • Same site on multiple IPs (large/critic sites do
    this for load balancing and redundency)
  • Multiple sites on single IP (virtual hosting)
  • Non-web site serving IPs Fax servers, etc.

11
(No Transcript)
12
Discussion on Sampling
  • Current techniques do not offer a uniform random
    sample.
  • Idea Using a combinatin of methods.
  • Question What should be counted?
  • i.e. Weather site, millions of pages.
  • Pages with no original content source elsewhere.

13
Analyzing and Modelling Web Growth
  • Link distribution of pages follows a power law
  • Prob. that a randomly selected web page has k
    inlinks is proportional to k? where ?2.1
  • Prob. That this web page has k outlinks is
    proportional to k? where ?2.72
  • Models for discussion Preferential Attachment
    and Competition Varies

14
Preferential Attachment
  • Rich get richer A node becomes more likely to
    get an edge from a new node if it has a larger
    number of edges. (undirected graph model)
  • Growth Starting with small m0 nodes, in every
    tme step introduce new node u with m edges, m
    less than or equal to m0
  • p Prob. that a new node will be connected to
    node u. Depends ku such that
  • p ku / S node w kw

15
Pref. Attch. - Results
  • Models the probability for k inlinks to be
    proportinal to k-3.
  • Older nodes get rich faster. This is not
    entirely correct in real life.
  • Diferent link distributions are observed among
    pages of the same category.

16
Competition Varies
  • In earlier models, most members of a community
    fare poorly, few or no inlinks.
  • In actual distributions, this is not the case.
  • i.e. The mode for inlinks can go up to 800 in
    universities.
  • New method by Pennock mixing uniform and
    preferential attachment accounts for
    connectivity distributions within communities as
    well as the web itself.

17
Competition Varies
18
The Hostgraph Model
  • Idea The web is a hierarchically nested graph,
    with domains, hosts, and pages introducing
    different levels of affiliation. Instead of
    modeling the web at the level of pages, one can
    also model it on the host or domain level.
  • Each node represents a host.
  • Bharats model power law dist., ?1.62 for
    inlinks, ?1.67 for outlinks.
  • In reality, no. of small inlink hosts is
    considerably smaller than predicted by the model.

19
The Hostgraph Model
  • Bharats model
  • At each step, with prob. ß, select a random
    already existing node u,
  • With prob. 1-ß create a new node u. Add d
    additional outlinks to it. Choice of outlinks
    made as follows
  • First choose existing node v at random.
  • Pick d random outgoing edges from v.
  • Then, for j 1, 2, . . . , d, the jth link of u
    points to a random existing node with probability
    a, and to the destination of vs jth link with
    probability 1-a .

20
The Hostgraph Model
Model with 1.000.000 nodes, d 7 and a 0.05
21
(No Transcript)
22
Communities on the Web
  • Identification of communities is valuable for
    several reasons.
  • Automatic web portals
  • Focused search engines
  • Content filtering
  • Defn. by Flake A web community is a collection
    of web pages such that each member page has more
    hyperlinks (in either direction) within the
    community than outside of the community (this
    definition may be generalized to identify
    communities with varying sizes and levels of
    cohesiveness).

23
Communities on the Web
24
Communities on the Web
  • Problems with regular approach
  • Cannot cluster fully connected graphs/subgraphs.
    (identical to non-connected graphs)
  • Introduce seed nodes and identify communities
    around seeds over a polynomial time algorithm.
  • Bi-partite subgraphs, cocitation, bibliographic
    coupling methods good for identifying narrow
    portions of the web only.
  • HITS and PageRank define large collections as
    communities. Less accuracy.

25
Summary
  • World Wide Web offers both opportunities and
    challanges.
  • Many areas open for further research
  • Interesting results may come from updated
    analysis.
  • Sampling still an important issue Which pages
    should be counted? How to reduce bias?
  • Growth models How to model or refine for
    accuracy, while keeping models simple and easy to
    analyse?
  • Communities How to define, how to model?
Write a Comment
User Comments (0)
About PowerShow.com