Extracting knowledge from the World Wide Web

About This Presentation

Title:

Extracting knowledge from the World Wide Web

Description:

Discuss methods for extracting knowledge from the web by randomly ... Pref. Attch. - Results. Models the probability for k inlinks to be proportinal to k-3. ... – PowerPoint PPT presentation

Number of Views:152

Avg rating:3.0/5.0

Slides: 26

Provided by: zgrdeni

Category:

more less

Transcript and Presenter's Notes

Title: Extracting knowledge from the World Wide Web

1
Extracting knowledge from the World Wide Web

by
Monika Rauch Henzinger Steve Lawrence

Presentation by Özgür Deniz Aydin
2
Overview

Sampling
Analyzing and Modeling Growth
Communities on Web
Summary

3
Abstract

Discuss methods for extracting knowledge from the
web by randomly sampling and analyzing hosts and
pages, and by analyzing the link structure of the
web and how links accumulate over time.
Information collected on the dist. of web pages
over domains, the dist. of interest in different
areas, communities related to different topics,
the nature of competition in different categories
of sites, and the degree of communication between
different communities or countries.

4
Sampling the Web

Size too big. Need to sample.
Important to preserve distribution
characteristics.
Simples method random selection
Fractions from various Internet domains
Fractions in various languages
Fractions indexed by certain search engines.

5
Sampling Random Walk

Take successive steps in random directions.
Main idea page is visited with probability
proportional to its PageRank.
Method
Choice of initial page is uniformly random
When at page p,
With prob. d, follow an outlink from p,
With prob. 1-d, follow to a random page.
Time spent at p defines the PageRank

6
Sampling Random Walk

Problems
We assume that we can select a page at random,
the very problem we are trying to build a model
for.
Many pages have outinks to other pages in the
same domain, very likely to get stuck.

7
Modified Random Walk

Method
Choice of initial page is uniformly random
When at page p,
With prob. d, follow an outlink from p,chosen
uniformly at random.
With prob. 1-d, follow to a random host from
visited hosts so far, and jump to a ramdomly
selected page out of the visited pages in that
host.

8
Random Walk - Results
9
IP Address Sampling

Idea with IP4, there are 4.3 billion possible
IP addresses. (2564)
Possible to traverse all IP addresses. Would not
be possible with IP6.
Possible Problems
Sites temporarily unavailable check more than
once.
Will also find pages not linked directly.

10
IP Address Sampling

Possible Problems
Firewalls and authentication requirements.
Default page (no page) responses
Coming soon pages
Same site on multiple IPs (large/critic sites do
this for load balancing and redundency)
Multiple sites on single IP (virtual hosting)
Non-web site serving IPs Fax servers, etc.

11
(No Transcript)
12
Discussion on Sampling

Current techniques do not offer a uniform random
sample.
Idea Using a combinatin of methods.
Question What should be counted?
i.e. Weather site, millions of pages.
Pages with no original content source elsewhere.

13
Analyzing and Modelling Web Growth

Link distribution of pages follows a power law
Prob. that a randomly selected web page has k
inlinks is proportional to k? where ?2.1
Prob. That this web page has k outlinks is
proportional to k? where ?2.72
Models for discussion Preferential Attachment
and Competition Varies

14
Preferential Attachment

Rich get richer A node becomes more likely to
get an edge from a new node if it has a larger
number of edges. (undirected graph model)
Growth Starting with small m0 nodes, in every
tme step introduce new node u with m edges, m
less than or equal to m0
p Prob. that a new node will be connected to
node u. Depends ku such that
p ku / S node w kw

15
Pref. Attch. - Results

Models the probability for k inlinks to be
proportinal to k-3.
Older nodes get rich faster. This is not
entirely correct in real life.
Diferent link distributions are observed among
pages of the same category.

16
Competition Varies

In earlier models, most members of a community
fare poorly, few or no inlinks.
In actual distributions, this is not the case.
i.e. The mode for inlinks can go up to 800 in
universities.
New method by Pennock mixing uniform and
preferential attachment accounts for
connectivity distributions within communities as
well as the web itself.

17
Competition Varies
18
The Hostgraph Model

Idea The web is a hierarchically nested graph,
with domains, hosts, and pages introducing
different levels of affiliation. Instead of
modeling the web at the level of pages, one can
also model it on the host or domain level.
Each node represents a host.
Bharats model power law dist., ?1.62 for
inlinks, ?1.67 for outlinks.
In reality, no. of small inlink hosts is
considerably smaller than predicted by the model.

19
The Hostgraph Model

Bharats model
At each step, with prob. ß, select a random
already existing node u,
With prob. 1-ß create a new node u. Add d
additional outlinks to it. Choice of outlinks
made as follows
First choose existing node v at random.
Pick d random outgoing edges from v.
Then, for j 1, 2, . . . , d, the jth link of u
points to a random existing node with probability
a, and to the destination of vs jth link with
probability 1-a .

20
The Hostgraph Model
Model with 1.000.000 nodes, d 7 and a 0.05
21
(No Transcript)
22
Communities on the Web

Identification of communities is valuable for
several reasons.
Automatic web portals
Focused search engines
Content filtering
Defn. by Flake A web community is a collection
of web pages such that each member page has more
hyperlinks (in either direction) within the
community than outside of the community (this
definition may be generalized to identify
communities with varying sizes and levels of
cohesiveness).

23
Communities on the Web
24
Communities on the Web

Problems with regular approach
Cannot cluster fully connected graphs/subgraphs.
(identical to non-connected graphs)
Introduce seed nodes and identify communities
around seeds over a polynomial time algorithm.
Bi-partite subgraphs, cocitation, bibliographic
coupling methods good for identifying narrow
portions of the web only.
HITS and PageRank define large collections as
communities. Less accuracy.

25
Summary

World Wide Web offers both opportunities and
challanges.
Many areas open for further research
Interesting results may come from updated
analysis.
Sampling still an important issue Which pages
should be counted? How to reduce bias?
Growth models How to model or refine for
accuracy, while keeping models simple and easy to
analyse?
Communities How to define, how to model?

Write a Comment

User Comments (0)

About PowerShow.com

Extracting knowledge from the World Wide Web - PowerPoint PPT Presentation

Extracting knowledge from the World Wide Web

Discuss methods for extracting knowledge from the web by randomly ... Pref. Attch. - Results. Models the probability for k inlinks to be proportinal to k-3. ... – PowerPoint PPT presentation