Crawling the Web - PowerPoint PPT Presentation

About This Presentation

Title:

Crawling the Web

Description:

... number of requests to a site per day. Limit depth of crawl. 6. Crawling Issues ... get 1/2 day of freshness. Visit slow changing e2. get 1/2 week of freshness ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 46

Provided by: Jungh1

Learn more at: http://oak.cs.ucla.edu

Category:

more less

Transcript and Presenter's Notes

Title: Crawling the Web

1
Crawling the Web

Discovery and Maintenance of
Large-Scale Web Data

Junghoo Cho Stanford University
2
What is a Crawler?
initial urls
init
to visit urls
get next url
get page
visited urls
web
extract urls
web pages
3
Applications

Internet Search Engines
Google, AltaVista
Comparison Shopping Services
My Simon, BizRate
Data mining
Stanford Web Base, IBM Web Fountain

4
WebBase Crawler

Web Base Project
BackRub Crawler, PageRank
Google
New Web Base Crawler
20,000 lines in C/C
130M pages collected

5
Crawling Issues (1)

Load at visited web sites
Space out requests to a site
Limit number of requests to a site per day
Limit depth of crawl

6
Crawling Issues (2)

Load at crawler
Parallelize

initial urls
init
init
to visit urls
get next url
get next url
get page
get page
extract urls
extract urls
visited urls
web pages
7
Crawling Issues (3)

Scope of crawl
Not enough space for all pages
Not enough time to visit all pages

8
Crawling Issues (4)

Replication
Pages mirrored at multiple locations

9
Crawling Issues (5)

Incremental crawling
How do we avoid crawling from scratch?
How do we keep pages fresh?

10
Summary of My Research

Load on sites PAWS00
Parallel crawler Tech Report 01
Page selection WWW7
Replicated page detection SIGMOD00
Page freshness SIGMOD00
Crawler architecture VLDB00

11
Outline of This Talk

How can we maintain pages fresh?
How does the Web change?
What do we mean by fresh pages?
How should we refresh pages?

12
Web Evolution Experiment

How often does a Web page change?
How long does a page stay on the Web?
How long does it take for 50 of the Web to
change?
How do we model Web changes?

13
Experimental Setup

February 17 to June 24, 1999
270 sites visited (with permission)
identified 400 sites with highest PageRank
contacted administrators
720,000 pages collected
3,000 pages from each site daily
start at root, visit breadth first (get new old
pages)
ran only 9pm - 6am, 10 seconds between site
requests

14
Average Change Interval
fraction of pages
¾
¾
average change interval
15
Change Interval By Domain
fraction of pages
¾
¾
average change interval
16
Modeling Web Evolution

Poisson process with rate ?
T is time to next event
fT (t) ? e-? t (t gt 0)

17
Change Interval of Pages
for pages that change every 10 days on average
fraction of changes with given interval
Poisson model
interval in days
18
Change Metrics

Freshness
Freshness of element ei at time t is F (
ei t ) 1 if ei is up-to-date at time t
0 otherwise

19
Change Metrics

Age
Age of element ei at time t is A( ei t
) 0 if ei is up-to-date at time t
t - (modification ei time)
otherwise

20
Change Metrics
F(ei)
1
0
time
A(ei)
0
time
refresh
update
21
Refresh Order

Fixed order
Explicit list of URLs to visit
Random order
Start from seed URLs follow links
Purely random
Refresh pages on demand,
as requested by user

web
database
ei
ei
...
...
22
Freshness vs. Revisit Frequency
r ? / f average change frequency / average
visit frequency
23
Age vs. Revisit Frequency
r ? / f average change frequency / average
visit frequency
24
Trick Question

Two page database
e1 changes daily
e2 changes once a week
Can visit one page per week
How should we visit pages?
e1 e2 e1 e2 e1 e2 e1 e2... uniform
e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 proportional
e1 e1 e1 e1 e1 e1 ...
e2 e2 e2 e2 e2 e2 ...
?

e1
e1
e2
e2
web
database
25
Proportional Often Not Good!

Visit fast changing e1
? get 1/2 day of freshness
Visit slow changing e2
? get 1/2 week of freshness
Visiting e2 is a better deal!

26
Optimal Refresh Frequency

Problem
Given and f ,
find
that maximize

27
Solution

Compute
Lagrange multiplier method
All

28
Optimal Refresh Frequency

Shape of curve is the same in all cases
Holds for any change frequency distribution

29
Optimal Refresh for Age

Shape of curve is the same in all cases
Holds for any change frequency distribution

30
Comparing Policies
Based on Statistics from experiment and revisit
frequency of every month
31
Topics to Follow

Weighted Freshness
Non-Poisson Model
Change Frequency Estimation

32
Not Every Page is Equal!
? Some pages are more important
e1
Accessed by users 10 times/day
e2
Accessed by users 20 times/day
33
Weighted Freshness
f
w 2
w 1
l
34
Non-Poisson Model
fraction of changes with given interval
interval in days
35
Optimal Revisit Frequencyfor Heavy-Tail
Distribution
36
Principle of Diminishing Return