Crawling the Web - PowerPoint PPT Presentation

About This Presentation
Title:

Crawling the Web

Description:

... number of requests to a site per day. Limit depth of crawl. 6. Crawling Issues ... get 1/2 day of freshness. Visit slow changing e2. get 1/2 week of freshness ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 46
Provided by: Jungh1
Learn more at: http://oak.cs.ucla.edu
Category:
Tags: crawling | web

less

Transcript and Presenter's Notes

Title: Crawling the Web


1
Crawling the Web
  • Discovery and Maintenance of
  • Large-Scale Web Data

Junghoo Cho Stanford University
2
What is a Crawler?
initial urls
init
to visit urls
get next url
get page
visited urls
web
extract urls
web pages
3
Applications
  • Internet Search Engines
  • Google, AltaVista
  • Comparison Shopping Services
  • My Simon, BizRate
  • Data mining
  • Stanford Web Base, IBM Web Fountain

4
WebBase Crawler
  • Web Base Project
  • BackRub Crawler, PageRank
  • Google
  • New Web Base Crawler
  • 20,000 lines in C/C
  • 130M pages collected

5
Crawling Issues (1)
  • Load at visited web sites
  • Space out requests to a site
  • Limit number of requests to a site per day
  • Limit depth of crawl

6
Crawling Issues (2)
  • Load at crawler
  • Parallelize

initial urls
init
init
to visit urls
get next url
get next url
get page
get page
extract urls
extract urls
visited urls
web pages
7
Crawling Issues (3)
  • Scope of crawl
  • Not enough space for all pages
  • Not enough time to visit all pages

8
Crawling Issues (4)
  • Replication
  • Pages mirrored at multiple locations

9
Crawling Issues (5)
  • Incremental crawling
  • How do we avoid crawling from scratch?
  • How do we keep pages fresh?

10
Summary of My Research
  • Load on sites PAWS00
  • Parallel crawler Tech Report 01
  • Page selection WWW7
  • Replicated page detection SIGMOD00
  • Page freshness SIGMOD00
  • Crawler architecture VLDB00

11
Outline of This Talk
  • How can we maintain pages fresh?
  • How does the Web change?
  • What do we mean by fresh pages?
  • How should we refresh pages?

12
Web Evolution Experiment
  • How often does a Web page change?
  • How long does a page stay on the Web?
  • How long does it take for 50 of the Web to
    change?
  • How do we model Web changes?

13
Experimental Setup
  • February 17 to June 24, 1999
  • 270 sites visited (with permission)
  • identified 400 sites with highest PageRank
  • contacted administrators
  • 720,000 pages collected
  • 3,000 pages from each site daily
  • start at root, visit breadth first (get new old
    pages)
  • ran only 9pm - 6am, 10 seconds between site
    requests

14
Average Change Interval
fraction of pages
¾
¾
average change interval
15
Change Interval By Domain
fraction of pages
¾
¾
average change interval
16
Modeling Web Evolution
  • Poisson process with rate ?
  • T is time to next event
  • fT (t) ? e-? t (t gt 0)

17
Change Interval of Pages
for pages that change every 10 days on average
fraction of changes with given interval
Poisson model
interval in days
18
Change Metrics
  • Freshness
  • Freshness of element ei at time t is F (
    ei t ) 1 if ei is up-to-date at time t
    0 otherwise

19
Change Metrics
  • Age
  • Age of element ei at time t is A( ei t
    ) 0 if ei is up-to-date at time t
    t - (modification ei time)
    otherwise

20
Change Metrics
F(ei)
1
0
time
A(ei)
0
time
refresh
update
21
Refresh Order
  • Fixed order
  • Explicit list of URLs to visit
  • Random order
  • Start from seed URLs follow links
  • Purely random
  • Refresh pages on demand,
  • as requested by user

web
database
ei
ei
...
...
22
Freshness vs. Revisit Frequency
r ? / f average change frequency / average
visit frequency
23
Age vs. Revisit Frequency
r ? / f average change frequency / average
visit frequency
24
Trick Question
  • Two page database
  • e1 changes daily
  • e2 changes once a week
  • Can visit one page per week
  • How should we visit pages?
  • e1 e2 e1 e2 e1 e2 e1 e2... uniform
  • e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 proportional
  • e1 e1 e1 e1 e1 e1 ...
  • e2 e2 e2 e2 e2 e2 ...
  • ?

e1
e1
e2
e2
web
database
25
Proportional Often Not Good!
  • Visit fast changing e1
  • ? get 1/2 day of freshness
  • Visit slow changing e2
  • ? get 1/2 week of freshness
  • Visiting e2 is a better deal!

26
Optimal Refresh Frequency
  • Problem
  • Given and f ,
  • find
  • that maximize

27
Solution
  • Compute
  • Lagrange multiplier method
  • All

28
Optimal Refresh Frequency
  • Shape of curve is the same in all cases
  • Holds for any change frequency distribution

29
Optimal Refresh for Age
  • Shape of curve is the same in all cases
  • Holds for any change frequency distribution

30
Comparing Policies
Based on Statistics from experiment and revisit
frequency of every month
31
Topics to Follow
  • Weighted Freshness
  • Non-Poisson Model
  • Change Frequency Estimation

32
Not Every Page is Equal!
? Some pages are more important
e1
Accessed by users 10 times/day
e2
Accessed by users 20 times/day
33
Weighted Freshness
f
w 2
w 1
l
34
Non-Poisson Model
fraction of changes with given interval
interval in days
35
Optimal Revisit Frequencyfor Heavy-Tail
Distribution
36
Principle of Diminishing Return
  • T time to next change
  • continuous, differentiable
  • Every page changes
  • Definition of change rate l

37
Change Frequency Estimation
  • How to estimate change frequency?
  • Naïve Estimator X/T
  • X number of detected changes
  • T monitoring period
  • 2 changes in 10 days 0.2 times/day
  • Incomplete change history

38
Improved Estimator
  • Based on the Poisson model
  • X number of detected changes
  • N number of accesses
  • f access frequency
  • 3 changes in 10 days 0.36 times/day
  • ? Accounts for missed changes

39
Improved Estimator
  • Bias
  • Efficiency
  • Consistency

40
Improvement Significant?
  • Application to a Web crawler
  • Visit pages once every week for 5 weeks
  • Estimate change frequency
  • Adjust revisit frequency based on the estimate
  • Uniform do not adjust
  • Naïve based on the naïve estimator
  • Ours based on our improved estimator

41
Improvement from Our Estimator
(9,200,000 visits in total)
42
Other Estimators
  • Irregular access interval
  • Last-modified date
  • Categorization

43
Summary
  • Web evolution experiment
  • Change metric
  • Refresh policy
  • Frequency estimator

44
Contribution
  • Freshness SIGMOD00
  • Page selection WWW7
  • Replicated page detection SIGMOD00
  • Load on sites PAWS00
  • Parallel crawler Tech Report 01
  • Crawler architecture VLDB00

45
The End
  • Thank you for your attention
  • For more information visit
  • http//www-db.stanford.edu/cho/
Write a Comment
User Comments (0)
About PowerShow.com