Title: Crawling the Web
1Crawling the Web
- Discovery and Maintenance of
- Large-Scale Web Data
Junghoo Cho Stanford University
2What is a Crawler?
initial urls
init
to visit urls
get next url
get page
visited urls
web
extract urls
web pages
3Applications
- Internet Search Engines
- Google, AltaVista
- Comparison Shopping Services
- My Simon, BizRate
- Data mining
- Stanford Web Base, IBM Web Fountain
4WebBase Crawler
- Web Base Project
- BackRub Crawler, PageRank
- Google
- New Web Base Crawler
- 20,000 lines in C/C
- 130M pages collected
5Crawling Issues (1)
- Load at visited web sites
- Space out requests to a site
- Limit number of requests to a site per day
- Limit depth of crawl
6Crawling Issues (2)
- Load at crawler
- Parallelize
initial urls
init
init
to visit urls
get next url
get next url
get page
get page
extract urls
extract urls
visited urls
web pages
7Crawling Issues (3)
- Scope of crawl
- Not enough space for all pages
- Not enough time to visit all pages
8Crawling Issues (4)
- Replication
- Pages mirrored at multiple locations
9Crawling Issues (5)
- Incremental crawling
- How do we avoid crawling from scratch?
- How do we keep pages fresh?
10Summary of My Research
- Load on sites PAWS00
- Parallel crawler Tech Report 01
- Page selection WWW7
- Replicated page detection SIGMOD00
- Page freshness SIGMOD00
- Crawler architecture VLDB00
11Outline of This Talk
- How can we maintain pages fresh?
- How does the Web change?
- What do we mean by fresh pages?
- How should we refresh pages?
12Web Evolution Experiment
- How often does a Web page change?
- How long does a page stay on the Web?
- How long does it take for 50 of the Web to
change? - How do we model Web changes?
13Experimental Setup
- February 17 to June 24, 1999
- 270 sites visited (with permission)
- identified 400 sites with highest PageRank
- contacted administrators
- 720,000 pages collected
- 3,000 pages from each site daily
- start at root, visit breadth first (get new old
pages) - ran only 9pm - 6am, 10 seconds between site
requests
14Average Change Interval
fraction of pages
¾
¾
average change interval
15Change Interval By Domain
fraction of pages
¾
¾
average change interval
16Modeling Web Evolution
- Poisson process with rate ?
- T is time to next event
- fT (t) ? e-? t (t gt 0)
17Change Interval of Pages
for pages that change every 10 days on average
fraction of changes with given interval
Poisson model
interval in days
18Change Metrics
- Freshness
- Freshness of element ei at time t is F (
ei t ) 1 if ei is up-to-date at time t
0 otherwise
19Change Metrics
- Age
- Age of element ei at time t is A( ei t
) 0 if ei is up-to-date at time t
t - (modification ei time)
otherwise
20Change Metrics
F(ei)
1
0
time
A(ei)
0
time
refresh
update
21Refresh Order
- Fixed order
- Explicit list of URLs to visit
- Random order
- Start from seed URLs follow links
- Purely random
- Refresh pages on demand,
- as requested by user
web
database
ei
ei
...
...
22Freshness vs. Revisit Frequency
r ? / f average change frequency / average
visit frequency
23Age vs. Revisit Frequency
r ? / f average change frequency / average
visit frequency
24Trick Question
- Two page database
- e1 changes daily
- e2 changes once a week
- Can visit one page per week
- How should we visit pages?
- e1 e2 e1 e2 e1 e2 e1 e2... uniform
- e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 proportional
- e1 e1 e1 e1 e1 e1 ...
- e2 e2 e2 e2 e2 e2 ...
- ?
e1
e1
e2
e2
web
database
25Proportional Often Not Good!
- Visit fast changing e1
- ? get 1/2 day of freshness
- Visit slow changing e2
- ? get 1/2 week of freshness
- Visiting e2 is a better deal!
26Optimal Refresh Frequency
- Problem
- Given and f ,
- find
-
- that maximize
27Solution
- Compute
- Lagrange multiplier method
- All
28Optimal Refresh Frequency
- Shape of curve is the same in all cases
- Holds for any change frequency distribution
29Optimal Refresh for Age
- Shape of curve is the same in all cases
- Holds for any change frequency distribution
30Comparing Policies
Based on Statistics from experiment and revisit
frequency of every month
31Topics to Follow
- Weighted Freshness
- Non-Poisson Model
- Change Frequency Estimation
32Not Every Page is Equal!
? Some pages are more important
e1
Accessed by users 10 times/day
e2
Accessed by users 20 times/day
33Weighted Freshness
f
w 2
w 1
l
34Non-Poisson Model
fraction of changes with given interval
interval in days
35Optimal Revisit Frequencyfor Heavy-Tail
Distribution
36Principle of Diminishing Return
- T time to next change
- continuous, differentiable
- Every page changes
- Definition of change rate l
37Change Frequency Estimation
- How to estimate change frequency?
- Naïve Estimator X/T
- X number of detected changes
- T monitoring period
- 2 changes in 10 days 0.2 times/day
- Incomplete change history
38Improved Estimator
- Based on the Poisson model
-
- X number of detected changes
- N number of accesses
- f access frequency
- 3 changes in 10 days 0.36 times/day
- ? Accounts for missed changes
39Improved Estimator
- Bias
- Efficiency
- Consistency
40Improvement Significant?
- Application to a Web crawler
- Visit pages once every week for 5 weeks
- Estimate change frequency
- Adjust revisit frequency based on the estimate
- Uniform do not adjust
- Naïve based on the naïve estimator
- Ours based on our improved estimator
41Improvement from Our Estimator
(9,200,000 visits in total)
42Other Estimators
- Irregular access interval
- Last-modified date
- Categorization
43Summary
- Web evolution experiment
- Change metric
- Refresh policy
- Frequency estimator
44Contribution
- Freshness SIGMOD00
- Page selection WWW7
- Replicated page detection SIGMOD00
- Load on sites PAWS00
- Parallel crawler Tech Report 01
- Crawler architecture VLDB00
45The End
- Thank you for your attention
- For more information visit
- http//www-db.stanford.edu/cho/