Title: Parallel Crawlers Junghoo Cho
1Parallel CrawlersJunghoo Cho Hector
Garcia-Molina
- Presented By
- Punnawat Tadapak
2Introduction
- Parallel crawler challenges
- Overlap minimization of multiple downloaded web
pages - Quality depends on the crawl strategy
- Communication bandwidth minimization of
communication overhead
3Introduction (2)
- Advantages of Parallel Crawler
- Scalability due to enormous size of the web
- Network-load dispersionmultiple crawling
process run at geographically locations - Network-load reductiontransfer downloaded page
via local network
4Architecture of a Parallel Crawler
Intra-site parallel crawler
Distributed crawler
Collected pages Queues of URLs to
visit C-proc Crawling process
5Transfer data methods
- Compression compress the data before sending
them - Difference
- send only this difference
- Summarizationextract the necessary information
for index construction
6To Avoid Overlap
- Independent no coordination
S1 (C1)
S2 (C2)
S3 (C3)
a
d
f
b
c
e
e
e
e
7To Avoid Overlap (2)
Central coordinator
Partition 1 ( c-procs 1)
www.cnn.com
www.cnn.com/b.htm
www.cnn.com/a.htm
www.cnn.com/c.htm
Partition 2 ( c-procs 2)
www.nytimes.com
8To Avoid Overlap (3)
- Static assignment the URLs are divided without
central coordinator - Firewall Mode
- Cross-over Mode
- Exchange Mode
9Firewall Mode
S1 (C1)
S2 (C2)
a
f
X
X
b
c
g
X
X
d
h
i
X
e
10Cross-over Mode
11Exchange Mode
S1 (C1)
S2 (C2)
a
f
EXCHANGE URLs
b
c
g
d
h
i
e
12URL Exchange Minimization
- Batch Communication
- collects inter-partition URLs (download k pages)
- Partition the URLs
- Sends to an appropriate C-proc
13URL Exchange Minimization (2)
S1 (C1)
S2 (C2)
K 4
a
f
c
b
g
m
d
i
h
e
j
Downloaded f, g, h, i
Downloaded a, b, c, m
Batch list g, j
Batch list d
14URL Exchange Minimization (3)
- Replication
- Reduce URL exchanges by replicating the most
popular URLs at each c-proc and stop
transferring them. - Identify the most popular k URLs based on the
image of the web collected in a previous crawl
(or on the fly) and replicate them. - Significantly reduce URL exchanges, even if we
replicate a small number of URLs.
15Partitioning Function
- URL-Hash Based
- Site-Hash Based
- Hierarchical
16Partitioning Function
- URL-Hash Based the hash value of URL
- http//www.cnn.com/index.html -gt 67hazf
- http//www.cnn.com/index.asp -gt dj281s
17Partitioning Function (2)
- Site-Hash Based the hash value of site name
- http//www.cnn.com/index.html -gt 67hazf
- http//www.cnn.com/index.asp -gt 67hazf
18Partitioning Function (3)
- Hierarchical Partition the web hierarchically
based on the URLs of pages - Example, using domain name (.com, .net, and
other)
19Summary of the Options
20Evaluation Models
- Overlap
- N Total number of pages downloaded
- I Number of unique pages downloaded
- Example
- Total number of pages downloaded 100
- Number of unique pages downloaded 80
- Overlap 0.25
21Evaluation Models (2)
- Coverage
- U Total number of pages that overall crawler
has to download - I Number of unique pages downloaded
- Example
- Total number of pages that overall crawler has to
download 100 - Number of unique pages downloaded 80
- Coverage 0.8
22Evaluation Models (3)
- Quality
- PN Set of pages downloaded by the Oracle crawler
(most important N pages) - AN Set of pages downloaded by the actual crawler
(most important N pages) - Example
- PN A,B,C,D,E,F,G,H,I,J
- AN A,B,C,D,E,F,Z,X,I,J
- Quality 8/10 0.8
23Evaluation Models (4)
24Comparison of 3 Crawling Modes
25Dataset
- Stanford WebBase Repository
- December 1999, period of 2 weeks
- 40 millions web pages
- Seed URLs
- Open Directory (http//www.dmoz.org)
- 1 millions URLs
26Firewall Mode and Coverage
27Firewall Mode and Coverage (2)
28Cross-over Mode and Overlap
29Exchange Mode and Communication
30Exchange Mode and Communication (2)
31Quality and Batch Communication
500K pages (1.2)
2M pages (5)
8M pages (20)
32Quality and Batch Communication (2)
500K pages (1.2)
2M pages (5)
8M pages (20)