Parallel Crawlers Junghoo Cho

About This Presentation

Title:

Parallel Crawlers Junghoo Cho

Description:

1. Parallel Crawlers. Junghoo Cho & Hector Garcia-Molina. Presented By. Punnawat Tadapak ... multiple crawling process run at geographically locations. Network ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 33

Provided by: punna

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Crawlers Junghoo Cho

1
Parallel CrawlersJunghoo Cho Hector
Garcia-Molina

Presented By
Punnawat Tadapak

2
Introduction

Parallel crawler challenges
Overlap minimization of multiple downloaded web
pages
Quality depends on the crawl strategy
Communication bandwidth minimization of
communication overhead

3
Introduction (2)

Advantages of Parallel Crawler
Scalability due to enormous size of the web
Network-load dispersionmultiple crawling
process run at geographically locations
Network-load reductiontransfer downloaded page
via local network

4
Architecture of a Parallel Crawler
Intra-site parallel crawler
Distributed crawler
Collected pages Queues of URLs to
visit C-proc Crawling process
5
Transfer data methods

Compression compress the data before sending
them
Difference
send only this difference
Summarizationextract the necessary information
for index construction

6
To Avoid Overlap

Independent no coordination

S1 (C1)
S2 (C2)
S3 (C3)
a
d
f
b
c
e
e
e
e
7
To Avoid Overlap (2)

Dynamic assignment

Central coordinator
Partition 1 ( c-procs 1)
www.cnn.com
www.cnn.com/b.htm
www.cnn.com/a.htm
www.cnn.com/c.htm
Partition 2 ( c-procs 2)
www.nytimes.com
8
To Avoid Overlap (3)

Static assignment the URLs are divided without
central coordinator
Firewall Mode
Cross-over Mode
Exchange Mode

9
Firewall Mode
S1 (C1)
S2 (C2)
a
f
X
X
b
c
g
X
X
d
h
i
X
e
10
Cross-over Mode
11
Exchange Mode
S1 (C1)
S2 (C2)
a
f
EXCHANGE URLs
b
c
g
d
h
i
e
12
URL Exchange Minimization

Batch Communication
collects inter-partition URLs (download k pages)
Partition the URLs
Sends to an appropriate C-proc

13
URL Exchange Minimization (2)
S1 (C1)
S2 (C2)
K 4
a
f
c
b
g
m
d
i
h
e
j
Downloaded f, g, h, i
Downloaded a, b, c, m
Batch list g, j
Batch list d
14
URL Exchange Minimization (3)

Replication
Reduce URL exchanges by replicating the most
popular URLs at each c-proc and stop
transferring them.
Identify the most popular k URLs based on the
image of the web collected in a previous crawl
(or on the fly) and replicate them.
Significantly reduce URL exchanges, even if we
replicate a small number of URLs.

15
Partitioning Function

URL-Hash Based
Site-Hash Based
Hierarchical

16
Partitioning Function

URL-Hash Based the hash value of URL
http//www.cnn.com/index.html -gt 67hazf
http//www.cnn.com/index.asp -gt dj281s

17
Partitioning Function (2)

Site-Hash Based the hash value of site name
http//www.cnn.com/index.html -gt 67hazf
http//www.cnn.com/index.asp -gt 67hazf

18
Partitioning Function (3)

Hierarchical Partition the web hierarchically
based on the URLs of pages
Example, using domain name (.com, .net, and
other)

19
Summary of the Options
20
Evaluation Models

Overlap
N Total number of pages downloaded
I Number of unique pages downloaded
Example
Total number of pages downloaded 100
Number of unique pages downloaded 80
Overlap 0.25

21
Evaluation Models (2)

Coverage
U Total number of pages that overall crawler
has to download
I Number of unique pages downloaded
Example
Total number of pages that overall crawler has to
download 100
Number of unique pages downloaded 80
Coverage 0.8

22
Evaluation Models (3)

Quality
PN Set of pages downloaded by the Oracle crawler
(most important N pages)
AN Set of pages downloaded by the actual crawler
(most important N pages)
Example
PN A,B,C,D,E,F,G,H,I,J
AN A,B,C,D,E,F,Z,X,I,J
Quality 8/10 0.8

23
Evaluation Models (4)

Communication Overhead

24
Comparison of 3 Crawling Modes
25
Dataset

Stanford WebBase Repository
December 1999, period of 2 weeks
40 millions web pages
Seed URLs
Open Directory (http//www.dmoz.org)
1 millions URLs

26
Firewall Mode and Coverage
27
Firewall Mode and Coverage (2)
28
Cross-over Mode and Overlap
29
Exchange Mode and Communication
30
Exchange Mode and Communication (2)
31
Quality and Batch Communication
500K pages (1.2)
2M pages (5)
8M pages (20)
32
Quality and Batch Communication (2)
500K pages (1.2)
2M pages (5)
8M pages (20)

Write a Comment

User Comments (0)

About PowerShow.com

Parallel Crawlers Junghoo Cho - PowerPoint PPT Presentation

Parallel Crawlers Junghoo Cho

1. Parallel Crawlers. Junghoo Cho & Hector Garcia-Molina. Presented By. Punnawat Tadapak ... multiple crawling process run at geographically locations. Network ... – PowerPoint PPT presentation