Parallel Crawlers Junghoo Cho - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Parallel Crawlers Junghoo Cho

Description:

1. Parallel Crawlers. Junghoo Cho & Hector Garcia-Molina. Presented By. Punnawat Tadapak ... multiple crawling process run at geographically locations. Network ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 33
Provided by: punna
Category:

less

Transcript and Presenter's Notes

Title: Parallel Crawlers Junghoo Cho


1
Parallel CrawlersJunghoo Cho Hector
Garcia-Molina
  • Presented By
  • Punnawat Tadapak

2
Introduction
  • Parallel crawler challenges
  • Overlap minimization of multiple downloaded web
    pages
  • Quality depends on the crawl strategy
  • Communication bandwidth minimization of
    communication overhead

3
Introduction (2)
  • Advantages of Parallel Crawler
  • Scalability due to enormous size of the web
  • Network-load dispersionmultiple crawling
    process run at geographically locations
  • Network-load reductiontransfer downloaded page
    via local network

4
Architecture of a Parallel Crawler
Intra-site parallel crawler
Distributed crawler
Collected pages Queues of URLs to
visit C-proc Crawling process
5
Transfer data methods
  • Compression compress the data before sending
    them
  • Difference
  • send only this difference
  • Summarizationextract the necessary information
    for index construction

6
To Avoid Overlap
  • Independent no coordination

S1 (C1)
S2 (C2)
S3 (C3)
a
d
f
b
c
e
e
e
e
7
To Avoid Overlap (2)
  • Dynamic assignment

Central coordinator
Partition 1 ( c-procs 1)
www.cnn.com
www.cnn.com/b.htm
www.cnn.com/a.htm
www.cnn.com/c.htm
Partition 2 ( c-procs 2)
www.nytimes.com
8
To Avoid Overlap (3)
  • Static assignment the URLs are divided without
    central coordinator
  • Firewall Mode
  • Cross-over Mode
  • Exchange Mode

9
Firewall Mode
S1 (C1)
S2 (C2)
a
f
X
X
b
c
g
X
X
d
h
i
X
e
10
Cross-over Mode
11
Exchange Mode
S1 (C1)
S2 (C2)
a
f
EXCHANGE URLs
b
c
g
d
h
i
e
12
URL Exchange Minimization
  • Batch Communication
  • collects inter-partition URLs (download k pages)
  • Partition the URLs
  • Sends to an appropriate C-proc

13
URL Exchange Minimization (2)
S1 (C1)
S2 (C2)
K 4
a
f
c
b
g
m
d
i
h
e
j
Downloaded f, g, h, i
Downloaded a, b, c, m
Batch list g, j
Batch list d
14
URL Exchange Minimization (3)
  • Replication
  • Reduce URL exchanges by replicating the most
    popular URLs at each c-proc and stop
    transferring them.
  • Identify the most popular k URLs based on the
    image of the web collected in a previous crawl
    (or on the fly) and replicate them.
  • Significantly reduce URL exchanges, even if we
    replicate a small number of URLs.

15
Partitioning Function
  • URL-Hash Based
  • Site-Hash Based
  • Hierarchical

16
Partitioning Function
  • URL-Hash Based the hash value of URL
  • http//www.cnn.com/index.html -gt 67hazf
  • http//www.cnn.com/index.asp -gt dj281s

17
Partitioning Function (2)
  • Site-Hash Based the hash value of site name
  • http//www.cnn.com/index.html -gt 67hazf
  • http//www.cnn.com/index.asp -gt 67hazf

18
Partitioning Function (3)
  • Hierarchical Partition the web hierarchically
    based on the URLs of pages
  • Example, using domain name (.com, .net, and
    other)

19
Summary of the Options
20
Evaluation Models
  • Overlap
  • N Total number of pages downloaded
  • I Number of unique pages downloaded
  • Example
  • Total number of pages downloaded 100
  • Number of unique pages downloaded 80
  • Overlap 0.25

21
Evaluation Models (2)
  • Coverage
  • U Total number of pages that overall crawler
    has to download
  • I Number of unique pages downloaded
  • Example
  • Total number of pages that overall crawler has to
    download 100
  • Number of unique pages downloaded 80
  • Coverage 0.8

22
Evaluation Models (3)
  • Quality
  • PN Set of pages downloaded by the Oracle crawler
    (most important N pages)
  • AN Set of pages downloaded by the actual crawler
    (most important N pages)
  • Example
  • PN A,B,C,D,E,F,G,H,I,J
  • AN A,B,C,D,E,F,Z,X,I,J
  • Quality 8/10 0.8

23
Evaluation Models (4)
  • Communication Overhead

24
Comparison of 3 Crawling Modes
25
Dataset
  • Stanford WebBase Repository
  • December 1999, period of 2 weeks
  • 40 millions web pages
  • Seed URLs
  • Open Directory (http//www.dmoz.org)
  • 1 millions URLs

26
Firewall Mode and Coverage
27
Firewall Mode and Coverage (2)
28
Cross-over Mode and Overlap
29
Exchange Mode and Communication
30
Exchange Mode and Communication (2)
31
Quality and Batch Communication
500K pages (1.2)
2M pages (5)
8M pages (20)
32
Quality and Batch Communication (2)
500K pages (1.2)
2M pages (5)
8M pages (20)
Write a Comment
User Comments (0)
About PowerShow.com