Parallel Crawlers - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Parallel Crawlers

Description:

Crawler a program downloads and stores web pages ... Challenge for Parallel Crawler. Overlap ... Intra-site crawler ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 18
Provided by: pasc8
Category:

less

Transcript and Presenter's Notes

Title: Parallel Crawlers


1
Parallel Crawlers
  • Junghoo Cho
  • Hector Garcia-Molina
  • WWW2002

2
Introduction
  • Crawler a program downloads and stores web
    pages
  • Working mechanism starts from an initial set of
    URL, downloads pages, add URLs in the page to the
    URL queue,
  • the Web size is huge ? parallel Crawler

3
Advantages of Parallel Crawler
  • Scalability
  • Network-load dispersion
  • Network-load reduction
  • Compression
  • Only send the difference
  • Only send the summarization

4
Challenge for Parallel Crawler
  • Overlap
  • Different processes may download the same page
    several times
  • Quality
  • Want to download the important pages
  • Processors dont have whole image of the web
  • Communication bandwidth
  • May need to coordinate with each other
  • Other problems not discussed in this paper
  • When to revisit a page
  • Make sure the Web site is not flooded during a
    crawl
  • What to download

5
Architecture
  • Intra-site crawler
  • All processes run on the same local network and
    communicate through high speed interconnect
  • Distributed crawler
  • Processes run at geographically distant locations
    connected by internet

6
Coordination type
  • Independent
  • No coordination, no partition
  • Start from different URLs
  • Static
  • Partition the web before crawling
  • Dynamic
  • Have a central coordinator
  • Inter-partition links are send to coordinator for
    the reallocation later

7
Partition
  • URL-hash based
  • Hash each URL
  • Site-hash based
  • Hash the site name of the URL
  • Hierarchical
  • Based on the URLs
  • Eg .com, .net and others

8
Crawler modes for static assignment
  • Firewall mode
  • Not follow the inter-partition links
  • Cross-over mode
  • Follow the inter-partition links
  • Exchange mode
  • Not follow the inter-partition links
  • Transfer them to the appropriate partition

9
Exchange minimization
  • Batch communication
  • Send the inter-partition links in a batch
  • Purge the links after sending
  • Replication
  • Replicate the most popular URLs at each proc
  • Reason the number of incoming links to a page
    follows the Zipfs law

10
Evaluation
  • Use crawled web pages as Test data
  • Use Open Directory as the seed URLs (1 million)
  • 40 million pages
  • Evaluation metrics
  • Coverage
  • Overlap
  • Quality
  • Communication overhead

11
Evaluation firewall coverage
  • Good coverage for few procs with few seeds
  • Cannot download entire web
  • Double web size, double proc numbers ? same
    coverage

12
Evaluation cross-over overlap
  • Overlap is low in the beginning, better than
    independent model
  • Not a good choice for a Crawler unless
    communication is forbidden

13
Evaluation exchange communication
  • Communication overhead average number of URLs
    transferred per page
  • A page has 10 out-links at average, and about 9
    of them point to the pages in the same site
  • For site-hash based partition, the communication
    overhead is less than 1.
  • the network bandwidth used for exchange is
    relatively small while the URL exchange on the
    overall system can be quite significant

14
Evaluation exchange communication
15
Evaluation batch exchange quality
  • Conclusions
  • Quality becomes worse when the number of
    processes increases
  • The difference between quality of firewall mode
    and that of single process becomes less
    significant

16
Evaluation batch exchange quality
  • Conclusions
  • Communication overhead does not increase linearly
    as the number of exchanges increases
  • Does not need to exchange a lot to achieve high
    quality

17
Questions
  • Can site-hash partition maintain a good balance
    work load among processes?
  • Is it the best way to use partition function to
    discriminate different coordination?
  • How to deal with the downloaded links?
  • Can we do something to improve the quality of the
    selected seeds?
Write a Comment
User Comments (0)
About PowerShow.com