Distributed Web Crawling over DHTs - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Distributed Web Crawling over DHTs

Description:

Bigger, better, faster web crawler. Enables new search and indexing technologies. P2P Web Search ... WebCrawler over PIER, Bamboo DHT, up to 80 PlanetLab nodes ... – PowerPoint PPT presentation

Number of Views:144
Avg rating:3.0/5.0
Slides: 22
Provided by: unkn491
Category:

less

Transcript and Presenter's Notes

Title: Distributed Web Crawling over DHTs


1
Distributed Web Crawling over DHTs
  • Boon Thau Loo, Owen Cooper,
  • Sailesh Krishnamurthy
  • CS294-4

2
Search Today
Search
Index
Crawl
Crawl
3
Whats Wrong?
  • Users have a limited search interface
  • Todays web is dynamic and growing
  • Timely re-crawls required.
  • Not feasible for all web sites.
  • Search engines control your search results
  • Decide which sites get crawled
  • 550 billion documents estimated in 2001
    (BrightPlanet)
  • Google indexes 3.3 billion documents.
  • Decide which sites gets updated more frequently
  • May censor or skew results rankings.

Challenge User customizable searches that scale.
4
Our Solution A Distributed Crawler
  • P2P users donate excess bandwidth and computation
    resources to crawl the web.
  • Organized using Distributed Hash tables (DHTs)
  • DHT and Query Processor agnostic crawler
  • Designed to work over any DHT
  • Crawls can be expressed as declarative recursive
    queries
  • Easy for user customization.
  • Queries can be executed over PIER, a DHT-based
    relational P2P Query Processor

Crawlees Web Servers
Crawlers PIER nodes
5
Potential
  • Infrastructure for crawl personalization
  • User-defined focused crawlers
  • Collaborative crawling/filtering (special
    interest groups)
  • Other possibilities
  • Bigger, better, faster web crawler
  • Enables new search and indexing technologies
  • P2P Web Search
  • Web archival and storage (with OceanStore)
  • Generalized crawler for querying distributed
    graph structures.
  • Monitor file-sharing networks. E.g. Gnutella.
  • P2P network maintenance
  • Routing information.
  • OceanStore meta-data.

6
Challenges that We Investigated
  • Scalability and Throughput
  • DHT communication overheads.
  • Balance network load on crawlers
  • 2 components of network load Download and DHT
    bandwidth.
  • Network Proximity. Exploit network locality of
    crawlers.
  • Limit download rates on web sites
  • Prevents denial of service attacks.
  • Main tradeoff Tension between coordination and
    communication
  • Balance load either on crawlers or on crawlees !
  • Exploit network proximity at the cost of
    communication.

7
Crawl as a Recursive Query
Publish WebPage(url)
? Link.destUrl ?WebPage(url)
Redirect
Crawler Thread
Output Links
Extractor
Downloader
Input Urls
Seed Urls
8
Crawl Distribution Strategies
  • Partition by URL
  • Ensures even distribution of crawler workload.
  • High DHT communication traffic.
  • Partition by Hostname
  • One crawler per hostname.
  • Creates a control point for per-server rate
    throttling.
  • May lead to uneven crawler load distribution
  • Single point of failure
  • Bad choice of crawler affects per-site crawl
    throughput.
  • Slight variation X crawlers per hostname.

9
Redirection
  • Simple technique that allows a crawler to
    redirect or pass on its assigned work to another
    crawler (and so on.)
  • A second chance distribution mechanism orthogonal
    to the partitioning scheme.
  • Example Partition by hostname
  • Node responsible for google.com (red) dispatches
    work (by URL) to grey nodes
  • Load balancing benefits of partition by URL
  • Control benefits of partition by hostname
  • When? Policy-based
  • Crawler load (queue size)
  • Network proximity
  • Why not? Cost of redirection
  • Increased DHT control traffic
  • Hence, put a limit number of redirections per URL.

www.google.com
10
Experiments
  • Deployment
  • WebCrawler over PIER, Bamboo DHT, up to 80
    PlanetLab nodes
  • 3 Crawl Threads per crawler, 15 min crawl
    duration
  • Distribution (Partition) Schemes
  • URL
  • Hostname
  • Hostname with 8 crawlers per unique host
  • Hostname, one level redirection on overload.
  • Crawl Workload
  • Exhaustive crawl
  • Seed URL http//www.google.com
  • 78244 different web servers
  • Crawl of fixed number of sites
  • Seed URL http//www.google.com
  • 45 web servers within google
  • Crawl of single site within http//groups.google.c
    om

11
Crawl of Multiple Sites I
CDF of Per-crawler Downloads (80 nodes)
Partition by Hostname shows poor imbalance (70
idle). Better off when more crawlers are busy
Crawl Throughput Scaleup
Hostname Can only exploit at most 45
crawlers. Redirect (hybrid hostname/url) does the
best.
12
Crawl of Multiple Sites II
Per-URL DHT Overheads
Redirect The per-URL DHT overheads hit their
maximum around 70 nodes.
Redirection incurs higher overheads only after
queue size exceeds a threshold. Hostname incurs
low overheads since crawl only looks at
google.com which has lots of self-links.
13
Network Proximity
Sampled 5100 crawl targets and measured ping
times from each of 80 PlanetLab hosts Partition
by hostname approximates random assignment
Best-3 random is close enough to Best-5 random
Sanity check what if a single host crawls all
targets ?
14
Summary of Schemes
15
Related Work
  • Herodotus, at MIT (Chord-based)
  • Partition by URL
  • Batching with ring-based forwarding.
  • Experimented on 4 local machines
  • Apoidea, at GaTech (Chord-based)
  • Partition by hostname.
  • Forwards crawl to DHT neighbor closest to
    website.
  • Experimented on 12 local machines.

16
Conclusion
  • Our main contributions
  • Propose a DHT and QP agnostic Distributed
    Crawler.
  • Express crawl as a query.
  • Permits user-customizable refinement of crawls
  • Discover important trade-offs in distributed
    crawling
  • Co-ordination comes with extra communication
    costs
  • Deployment and experimentation on PlanetLab.
  • Examine crawl distribution strategies under
    different workloads on live web sources
  • Measure the potential benefits of network
    proximity.

17
Backup slides
18
Existing Crawlers
  • Cluster-based crawlers
  • Google Centralized dispatcher sends urls to be
    crawled.
  • Hash-based parallel crawlers.
  • Focused Crawlers
  • BINGO!
  • Crawls the web given basic training set.
  • Peer-to-Peer
  • Grub SETI_at_Home infrastructure.
  • 23993 members .

19
Exhaustive Crawl
Partition by Hostname shows imbalance. Some
crawlers are over-utilized for downloads.
Little difference in throughput. Most crawler
threads are kept busy.
20
Single Site
URL is best, followed by redirect and hostname.
21
Future Work
  • Fault Tolerance
  • Security
  • Single-Node Throughput
  • Work-Sharing between Crawl Queries
  • Essential for overlapping users.
  • Crawl Global Prioritization
  • A requirement of personalized crawls.
  • Online relevance feedback.
  • Deep web retrieval.
Write a Comment
User Comments (0)
About PowerShow.com