High Performance Crawling - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

High Performance Crawling

Description:

Mercator- A Scalable, Extensible Web Crawler(1999) High-Performance Web Crawling (2001) ... 4 byte fingerprint ? Anatomy of a large-scale crawler. The End. ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 22
Provided by: pb8
Category:

less

Transcript and Presenter's Notes

Title: High Performance Crawling


1
High Performance Crawling
  • MW Ch.2 Crawling the web
  • Mercator- A Scalable, Extensible Web
    Crawler(1999)
  • High-Performance Web Crawling (2001)

2
Why Crawling?
  • Information gathering
  • Info processing and service
  • Info mining
  • Info archiving
  • Information discovering
  • Web size? Statistics? Evolution?
  • Goal
  • Generalized web crawling
  • Fetech all the pages of the web

3
Just follow the links, Right?
4
Basic Crawling Algorithm
  • Simple-Crawler(S0,D,E)
  • Q?S0
  • While Q??
  • Do u? dequeue(Q)
  • D(u)?fetech(u)
  • Store(D,(d(u),u))
  • L?parse(d(u))
  • For each v in L
  • Do store(E,(u,v))
  • if (v?D ? v?Q)
  • Then enqueue(Q,v)

5
Lets think about it a little longer
  • The Web is huge
  • 1 billion pages / per month?385pages/sec
  • Morn than connecting to host in fetch()
  • Dns and tcp connection/transfer overheads
  • No enough memory to hold all data structure
  • Is url or pages seen? ?time-space trade-off
  • The real world is not perfect
  • url/html syntax error , server traps
  • Server complains, legal issues

6
High-performance Crawler need
  • Scalable
  • Parallel , distributed
  • Fast
  • Bottleneck? Network utilization
  • Polite
  • DoS, robot.txt
  • Robust
  • Traps, errors, crash recovery
  • Continuous
  • Batch or incremental

7
DNS resolver
  • Fetch ? http//ibook.ics.uci.edu/
  • Problem synchronized and very slow!
  • Solution A customized DNS component with..

8
Custom client for dns resolution
  • Tailored for concurrent handling of multiple
    outstanding requests
  • Allows issuing of many resolution requests
    together
  • polling at a later time for completion of
    individual requests
  • Facilitates load distribution among many DNS
    servers.

9
Caching server
  • With a large cache, persistent across DNS
    restarts
  • Residing largely in memory if possible.

10
Prefetching client
  • Steps
  • Parse a page that has just been fetched
  • extract host names from HREF targets
  • Make DNS resolution requests to the caching
    server
  • Usually implemented using UDP
  • User Datagram Protocol
  • connectionless, packet-based communication
    protocol
  • does not guarantee packet delivery
  • Does not wait for resolution to be completed.

11
Page fetching
  • Problem network connection and transfer
    overheads
  • Solutions Multiple concurrent fetches
  • Managing multiple concurrent connections
  • A single download may take several seconds
  • Open many socket connections to different HTTP
    servers simultaneously
  • Multi-CPU machines not useful
  • crawling performance limited by network and disk
  • Two approaches
  • using multi-threading
  • using non-blocking sockets with event handlers

12
Multi-threading
  • logical threads
  • physical thread of control provided by the
    operating system (E.g. pthreads) OR
  • concurrent processes
  • fixed number of threads allocated in advance
  • programming paradigm
  • create a client socket
  • connect the socket to the HTTP service on a
    server
  • Send the HTTP request header
  • read the socket (recv) until
  • no more characters are available
  • close the socket.
  • use blocking system calls

13
Multi-threading Problems
  • performance penalty
  • mutual exclusion
  • concurrent access to data structures
  • slow disk seeks.
  • great deal of interleaved, random input-output on
    disk
  • Due to concurrent modification of document
    repository by multiple threads

14
Asynchronous I/O
  • non-blocking sockets
  • connect, send or recv call returns immediately
    without waiting for the network operation to
    complete.
  • poll the status of the network operation
    separately
  • select system call
  • lets application suspend until more data can be
    read from or written to the socket
  • timing out after a pre-specified deadline
  • Monitor polls several sockets at the same time
  • More efficient memory management
  • code that completes processing not interrupted by
    other completions
  • No need for locks and semaphores on the pool
  • only append complete pages to the log

15
More
  • How many connections are enough?
  • Probability of cpu busy
  • Network bandwidth utilization
  • Quantity of context switch overheads?
  • ms?
  • Difference between multithread and AIO beside
    context switch?

16
IsUrlVisited?
  • Validate Url before Enqueue(Q,v) to avoid
    download pages more than once.
  • Issue time and space
  • Fingerprints of URL
  • Hold the structure in memory?
  • Physical limitations of disk speed can reduce
    performance to 75 downloads per second.

17
Cache
  • Exploiting spatio-temporal locality of access
  • Popular url/host
  • Locality inside host
  • Implementation
  • Two-level hash function.
  • most significant bits (say, 24) derived by
    hashing the host name plus port
  • lower order bits (say, 40) derived by hashing the
    path
  • concatenated bits use d as a key in a B-tree

18
Mecrator implementation
19
More
  • Probability Set test algorithm
  • Bloom filter
  • Short fingerprint with collisions?
  • 4 byte fingerprint ?
  • ???

20
Anatomy of a large-scale crawler
21
The End.
Write a Comment
User Comments (0)
About PowerShow.com