Mining%20the%20Web - PowerPoint PPT Presentation

About This Presentation
Title:

Mining%20the%20Web

Description:

Great deal of engineering goes into industry-strength crawlers ... specifies a list of path prefixes which crawlers should not attempt to fetch. ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 113
Provided by: malvasia
Category:

less

Transcript and Presenter's Notes

Title: Mining%20the%20Web


1
Mining the Web
  • Crawling the Web

2
Schedule
  • Search engine requirements
  • Components overview
  • Specific modules the crawler
  • Purpose
  • Implementation
  • Performance metrics

3
What does it do?
  • Processes users queries
  • Find pages with related information
  • Return a list of resources
  • Is it really that simple?

4
What does it do?
  • Processes users queries
  • How is a query represented?
  • Find pages with related information
  • Return a resources list
  • Is it really that simple?

5
What does it do?
  • Processes users queries
  • Find pages with related information
  • How do we find pages?
  • Where in the web do we look?
  • How do we match query and documents?
  • Return a resources list
  • Is it really that simple?

6
What does it do?
  • Processes users queries
  • Find pages with related information
  • Return a resources list
  • Is what order?
  • How are the pages ranked?
  • Is it really that simple?

7
What does it do?
  • Processes users queries
  • Find pages with related information
  • Return a resources list
  • Is it really that simple?
  • Limited resources
  • Time quality tradeoff

8
Search Engine Structure
  • General Design
  • Crawling
  • Storage
  • Indexing
  • Ranking

9
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
10
Is it an IR system?
  • The web is
  • Used by millions
  • Contains lots of information
  • Link based
  • Incoherent
  • Changes rapidly
  • Distributed
  • Traditional information retrieval was built with
    the exact opposite in mind

11
Web Dynamics
  • Size
  • 10 billion Public Indexable pages
  • 10kB / page ? 100 TB
  • Doubles every 18 months
  • Dynamics
  • 33 change weekly
  • 8 new pages every week
  • 25 new links every week

12
Weekly change
Fetterly, Manasse, Najork, Wiener 2003
13
Collecting all Web pages
  • For searching, for classifying, for mining, etc
  • Problems
  • No catalog of all accessible URLs on the Web
  • Volume, latency, duplications, dinamicity, etc.

14
The Crawler
  • A program that downloads and stores web pages
  • Starts off by placing an initial set of URLs, S0,
    in a queue, where all URLs to be retrieved are
    kept and prioritized.
  • From this queue, the crawler gets a URL (in some
    order), downloads the page, extracts any URLs in
    the downloaded page, and puts the new URLs in the
    queue.
  • This process is repeated until the crawler
    decides to stop.

15
Crawling Issues
  • How to crawl?
  • Quality Best pages first
  • Efficiency Avoid duplication (or near
    duplication)
  • Etiquette Robots.txt, Server load concerns
  • How much to crawl? How much to index?
  • Coverage How big is the Web? How much do we
    cover?
  • Relative Coverage How much do competitors have?
  • How often to crawl?
  • Freshness How much has changed?
  • How much has really changed? (why is this a
    different question?)

16
Before discussing crawling policies
  • Some implementation issue

17
HTML
  • HyperText Markup Language
  • Lets the author
  • specify layout and typeface
  • embed diagrams
  • create hyperlinks.
  • expressed as an anchor tag with a HREF attribute
  • HREF names another page using a Uniform Resource
    Locator (URL),
  • URL
  • protocol field (HTTP)
  • a server hostname (www.cse.iitb.ac.in)
  • file path (/, the root' of the published file
    system).

18
HTTP(hypertext transport protocol)
  • Built on top of the Transport Control Protocol
    (TCP)
  • Steps(from client end)
  • resolve the server host name to an Internet
    address (IP)
  • Use Domain Name Server (DNS)
  • DNS is a distributed database of name-to-IP
    mappings maintained at a set of known servers
  • contact the server using TCP
  • connect to default HTTP port (80) on the server.
  • Enter the HTTP requests header (E.g. GET)
  • Fetch the response header
  • MIME (Multipurpose Internet Mail Extensions)
  • A meta-data standard for email and Web content
    transfer
  • Fetch the HTML page

19
Crawling procedure
  • Simple
  • Great deal of engineering goes into
    industry-strength crawlers
  • Industry crawlers crawl a substantial fraction of
    the Web
  • E.g. Google, Yahoo
  • No guarantee that all accessible Web pages will
    be located
  • Crawler may never halt .
  • pages will be added continually even as it is
    running.

20
Crawling overheads
  • Delays involved in
  • Resolving the host name in the URL to an IP
    address using DNS
  • Connecting a socket to the server and sending the
    request
  • Receiving the requested page in response
  • Solution Overlap the above delays by
  • fetching many pages at the same time

21
Anatomy of a crawler
  • Page fetching by (logical) threads
  • Starts with DNS resolution
  • Finishes when the entire page has been fetched
  • Each page
  • stored in compressed form to disk/tape
  • scanned for outlinks
  • Work pool of outlinks
  • maintain network utilization without overloading
    it
  • Dealt with by load manager
  • Continue till the crawler has collected a
    sufficient number of pages.

22
Typical anatomy of a large-scale crawler.
23
Large-scale crawlers performance and reliability
considerations
  • Need to fetch many pages at same time
  • utilize the network bandwidth
  • single page fetch may involve several seconds of
    network latency
  • Highly concurrent and parallelized DNS lookups
  • Multi-processing or multi-threading impractical
    at low level
  • Use of asynchronous sockets
  • Explicit encoding of the state of a fetch context
    in a data structure
  • Polling socket to check for completion of network
    transfers
  • Care in URL extraction
  • Eliminating duplicates to reduce redundant
    fetches
  • Avoiding spider traps

24
DNS caching, pre-fetching and resolution
  • A customized DNS component with..
  • Custom client for address resolution
  • Caching server
  • Prefetching client

25
Custom client for address resolution
  • Tailored for concurrent handling of multiple
    outstanding requests
  • Allows issuing of many resolution requests
    together
  • polling at a later time for completion of
    individual requests
  • Facilitates load distribution among many DNS
    servers.

26
Caching server
  • With a large cache, persistent across DNS
    restarts
  • Residing largely in memory if possible.

27
Prefetching client
  • Steps
  • Parse a page that has just been fetched
  • extract host names from HREF targets
  • Make DNS resolution requests to the caching
    server
  • Usually implemented using UDP
  • User Datagram Protocol
  • connectionless, packet-based communication
    protocol
  • does not guarantee packet delivery
  • Does not wait for resolution to be completed.

28
Multiple concurrent fetches
  • Managing multiple concurrent connections
  • A single download may take several seconds
  • Open many socket connections to different HTTP
    servers simultaneously
  • Multi-CPU machines not useful
  • crawling performance limited by network and disk
  • Two approaches
  • using multi-threading
  • using non-blocking sockets with event handlers

29
Multi-threading
  • threads
  • physical thread of control provided by the
    operating system (E.g. pthreads) OR
  • concurrent processes
  • fixed number of threads allocated in advance
  • programming paradigm
  • create a client socket
  • connect the socket to the HTTP service on a
    server
  • Send the HTTP request header
  • read the socket (recv) until
  • no more characters are available
  • close the socket.
  • use blocking system calls

30
Multi-threading Problems
  • performance penalty
  • mutual exclusion
  • concurrent access to data structures
  • slow disk seeks.
  • great deal of interleaved, random input-output on
    disk
  • Due to concurrent modification of document
    repository by multiple threads

31
Non-blocking sockets and event handlers
  • non-blocking sockets
  • connect, send or recv call returns immediately
    without waiting for the network operation to
    complete.
  • poll the status of the network operation
    separately
  • select system call
  • lets application suspend until more data can be
    read from or written to the socket
  • timing out after a pre-specified deadline
  • Monitor polls several sockets at the same time
  • More efficient memory management
  • code that completes processing not interrupted by
    other completions
  • No need for locks and semaphores on the pool
  • only append complete pages to the log

32
Link extraction and normalization
  • Goal Obtaining a canonical form of URL
  • URL processing and filtering
  • Avoid multiple fetches of pages known by
    different URLs
  • many IP addresses
  • For load balancing on large sites
  • Mirrored contents/contents on same file system
  • Proxy pass
  • Mapping of different host names to a single IP
    address
  • need to publish many logical sites
  • Relative URLs
  • need to be interpreted w.r.t to a base URL.

33
Canonical URL
  • Formed by
  • Using a standard string for the protocol
  • Canonicalizing the host name
  • Adding an explicit port number
  • Normalizing and cleaning up the path

34
Robot exclusion
  • Check
  • whether the server prohibits crawling a
    normalized URL
  • In robots.txt file in the HTTP root directory of
    the server
  • specifies a list of path prefixes which crawlers
    should not attempt to fetch.
  • Meant for crawlers only

35
Eliminating already-visited URLs
  • Checking if a URL has already been fetched
  • Before adding a new URL to the work pool
  • Needs to be very quick.
  • Achieved by computing MD5 hash function on the
    URL
  • Exploiting spatio-temporal locality of access
  • Two-level hash function.
  • most significant bits (say, 24) derived by
    hashing the host name plus port
  • lower order bits (say, 40) derived by hashing the
    path
  • concatenated bits used as a key in a B-tree
  • qualifying URLs added to frontier of the crawl.
  • hash values added to B-tree.

36
Spider traps
  • Protecting from crashing on
  • Ill-formed HTML
  • E.g. page with 68 kB of null characters
  • Misleading sites
  • indefinite number of pages dynamically generated
    by CGI scripts
  • paths of arbitrary depth created using soft
    directory links and path remapping features in
    HTTP server

37
Spider Traps Solutions
  • No automatic technique can be foolproof
  • Check for URL length
  • Guards
  • Preparing regular crawl statistics
  • Adding dominating sites to guard module
  • Disable crawling active content such as CGI form
    queries
  • Eliminate URLs with non-textual data types

38
Avoiding repeated expansion of links on duplicate
pages
  • Reduce redundancy in crawls
  • Duplicate detection
  • Mirrored Web pages and sites
  • Detecting exact duplicates
  • Checking against MD5 digests of stored URLs
  • Representing a relative link v (relative to
    aliases u1 and u2) as tuples (h(u1) v) and
    (h(u2) v)
  • Detecting near-duplicates
  • Even a single altered character will completely
    change the digest !
  • E.g. date of update/ name and email of the site
    administrator

39
Load monitor
  • Keeps track of various system statistics
  • Recent performance of the wide area network (WAN)
    connection
  • E.g. latency and bandwidth estimates.
  • Operator-provided/estimated upper bound on open
    sockets for a crawler
  • Current number of active sockets.

40
Thread manager
  • Responsible for
  • Choosing units of work from frontier
  • Scheduling issue of network resources
  • Distribution of these requests over multiple ISPs
    if appropriate.
  • Uses statistics from load monitor

41
Per-server work queues
  • Denial of service (DoS) attacks
  • limit the speed or frequency of responses to any
    fixed client IP address
  • Avoiding DOS
  • limit the number of active requests to a given
    server IP address at any time
  • maintain a queue of requests for each server
  • Use the HTTP/1.1 persistent socket capability.
  • Distribute attention relatively evenly between a
    large number of sites
  • Access locality vs. politeness dilemma

42
Crawling Issues
  • How to crawl?
  • Quality Best pages first
  • Efficiency Avoid duplication (or near
    duplication)
  • Etiquette Robots.txt, Server load concerns
  • How much to crawl? How much to index?
  • Coverage How big is the Web? How much do we
    cover?
  • Relative Coverage How much do competitors have?
  • How often to crawl?
  • Freshness How much has changed?
  • How much has really changed? (why is this a
    different question?)

43
Crawl Order
  • Want best pages first
  • Potential quality measures
  • Final In-degree
  • Final PageRank
  • Crawl heuristics
  • Breadth First Search (BFS)
  • Partial Indegree
  • Partial PageRank
  • Random walk

44
Breadth-First Crawl
  • Basic idea
  • start at a set of known URLs
  • explore in concentric circles around these URLs

start pages
distance-one pages
distance-two pages
  • used by broad web search engines
  • balances load between servers

45
Web Wide Crawl (328M pages) Najo01
BFS crawling brings in high quality pages early
in the crawl
46
Stanford Web Base (179K) Cho98
Overlap with best x by indegree
x crawled by O(u)
47
Queue of URLs to be fetched
  • What constraints dictate which queued URL is
    fetched next?
  • Politeness dont hit a server too often, even
    from different threads of your spider
  • How far into a site youve crawled already
  • Most sites, stay at 5 levels of URL hierarchy
  • Which URLs are most promising for building a
    high-quality corpus
  • This is a graph traversal problem
  • Given a directed graph youve partially visited,
    where do you visit next?

48
Where do we crawl next?
  • Complex scheduling optimization problem, subject
    to constraints
  • Plus operational constraints (e.g., keeping all
    machines load-balanced)
  • Scientific study limited to specific aspects
  • Which ones?
  • What do we measure?
  • What are the compromises in distributed crawling?

49
Page selection
  • Importance metric
  • Web crawler model
  • Crawler method for choosing page to download

50
Importance Metrics
  • Given a page P, define how good that page is
  • Several metric types
  • Interest driven
  • Popularity driven
  • Location driven
  • Combined

51
Interest Driven
  • Define a driving query Q
  • Find textual similarity between P and Q
  • Define a word vocabulary t1tn
  • Define a vector for P and Q
  • Vp, Vq ltw1,,wngt
  • wi 0 if ti does not appear in the document
  • wi IDF(ti) 1 / number of pages containing ti
  • Importance IS(P) Vp Vq (cosine product)
  • Finding IDF requires going over the entire web
  • Estimate IDF by pages already visited, to
    calculate IS

52
Popularity Driven
  • How popular a page is
  • Backlink count
  • IB(P) the number of pages containing a link to
    P
  • Estimate by pervious crawls IB(P)
  • More sophisticated metric, e.g. PageRank IR(P)

53
Location Driven
  • IL(P) A function of the URL to P
  • Words appearing on URL
  • Number of / on the URL
  • Easily evaluated, requires no data from pervious
    crawls

54
Combined Metrics
  • IC(P) a function of several other metrics
  • Allows using local metrics for first stage and
    estimated metrics for second stage
  • IC(P) aIS(P) bIB(P) cIL(P)

55
Crawling Issues
  • How to crawl?
  • Quality Best pages first
  • Efficiency Avoid duplication (or near
    duplication)
  • Etiquette Robots.txt, Server load concerns
  • How much to crawl? How much to index?
  • Coverage How big is the Web? How much do we
    cover?
  • Relative Coverage How much do competitors have?
  • How often to crawl?
  • Freshness How much has changed?
  • How much has really changed? (why is this a
    different question?)

56
Crawler Models
  • A crawler
  • Tries to visit more important pages first
  • Only has estimates of importance metrics
  • Can only download a limited amount
  • How well does a crawler perform?
  • Crawl and Stop
  • Crawl and Stop with Threshold

57
Crawl and Stop
  • A crawler stops after visiting K pages
  • A perfect crawler
  • Visits pages with ranks R1,,Rk
  • These are called Top Pages
  • A real crawler
  • Visits only M lt K top pages

58
Crawl and Stop with Threshold
  • A crawler stops after visiting T top pages
  • Top pages are pages with a metric higher than G
  • A crawler continues until T threshold is reached

59
Ordering Metrics
  • The crawlers queue is prioritized according to an
    ordering metric
  • The ordering metric is based on an importance
    metric
  • Location metrics - directly
  • Popularity metrics - via estimates according to
    pervious crawls
  • Similarity metrics via estimates according to
    anchor

60
Focused Crawling (Chakrabarti)
  • Distributed federation of focused crawlers
  • Supervised topic classifier
  • Controls priority of unvisited frontier
  • Trained on document samples from Web directory
    (Dmoz)

61
Motivation
  • Lets relax the problem space
  • Focus on a restricted target space of Web pages
  • that may be of some type (e.g., homepages)
  • that may be of some topic (CS, quantum physics)
  • The focused crawling effort would
  • use much less resources,
  • be more timely,
  • be more qualified for indexing searching
    purposes

62
Motivation
  • Goal Design and implement a focused Web crawler
    that would
  • gather only pages on a particular topic (or
    class)
  • use effective heuristics while choosing the next
    page to download

63
Focused crawling
  • A focused crawler seeks and acquires ...
    pages on a specific set of topics representing a
    relatively narrow segment of the Web. (Soumen
    Chakrabarti)
  • The underlying paradigm is Best-First Search
    instead of the Breadth-First Search

64
Breadth vs. Best First Search
65
Two fundamental questions
  • Q1 How to decide whether a downloaded page is
    on-topic, or not?
  • Q2 How to choose the next page to visit?

66
Chakrabartis crawler
  • Chakrabartis focused crawler
  • A1 Determines the page relevance using a text
    classifier
  • A2 Adds URLs to a max-priority queue with their
    parent pages score and visits them in descending
    order!
  • What is original is using a text classifier!

67
Page relevance
  • Testing the classifier
  • User determines focus topics
  • Crawler calls the classifier and obtains a score
    for each downloaded page
  • Classifier returns a sorted list of classes and
    scores
  • (A 80, B 10, C 7, D 1,...)
  • The classifier determines the page relevance!

68
Visit order
  • The radius-1 hypothesis If page u is an on-topic
    example and u links to v, then the probability
    that v is on-topic is higher than the probability
    that a random chosen Web page is on-topic.

69
Visit order case 1
  • Hard-focus crawling
  • If a downloaded page is off-topic, stops
    following hyperlinks from this page.
  • Assume target is class B
  • And for page P, classifier gives
  • A 80, B 10, C 7, D 1,...
  • Do not follow Ps links at all!

70
Visit order case 2
  • Soft-focus crawling
  • obtains a pages relevance score (a score on the
    pages relevance to the target topic)
  • assigns this score to every URL extracted from
    this particular page, and adds to the priority
    queue
  • Example A 80, B 10, C 7, D 1,...
  • Insert Ps links with score 0.10 into PQ

71
Basic Focused Crawler
72
Comparisons
  • Start the baseline crawler from the URLs in one
    topic
  • Fetch up to 20000-25000 pages
  • For each pair of fetched pages (u,v), add item to
    the training set of the apprentice
  • Train the apprentice
  • Start the enhanced crawler from the same set of
    pages
  • Fetch about the same number of pages

73
Results
74
Controversy
  • Chakrabarty claims focused crawler superior to
    breadth-first
  • Suel claims the contrary and that argument was
    based on experiments with poor performance
    crawlers

75
Crawling Issues
  • How to crawl?
  • Quality Best pages first
  • Efficiency Avoid duplication (or near
    duplication)
  • Etiquette Robots.txt, Server load concerns
  • How much to crawl? How much to index?
  • Coverage How big is the Web? How much do we
    cover?
  • Relative Coverage How much do competitors have?
  • How often to crawl?
  • Freshness How much has changed?
  • How much has really changed? (why is this a
    different question?)

76
Determining page changes
  • Expires HTTP response header
  • For page that come with an expiry date
  • Otherwise need to guess if revisiting that page
    will yield a modified version.
  • Score reflecting probability of page being
    modified
  • Crawler fetches URLs in decreasing order of
    score.
  • Assumption recent past predicts the future

77
Estimating page change rates
  • Brewington and Cybenko Cho
  • Algorithms for maintaining a crawl in which most
    pages are fresher than a specified epoch.
  • Prerequisite
  • average interval at which crawler checks for
    changes is smaller than the inter-modification
    times of a page
  • Small scale intermediate crawler runs
  • to monitor fast changing sites
  • E.g. current news, weather, etc.
  • Patched intermediate indices into master index

78
Refresh Strategy
  • Crawlers can refresh only a certain amount of
    pages in a period of time.
  • The page download resource can be allocated in
    many ways
  • The proportional refresh policy allocated the
    resource proportionally to the pages change rate.

79
Average Change Interval
fraction of pages
¾
¾
average change interval
80
Change Interval By Domain
fraction of pages
¾
¾
average change interval
81
Modeling Web Evolution
  • Poisson process with rate ?
  • T is time to next event
  • fT (t) ? e-? t (t gt 0)

82
Change Interval
for pages that change every 10 days on average
fraction of changes with given interval
Poisson model
interval in days
83
Change Metrics
  • Freshness
  • Freshness of element ei at time t is F (
    ei t ) 1 if ei is up-to-date at time t
    0 otherwise

84
Change Metrics
  • Age
  • Age of element ei at time t is A( ei t
    ) 0 if ei is up-to-date at time t
    t - (modification ei time)
    otherwise

85
Example
  • The collection contains 2 pages
  • E1 changes 9 times a day
  • E2 changes once a day
  • Simplified change model
  • The day is split into 9 equal intervals, and E1
    changes once on each interval
  • E2 changes once during the entire day
  • The only unknown is when the pages change within
    the intervals
  • The crawler can download only a page a day.
  • Our goal is to maximize the freshness

86
Example (2)
87
Example (3)
  • Which page do we refresh?
  • If we refresh E2 in midday
  • If E2 changes in first half of the day, and we
    refresh in midday, it remains fresh for the rest
    half of the day.
  • 50 for 0.5 day freshness increase
  • 50 for no increase
  • Expectancy of 0.25 day freshness increase
  • If we refresh E1 in midday
  • If E1 changes in first half of the interval, and
    we refresh in midday (which is the middle of the
    interval), it remains fresh for the rest half of
    the interval 1/18 of a day.
  • 50 for 1/18 day freshness increase
  • 50 for no increase
  • Expectancy of 1/36 day freshness increase

88
Example (4)
  • This gives a nice estimation
  • But things are more complex in real life
  • Not sure that a page will change within an
    interval
  • Have to worry about age
  • Using a Poisson model shows a uniform policy
    always performs better than a proportional one.

89
Example (5)
  • Studies have found the best policy for similar
    example
  • Assume page changes follow a Poisson process.
  • Assume 5 pages, which change 1,2,3,4,5 times a
    day

90
Distributed Crawling
91
Approaches
  • Centralized Parallel Crawler
  • Distributed
  • P2P

92
Distributed Crawlers
  • A distributed crawler consists of multiple
    crawling processes communicating via local
    network (intra-site distributed crawler) or
    Internet (distributed crawler)
  • http//www2002.org/CDROM/refereed/108/index.html
  • Setting we have a number of c-procs
  • c-proc crawling process
  • Goal we wish to crawl the best pages with
    minimum overhead

93
Crawler-process distribution
at geographically distant locations.
on the same local network
Distributed Crawler
Central Parallel Crawler
94
Distributed model
  • Crawlers may be running in diverse geographic
    locations
  • Periodically update a master index
  • Incremental update so this is cheap
  • Compression, differential update etc.
  • Focus on communication overhead during the crawl

95
Issues and benefits
  • Issues
  • overlap minimization of multiple downloaded
    pages
  • quality depends on the crawl strategy
  • communication bandwidth minimization
  • Benefits
  • scalability for large-scale web-crawls
  • costs use of cheaper machines
  • network-load dispersion and reduction by
    dividing the web into regions and crawling only
    the nearest pages

96
Coordination
  • A parallel crawler consists of multiple crawling
    processes communicating via local network
    (intra-site parallel crawler) or Internet
    (distributed crawler)

97
Coordination
  • Independent
  • no coordination, every process follows its
    extracted links
  • Dynamic assignment
  • a central coordinator dynamically divides the web
    into small partitions and assigns each partition
    to a process
  • Static assignment
  • Web is partitioned and assigned without central
    coordinator before the crawl starts

98
c-procs crawling the web
URLs crawled
URLs in queues
Communication by URLs passed between c-procs.
99
Static assignment
  • Links from one partition to another
    (inter-partition links) can be handled either in
  1. Firewall mode a process does not follow any
    inter-partition link
  2. Cross-over mode a process follows also
    inter-partition links and discovers also more
    pages in its partition
  3. Exchange mode processes exchange
    inter-partition URLs mode needs communication

100
Classification of parallel crawlers
  • If exchange mode is used, communication can be
    limited by
  • Batch communication every process collects some
    URLs and send them in a batch
  • Replication the k most popular URLs are
    replicated at each process and are not exchanged
    (previous crawl or on the fly)
  • Some ways to partition the Web
  • URL-hash based many inter-partition links
  • Site-hash based reduces the inter partition
    links
  • Hierarchical .com domain, .net domain

101
Static assignement comparison
Coverage Overlap Quality Communication
Firewall Bad Good Bad Good
Cross-over Good Bad Bad Good
Exchange Good Good Good Bad
102
UBI Crawler
  • 2002, Boldi, Codenotti, Santini, Vigna
  • Features
  • Full distribution identical agents / no central
    coordinator
  • Balanced locally computable assignment
  • each URL is assigned to one agent
  • each agent can compute the URL assignement
    locally
  • distribution of URLs is balanced
  • Scalability
  • number of crawled pages per second and per agent
    are independent of the number of agents
  • Fault tolerance
  • URLs are not statically distributed
  • distributed reassignment protocol not reasonableè

103
UBI Crawler Assignment Function
  • A set of agent identifiers
  • L set of alive agents
  • m total number of hosts
  • ? assigns host h to an alive agent in L
  • Requirements
  • Balance each agent should be responsible for
    approximatly the same number of hosts
  • Contravariance if the number of agents grows,
    the portion of the web crawled by each agent must
    shrink

104
Consistent Hashing
  • Each bucket is replicated k times and each
    replica is mapped randomly on the unit circle
  • Hashing a key compute a point on the unit circle
    and find the nearest replica
  • L a,b, L a,b,c, k 3, hosts 0,1,..,9

Contravariance
Balancing Hash function and random number
generator
?L-1(a) 4,5,6,8 ?L-1(b) 0,2,7 ?L-1(c)
1,3,9
?L-1(a) 1,4,5,6,8,9 ?L-1(b) 0,2,3,7
105
UBI Crawler fault tolerance
  • Up to now no metrics for estimating the fault
    tolerance of distributed crawlers
  • Each agent has its own view of the set of alive
    agents (views can be different) but two agents
    will never dispatch hosts to two different
    agents.
  • Agents can be added dynamically in a
    self-stabilizing way

106
Evaluation metrics (1)
  • Overlap N total number of fetched pages
  • I number of distinct fetched pages
  • minimize the overlap
  • CoverageU total number of Web pages
  • maximize the coverage

107
Evaluation metrics (2)
  • Communication Overhead M number of exchanged
    messages (URLs)P number of downloaded pages
  • minimize the overhead
  • Quality
  • maximize the quality
  • backlink count / oracle crawler

108
Experiments
  • 40M URL graph Stanford Webbase
  • Open Directory (dmoz.org) URLs as seeds
  • Should be considered a small Web

109
Firewall mode coverage
  • The price of crawling in firewall mode

110
Crossover mode overlap
  • Demanding coverage drives up overlap

111
Exchange mode communication
  • Communication overhead sublinear

Per downloaded URL
112
Chos conclusion
  • lt4 crawling processes run in parallel firewall
    mode provide good coverage
  • firewall mode not appropriate when
  • gt 4 crawling processes
  • download only a small subset of the Web and
    quality of the downloaded pages is important
  • exchange mode
  • consumes lt 1 network bandwidth for URL exchanges
  • maximizes the quality of the downloaded pages
  • By replicating 10,000 - 100,000 popular URLs,
    communication overhead reduced by 40
Write a Comment
User Comments (0)
About PowerShow.com