Title: Mining%20the%20Web
1Mining the Web
2Schedule
- Search engine requirements
- Components overview
- Specific modules the crawler
- Purpose
- Implementation
- Performance metrics
3What does it do?
- Processes users queries
- Find pages with related information
- Return a list of resources
- Is it really that simple?
4What does it do?
- Processes users queries
- How is a query represented?
- Find pages with related information
- Return a resources list
- Is it really that simple?
5What does it do?
- Processes users queries
- Find pages with related information
- How do we find pages?
- Where in the web do we look?
- How do we match query and documents?
- Return a resources list
- Is it really that simple?
6What does it do?
- Processes users queries
- Find pages with related information
- Return a resources list
- Is what order?
- How are the pages ranked?
- Is it really that simple?
7What does it do?
- Processes users queries
- Find pages with related information
- Return a resources list
- Is it really that simple?
- Limited resources
- Time quality tradeoff
8Search Engine Structure
- General Design
- Crawling
- Storage
- Indexing
- Ranking
9Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
10Is it an IR system?
- The web is
- Used by millions
- Contains lots of information
- Link based
- Incoherent
- Changes rapidly
- Distributed
- Traditional information retrieval was built with
the exact opposite in mind
11Web Dynamics
- Size
- 10 billion Public Indexable pages
- 10kB / page ? 100 TB
- Doubles every 18 months
- Dynamics
- 33 change weekly
- 8 new pages every week
- 25 new links every week
12Weekly change
Fetterly, Manasse, Najork, Wiener 2003
13Collecting all Web pages
- For searching, for classifying, for mining, etc
- Problems
- No catalog of all accessible URLs on the Web
- Volume, latency, duplications, dinamicity, etc.
14The Crawler
- A program that downloads and stores web pages
- Starts off by placing an initial set of URLs, S0,
in a queue, where all URLs to be retrieved are
kept and prioritized. - From this queue, the crawler gets a URL (in some
order), downloads the page, extracts any URLs in
the downloaded page, and puts the new URLs in the
queue. - This process is repeated until the crawler
decides to stop.
15Crawling Issues
- How to crawl?
- Quality Best pages first
- Efficiency Avoid duplication (or near
duplication) - Etiquette Robots.txt, Server load concerns
- How much to crawl? How much to index?
- Coverage How big is the Web? How much do we
cover? - Relative Coverage How much do competitors have?
- How often to crawl?
- Freshness How much has changed?
- How much has really changed? (why is this a
different question?)
16Before discussing crawling policies
- Some implementation issue
17HTML
- HyperText Markup Language
- Lets the author
- specify layout and typeface
- embed diagrams
- create hyperlinks.
- expressed as an anchor tag with a HREF attribute
- HREF names another page using a Uniform Resource
Locator (URL), - URL
- protocol field (HTTP)
- a server hostname (www.cse.iitb.ac.in)
- file path (/, the root' of the published file
system).
18HTTP(hypertext transport protocol)
- Built on top of the Transport Control Protocol
(TCP) - Steps(from client end)
- resolve the server host name to an Internet
address (IP) - Use Domain Name Server (DNS)
- DNS is a distributed database of name-to-IP
mappings maintained at a set of known servers - contact the server using TCP
- connect to default HTTP port (80) on the server.
- Enter the HTTP requests header (E.g. GET)
- Fetch the response header
- MIME (Multipurpose Internet Mail Extensions)
- A meta-data standard for email and Web content
transfer - Fetch the HTML page
19Crawling procedure
- Simple
- Great deal of engineering goes into
industry-strength crawlers - Industry crawlers crawl a substantial fraction of
the Web - E.g. Google, Yahoo
- No guarantee that all accessible Web pages will
be located - Crawler may never halt .
- pages will be added continually even as it is
running.
20Crawling overheads
- Delays involved in
- Resolving the host name in the URL to an IP
address using DNS - Connecting a socket to the server and sending the
request - Receiving the requested page in response
- Solution Overlap the above delays by
- fetching many pages at the same time
21Anatomy of a crawler
- Page fetching by (logical) threads
- Starts with DNS resolution
- Finishes when the entire page has been fetched
- Each page
- stored in compressed form to disk/tape
- scanned for outlinks
- Work pool of outlinks
- maintain network utilization without overloading
it - Dealt with by load manager
- Continue till the crawler has collected a
sufficient number of pages.
22Typical anatomy of a large-scale crawler.
23Large-scale crawlers performance and reliability
considerations
- Need to fetch many pages at same time
- utilize the network bandwidth
- single page fetch may involve several seconds of
network latency - Highly concurrent and parallelized DNS lookups
- Multi-processing or multi-threading impractical
at low level - Use of asynchronous sockets
- Explicit encoding of the state of a fetch context
in a data structure - Polling socket to check for completion of network
transfers - Care in URL extraction
- Eliminating duplicates to reduce redundant
fetches - Avoiding spider traps
24DNS caching, pre-fetching and resolution
- A customized DNS component with..
- Custom client for address resolution
- Caching server
- Prefetching client
25Custom client for address resolution
- Tailored for concurrent handling of multiple
outstanding requests - Allows issuing of many resolution requests
together - polling at a later time for completion of
individual requests - Facilitates load distribution among many DNS
servers.
26Caching server
- With a large cache, persistent across DNS
restarts - Residing largely in memory if possible.
27Prefetching client
- Steps
- Parse a page that has just been fetched
- extract host names from HREF targets
- Make DNS resolution requests to the caching
server - Usually implemented using UDP
- User Datagram Protocol
- connectionless, packet-based communication
protocol - does not guarantee packet delivery
- Does not wait for resolution to be completed.
28Multiple concurrent fetches
- Managing multiple concurrent connections
- A single download may take several seconds
- Open many socket connections to different HTTP
servers simultaneously - Multi-CPU machines not useful
- crawling performance limited by network and disk
- Two approaches
- using multi-threading
- using non-blocking sockets with event handlers
29Multi-threading
- threads
- physical thread of control provided by the
operating system (E.g. pthreads) OR - concurrent processes
- fixed number of threads allocated in advance
- programming paradigm
- create a client socket
- connect the socket to the HTTP service on a
server - Send the HTTP request header
- read the socket (recv) until
- no more characters are available
- close the socket.
- use blocking system calls
30Multi-threading Problems
- performance penalty
- mutual exclusion
- concurrent access to data structures
- slow disk seeks.
- great deal of interleaved, random input-output on
disk - Due to concurrent modification of document
repository by multiple threads
31Non-blocking sockets and event handlers
- non-blocking sockets
- connect, send or recv call returns immediately
without waiting for the network operation to
complete. - poll the status of the network operation
separately - select system call
- lets application suspend until more data can be
read from or written to the socket - timing out after a pre-specified deadline
- Monitor polls several sockets at the same time
- More efficient memory management
- code that completes processing not interrupted by
other completions - No need for locks and semaphores on the pool
- only append complete pages to the log
32Link extraction and normalization
- Goal Obtaining a canonical form of URL
- URL processing and filtering
- Avoid multiple fetches of pages known by
different URLs - many IP addresses
- For load balancing on large sites
- Mirrored contents/contents on same file system
- Proxy pass
- Mapping of different host names to a single IP
address - need to publish many logical sites
- Relative URLs
- need to be interpreted w.r.t to a base URL.
33Canonical URL
- Formed by
- Using a standard string for the protocol
- Canonicalizing the host name
- Adding an explicit port number
- Normalizing and cleaning up the path
34Robot exclusion
- Check
- whether the server prohibits crawling a
normalized URL - In robots.txt file in the HTTP root directory of
the server - specifies a list of path prefixes which crawlers
should not attempt to fetch. - Meant for crawlers only
35Eliminating already-visited URLs
- Checking if a URL has already been fetched
- Before adding a new URL to the work pool
- Needs to be very quick.
- Achieved by computing MD5 hash function on the
URL - Exploiting spatio-temporal locality of access
- Two-level hash function.
- most significant bits (say, 24) derived by
hashing the host name plus port - lower order bits (say, 40) derived by hashing the
path - concatenated bits used as a key in a B-tree
- qualifying URLs added to frontier of the crawl.
- hash values added to B-tree.
36Spider traps
- Protecting from crashing on
- Ill-formed HTML
- E.g. page with 68 kB of null characters
- Misleading sites
- indefinite number of pages dynamically generated
by CGI scripts - paths of arbitrary depth created using soft
directory links and path remapping features in
HTTP server
37 Spider Traps Solutions
- No automatic technique can be foolproof
- Check for URL length
- Guards
- Preparing regular crawl statistics
- Adding dominating sites to guard module
- Disable crawling active content such as CGI form
queries - Eliminate URLs with non-textual data types
38Avoiding repeated expansion of links on duplicate
pages
- Reduce redundancy in crawls
- Duplicate detection
- Mirrored Web pages and sites
- Detecting exact duplicates
- Checking against MD5 digests of stored URLs
- Representing a relative link v (relative to
aliases u1 and u2) as tuples (h(u1) v) and
(h(u2) v) - Detecting near-duplicates
- Even a single altered character will completely
change the digest ! - E.g. date of update/ name and email of the site
administrator
39Load monitor
- Keeps track of various system statistics
- Recent performance of the wide area network (WAN)
connection - E.g. latency and bandwidth estimates.
- Operator-provided/estimated upper bound on open
sockets for a crawler - Current number of active sockets.
40Thread manager
- Responsible for
- Choosing units of work from frontier
- Scheduling issue of network resources
- Distribution of these requests over multiple ISPs
if appropriate. - Uses statistics from load monitor
41Per-server work queues
- Denial of service (DoS) attacks
- limit the speed or frequency of responses to any
fixed client IP address - Avoiding DOS
- limit the number of active requests to a given
server IP address at any time - maintain a queue of requests for each server
- Use the HTTP/1.1 persistent socket capability.
- Distribute attention relatively evenly between a
large number of sites - Access locality vs. politeness dilemma
42Crawling Issues
- How to crawl?
- Quality Best pages first
- Efficiency Avoid duplication (or near
duplication) - Etiquette Robots.txt, Server load concerns
- How much to crawl? How much to index?
- Coverage How big is the Web? How much do we
cover? - Relative Coverage How much do competitors have?
- How often to crawl?
- Freshness How much has changed?
- How much has really changed? (why is this a
different question?)
43Crawl Order
- Want best pages first
- Potential quality measures
- Final In-degree
- Final PageRank
- Crawl heuristics
- Breadth First Search (BFS)
- Partial Indegree
- Partial PageRank
- Random walk
44Breadth-First Crawl
- Basic idea
- start at a set of known URLs
- explore in concentric circles around these URLs
start pages
distance-one pages
distance-two pages
- used by broad web search engines
- balances load between servers
45Web Wide Crawl (328M pages) Najo01
BFS crawling brings in high quality pages early
in the crawl
46Stanford Web Base (179K) Cho98
Overlap with best x by indegree
x crawled by O(u)
47Queue of URLs to be fetched
- What constraints dictate which queued URL is
fetched next? - Politeness dont hit a server too often, even
from different threads of your spider - How far into a site youve crawled already
- Most sites, stay at 5 levels of URL hierarchy
- Which URLs are most promising for building a
high-quality corpus - This is a graph traversal problem
- Given a directed graph youve partially visited,
where do you visit next?
48Where do we crawl next?
- Complex scheduling optimization problem, subject
to constraints - Plus operational constraints (e.g., keeping all
machines load-balanced) - Scientific study limited to specific aspects
- Which ones?
- What do we measure?
- What are the compromises in distributed crawling?
49Page selection
- Importance metric
- Web crawler model
- Crawler method for choosing page to download
50Importance Metrics
- Given a page P, define how good that page is
- Several metric types
- Interest driven
- Popularity driven
- Location driven
- Combined
51Interest Driven
- Define a driving query Q
- Find textual similarity between P and Q
- Define a word vocabulary t1tn
- Define a vector for P and Q
- Vp, Vq ltw1,,wngt
- wi 0 if ti does not appear in the document
- wi IDF(ti) 1 / number of pages containing ti
- Importance IS(P) Vp Vq (cosine product)
- Finding IDF requires going over the entire web
- Estimate IDF by pages already visited, to
calculate IS
52Popularity Driven
- How popular a page is
- Backlink count
- IB(P) the number of pages containing a link to
P - Estimate by pervious crawls IB(P)
- More sophisticated metric, e.g. PageRank IR(P)
53Location Driven
- IL(P) A function of the URL to P
- Words appearing on URL
- Number of / on the URL
- Easily evaluated, requires no data from pervious
crawls
54Combined Metrics
- IC(P) a function of several other metrics
- Allows using local metrics for first stage and
estimated metrics for second stage - IC(P) aIS(P) bIB(P) cIL(P)
55Crawling Issues
- How to crawl?
- Quality Best pages first
- Efficiency Avoid duplication (or near
duplication) - Etiquette Robots.txt, Server load concerns
- How much to crawl? How much to index?
- Coverage How big is the Web? How much do we
cover? - Relative Coverage How much do competitors have?
- How often to crawl?
- Freshness How much has changed?
- How much has really changed? (why is this a
different question?)
56Crawler Models
- A crawler
- Tries to visit more important pages first
- Only has estimates of importance metrics
- Can only download a limited amount
- How well does a crawler perform?
- Crawl and Stop
- Crawl and Stop with Threshold
57Crawl and Stop
- A crawler stops after visiting K pages
- A perfect crawler
- Visits pages with ranks R1,,Rk
- These are called Top Pages
- A real crawler
- Visits only M lt K top pages
58Crawl and Stop with Threshold
- A crawler stops after visiting T top pages
- Top pages are pages with a metric higher than G
- A crawler continues until T threshold is reached
59Ordering Metrics
- The crawlers queue is prioritized according to an
ordering metric - The ordering metric is based on an importance
metric - Location metrics - directly
- Popularity metrics - via estimates according to
pervious crawls - Similarity metrics via estimates according to
anchor
60Focused Crawling (Chakrabarti)
- Distributed federation of focused crawlers
- Supervised topic classifier
- Controls priority of unvisited frontier
- Trained on document samples from Web directory
(Dmoz)
61Motivation
- Lets relax the problem space
- Focus on a restricted target space of Web pages
- that may be of some type (e.g., homepages)
- that may be of some topic (CS, quantum physics)
- The focused crawling effort would
- use much less resources,
- be more timely,
- be more qualified for indexing searching
purposes
62Motivation
- Goal Design and implement a focused Web crawler
that would - gather only pages on a particular topic (or
class) - use effective heuristics while choosing the next
page to download
63Focused crawling
- A focused crawler seeks and acquires ...
pages on a specific set of topics representing a
relatively narrow segment of the Web. (Soumen
Chakrabarti) - The underlying paradigm is Best-First Search
instead of the Breadth-First Search
64Breadth vs. Best First Search
65Two fundamental questions
- Q1 How to decide whether a downloaded page is
on-topic, or not? - Q2 How to choose the next page to visit?
66Chakrabartis crawler
- Chakrabartis focused crawler
- A1 Determines the page relevance using a text
classifier - A2 Adds URLs to a max-priority queue with their
parent pages score and visits them in descending
order! - What is original is using a text classifier!
67Page relevance
- Testing the classifier
- User determines focus topics
- Crawler calls the classifier and obtains a score
for each downloaded page - Classifier returns a sorted list of classes and
scores - (A 80, B 10, C 7, D 1,...)
- The classifier determines the page relevance!
68Visit order
- The radius-1 hypothesis If page u is an on-topic
example and u links to v, then the probability
that v is on-topic is higher than the probability
that a random chosen Web page is on-topic.
69Visit order case 1
- Hard-focus crawling
- If a downloaded page is off-topic, stops
following hyperlinks from this page. - Assume target is class B
- And for page P, classifier gives
- A 80, B 10, C 7, D 1,...
- Do not follow Ps links at all!
70Visit order case 2
- Soft-focus crawling
- obtains a pages relevance score (a score on the
pages relevance to the target topic) - assigns this score to every URL extracted from
this particular page, and adds to the priority
queue - Example A 80, B 10, C 7, D 1,...
- Insert Ps links with score 0.10 into PQ
71Basic Focused Crawler
72Comparisons
- Start the baseline crawler from the URLs in one
topic - Fetch up to 20000-25000 pages
- For each pair of fetched pages (u,v), add item to
the training set of the apprentice - Train the apprentice
- Start the enhanced crawler from the same set of
pages - Fetch about the same number of pages
73Results
74Controversy
- Chakrabarty claims focused crawler superior to
breadth-first - Suel claims the contrary and that argument was
based on experiments with poor performance
crawlers
75Crawling Issues
- How to crawl?
- Quality Best pages first
- Efficiency Avoid duplication (or near
duplication) - Etiquette Robots.txt, Server load concerns
- How much to crawl? How much to index?
- Coverage How big is the Web? How much do we
cover? - Relative Coverage How much do competitors have?
- How often to crawl?
- Freshness How much has changed?
- How much has really changed? (why is this a
different question?)
76Determining page changes
- Expires HTTP response header
- For page that come with an expiry date
- Otherwise need to guess if revisiting that page
will yield a modified version. - Score reflecting probability of page being
modified - Crawler fetches URLs in decreasing order of
score. - Assumption recent past predicts the future
77Estimating page change rates
- Brewington and Cybenko Cho
- Algorithms for maintaining a crawl in which most
pages are fresher than a specified epoch. - Prerequisite
- average interval at which crawler checks for
changes is smaller than the inter-modification
times of a page - Small scale intermediate crawler runs
- to monitor fast changing sites
- E.g. current news, weather, etc.
- Patched intermediate indices into master index
78Refresh Strategy
- Crawlers can refresh only a certain amount of
pages in a period of time. - The page download resource can be allocated in
many ways - The proportional refresh policy allocated the
resource proportionally to the pages change rate.
79Average Change Interval
fraction of pages
¾
¾
average change interval
80Change Interval By Domain
fraction of pages
¾
¾
average change interval
81Modeling Web Evolution
- Poisson process with rate ?
- T is time to next event
- fT (t) ? e-? t (t gt 0)
82Change Interval
for pages that change every 10 days on average
fraction of changes with given interval
Poisson model
interval in days
83Change Metrics
- Freshness
- Freshness of element ei at time t is F (
ei t ) 1 if ei is up-to-date at time t
0 otherwise
84Change Metrics
- Age
- Age of element ei at time t is A( ei t
) 0 if ei is up-to-date at time t
t - (modification ei time)
otherwise
85Example
- The collection contains 2 pages
- E1 changes 9 times a day
- E2 changes once a day
- Simplified change model
- The day is split into 9 equal intervals, and E1
changes once on each interval - E2 changes once during the entire day
- The only unknown is when the pages change within
the intervals - The crawler can download only a page a day.
- Our goal is to maximize the freshness
86Example (2)
87Example (3)
- Which page do we refresh?
- If we refresh E2 in midday
- If E2 changes in first half of the day, and we
refresh in midday, it remains fresh for the rest
half of the day. - 50 for 0.5 day freshness increase
- 50 for no increase
- Expectancy of 0.25 day freshness increase
- If we refresh E1 in midday
- If E1 changes in first half of the interval, and
we refresh in midday (which is the middle of the
interval), it remains fresh for the rest half of
the interval 1/18 of a day. - 50 for 1/18 day freshness increase
- 50 for no increase
- Expectancy of 1/36 day freshness increase
88Example (4)
- This gives a nice estimation
- But things are more complex in real life
- Not sure that a page will change within an
interval - Have to worry about age
- Using a Poisson model shows a uniform policy
always performs better than a proportional one.
89Example (5)
- Studies have found the best policy for similar
example - Assume page changes follow a Poisson process.
- Assume 5 pages, which change 1,2,3,4,5 times a
day
90Distributed Crawling
91Approaches
- Centralized Parallel Crawler
- Distributed
- P2P
92Distributed Crawlers
- A distributed crawler consists of multiple
crawling processes communicating via local
network (intra-site distributed crawler) or
Internet (distributed crawler) - http//www2002.org/CDROM/refereed/108/index.html
- Setting we have a number of c-procs
- c-proc crawling process
- Goal we wish to crawl the best pages with
minimum overhead
93Crawler-process distribution
at geographically distant locations.
on the same local network
Distributed Crawler
Central Parallel Crawler
94Distributed model
- Crawlers may be running in diverse geographic
locations - Periodically update a master index
- Incremental update so this is cheap
- Compression, differential update etc.
- Focus on communication overhead during the crawl
95Issues and benefits
- Issues
- overlap minimization of multiple downloaded
pages - quality depends on the crawl strategy
- communication bandwidth minimization
- Benefits
- scalability for large-scale web-crawls
- costs use of cheaper machines
- network-load dispersion and reduction by
dividing the web into regions and crawling only
the nearest pages
96Coordination
- A parallel crawler consists of multiple crawling
processes communicating via local network
(intra-site parallel crawler) or Internet
(distributed crawler)
97Coordination
- Independent
- no coordination, every process follows its
extracted links - Dynamic assignment
- a central coordinator dynamically divides the web
into small partitions and assigns each partition
to a process - Static assignment
- Web is partitioned and assigned without central
coordinator before the crawl starts
98c-procs crawling the web
URLs crawled
URLs in queues
Communication by URLs passed between c-procs.
99Static assignment
- Links from one partition to another
(inter-partition links) can be handled either in
- Firewall mode a process does not follow any
inter-partition link - Cross-over mode a process follows also
inter-partition links and discovers also more
pages in its partition - Exchange mode processes exchange
inter-partition URLs mode needs communication
100Classification of parallel crawlers
- If exchange mode is used, communication can be
limited by - Batch communication every process collects some
URLs and send them in a batch - Replication the k most popular URLs are
replicated at each process and are not exchanged
(previous crawl or on the fly) - Some ways to partition the Web
- URL-hash based many inter-partition links
- Site-hash based reduces the inter partition
links - Hierarchical .com domain, .net domain
101Static assignement comparison
Coverage Overlap Quality Communication
Firewall Bad Good Bad Good
Cross-over Good Bad Bad Good
Exchange Good Good Good Bad
102UBI Crawler
- 2002, Boldi, Codenotti, Santini, Vigna
- Features
- Full distribution identical agents / no central
coordinator - Balanced locally computable assignment
- each URL is assigned to one agent
- each agent can compute the URL assignement
locally - distribution of URLs is balanced
- Scalability
- number of crawled pages per second and per agent
are independent of the number of agents - Fault tolerance
- URLs are not statically distributed
- distributed reassignment protocol not reasonableè
103UBI Crawler Assignment Function
- A set of agent identifiers
- L set of alive agents
- m total number of hosts
- ? assigns host h to an alive agent in L
- Requirements
- Balance each agent should be responsible for
approximatly the same number of hosts - Contravariance if the number of agents grows,
the portion of the web crawled by each agent must
shrink
104Consistent Hashing
- Each bucket is replicated k times and each
replica is mapped randomly on the unit circle - Hashing a key compute a point on the unit circle
and find the nearest replica - L a,b, L a,b,c, k 3, hosts 0,1,..,9
Contravariance
Balancing Hash function and random number
generator
?L-1(a) 4,5,6,8 ?L-1(b) 0,2,7 ?L-1(c)
1,3,9
?L-1(a) 1,4,5,6,8,9 ?L-1(b) 0,2,3,7
105UBI Crawler fault tolerance
- Up to now no metrics for estimating the fault
tolerance of distributed crawlers - Each agent has its own view of the set of alive
agents (views can be different) but two agents
will never dispatch hosts to two different
agents. - Agents can be added dynamically in a
self-stabilizing way
106Evaluation metrics (1)
- Overlap N total number of fetched pages
- I number of distinct fetched pages
- minimize the overlap
- CoverageU total number of Web pages
- maximize the coverage
107Evaluation metrics (2)
- Communication Overhead M number of exchanged
messages (URLs)P number of downloaded pages - minimize the overhead
- Quality
- maximize the quality
- backlink count / oracle crawler
108Experiments
- 40M URL graph Stanford Webbase
- Open Directory (dmoz.org) URLs as seeds
- Should be considered a small Web
109Firewall mode coverage
- The price of crawling in firewall mode
110Crossover mode overlap
- Demanding coverage drives up overlap
111Exchange mode communication
- Communication overhead sublinear
Per downloaded URL
112Chos conclusion
- lt4 crawling processes run in parallel firewall
mode provide good coverage - firewall mode not appropriate when
- gt 4 crawling processes
- download only a small subset of the Web and
quality of the downloaded pages is important - exchange mode
- consumes lt 1 network bandwidth for URL exchanges
- maximizes the quality of the downloaded pages
- By replicating 10,000 - 100,000 popular URLs,
communication overhead reduced by 40