Mining%20the%20Web

About This Presentation

Title:

Mining%20the%20Web

Description:

Great deal of engineering goes into industry-strength crawlers ... specifies a list of path prefixes which crawlers should not attempt to fetch. ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 113

Provided by: malvasia

Category:

more less

Transcript and Presenter's Notes

Title: Mining%20the%20Web

1
Mining the Web

Crawling the Web

2
Schedule

Search engine requirements
Components overview
Specific modules the crawler
Purpose
Implementation
Performance metrics

3
What does it do?

Processes users queries
Find pages with related information
Return a list of resources
Is it really that simple?

4
What does it do?

Processes users queries
How is a query represented?
Find pages with related information
Return a resources list
Is it really that simple?

5
What does it do?

Processes users queries
Find pages with related information
How do we find pages?
Where in the web do we look?
How do we match query and documents?
Return a resources list
Is it really that simple?

6
What does it do?

Processes users queries
Find pages with related information
Return a resources list
Is what order?
How are the pages ranked?
Is it really that simple?

7
What does it do?

Processes users queries
Find pages with related information
Return a resources list
Is it really that simple?
Limited resources
Time quality tradeoff

8
Search Engine Structure

General Design
Crawling
Storage
Indexing
Ranking

9
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
10
Is it an IR system?

The web is
Used by millions
Contains lots of information
Link based
Incoherent
Changes rapidly
Distributed
Traditional information retrieval was built with
the exact opposite in mind

11
Web Dynamics

Size
10 billion Public Indexable pages
10kB / page ? 100 TB
Doubles every 18 months
Dynamics
33 change weekly
8 new pages every week
25 new links every week

12
Weekly change
Fetterly, Manasse, Najork, Wiener 2003
13
Collecting all Web pages

For searching, for classifying, for mining, etc
Problems
No catalog of all accessible URLs on the Web
Volume, latency, duplications, dinamicity, etc.

14
The Crawler

A program that downloads and stores web pages
Starts off by placing an initial set of URLs, S0,
in a queue, where all URLs to be retrieved are
kept and prioritized.
From this queue, the crawler gets a URL (in some
order), downloads the page, extracts any URLs in
the downloaded page, and puts the new URLs in the
queue.
This process is repeated until the crawler
decides to stop.

15
Crawling Issues

How to crawl?
Quality Best pages first
Efficiency Avoid duplication (or near
duplication)
Etiquette Robots.txt, Server load concerns
How much to crawl? How much to index?
Coverage How big is the Web? How much do we
cover?
Relative Coverage How much do competitors have?
How often to crawl?
Freshness How much has changed?
How much has really changed? (why is this a
different question?)

16
Before discussing crawling policies

Some implementation issue

17
HTML

HyperText Markup Language
Lets the author
specify layout and typeface
embed diagrams
create hyperlinks.
expressed as an anchor tag with a HREF attribute
HREF names another page using a Uniform Resource
Locator (URL),
URL
protocol field (HTTP)
a server hostname (www.cse.iitb.ac.in)
file path (/, the root' of the published file
system).

18
HTTP(hypertext transport protocol)

Built on top of the Transport Control Protocol
(TCP)
Steps(from client end)
resolve the server host name to an Internet
address (IP)
Use Domain Name Server (DNS)
DNS is a distributed database of name-to-IP
mappings maintained at a set of known servers
contact the server using TCP
connect to default HTTP port (80) on the server.
Enter the HTTP requests header (E.g. GET)
Fetch the response header
MIME (Multipurpose Internet Mail Extensions)
A meta-data standard for email and Web content
transfer
Fetch the HTML page

19
Crawling procedure

Simple
Great deal of engineering goes into
industry-strength crawlers
Industry crawlers crawl a substantial fraction of
the Web
E.g. Google, Yahoo
No guarantee that all accessible Web pages will
be located
Crawler may never halt .
pages will be added continually even as it is
running.

20
Crawling overheads

Delays involved in
Resolving the host name in the URL to an IP
address using DNS
Connecting a socket to the server and sending the
request
Receiving the requested page in response
Solution Overlap the above delays by
fetching many pages at the same time

21
Anatomy of a crawler

Page fetching by (logical) threads
Starts with DNS resolution
Finishes when the entire page has been fetched
Each page
stored in compressed form to disk/tape
scanned for outlinks
Work pool of outlinks
maintain network utilization without overloading
it
Dealt with by load manager
Continue till the crawler has collected a
sufficient number of pages.

22
Typical anatomy of a large-scale crawler.
23
Large-scale crawlers performance and reliability
considerations

Need to fetch many pages at same time
utilize the network bandwidth
single page fetch may involve several seconds of
network latency
Highly concurrent and parallelized DNS lookups
Multi-processing or multi-threading impractical
at low level
Use of asynchronous sockets
Explicit encoding of the state of a fetch context
in a data structure
Polling socket to check for completion of network
transfers
Care in URL extraction
Eliminating duplicates to reduce redundant
fetches
Avoiding spider traps

24
DNS caching, pre-fetching and resolution

A customized DNS component with..
Custom client for address resolution
Caching server
Prefetching client

25
Custom client for address resolution

Tailored for concurrent handling of multiple
outstanding requests
Allows issuing of many resolution requests
together
polling at a later time for completion of
individual requests
Facilitates load distribution among many DNS
servers.

26
Caching server

With a large cache, persistent across DNS
restarts
Residing largely in memory if possible.

27
Prefetching client

Steps
Parse a page that has just been fetched
extract host names from HREF targets
Make DNS resolution requests to the caching
server
Usually implemented using UDP
User Datagram Protocol
connectionless, packet-based communication
protocol
does not guarantee packet delivery
Does not wait for resolution to be completed.

28
Multiple concurrent fetches

Managing multiple concurrent connections
A single download may take several seconds
Open many socket connections to different HTTP
servers simultaneously
Multi-CPU machines not useful
crawling performance limited by network and disk
Two approaches
using multi-threading
using non-blocking sockets with event handlers

29
Multi-threading

threads
physical thread of control provided by the
operating system (E.g. pthreads) OR
concurrent processes
fixed number of threads allocated in advance
programming paradigm
create a client socket
connect the socket to the HTTP service on a
server
Send the HTTP request header
read the socket (recv) until
no more characters are available
close the socket.
use blocking system calls

30
Multi-threading Problems

performance penalty
mutual exclusion
concurrent access to data structures
slow disk seeks.
great deal of interleaved, random input-output on
disk
Due to concurrent modification of document
repository by multiple threads

31
Non-blocking sockets and event handlers

non-blocking sockets
connect, send or recv call returns immediately
without waiting for the network operation to
complete.
poll the status of the network operation
separately
select system call
lets application suspend until more data can be
read from or written to the socket
timing out after a pre-specified deadline
Monitor polls several sockets at the same time
More efficient memory management
code that completes processing not interrupted by
other completions
No need for locks and semaphores on the pool
only append complete pages to the log

32
Link extraction and normalization

Goal Obtaining a canonical form of URL
URL processing and filtering
Avoid multiple fetches of pages known by
different URLs
many IP addresses
For load balancing on large sites
Mirrored contents/contents on same file system
Proxy pass
Mapping of different host names to a single IP
address
need to publish many logical sites
Relative URLs
need to be interpreted w.r.t to a base URL.

33
Canonical URL

Formed by
Using a standard string for the protocol
Canonicalizing the host name
Adding an explicit port number
Normalizing and cleaning up the path

34
Robot exclusion

Check
whether the server prohibits crawling a
normalized URL
In robots.txt file in the HTTP root directory of
the server
specifies a list of path prefixes which crawlers
should not attempt to fetch.
Meant for crawlers only

35
Eliminating already-visited URLs

Checking if a URL has already been fetched
Before adding a new URL to the work pool
Needs to be very quick.
Achieved by computing MD5 hash function on the
URL
Exploiting spatio-temporal locality of access
Two-level hash function.
most significant bits (say, 24) derived by
hashing the host name plus port
lower order bits (say, 40) derived by hashing the
path
concatenated bits used as a key in a B-tree
qualifying URLs added to frontier of the crawl.
hash values added to B-tree.

36
Spider traps

Protecting from crashing on
Ill-formed HTML
E.g. page with 68 kB of null characters
Misleading sites
indefinite number of pages dynamically generated
by CGI scripts
paths of arbitrary depth created using soft
directory links and path remapping features in
HTTP server

37
Spider Traps Solutions

No automatic technique can be foolproof
Check for URL length
Guards
Preparing regular crawl statistics
Adding dominating sites to guard module
Disable crawling active content such as CGI form
queries
Eliminate URLs with non-textual data types

38
Avoiding repeated expansion of links on duplicate
pages

Reduce redundancy in crawls
Duplicate detection
Mirrored Web pages and sites
Detecting exact duplicates
Checking against MD5 digests of stored URLs
Representing a relative link v (relative to
aliases u1 and u2) as tuples (h(u1) v) and
(h(u2) v)
Detecting near-duplicates
Even a single altered character will completely
change the digest !
E.g. date of update/ name and email of the site
administrator

39
Load monitor

Keeps track of various system statistics
Recent performance of the wide area network (WAN)
connection
E.g. latency and bandwidth estimates.
Operator-provided/estimated upper bound on open
sockets for a crawler
Current number of active sockets.

40
Thread manager

Responsible for
Choosing units of work from frontier
Scheduling issue of network resources
Distribution of these requests over multiple ISPs
if appropriate.
Uses statistics from load monitor

41
Per-server work queues

Denial of service (DoS) attacks
limit the speed or frequency of responses to any
fixed client IP address
Avoiding DOS
limit the number of active requests to a given
server IP address at any time
maintain a queue of requests for each server
Use the HTTP/1.1 persistent socket capability.
Distribute attention relatively evenly between a
large number of sites
Access locality vs. politeness dilemma

42
Crawling Issues

How to crawl?
Quality Best pages first
Efficiency Avoid duplication (or near
duplication)
Etiquette Robots.txt, Server load concerns
How much to crawl? How much to index?
Coverage How big is the Web? How much do we
cover?
Relative Coverage How much do competitors have?
How often to crawl?
Freshness How much has changed?
How much has really changed? (why is this a
different question?)

43
Crawl Order

Want best pages first
Potential quality measures
Final In-degree
Final PageRank
Crawl heuristics
Breadth First Search (BFS)
Partial Indegree
Partial PageRank
Random walk

44
Breadth-First Crawl

Basic idea
start at a set of known URLs
explore in concentric circles around these URLs

start pages
distance-one pages
distance-two pages

used by broad web search engines
balances load between servers

45
Web Wide Crawl (328M pages) Najo01
BFS crawling brings in high quality pages early
in the crawl
46
Stanford Web Base (179K) Cho98
Overlap with best x by indegree
x crawled by O(u)
47
Queue of URLs to be fetched

What constraints dictate which queued URL is
fetched next?
Politeness dont hit a server too often, even
from different threads of your spider
How far into a site youve crawled already
Most sites, stay at 5 levels of URL hierarchy
Which URLs are most promising for building a
high-quality corpus
This is a graph traversal problem
Given a directed graph youve partially visited,
where do you visit next?

48
Where do we crawl next?

Complex scheduling optimization problem, subject
to constraints
Plus operational constraints (e.g., keeping all
machines load-balanced)
Scientific study limited to specific aspects
Which ones?
What do we measure?
What are the compromises in distributed crawling?

49
Page selection

Importance metric
Web crawler model
Crawler method for choosing page to download

50
Importance Metrics

Given a page P, define how good that page is
Several metric types
Interest driven
Popularity driven
Location driven
Combined

51
Interest Driven

Define a driving query Q
Find textual similarity between P and Q
Define a word vocabulary t1tn
Define a vector for P and Q
Vp, Vq ltw1,,wngt
wi 0 if ti does not appear in the document
wi IDF(ti) 1 / number of pages containing ti
Importance IS(P) Vp Vq (cosine product)
Finding IDF requires going over the entire web
Estimate IDF by pages already visited, to
calculate IS

52
Popularity Driven

How popular a page is
Backlink count
IB(P) the number of pages containing a link to
P
Estimate by pervious crawls IB(P)
More sophisticated metric, e.g. PageRank IR(P)

53
Location Driven

IL(P) A function of the URL to P
Words appearing on URL
Number of / on the URL
Easily evaluated, requires no data from pervious
crawls

54
Combined Metrics

IC(P) a function of several other metrics
Allows using local metrics for first stage and
estimated metrics for second stage
IC(P) aIS(P) bIB(P) cIL(P)

55
Crawling Issues

How to crawl?
Quality Best pages first
Efficiency Avoid duplication (or near
duplication)
Etiquette Robots.txt, Server load concerns
How much to crawl? How much to index?
Coverage How big is the Web? How much do we
cover?
Relative Coverage How much do competitors have?
How often to crawl?
Freshness How much has changed?
How much has really changed? (why is this a
different question?)

56
Crawler Models

A crawler
Tries to visit more important pages first
Only has estimates of importance metrics
Can only download a limited amount
How well does a crawler perform?
Crawl and Stop
Crawl and Stop with Threshold

57
Crawl and Stop

A crawler stops after visiting K pages
A perfect crawler
Visits pages with ranks R1,,Rk
These are called Top Pages
A real crawler
Visits only M lt K top pages

58
Crawl and Stop with Threshold

A crawler stops after visiting T top pages
Top pages are pages with a metric higher than G
A crawler continues until T threshold is reached

59
Ordering Metrics

The crawlers queue is prioritized according to an
ordering metric
The ordering metric is based on an importance
metric
Location metrics - directly
Popularity metrics - via estimates according to
pervious crawls
Similarity metrics via estimates according to
anchor

60
Focused Crawling (Chakrabarti)

Distributed federation of focused crawlers
Supervised topic classifier
Controls priority of unvisited frontier
Trained on document samples from Web directory
(Dmoz)

61
Motivation

Lets relax the problem space
Focus on a restricted target space of Web pages
that may be of some type (e.g., homepages)
that may be of some topic (CS, quantum physics)
The focused crawling effort would
use much less resources,
be more timely,
be more qualified for indexing searching
purposes

62
Motivation

Goal Design and implement a focused Web crawler
that would
gather only pages on a particular topic (or
class)
use effective heuristics while choosing the next
page to download

63
Focused crawling

A focused crawler seeks and acquires ...
pages on a specific set of topics representing a
relatively narrow segment of the Web. (Soumen
Chakrabarti)
The underlying paradigm is Best-First Search
instead of the Breadth-First Search

64
Breadth vs. Best First Search
65
Two fundamental questions

Q1 How to decide whether a downloaded page is
on-topic, or not?
Q2 How to choose the next page to visit?

66
Chakrabartis crawler

Chakrabartis focused crawler
A1 Determines the page relevance using a text
classifier
A2 Adds URLs to a max-priority queue with their
parent pages score and visits them in descending
order!
What is original is using a text classifier!

67
Page relevance

Testing the classifier
User determines focus topics
Crawler calls the classifier and obtains a score
for each downloaded page
Classifier returns a sorted list of classes and
scores
(A 80, B 10, C 7, D 1,...)
The classifier determines the page relevance!

68
Visit order

The radius-1 hypothesis If page u is an on-topic
example and u links to v, then the probability
that v is on-topic is higher than the probability
that a random chosen Web page is on-topic.

69
Visit order case 1

Hard-focus crawling
If a downloaded page is off-topic, stops
following hyperlinks from this page.
Assume target is class B
And for page P, classifier gives
A 80, B 10, C 7, D 1,...
Do not follow Ps links at all!

70
Visit order case 2

Soft-focus crawling
obtains a pages relevance score (a score on the
pages relevance to the target topic)
assigns this score to every URL extracted from
this particular page, and adds to the priority
queue
Example A 80, B 10, C 7, D 1,...
Insert Ps links with score 0.10 into PQ

71
Basic Focused Crawler
72
Comparisons

Start the baseline crawler from the URLs in one
topic
Fetch up to 20000-25000 pages
For each pair of fetched pages (u,v), add item to
the training set of the apprentice
Train the apprentice
Start the enhanced crawler from the same set of
pages
Fetch about the same number of pages

73
Results
74
Controversy

Chakrabarty claims focused crawler superior to
breadth-first
Suel claims the contrary and that argument was
based on experiments with poor performance
crawlers

75
Crawling Issues

How to crawl?
Quality Best pages first
Efficiency Avoid duplication (or near
duplication)
Etiquette Robots.txt, Server load concerns
How much to crawl? How much to index?
Coverage How big is the Web? How much do we
cover?
Relative Coverage How much do competitors have?
How often to crawl?
Freshness How much has changed?
How much has really changed? (why is this a
different question?)

76
Determining page changes

Expires HTTP response header
For page that come with an expiry date
Otherwise need to guess if revisiting that page
will yield a modified version.
Score reflecting probability of page being
modified
Crawler fetches URLs in decreasing order of
score.
Assumption recent past predicts the future

77
Estimating page change rates

Brewington and Cybenko Cho
Algorithms for maintaining a crawl in which most
pages are fresher than a specified epoch.
Prerequisite
average interval at which crawler checks for
changes is smaller than the inter-modification
times of a page
Small scale intermediate crawler runs
to monitor fast changing sites
E.g. current news, weather, etc.
Patched intermediate indices into master index

78
Refresh Strategy

Crawlers can refresh only a certain amount of
pages in a period of time.
The page download resource can be allocated in
many ways
The proportional refresh policy allocated the
resource proportionally to the pages change rate.

79
Average Change Interval
fraction of pages
¾
¾
average change interval
80
Change Interval By Domain
fraction of pages
¾
¾
average change interval
81
Modeling Web Evolution

Poisson process with rate ?
T is time to next event
fT (t) ? e-? t (t gt 0)

82
Change Interval
for pages that change every 10 days on average
fraction of changes with given interval
Poisson model
interval in days
83
Change Metrics

Freshness
Freshness of element ei at time t is F (
ei t ) 1 if ei is up-to-date at time t
0 otherwise

84
Change Metrics

Age
Age of element ei at time t is A( ei t
) 0 if ei is up-to-date at time t
t - (modification ei time)
otherwise

85
Example

The collection contains 2 pages
E1 changes 9 times a day
E2 changes once a day
Simplified change model
The day is split into 9 equal intervals, and E1
changes once on each interval
E2 changes once during the entire day
The only unknown is when the pages change within
the intervals
The crawler can download only a page a day.
Our goal is to maximize the freshness

86
Example (2)
87
Example (3)

Which page do we refresh?
If we refresh E2 in midday
If E2 changes in first half of the day, and we
refresh in midday, it remains fresh for the rest
half of the day.
50 for 0.5 day freshness increase
50 for no increase
Expectancy of 0.25 day freshness increase
If we refresh E1 in midday
If E1 changes in first half of the interval, and
we refresh in midday (which is the middle of the
interval), it remains fresh for the rest half of
the interval 1/18 of a day.
50 for 1/18 day freshness increase
50 for no increase
Expectancy of 1/36 day freshness increase

88
Example (4)

This gives a nice estimation
But things are more complex in real life
Not sure that a page will change within an
interval
Have to worry about age
Using a Poisson model shows a uniform policy
always performs better than a proportional one.

89
Example (5)

Studies have found the best policy for similar
example
Assume page changes follow a Poisson process.
Assume 5 pages, which change 1,2,3,4,5 times a
day

90
Distributed Crawling
91
Approaches

Centralized Parallel Crawler
Distributed
P2P

92
Distributed Crawlers

A distributed crawler consists of multiple
crawling processes communicating via local
network (intra-site distributed crawler) or
Internet (distributed crawler)
http//www2002.org/CDROM/refereed/108/index.html
Setting we have a number of c-procs
c-proc crawling process
Goal we wish to crawl the best pages with
minimum overhead

93
Crawler-process distribution
at geographically distant locations.
on the same local network
Distributed Crawler
Central Parallel Crawler
94
Distributed model

Crawlers may be running in diverse geographic
locations
Periodically update a master index
Incremental update so this is cheap
Compression, differential update etc.
Focus on communication overhead during the crawl

95
Issues and benefits

Issues
overlap minimization of multiple downloaded
pages
quality depends on the crawl strategy
communication bandwidth minimization
Benefits
scalability for large-scale web-crawls
costs use of cheaper machines
network-load dispersion and reduction by
dividing the web into regions and crawling only
the nearest pages

96
Coordination

A parallel crawler consists of multiple crawling
processes communicating via local network
(intra-site parallel crawler) or Internet
(distributed crawler)

97
Coordination

Independent
no coordination, every process follows its
extracted links
Dynamic assignment
a central coordinator dynamically divides the web
into small partitions and assigns each partition
to a process
Static assignment
Web is partitioned and assigned without central
coordinator before the crawl starts

98
c-procs crawling the web
URLs crawled
URLs in queues
Communication by URLs passed between c-procs.
99
Static assignment

Links from one partition to another
(inter-partition links) can be handled either in

Firewall mode a process does not follow any
inter-partition link
Cross-over mode a process follows also
inter-partition links and discovers also more
pages in its partition
Exchange mode processes exchange
inter-partition URLs mode needs communication

100
Classification of parallel crawlers

If exchange mode is used, communication can be
limited by
Batch communication every process collects some
URLs and send them in a batch
Replication the k most popular URLs are
replicated at each process and are not exchanged
(previous crawl or on the fly)
Some ways to partition the Web
URL-hash based many inter-partition links
Site-hash based reduces the inter partition
links
Hierarchical .com domain, .net domain

101
Static assignement comparison
Coverage Overlap Quality Communication
Firewall Bad Good Bad Good
Cross-over Good Bad Bad Good
Exchange Good Good Good Bad
102
UBI Crawler

2002, Boldi, Codenotti, Santini, Vigna
Features
Full distribution identical agents / no central
coordinator
Balanced locally computable assignment
each URL is assigned to one agent
each agent can compute the URL assignement
locally
distribution of URLs is balanced
Scalability
number of crawled pages per second and per agent
are independent of the number of agents
Fault tolerance
URLs are not statically distributed
distributed reassignment protocol not reasonableè

103
UBI Crawler Assignment Function

A set of agent identifiers
L set of alive agents
m total number of hosts
? assigns host h to an alive agent in L
Requirements
Balance each agent should be responsible for
approximatly the same number of hosts
Contravariance if the number of agents grows,
the portion of the web crawled by each agent must
shrink

104
Consistent Hashing

Each bucket is replicated k times and each
replica is mapped randomly on the unit circle
Hashing a key compute a point on the unit circle
and find the nearest replica
L a,b, L a,b,c, k 3, hosts 0,1,..,9

Contravariance
Balancing Hash function and random number
generator
?L-1(a) 4,5,6,8 ?L-1(b) 0,2,7 ?L-1(c)
1,3,9
?L-1(a) 1,4,5,6,8,9 ?L-1(b) 0,2,3,7
105
UBI Crawler fault tolerance

Up to now no metrics for estimating the fault
tolerance of distributed crawlers
Each agent has its own view of the set of alive
agents (views can be different) but two agents
will never dispatch hosts to two different
agents.
Agents can be added dynamically in a
self-stabilizing way

106
Evaluation metrics (1)

Overlap N total number of fetched pages
I number of distinct fetched pages
minimize the overlap
CoverageU total number of Web pages
maximize the coverage

107
Evaluation metrics (2)

Communication Overhead M number of exchanged
messages (URLs)P number of downloaded pages
minimize the overhead
Quality
maximize the quality
backlink count / oracle crawler

108
Experiments

40M URL graph Stanford Webbase
Open Directory (dmoz.org) URLs as seeds
Should be considered a small Web

109
Firewall mode coverage

The price of crawling in firewall mode

110
Crossover mode overlap

Demanding coverage drives up overlap

111
Exchange mode communication

Communication overhead sublinear

Per downloaded URL
112
Chos conclusion

lt4 crawling processes run in parallel firewall
mode provide good coverage
firewall mode not appropriate when
gt 4 crawling processes
download only a small subset of the Web and
quality of the downloaded pages is important
exchange mode
consumes lt 1 network bandwidth for URL exchanges
maximizes the quality of the downloaded pages
By replicating 10,000 - 100,000 popular URLs,
communication overhead reduced by 40